{"id":2917,"date":"2025-08-01T22:35:55","date_gmt":"2025-08-01T22:35:55","guid":{"rendered":"https:\/\/violethoward.com\/new\/new-vision-model-from-cohere-runs-on-two-gpus-beats-top-tier-vlms-on-visual-tasks\/"},"modified":"2025-08-01T22:35:55","modified_gmt":"2025-08-01T22:35:55","slug":"new-vision-model-from-cohere-runs-on-two-gpus-beats-top-tier-vlms-on-visual-tasks","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/new-vision-model-from-cohere-runs-on-two-gpus-beats-top-tier-vlms-on-visual-tasks\/","title":{"rendered":"New vision model from Cohere runs on two GPUs, beats top-tier VLMs on visual tasks"},"content":{"rendered":" \r\n<br><div>\n\t\t\t\t<div id=\"boilerplate_2682874\" class=\"post-boilerplate boilerplate-before\">\n<p><em>Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders.<\/em> <em>Subscribe Now<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n<\/div><p>The rise in Deep Research features and other AI-powered analysis has given rise to more models and services looking to simplify that process and read more of the documents businesses actually use.\u00a0<\/p>\n\n\n\n<p>Canadian AI company Cohere is banking on its models, including a newly released visual model, to make the case that Deep Research features should also be optimized for enterprise use cases.\u00a0<\/p>\n\n\n\n<p>The company has released Command A Vision, a visual model specifically targeting enterprise use cases, built on the back of its Command A model. The 112 billion parameter model can \u201cunlock valuable insights from visual data, and make highly accurate, data-driven decisions through document optical character recognition (OCR) and image analysis,\u201d the company says. <\/p>\n\n\n\n<p>\u201cWhether it\u2019s interpreting product manuals with complex diagrams or analyzing photographs of real-world scenes for risk detection, Command A Vision excels at tackling the most demanding enterprise vision challenges,\u201d the company said in a blog post.\u00a0<\/p>\n\n\n\n<div id=\"boilerplate_2803147\" class=\"post-boilerplate boilerplate-speedbump\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong>The AI Impact Series Returns to San Francisco &#8211; August 5<\/strong><\/p>\n\n\n\n<p>The next phase of AI is here &#8211; are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows &#8211; from real-time decision-making to end-to-end automation.<\/p>\n\n\n\n<p>Secure your spot now &#8211; space is limited: https:\/\/bit.ly\/3GuuPLF<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<\/div><p>This means Command A Vision can read and analyze the most common types of images enterprises need: graphs, charts, diagrams, scanned documents and PDFs.\u00a0<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"twitter-tweet\" data-width=\"500\" data-dnt=\"true\"><p lang=\"en\" dir=\"ltr\">? <a href=\"https:\/\/twitter.com\/cohere?ref_src=twsrc%5Etfw\">@cohere<\/a> just dropped Command A Vision on <a href=\"https:\/\/twitter.com\/huggingface?ref_src=twsrc%5Etfw\">@huggingface<\/a>  ?<\/p><p>Designed for enterprise multimodal use cases: interpreting product manuals, analyzing photos, asking about charts\u2026 \u2753??<\/p><p>A 112B dense vision-language model with SOTA performance \u2013 check out the benchmark metrics in\u2026 <a href=\"https:\/\/t.co\/ORMfM5f8cF\">pic.twitter.com\/ORMfM5f8cF<\/a><\/p>\u2014 Jeff Boudier ? (@jeffboudier) <a href=\"https:\/\/twitter.com\/jeffboudier\/status\/1950937716385886659?ref_src=twsrc%5Etfw\">July 31, 2025<\/a><\/blockquote>\n<\/div><\/figure>\n\n\n\n<p>Since it\u2019s built on Command A\u2019s architecture, Command A Vision requires two or fewer GPUs, just like the text model. The vision model also retains the text capabilities of Command A to read words on images and understands at least 23 languages. Cohere said that, unlike other models, Command A Vision reduces the total cost of ownership for enterprises and is fully optimized for retrieval use cases for businesses.\u00a0<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-how-cohere-is-architecting-command-a\">How Cohere is architecting Command A<\/h2>\n\n\n\n<p>Cohere said it followed a Llava architecture to build its Command A models, including the visual model. This architecture turns visual features into soft vision tokens, which can be divided into different tiles.\u00a0<\/p>\n\n\n\n<p>These tiles are passed into the Command A text tower, \u201ca dense, 111B parameters textual LLM,\u201d the company said. \u201cIn this manner, a single image consumes up to 3,328 tokens.\u201d<\/p>\n\n\n\n<p>Cohere said it trained the visual model in three stages: vision-language alignment, supervised fine-tuning (SFT) and post-training reinforcement learning with human feedback (RLHF).<\/p>\n\n\n\n<p>\u201cThis approach enables the mapping of image encoder features to the language model embedding space,\u201d the company said. \u201cIn contrast, during the SFT stage, we simultaneously trained the vision encoder, the vision adapter and the language model on a diverse set of instruction-following multimodal tasks.\u201d<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-visualizing-enterprise-ai-nbsp\">Visualizing enterprise AI\u00a0<\/h2>\n\n\n\n<p>Benchmark tests showed Command A Vision outperforming other models with similar visual capabilities.\u00a0<\/p>\n\n\n\n<p>Cohere pitted Command A Vision against OpenAI\u2019s GPT 4.1, Meta\u2019s Llama 4 Maverick, Mistral\u2019s Pixtral Large and Mistral Medium 3 in nine benchmark tests. The company did not mention if it tested the model against Mistral\u2019s OCR-focused API, Mistral OCR.\u00a0<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"twitter-tweet\" data-width=\"500\" data-dnt=\"true\"><p lang=\"en\" dir=\"ltr\">It enables agents to securely see inside your organization\u2019s visual data, unlocking the automation of tedious tasks involving slides, diagrams, PDFs, and photos. <a href=\"https:\/\/t.co\/iHZnUWekrk\">pic.twitter.com\/iHZnUWekrk<\/a><\/p>\u2014 cohere (@cohere) <a href=\"https:\/\/twitter.com\/cohere\/status\/1950920613767340189?ref_src=twsrc%5Etfw\">July 31, 2025<\/a><\/blockquote>\n<\/div><\/figure>\n\n\n\n<p>Command A Vision outscored the other models in tests such as ChartQA, OCRBench, AI2D and TextVQA. Overall, Command A Vision had an average score of 83.1% compared to GPT 4.1\u2019s 78.6%, Llama 4 Maverick\u2019s 80.5% and the 78.3% from Mistral Medium 3.\u00a0<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcSCnFTJrO6NF-S-eFsFHUhCdJaM1xHBNqkNcTjYv8igsOjBmtZYa8IAo0WiTapRQkSAP92LzXGE77B5EOWjlwPksCdsatx10o6SlehYtg9JDwBQNOa2E8cRdKQ2BRz-l5xbWBosg?key=Vfq9RrNgfcLbbqBNE4F1QQ\" alt=\"\"\/><\/figure>\n\n\n\n<p>Most large language models (LLMs) these days are multimodal, meaning they can generate or understand visual media like photos or videos. However, enterprises generally use more graphical documents such as charts and PDFs, so extracting information from these unstructured data sources often proves difficult.\u00a0<\/p>\n\n\n\n<p>With Deep Research on the rise, the importance of bringing in models capable of reading, analyzing and even downloading unstructured data has grown.<\/p>\n\n\n\n<p>Cohere also said it\u2019s offering Command A Vision in an open weights system, in hopes that enterprises looking to move away from closed or proprietary models will start using its products.\u00a0So far, there is some interest from developers.<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"twitter-tweet\" data-width=\"500\" data-dnt=\"true\"><p lang=\"en\" dir=\"ltr\">Very impressed at its accuracy extracting hand handwritten notes from an image!<\/p>\u2014 Adam Sardo (@sardo_adam) <a href=\"https:\/\/twitter.com\/sardo_adam\/status\/1950964437126619515?ref_src=twsrc%5Etfw\">July 31, 2025<\/a><\/blockquote>\n<\/div><\/figure>\n\n\n\n<figure class=\"wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"twitter-tweet\" data-width=\"500\" data-dnt=\"true\"><p lang=\"en\" dir=\"ltr\">Finally, an AI that won\u2019t judge my terrible doodles.<\/p>\u2014 Martha Wisener ? (@martwisener) <a href=\"https:\/\/twitter.com\/martwisener\/status\/1951146777815441883?ref_src=twsrc%5Etfw\">August 1, 2025<\/a><\/blockquote>\n<\/div><\/figure>\n<div id=\"boilerplate_2660155\" class=\"post-boilerplate boilerplate-after\"><div class=\"Boilerplate__newsletter-container vb\">\n<div class=\"Boilerplate__newsletter-main\">\n<p><strong>Daily insights on business use cases with VB Daily<\/strong><\/p>\n<p class=\"copy\">If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n<p class=\"Form__newsletter-legal\">Read our Privacy Policy<\/p>\n<p class=\"Form__success\" id=\"boilerplateNewsletterConfirmation\">\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n<p class=\"Form__error\">An error occured.<\/p>\n<\/p><\/div>\n<div class=\"image-container\">\n\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/venturebeat.com\/wp-content\/themes\/vb-news\/brand\/img\/vb-daily-phone.png\" alt=\"\"\/>\n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div><template id="Bhbr84MgqiIL6zCUs1Lu"></template><\/script>\r\n<br>\r\n<br><a href=\"https:\/\/venturebeat.com\/ai\/new-vision-model-from-cohere-runs-on-two-gpus-beats-top-tier-vlms-on-visual-tasks\/\">Source link <\/a>","protected":false},"excerpt":{"rendered":"<p>Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now The rise in Deep Research features and other AI-powered analysis has given rise to more models and services looking to simplify that process and read more of the documents [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":2918,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-2917","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/08\/Robot-poring-over-documents.jpg","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/2917","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=2917"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/2917\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/2918"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=2917"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=2917"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=2917"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69e302c146fa5c92dc28ac12. Config Timestamp: 2026-04-18 04:04:16 UTC, Cached Timestamp: 2026-04-29 17:10:56 UTC -->