This article is part of VentureBeat\u2019s special issue, \u201cThe Real Cost of AI: Performance, Efficiency and ROI at Scale.\u201d Read more from this special issue.<\/em><\/p>\n

Model providers continue to roll out increasingly sophisticated large language models (LLMs) with longer context windows and enhanced reasoning capabilities.\u00a0<\/p>\n

This allows models to process and \u201cthink\u201d more, but it also increases compute: The more a model takes in and puts out, the more energy it expends and the higher the costs.\u00a0<\/p>\n

Couple this with all the tinkering involved with prompting \u2014 it can take a few tries to get to the intended result, and sometimes the question at hand simply doesn\u2019t need a model that can think like a PhD \u2014 and compute spend can get out of control.\u00a0<\/p>\n

This is giving rise to prompt ops, a whole new discipline in the dawning age of AI.\u00a0<\/p>\n

\u201cPrompt engineering is kind of like writing, the actual creating, whereas prompt ops is like publishing, where you\u2019re evolving the content,\u201d Crawford Del Prete, IDC president, told VentureBeat. \u201cThe content is alive, the content is changing, and you want to make sure you\u2019re refining that over time.\u201d<\/p>\n

The challenge of compute use and cost<\/h2>\n
Compute use and cost are two \u201crelated but separate concepts\u201d in the context of LLMs, explained David Emerson, applied scientist at the Vector Institute. Generally, the price users pay scales based on both the number of input tokens (what the user prompts) and the number of output tokens (what the model delivers). However, they are not changed for behind-the-scenes actions like meta-prompts, steering instructions or retrieval-augmented generation (RAG).\u00a0<\/p>\n
While longer context allows models to process much more text at once, it directly translates to significantly more FLOPS (a measurement of compute power), he explained. Some aspects of transformer models even scale quadratically with input length if not well managed. Unnecessarily long responses can also slow down processing time and require additional compute and cost to build and maintain algorithms to post-process responses into the answer users were hoping for.<\/p>\n
Typically, longer context environments incentivize providers to deliberately deliver verbose responses, said Emerson. For example, many heavier reasoning models (o3 or o1 from OpenAI, for example) will often provide long responses to even simple questions, incurring heavy computing costs.\u00a0<\/p>\n
Here\u2019s an example:<\/p>\n
Input<\/strong>: Answer the following math problem. If I have 2 apples and I buy 4 more at the<\/em> store after eating 1, how many apples do I have?<\/em><\/p>\n
Output<\/strong>: If I eat 1, I only have 1 left. I would have 5 apples if I buy 4 more. <\/em><\/p>\n
The model not only generated more tokens than it needed to, it buried its answer. An engineer may then have to design a programmatic way to extract the final answer or ask follow-up questions like \u2018What is your final answer?\u2019 that incur even more API costs.\u00a0<\/p>\n
Alternatively, the prompt could be redesigned to guide the model to produce an immediate answer. For instance:\u00a0<\/p>\n
Input<\/strong>: Answer the following math problem. If I have 2 apples and I buy 4 more at th<\/em>e store after eating 1, how many apples do I have? Start your response with \u201cThe answer is\u201d\u2026<\/em><\/p>\n
Or:\u00a0<\/p>\n
Input<\/strong>: Answer the following math problem. If I have 2 apples and I buy 4 more at the store after eating 1, how many apples do I have? Wrap your final answer in bold tags .<\/em><\/p>\n
\u201cThe way the question is asked can reduce the effort or cost in getting to the desired answer,\u201d said Emerson. He also pointed out that techniques like few-shot prompting (providing a few examples of what the user is looking for) can help produce quicker outputs.\u00a0<\/p>\n
One danger is not knowing when to use sophisticated techniques like chain-of-thought (CoT) prompting (generating answers in steps) or self-refinement, which directly encourage models to produce many tokens or go through several iterations when generating responses, Emerson pointed out.\u00a0<\/p>\n
Not every query requires a model to analyze and re-analyze before providing an answer, he emphasized; they could be perfectly capable of answering correctly when instructed to respond directly. Additionally, incorrect prompting API configurations (such as OpenAI o3, which requires a high reasoning effort) will incur higher costs when a lower-effort, cheaper request would suffice.<\/p>\n
\u201cWith longer contexts, users can also be tempted to use an \u2018everything but the kitchen sink\u2019 approach, where you dump as much text as possible into a model context in the hope that doing so will help the model perform a task more accurately,\u201d said Emerson. \u201cWhile more context can help models perform tasks, it isn\u2019t always the best or most efficient approach.\u201d<\/p>\n
Evolution to prompt ops<\/h2>\n
It\u2019s no big secret that AI-optimized infrastructure can be hard to come by these days; IDC\u2019s Del Prete pointed out that enterprises must be able to minimize the amount of GPU idle time and fill more queries into idle cycles between GPU requests.\u00a0<\/p>\n
\u201cHow do I squeeze more out of these very, very precious commodities?,\u201d he noted. \u201cBecause I\u2019ve got to get my system utilization up, because I just don\u2019t have the benefit of simply throwing more capacity at the problem.\u201d\u00a0<\/p>\n
Prompt ops can go a long way towards addressing this challenge, as it ultimately manages the lifecycle of the prompt. While prompt engineering is about the quality of the prompt, prompt ops is where you repeat, Del Prete explained.\u00a0<\/p>\n
\u201cIt\u2019s more orchestration,\u201d he said. \u201cI think of it as the curation of questions and the curation of how you interact with AI to make sure you\u2019re getting the most out of it.\u201d\u00a0<\/p>\n
Models can tend to get \u201cfatigued,\u201d cycling in loops where quality of outputs degrades, he said. Prompt ops help manage, measure, monitor and tune prompts. \u201cI think when we look back three or four years from now, it\u2019s going to be a whole discipline. It\u2019ll be a skill.\u201d<\/p>\n
While it\u2019s still very much an emerging field, early providers include QueryPal, Promptable, Rebuff and TrueLens. As prompt ops evolve, these platforms will continue to iterate, improve and provide real-time feedback to give users more capacity to tune prompts over time, Dep Prete noted.<\/p>\n
Eventually, he predicted, agents will be able to tune, write and structure prompts on their own. \u201cThe level of automation will increase, the level of human interaction will decrease, you\u2019ll be able to have agents operating more autonomously in the prompts that they\u2019re creating.\u201d<\/p>\n
Common prompting mistakes<\/h2>\n
Until prompt ops is fully realized, there is ultimately no perfect prompt. Some of the biggest mistakes people make, according to Emerson:\u00a0<\/p>\n
\n
Not being specific enough about the problem to be solved. This includes how the user wants the model to provide its answer, what should be considered when responding, constraints to take into account and other factors. \u201cIn many settings, models need a good amount of context to provide a response that meets users expectations,\u201d said Emerson.\u00a0<\/li>\n
Not taking into account the ways a problem can be simplified to narrow the scope of the response. Should the answer be within a certain range (0 to 100)? Should the answer be phrased as a multiple choice problem rather than something open-ended? Can the user provide good examples to contextualize the query? Can the problem be broken into steps for separate and simpler queries?<\/li>\n
Not taking advantage of structure. LLMs are very good at pattern recognition, and many can understand code. While using bullet points, itemized lists or bold indicators (****) may seem \u201ca bit cluttered\u201d to human eyes, Emerson noted, these callouts can be beneficial for an LLM. Asking for structured outputs (such as JSON or Markdown) can also help when users are looking to process responses automatically.\u00a0<\/li>\n<\/ul>\n
There are many other factors to consider in maintaining a production pipeline, based on engineering best practices, Emerson noted. These include:\u00a0<\/p>\n
\n
Making sure that the throughput of the pipeline remains consistent;\u00a0<\/li>\n
Monitoring the performance of the prompts over time (potentially against a validation set);<\/li>\n
Setting up tests and early warning detection to identify pipeline issues.<\/li>\n<\/ul>\n
Users can also take advantage of tools designed to support the prompting process. For instance, the open-source DSPy can automatically configure and optimize prompts for downstream tasks based on a few labeled examples. While this may be a fairly sophisticated example, there are many other offerings (including some built into tools like ChatGPT, Google and others) that can assist in prompt design.\u00a0<\/p>\n
And ultimately, Emerson said, \u201cI think one of the simplest things users can do is to try to stay up-to-date on effective prompting approaches, model developments and new ways to configure and interact with models.\u201d\u00a0<\/p>\n<\/p><\/div>\n

\n
Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"
This article is part of VentureBeat\u2019s special issue, \u201cThe Real Cost of AI: Performance, Efficiency and ROI at Scale.\u201d Read more from this special issue. Model providers continue to roll out increasingly sophisticated large language models (LLMs) with longer context windows and enhanced reasoning capabilities.\u00a0 This allows models to process and \u201cthink\u201d more, but it […]<\/p>\n","protected":false},"author":1,"featured_media":2196,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-2195","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/06\/teal-The-rise-of-prompt-ops_-Tackling-hidden-AI-costs-from-bad-inputs-and-context-bloat.jpg","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/2195","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=2195"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/2195\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/2196"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=2195"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=2195"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=2195"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}