{"id":1922,"date":"2025-06-08T11:29:10","date_gmt":"2025-06-08T11:29:10","guid":{"rendered":"https:\/\/violethoward.com\/new\/how-much-information-do-llms-really-memorize-now-we-know-thanks-to-meta-google-nvidia-and-cornell\/"},"modified":"2025-06-08T11:29:10","modified_gmt":"2025-06-08T11:29:10","slug":"how-much-information-do-llms-really-memorize-now-we-know-thanks-to-meta-google-nvidia-and-cornell","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/how-much-information-do-llms-really-memorize-now-we-know-thanks-to-meta-google-nvidia-and-cornell\/","title":{"rendered":"How much information do LLMs really memorize? Now we know, thanks to Meta, Google, Nvidia and Cornell"},"content":{"rendered":" \r\n
\n\t\t\t\t
\n

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy.\u00a0Learn more<\/em><\/p>\n\n\n\n


\n<\/div>

Most people interested in generative AI likely already know that Large Language Models (LLMs) \u2014 like those behind ChatGPT, Anthropic\u2019s Claude, and Google\u2019s Gemini \u2014 are trained on massive datasets: trillions of words pulled from websites, books, codebases, and, increasingly, other media such as images, audio, and video. But why?<\/p>\n\n\n\n

From this data, LLMs develop a statistical, generalized understanding of language, its patterns, and the world \u2014 encoded in the form of billions of parameters, or \u201csettings,\u201d in a network of artificial neurons (which are mathematical functions that transform input data into output signals).<\/p>\n\n\n\n

By being exposed to all this training data, LLMs learn to detect and generalize patterns that are reflected in the parameters of their neurons. For instance, the word \u201capple\u201d often appears near terms related to food, fruit, or trees, and sometimes computers. The model picks up that apples can be red, green, or yellow, or even sometimes other colors if rotten or rare, are spelled \u201ca-p-p-l-e\u201d in English, and are edible. This statistical knowledge influences how the model responds when a user enters a prompt \u2014 shaping the output it generates based on the associations it \u201clearned\u201d from the training data.<\/p>\n\n\n\n

But a big question \u2014 even among AI researchers \u2014 remains: how much of an LLM\u2019s training data is used to build generalized<\/em> representations of concepts, and how much is instead memorized<\/em> verbatim or stored in a way that is identical or nearly identical to the original data? <\/p>\n\n\n\n

This is important not only for better understanding how LLMs operate \u2014 and when they go wrong \u2014 but also as model providers defend themselves in copyright infringement lawsuits brought by data creators and owners, such as artists and record labels. If LLMs are shown to reproduce significant portions of their training data verbatim, courts could be more likely to side with plaintiffs arguing that the models unlawfully copied protected material. If not \u2014 if the models are found to generate outputs based on generalized patterns rather than exact replication \u2014 developers may be able to continue scraping and training on copyrighted data under existing legal defenses such as fair use.<\/p>\n\n\n\n

Now, we finally have an answer to the question of how much LLMs memorize versus generalize: a new study released this week from researchers at Meta, Google DeepMind, Cornell University, and NVIDIA finds that GPT-style models have a fixed memorization capacity of approximately 3.6 bits per parameter<\/strong>. <\/p>\n\n\n\n

To understand what 3.6 bits means in practice:<\/p>\n\n\n\n