{"id":1922,"date":"2025-06-08T11:29:10","date_gmt":"2025-06-08T11:29:10","guid":{"rendered":"https:\/\/violethoward.com\/new\/how-much-information-do-llms-really-memorize-now-we-know-thanks-to-meta-google-nvidia-and-cornell\/"},"modified":"2025-06-08T11:29:10","modified_gmt":"2025-06-08T11:29:10","slug":"how-much-information-do-llms-really-memorize-now-we-know-thanks-to-meta-google-nvidia-and-cornell","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/how-much-information-do-llms-really-memorize-now-we-know-thanks-to-meta-google-nvidia-and-cornell\/","title":{"rendered":"How much information do LLMs really memorize? Now we know, thanks to Meta, Google, Nvidia and Cornell"},"content":{"rendered":" \r\n
Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy.\u00a0Learn more<\/em><\/p>\n\n\n\n Most people interested in generative AI likely already know that Large Language Models (LLMs) \u2014 like those behind ChatGPT, Anthropic\u2019s Claude, and Google\u2019s Gemini \u2014 are trained on massive datasets: trillions of words pulled from websites, books, codebases, and, increasingly, other media such as images, audio, and video. But why?<\/p>\n\n\n\n From this data, LLMs develop a statistical, generalized understanding of language, its patterns, and the world \u2014 encoded in the form of billions of parameters, or \u201csettings,\u201d in a network of artificial neurons (which are mathematical functions that transform input data into output signals).<\/p>\n\n\n\n By being exposed to all this training data, LLMs learn to detect and generalize patterns that are reflected in the parameters of their neurons. For instance, the word \u201capple\u201d often appears near terms related to food, fruit, or trees, and sometimes computers. The model picks up that apples can be red, green, or yellow, or even sometimes other colors if rotten or rare, are spelled \u201ca-p-p-l-e\u201d in English, and are edible. This statistical knowledge influences how the model responds when a user enters a prompt \u2014 shaping the output it generates based on the associations it \u201clearned\u201d from the training data.<\/p>\n\n\n\n But a big question \u2014 even among AI researchers \u2014 remains: how much of an LLM\u2019s training data is used to build generalized<\/em> representations of concepts, and how much is instead memorized<\/em> verbatim or stored in a way that is identical or nearly identical to the original data? <\/p>\n\n\n\n This is important not only for better understanding how LLMs operate \u2014 and when they go wrong \u2014 but also as model providers defend themselves in copyright infringement lawsuits brought by data creators and owners, such as artists and record labels. If LLMs are shown to reproduce significant portions of their training data verbatim, courts could be more likely to side with plaintiffs arguing that the models unlawfully copied protected material. If not \u2014 if the models are found to generate outputs based on generalized patterns rather than exact replication \u2014 developers may be able to continue scraping and training on copyrighted data under existing legal defenses such as fair use.<\/p>\n\n\n\n Now, we finally have an answer to the question of how much LLMs memorize versus generalize: a new study released this week from researchers at Meta, Google DeepMind, Cornell University, and NVIDIA finds that GPT-style models have a fixed memorization capacity of approximately 3.6 bits per parameter<\/strong>. <\/p>\n\n\n\n To understand what 3.6 bits means in practice:<\/p>\n\n\n\n This number is model-independent within reasonable architectural variations: different depths, widths, and precisions produced similar results. The estimate held steady across model sizes and even precision levels, with full-precision models reaching slightly higher values (up to 3.83 bits\/parameter).<\/p>\n\n\n\n One key takeaway from the research is that models do not memorize more when trained on more data. Instead, a model\u2019s fixed capacity is distributed across the dataset, meaning each individual datapoint receives less attention. <\/p>\n\n\n\n Jack Morris, the lead author, explained via the social network X that \u201ctraining on more data will force models to memorize less per-sample.\u201d<\/p>\n\n\n\n These findings may help ease concerns around large models memorizing copyrighted or sensitive content. <\/p>\n\n\n\n If memorization is limited and diluted across many examples, the likelihood of reproducing any one specific training example decreases. In essence, more training data leads to safer generalization behavior, not increased risk.<\/p>\n\n\n\n To precisely quantify how much language models memorize, the researchers used an unconventional but powerful approach: they trained transformer models on datasets composed of uniformly random bitstrings<\/strong>. Each of these bitstrings was sampled independently, ensuring that no patterns, structure, or redundancy existed across examples.<\/p>\n\n\n\n Because each sample is unique and devoid of shared features, any ability the model shows in reconstructing or identifying these strings during evaluation directly reflects how much information it retained\u2014or memorized<\/strong>\u2014during training.<\/p>\n\n\n\n The key reason for this setup was to completely eliminate the possibility of generalization. Unlike natural language\u2014which is full of grammatical structure, semantic overlap, and repeating concepts\u2014uniform random data contains no such information. Every example is essentially noise, with no statistical relationship to any other. In such a scenario, any performance by the model on test data must come purely from memorization of the training examples, since there is no distributional pattern to generalize from.<\/p>\n\n\n\n The authors argue their method is perhaps one of the only principled ways to decouple memorization from learning<\/strong> in practice, because when LLMs are trained on real language, even when they produce an output that matches the training data, it\u2019s difficult to know whether they memorized the input or merely inferred the underlying structure from the patterns they\u2019ve observed.<\/p>\n\n\n\n This method allows the researchers to map a direct relationship between the number of model parameters and the total information stored. By gradually increasing model size and training each variant to saturation, across hundreds of experiments on models ranging from 500K to 1.5 billion parameters, they observed consistent results: 3.6 bits memorized per parameter<\/strong>, which they report as a fundamental measure of LLM memory capacity.<\/p>\n\n\n\n The team applied their methodology to models trained on real-world datasets as well. When trained on text, models exhibited a balance of memorization and generalization. <\/p>\n\n\n\n Smaller datasets encouraged more memorization, but as dataset size increased, models shifted toward learning generalizable patterns. This transition was marked by a phenomenon known as \u201cdouble descent,\u201d where performance temporarily dips before improving once generalization kicks in.<\/p>\n\n\n\n The study also examined how model precision\u2014comparing training in bfloat16 versus float32\u2014affects memorization capacity. They observed a modest increase from 3.51 to 3.83 bits-per-parameter when switching to full 32-bit precision. However, this gain is far less than the doubling of available bits would suggest, implying diminishing returns from higher precision.<\/p>\n\n\n\n The paper proposes a scaling law that relates a model\u2019s capacity and dataset size to the effectiveness of membership inference attacks. <\/p>\n\n\n\n These attacks attempt to determine whether a particular data point was part of a model\u2019s training set. The research shows that such attacks become unreliable as dataset size grows, supporting the argument that large-scale training helps reduce privacy risk.<\/p>\n\n\n\n While the paper focuses on average-case behavior, some researchers have pointed out that certain types of data\u2014such as highly unique or stylized writing\u2014may still be more susceptible to memorization.<\/p>\n\n\n\n The authors acknowledge this limitation and emphasize that their method is designed to characterize general trends rather than edge cases.<\/p>\n\n\n\n By introducing a principled and quantifiable definition of memorization, the study gives developers and researchers new tools for evaluating the behavior of language models. This helps not only with model transparency but also with compliance, privacy, and ethical standards in AI development. The findings suggest that more data\u2014and not less\u2014may be the safer path when training large-scale language models.<\/p>\n\n\n\n To put total model memorization in perspective:<\/p>\n\n\n\n I\u2019m no lawyer or legal expert, but I would highly expect such research to be cited in the numerous ongoing lawsuits between AI providers and data creators\/rights owners.<\/p>\n
\n<\/div>\n
More training data DOES NOT lead to more memorization \u2014 in fact, a model will be less likely<\/em> to memorize any single data point<\/h2>\n\n\n\n
How the researchers identified these findings<\/h2>\n\n\n\n
Unique data is more likely to be memorized<\/h2>\n\n\n\n
Moving toward greater human understanding of LLM understanding<\/h2>\n\n\n\n
\n