Do AI Programs Store Personal Data? German Regulator Weighs In featured image

Do AI Programs Store Personal Data? German Regulator Weighs In

by John DiGiacomo

Partner

Internet Law

There are several ongoing legal controversies relating to AI computer software models — such as Chatbot — and whether the training and output of such models violate copyright laws and data privacy laws and endanger personal and social freedoms. We wrote recently about the pending case of Andersen v. Stability AI, Ltd. (N.Dist. Cal.) involving whether AI-generated images infringe upon copyrights. Recently, the federal judge in the case allowed the case to proceed beyond the Motion To Dismiss phase — see here — because it was alleged that the AI program involved stored or contained compressed copies of billions of copyrighted images that had been downloaded and used for training. This was an allegation in the Amended Complaint that the judge was required to accept “as true.” Because this fact was “taken as true,” the court allowed the case to go forward on claims of direct infringement and induced infringement.

Relevant to this issue is the recently released report by a German regulator that AI models do NOT memorize or store personal data like names and birth dates. The regulator in question is the Hamburg Commissioner for Data Protection and Freedom of Information. The report itself involves personal data privacy rights but presents a factual finding that might be relevant to questions of copyright infringement.

The Hamburg Commissioner noted that AI generative software programs contain several interacting components, one of which is generally called a Large Language Model (“LLM”). These are used for text-generative AI programs, and similar components are used for image and video-generative AI models. The Hamburg Commissioner’s ultimate finding was that LLMs do not store or memorize personal data. Rather, “LLMs store highly abstracted and aggregated data points from training data and their relationships to each other, without concrete characteristics or references that “relate“ to individuals.” (p. 6). Because of this, LLMs are not storing “personal data” — as defined by EU personal and data privacy jurisprudence — because what is stored “lacks the necessary direct, targeted association to individuals….” In overly simplified terms, the LLMs store data that is disaggregated, abstracted, and disconnected. As such, there is no “personal data.”

Now, it must be said that the Hamburg Commissioner’s report is focused on a couple of very narrow questions: are LLMs engaged in the “processing” of personal data, and are they, themselves, subject to EU data privacy regulations? On that very narrow set of questions, the Hamburg Commissioner is suggesting that the answer is “no.” However, there is – or will be – a different answer when the whole AI program is considered since it is admitted that the output of the AI program “… may contain information relating to natural persons, especially if the prompt specifically asks for it.” Again, in overly simplified terms, the dis-aggregated data is re-aggregated, and that output generates information that identifies natural persons. That is “personal data” subject to EU privacy regulations.

In any event, it will be interesting to see how information and data are stored with respect to image-generated AI programs. The outcome of the various copyright cases may turn on the answer to that question.

Contact the AI, Internet Law, and Copyright Attorneys at Revision Legal

For more information, contact the experienced the AI, Internet Law, and Copyright Lawyers at Revision Legal. You can contact us through the form on this page or call (855) 473-8474.

Why the Hamburg Report’s Scope Matters

The Hamburg Commissioner’s report is deliberately narrow in its framing. The Commissioner was not asked to resolve whether AI training constitutes copyright infringement, whether AI-generated output violates privacy rights, or whether AI companies should face liability under GDPR for the personal data embedded in their training datasets. The report was focused on a specific technical question: does an LLM, considered as a standalone component, process or store “personal data” within the meaning of GDPR Article 4(1)?

The report’s answer — that LLMs store “highly abstracted and aggregated data points” without “direct, targeted association to individuals” — is a technical finding about how transformer-based neural networks represent and compress information from training data. It is not a finding that AI systems are generally exempt from data protection law, and the Commissioner explicitly acknowledged that the full AI system — including training pipelines, API infrastructure, and output — does implicate GDPR when its outputs contain identifiable information about natural persons.

The U.S. Copyright Cases and How They Differ

The Hamburg report has attracted attention from U.S. copyright litigants because the question of what an LLM “stores” is also at issue in the Andersen v. Stability AI litigation and a cluster of related cases involving generative image models. The theory advanced by copyright plaintiffs is that generative AI models contain compressed or “memorized” copies of the training data — copyrighted images or text — that can be reconstructed or closely approximated through targeted prompting.

If the Hamburg Commissioner’s technical conclusion is correct — that LLMs store only aggregated, abstracted representations of training data rather than compressed copies — that finding would undermine the direct infringement theory in U.S. copyright cases. However, the legal standards governing copyright infringement and GDPR personal data processing are not the same. A finding that training data is too abstracted to constitute “personal data” under GDPR does not necessarily mean it is too abstracted to constitute infringing copying under 17 U.S.C. § 106. The copyright inquiry focuses on whether the work was reproduced, not whether the stored representation retains a reference to a specific natural person.

Courts in the Northern District of California have allowed copyright claims against Stability AI, Midjourney, and related defendants to proceed past the motion-to-dismiss stage, at least in part. Andersen v. Stability AI, No. 3:23-cv-00201 (N.D. Cal.). Judge Orrick permitted claims of direct infringement and induced infringement to proceed based on the allegation — taken as true at the pleading stage — that the model contains compressed copies of training images. Whether plaintiffs can prove that allegation at the merits stage, in light of technical evidence about how the model actually stores representations, remains to be seen.

EU AI Act and GDPR: The Regulatory Landscape for AI and Personal Data

In Europe, the Hamburg report sits within a broader regulatory framework that is more developed than its U.S. equivalent. The EU’s General Data Protection Regulation creates obligations for data controllers and processors regardless of whether the processing occurs through a traditional database or an AI model. The EU AI Act, which entered into force in August 2024, adds a layer of requirements specifically for AI systems, tiered by risk level. High-risk AI systems — which include applications in biometric identification, critical infrastructure, employment, and law enforcement — face the most demanding requirements, including mandatory conformity assessments and registration.

The Hamburg report addressed whether LLMs, at the model level, are data “controllers” or “processors” under GDPR. The conclusion — that LLMs do not independently process personal data — has the practical consequence of locating GDPR obligations at the level of the company that deploys the AI system and feeds user data into it, rather than at the level of the underlying model. That is a commercially significant finding for companies that use third-party foundation models, because it suggests their primary compliance obligations under GDPR arise from their own data handling practices, not from the structure of the AI model they are using.

U.S. AI Privacy Law: What Currently Applies

The United States has no federal AI-specific privacy statute comparable to the EU AI Act. Privacy obligations for U.S. AI systems arise from the patchwork of state consumer data privacy statutes and sector-specific federal statutes like HIPAA, FERPA, and COPPA. Several states — California, Colorado, Connecticut, and Virginia among them — have amended or are amending their consumer data privacy laws to specifically address automated decision-making and profiling, which are the primary ways that AI systems generate outputs that affect individuals.

California’s CPRA requires businesses to conduct data protection impact assessments for high-risk processing activities, which can include AI-driven profiling. Colorado’s privacy law contains an opt-out right for profiling in furtherance of decisions that produce legal or similarly significant effects. These provisions are modeled on GDPR but are less prescriptive, and their application to specific AI use cases will be developed through regulatory guidance and enforcement actions over the next several years.

What Businesses Using or Building AI Should Monitor

  • Track the outcomes of the U.S. copyright cases involving generative AI — the technical findings about model storage will directly shape the legal risk profile of training on web-scraped data
  • Evaluate whether your AI use cases involve automated decision-making that triggers state opt-out rights under Colorado, Virginia, or Connecticut law
  • If you are using a third-party AI foundation model and feeding it user data, audit your data sharing agreement with the model provider to ensure it meets your obligations as a GDPR data controller (for any EU users)
  • Do not assume that because no U.S. federal AI privacy law exists, your AI practices are unregulated — state consumer privacy statutes, FTC Section 5 authority, and sector-specific federal statutes all apply

AI law is evolving at a pace that makes annual compliance reviews insufficient. If your business trains, deploys, or depends on AI systems that process user data, you need regular legal counsel on the intersection of AI, copyright, and privacy. Contact the AI, internet law, and copyright attorneys at Revision Legal through the form on this page or call (855) 473-8474.

Extra, Extra!
Related Posts

The Risks of Using AI-Generated Content in Your Business

The Risks of Using AI-Generated Content in Your Business

Artificial intelligence has become part of nearly every business operation. Businesses now use AI tools to write marketing copy, generate product images, compose emails, draft social media posts, and produce video and audio content at a scale that was not possible a few years ago. The efficiency gains are real. But so are the legal […]

Read more about The Risks of Using AI-Generated Content in Your Business

How to Respond to a Cease and Desist Letter

How to Respond to a Cease and Desist Letter

Receiving a cease and desist letter can feel alarming. One minute you are running your business as usual, and the next you are staring at a legal demand accusing you of trademark infringement, copyright violation, breach of contract, or some other wrong. The situation can escalate quickly if not handled properly. But receiving a cease […]

Read more about How to Respond to a Cease and Desist Letter

Put Revision Legal on your side