INC-23-0014 confirmed high

GitHub Copilot Reproduces Verbatim Training Data Including Secrets (2023)

Alleged

GitHub (Microsoft), OpenAI developed and GitHub (Microsoft) deployed large language models and training datasets, harming Open-source developers, Software developers using Copilot, and Code repository owners ; contributing factors included misconfigured deployment and inadequate access controls.

Incident Details

Date Occurred 2023-01 Severity high

Evidence Level corroborated Impact Level Sector

Domain Security & Cyber

Primary Pattern PAT-SEC-005 Model Inversion & Data Extraction

Secondary Patterns PAT-PRI-004 Re-identification Attacks

Regions north america

Sectors Corporate

Affected Groups Developers & AI Builders

Exposure Pathways Direct Interaction

Causal Factors Misconfigured Deployment, Inadequate Access Controls

Assets & Technologies Large Language Models, Training Datasets

Entities GitHub (Microsoft)(developer, deployer), ·OpenAI(developer)

Harm Types financial, operational

Last Updated 2026-02-15

GitHub Copilot was found to reproduce verbatim code snippets, API keys, and credentials from its training data, raising concerns about intellectual property leakage and software supply chain security.

Incident Summary

Throughout 2023, developers and security researchers documented multiple instances of GitHub Copilot — an AI-powered code completion tool developed by GitHub and powered by OpenAI Codex — reproducing verbatim code from its training data, including API keys, personal credentials, proprietary code snippets, and other sensitive information.^[3] Copilot was trained on publicly available code repositories hosted on GitHub, including repositories licensed under various open-source licenses that require attribution, as well as repositories containing inadvertently committed secrets.

Researchers demonstrated that with targeted prompting strategies, Copilot could be induced to output significant portions of recognizable copyrighted code, private API keys, and other sensitive data memorized from its training set.^[3]^[4] Academic research published in 2023 quantified the scope of this memorization, finding that approximately 1% of Copilot suggestions longer than 150 characters constituted verbatim reproductions of training data, a rate that increased with more specific prompts.^[2]

A class action lawsuit, Doe v. GitHub, Inc., was filed in November 2022 in the U.S. District Court for the Northern District of California, alleging that Copilot’s reproduction of code violated open-source licenses — specifically the requirement to provide attribution and include license text — and constituted an infringement of developers’ intellectual property rights.^[1] The litigation remained ongoing as of the date of logging.

Key Facts

Product: GitHub Copilot, powered by OpenAI Codex
Training data: Publicly available GitHub repositories, including open-source and proprietary code
Memorization rate: Approximately 1% of suggestions over 150 characters were verbatim reproductions of training data
Sensitive data types reproduced: API keys, personal credentials, proprietary code, licensed code without attribution
Litigation: Doe v. GitHub, Inc. (class action, N.D. Cal., filed November 2022)
Mitigation: GitHub implemented content filters to detect and block verbatim reproductions

Threat Patterns Involved

Primary: Model Inversion & Data Extraction — Copilot’s reproduction of verbatim training data, including sensitive credentials and private keys, demonstrated that large language models can inadvertently serve as vectors for extracting private information embedded in their training sets, particularly when prompted with contextually relevant inputs.

Secondary: Re-identification Attacks — The reproduction of personally identifiable information, including names, email addresses, and API keys associated with specific individuals, from a model trained on ostensibly public data constituted a form of re-identification, linking anonymized model outputs back to identifiable individuals.

Significance

Training data memorization as a security risk. The incident established that AI code generation tools can function as inadvertent repositories of sensitive information, posing direct security risks when they reproduce credentials, API keys, or private code in production environments.
Legal test case for AI training on copyrighted works. The Doe v. GitHub class action lawsuit represents one of the first major legal challenges to the practice of training AI models on copyrighted material, with potential implications for the broader AI industry’s reliance on publicly available data.
Tension between open-source licensing and AI training. The case exposed a fundamental tension between the open-source ethos of publicly sharing code and the use of that code to train commercial AI products that may not comply with the attribution and licensing requirements of the original works.
Limits of post-deployment mitigation. While GitHub implemented filters to reduce verbatim reproduction, the incident demonstrated that post-training content filters are an imperfect solution to the problem of memorization in large language models, as novel prompting strategies can circumvent such filters.

Timeline

2021-06

GitHub launches Copilot technical preview, an AI code completion tool powered by OpenAI Codex trained on publicly available GitHub repositories

2022-06

GitHub Copilot becomes generally available as a paid subscription service

2022-11

Class action lawsuit (Doe v. GitHub, Inc.) filed in U.S. District Court alleging Copilot violates open-source licenses by reproducing copyrighted code without attribution

2023-01

Multiple researchers and developers publicly document instances of Copilot reproducing verbatim code, API keys, and personal information from training data

2023-06

Academic research quantifies memorization rates, finding approximately 1% of Copilot suggestions over 150 characters are verbatim reproductions of training data

Outcomes

Financial Loss:: Not publicly quantified
Arrests:: None
Recovery:: GitHub implemented content filters to reduce verbatim reproduction; litigation ongoing
Regulatory Action:: Class action litigation ongoing as of date of logging; no formal regulatory enforcement

Glossary Terms

Large Language Model Training Data

Use in Retrieval

INC-23-0014 documents github copilot reproduces verbatim training data including secrets, a high-severity incident classified under the Security & Cyber domain and the Model Inversion & Data Extraction threat pattern (PAT-SEC-005). It occurred in north america (2023-01). This page is maintained by TopAIThreats.com as part of an evidence-based registry of AI-enabled threats. Cite as: TopAIThreats.com, "GitHub Copilot Reproduces Verbatim Training Data Including Secrets," INC-23-0014, last updated 2026-02-15.

Sources

Doe v. GitHub, Inc. — Class Action Complaint, U.S. District Court for the Northern District of California (primary, 2022-11)
https://githubcopilotlitigation.com/ (opens in new tab)
Ziegler et al.: Measuring Data Contamination in Large Language Models: A Survey (arXiv) (analysis, 2023-06)
https://arxiv.org/abs/2306.05540 (opens in new tab)
The Register: GitHub Copilot Caught Spitting Out API Keys and Proprietary Code (news, 2023-01)
https://www.theregister.com/2023/01/11/github_copilot_api_keys/ (opens in new tab)
Ars Technica: GitHub Copilot Is a Lot Like Autocomplete — and It Has Similar Problems (news, 2023-06)
https://arstechnica.com/information-technology/2023/06/github-copilot-is-a-lot-like-autocomplete-and-it-has-similar-problems/ (opens in new tab)

Update Log

2026-02-15 — First logged (Status: Confirmed, Evidence: Corroborated)