Private API Keys and Passwords Leaked in AI Training Dataset

Private API Keys and Passwords Leaked in AI Training Dataset
  • Nearly 12,000 private API keys and passwords were leaked in the Common Crawl dataset
  • The leaked secrets include AWS, MailChimp, and WalkScore API keys
  • 63% of the leaked secrets were found on multiple pages
  • One WalkScore API key appeared 57,029 times across 1,871 subdomains
  • The Common Crawl dataset is used to train many popular large language models
  • Cybercriminals could exploit these secrets to gain unauthorized access to sensitive information
  • Truffle Security helped impacted vendors revoke compromised keys

Introduction to the Common Crawl Dataset

The Common Crawl dataset is a massive archive of web data collected through large-scale web crawling. It is hosted by a nonprofit organization and contains over 250 petabytes of web data, with monthly crawls adding several petabytes more.

Recently, security researchers from Truffle Security analyzed roughly 400 terabytes of information collected from 2.67 billion web pages archived in 2024. They found almost 12,000 valid secrets, including API keys, passwords, and similar sensitive information, hardcoded in the archive.

Types of Leaked Secrets

The researchers discovered more than 200 different secret types, but the majority were for Amazon Web Services (AWS), MailChimp, and WalkScore. Nearly 1,500 unique Mailchimp API keys were hardcoded in front-end HTML and JavaScript, and many secrets were found in multiple instances.

In fact, almost two-thirds (63%) of the leaked secrets were found on multiple pages, with one WalkScore API key appearing 57,029 times across 1,871 subdomains. This poses a significant security risk, as cybercriminals could exploit these secrets to gain unauthorized access to sensitive information.

Implications for AI Training

The Common Crawl dataset is used to train many of the world's most popular large language models (LLMs), including those from OpenAI, DeepSeek, Google, Meta, and others. While LLMs don't use entirely raw data and are filtered to remove sensitive information, the question remains how well these filters work and how many secrets make it through.

Cybercriminals could use generative AI to uncover login credentials and other secrets, making it essential to address this security vulnerability. Truffle Security allegedly reached out to impacted vendors and helped them revoke compromised keys, but the incident highlights the need for better security practices when training AI models.