Security Spotlight, News

12,000 API Keys and Passwords Found in AI Training Datasets

Nearly 12,000 API keys and passwords were discovered in the Common Crawl dataset used for training AI models, highlighting significant security risks for enterprises. Many were hardcoded into front-end code.

Mitchell Langley
March 6, 2025

Table of Contents

Add a header to begin generating the table of contents

Researchers have uncovered a significant security vulnerability: nearly 12,000 valid API keys and passwords within the Common Crawl dataset. This dataset, a massive open-source repository used for training numerous artificial intelligence models, poses a substantial risk to enterprise security.

The Discovery

Truffle Security, utilizing their TruffleHog open-source scanner, analyzed 400 terabytes of data from 2.67 billion web pages within the Common Crawl December 2024 archive. Their findings revealed 11,908 successfully authenticated secrets.

These secrets, often hardcoded by developers, highlight a critical flaw in software development practices.

Impact on API Keys Leak on Enterprises

The exposed secrets included API keys for major services like Amazon Web Services (AWS), MailChimp, and WalkScore.

AWS root key in front-end HTML
Source: Truffle Security

This poses a severe risk to enterprise businesses relying on these services. The data breach exposes businesses to potential data exfiltration and malicious activities such as phishing campaigns and brand impersonation.

“Nearly 1,500 unique Mailchimp API keys were hard coded in front-end HTML and JavaScript,” Truffle Security reported.

This highlights a common mistake: developers hardcoding sensitive information into front-end code instead of using secure server-side environment variables.

MailChimp API key leaked in front-end HTML
source: Truffle Security

One particularly concerning finding was a WalkScore API key appearing 57,029 times across 1,871 subdomains, demonstrating the high reuse rate of these compromised credentials. The researchers also uncovered a webpage containing 17 unique live Slack webhooks, further emphasizing the widespread nature of this vulnerability.

Data Preprocessing Limitations

While AI LLM training data undergoes a preprocessing stage to filter out unwanted content, removing all sensitive information from such a massive dataset proves extremely difficult. This highlights the limitations of current data sanitization techniques.

Following the discovery, Truffle Security contacted the affected vendors to revoke the compromised keys. This proactive measure helps mitigate the immediate risks, but underscores the need for improved security practices throughout the software development lifecycle.

High Reuse Rate and Slack Webhook Vulnerability

A particularly concerning aspect of the report is the high reuse rate of the discovered secrets. 63% of the compromised keys were present on multiple pages. One WalkScore API key, for instance, “appeared 57,029 times across 1,871 subdomains,” highlighting a significant risk.

The researchers also uncovered a webpage containing 17 unique live Slack webhooks. These webhooks, which allow applications to post messages to Slack, should be kept strictly confidential.

“Keep it secret, keep it safe. Your webhook URL contains a secret. Don’t share it online, including via public version control repositories.” Slack warns.

Following the research, Truffle Security proactively contacted the affected vendors and collaborated with them to revoke the compromised keys.

“We successfully helped those organizations collectively rotate/revoke several thousand keys,” the researchers stated.

Long-Term Implications and Best Practices

Even if an AI model uses older archives than the dataset scanned by Truffle Security, these findings serve as a crucial warning. Insecure coding practices can significantly influence the behavior of LLMs, potentially leading to future vulnerabilities. Enterprise businesses must prioritize secure coding practices, including:

Avoiding hardcoding sensitive information: Utilize secure server-side environment variables instead.
Regular security audits: Conduct frequent scans to identify and address potential vulnerabilities.
Prompt key rotation: Regularly update and revoke API keys to minimize the impact of potential breaches.

Protecting Your Enterprise

This incident shows the importance of robust security measures for enterprise businesses. Regular security audits, secure coding practices, and the use of secure environment variables are crucial in preventing similar breaches.

For more information on protecting your enterprise from ransomware attacks, see our article on Top 10 Ransomware Groups of 2024.

Learn more about securing your remote work environment in our article, Ticking Time Bomb or Opportunity? How to Secure Remote Work Environments.

Trending

Daily Briefing Newsletter

Subscribe to the Daily Security Review Email Briefing to stay informed on the latest threats, trends, and technology, along with insightful columns from industry experts.

Mitchell Langley
June 5, 2025

BlackSuit ransomware and Royal operations breached 450+ U.S. companies

Mitchell Langley
August 8, 2025

Pandora Confirms Third-Party Data Breach, Advises Customers to Stay Alert

Mitchell Langley
August 8, 2025

CISA orders federal agencies to patch critical Exchange hybrid vulnerability by Monday morning — what organizations need to know

Andrew Doyle
August 8, 2025

Bouygues Telecom data breach exposes 6.4 million customers’ information

Mitchell Langley
August 8, 2025

Technical Glitch Briefly Erases Sections of U.S. Constitution from Congress.gov, Restored Quickly

Andrew Doyle
August 8, 2025

ReVault: Critical Dell Firmware Flaws Allow Windows Login Bypass and Persistent Implants

Andrew Doyle
August 7, 2025

Security Spotlight, News

12,000 API Keys and Passwords Found in AI Training Datasets

The Discovery

Impact on API Keys Leak on Enterprises

Data Preprocessing Limitations

High Reuse Rate and Slack Webhook Vulnerability

Long-Term Implications and Best Practices

Protecting Your Enterprise

Bouygues Telecom data breach exposes 6.4 million customers’ information

Technical Glitch Briefly Erases Sections of U.S. Constitution from Congress.gov, Restored Quickly

ReVault: Critical Dell Firmware Flaws Allow Windows Login Bypass and Persistent Implants

Air France–KLM Data Breach Exposes Customer Info via Compromised Third-Party Platform

Critical Flaws in CyberArk Conjur and HashiCorp Vault Put Enterprise Secrets at Risk

Prompt Injection Nightmare: Critical AI Vulnerabilities in ChatGPT, Copilot, Gemini & More

Air France and KLM Confirm Third-Party Data Breach Impacting Customer Information

Akira Ransomware Disables Microsoft Defender Using Intel Driver Exploit in New Wave of Attacks

MagentaTV Data Leak Exposes Over 324 Million Logs Linked to Deutsche Telekom’s Streaming Platform

Meta Blocks 6.8 Million WhatsApp Accounts Amid Rising Scam Group Abuse

Daily Briefing Newsletter

Cyprus Airways Data Breach: Hackers Claim Access to Real-Time Systems and Passenger Records

BlackSuit ransomware and Royal operations breached 450+ U.S. companies

Pandora Confirms Third-Party Data Breach, Advises Customers to Stay Alert

CISA orders federal agencies to patch critical Exchange hybrid vulnerability by Monday morning — what organizations need to know

Bouygues Telecom data breach exposes 6.4 million customers’ information

Technical Glitch Briefly Erases Sections of U.S. Constitution from Congress.gov, Restored Quickly

ReVault: Critical Dell Firmware Flaws Allow Windows Login Bypass and Persistent Implants