12,000 API Keys and Passwords Found in AI Training Datasets

Nearly 12,000 API keys and passwords were discovered in the Common Crawl dataset used for training AI models, highlighting significant security risks for enterprises. Many were hardcoded into front-end code.
12,000 API Keys and Passwords Found in AI Training Datasets
Table of Contents
    Add a header to begin generating the table of contents

    Researchers have uncovered a significant security vulnerability: nearly 12,000 valid API keys and passwords within the Common Crawl dataset. This dataset, a massive open-source repository used for training numerous artificial intelligence models, poses a substantial risk to enterprise security.

    The Discovery

    Truffle Security, utilizing their TruffleHog open-source scanner, analyzed 400 terabytes of data from 2.67 billion web pages within the Common Crawl December 2024 archive. Their findings revealed 11,908 successfully authenticated secrets.

    These secrets, often hardcoded by developers, highlight a critical flaw in software development practices.

    Impact on API Keys Leak on Enterprises

    The exposed secrets included API keys for major services like Amazon Web Services (AWS), MailChimp, and WalkScore.

    AWS root key in front-end HTML
    Source: Truffle Security

    This poses a severe risk to enterprise businesses relying on these services. The data breach exposes businesses to potential data exfiltration and malicious activities such as phishing campaigns and brand impersonation.

    “Nearly 1,500 unique Mailchimp API keys were hard coded in front-end HTML and JavaScript,” Truffle Security reported.

    This highlights a common mistake: developers hardcoding sensitive information into front-end code instead of using secure server-side environment variables.

    MailChimp API key leaked in front-end HTML
    source: Truffle Security

    One particularly concerning finding was a WalkScore API key appearing 57,029 times across 1,871 subdomains, demonstrating the high reuse rate of these compromised credentials. The researchers also uncovered a webpage containing 17 unique live Slack webhooks, further emphasizing the widespread nature of this vulnerability.

    Data Preprocessing Limitations

    While AI LLM training data undergoes a preprocessing stage to filter out unwanted content, removing all sensitive information from such a massive dataset proves extremely difficult. This highlights the limitations of current data sanitization techniques.

    Following the discovery, Truffle Security contacted the affected vendors to revoke the compromised keys. This proactive measure helps mitigate the immediate risks, but underscores the need for improved security practices throughout the software development lifecycle.

    High Reuse Rate and Slack Webhook Vulnerability

    A particularly concerning aspect of the report is the high reuse rate of the discovered secrets. 63% of the compromised keys were present on multiple pages. One WalkScore API key, for instance, “appeared 57,029 times across 1,871 subdomains,” highlighting a significant risk.

    The researchers also uncovered a webpage containing 17 unique live Slack webhooks. These webhooks, which allow applications to post messages to Slack, should be kept strictly confidential.

    “Keep it secret, keep it safe. Your webhook URL contains a secret. Don’t share it online, including via public version control repositories.” Slack warns.

    Following the research, Truffle Security proactively contacted the affected vendors and collaborated with them to revoke the compromised keys.

    “We successfully helped those organizations collectively rotate/revoke several thousand keys,” the researchers stated.

    Long-Term Implications and Best Practices

    Even if an AI model uses older archives than the dataset scanned by Truffle Security, these findings serve as a crucial warning. Insecure coding practices can significantly influence the behavior of LLMs, potentially leading to future vulnerabilities. Enterprise businesses must prioritize secure coding practices, including:

    • Avoiding hardcoding sensitive information: Utilize secure server-side environment variables instead.
    • Regular security audits: Conduct frequent scans to identify and address potential vulnerabilities.
    • Prompt key rotation: Regularly update and revoke API keys to minimize the impact of potential breaches.

    Protecting Your Enterprise

    This incident shows the importance of robust security measures for enterprise businesses. Regular security audits, secure coding practices, and the use of secure environment variables are crucial in preventing similar breaches.

    For more information on protecting your enterprise from ransomware attacks, see our article on Top 10 Ransomware Groups of 2024.

    Learn more about securing your remote work environment in our article, Ticking Time Bomb or Opportunity? How to Secure Remote Work Environments.

    Related Posts