Finding Leaked API keys through Entropy Analysis

Hardcoded API tokens appearing in GitHub repositories is not a new problem. In fact, mitigation efforts are reducing the scope of this issue with organizations like Facebook actively monitoring public repositories. However, the automated solutions employed are only applicable to platforms like GitHub and fall short in different arenas. Try searching JavaScript files found in the wild for example - too many false positives to be practical at scale. To understand why this is the case, it is important to first understand what technologies are being leveraged to locate these leaked tokens.

Open-source projects employ one of three tactics to locate leaked API keys:

  1. High-risk variable names
  2. Regular Expressions
  3. High entropy (highly random) strings

High-risk variable names like “API_SECRET” and “PRIVATE_KEY” could be indicative of information disclosure, but this detection method lacks the proper validation for anything other than pre-push hooks. For example, many production JavaScript files are minified for browser optimization, effectively erasing these keywords. Relying on naming conventions for detection is ineffective and should only be used to supplement a more dynamic solution.

Regular expressions can be used to detect API tokens that follow strict patterns, like Stripe, who use the “sk_live_” prefix and can be easily identified with a regex pattern like sk_live_(0-9a-zA-Z]{24} [source]. While the tokens of select services can be matched with confidence, other services like AWS are essentially base64. This means a regular expression to match an AWS Secret Key will also match non-secret base64 strings, like those found in images. As is the case with Antivirus software, signature based detection is only good for token types you are specifically looking for and are already familiar with. While this reduces false positives it also gives scans a tunnel vision, excluding valid API keys that may not be in the signature table. Despite these drawbacks the majority of open-source secret discovery tools still heavily rely on regex.






















To avoid relying on signature based detection and locate tokens for services that may be unfamiliar, it is vital a more dynamic method is used. This has been attempted with a formula created by the mathematician Shannon Claude who figured out how to quantify entropy.


Using this formula it is possible to create a so-called “Shannon Entropy Calculator” and mathematically determine how “random” a string is. If the entropy level is above a particular threshold, say 4, it will be considered a valid finding. Since API tokens are designed to be high entropy strings they are easy to spot amongst the English-based programming languages, like JavaScript. So easy in fact, people with zero programming experience can accurately determine which line an API token is on without even knowing what an API token is - seriously, try it. This is a fairly intuitive concept - find what does not match, i.e., find what is random. Yet as shown in the table above, this is not utilized nearly as much as regex. Simply put, there are too many false positives. Take the string “AppendWebUICSSTextDefaults” for example, which has an entropy score of ~4.1. Now compare that to “tLD3Lq2BjPjjPzxBB3qLDxDMju”, which is objectively more anomalous, but has an entropy score of ~3.6 - mathematically less random than the previous string.

So what gives? This is a perfect example of a cross-industry semantic misunderstanding. This math applied to a string does not quantify entropy but instead redundancy. The probability of a character appearing in a sequence has nothing to do with the order of those characters (or lack thereof). Without taking into account character relationships this math cannot be applied to strings for the purposes of locating secrets.

Please consider this prior to implementing Shannon entropy into your suite.

Too long; did not read: AABBCC has the same entropy score as CABCBA.