Data Mining finds JASBUG, a Critical Security Vulnerability

We explain how the critical Microsoft security vulnerability JASBUG that existed for 15 years was detected with similarity search and regular expression inference.

By Arnoldo Muller-Molina (simMachines).

JASBUG If you pay attention to security news perhaps you have heard of the recent discovery of the so called JASBUG.

See, for example here, here, or CNN report.

By using similarity search and large scale clustering, we discovered one of the worst security vulnerabilities of the past 15 years. In what follows, I explain the context of the issue and the general gist of the solution used to discover the issue.

The Internet Corporation for Assigned Names and Numbers (ICANN) engaged us to research potential technical issues relating to the rollout of new Generic Top Level Domains (New gTLDs) on the Internet.

During the course of the research, we uncovered a vulnerability not directly related to ICANN’s New gTLD Program nor to new TLDs in general. Once we understood the seriousness of the vulnerability, we notified the affected vendor and withheld additional disclosure until the vendor addressed the vulnerability. Information from Microsoft relating to this issue is available here and here.

Microsoft has classified this vulnerability “Critical” as “…exploitation could allow code execution without user interaction.” This is the most serious rating in Microsoft’s classification taxonomy.

The vulnerability impacts core components of the Microsoft Windows Operating System. All domain-joined Windows Clients and Servers (i.e. Members of a corporate Active Directory) may be at risk. The vulnerability is remotely exploitable and may grant the attacker administrator level privileges on the target machine/device. Roaming machines – domain-joined Windows devices that connect to corporate networks via the public Internet (e.g. from hotels and coffee shops) – are at heightened risk.

Unlike recent high-profile vulnerabilities like Heartbleed, Shellshock, Gotofail, and POODLE, this is a design problem, not an implementation problem, making this type of vulnerability unusual and much more difficult to fix. The fix required Microsoft to re-engineer core components of the operating system and to add several new features. Furthermore, the problem has been present for at least 15 years.

How come this problem was not noticed before?

The answer is simple: nobody was looking at the data that is collected by domain name servers by means of clustering and similarity. Every year, the Domain Name System Operations Analysis and Research Center (DNS-OARC) collects data from DNS nameservers through various means as the annual Day In the Life of the Internet (DITL) collection effort. The data set consists of billions of records that include various measurements including the Domain Name that was being queried, for example:


Many of these queries will contain randomly generated numbers so for example a query would look like:


With arbitrarily complex string generation algorithms, it is necessary to extract the gist of the string in the form of a regular expression or equivalent in order to explain a potentially large amount of strings that share similar elements. In the case of the example above we would like to infer the regular expression:


Our objective is then to discover overrepresented regular expressions out of the queries that are received by Domain Name Servers. The input is a very large set of strings and the output is an enumeration of regular expressions or a suitable visualization such as the Position Weight Matrices (PWM) that will explain over-represented patterns. Please review this article for a description of a PWM. A visualization of a PWM looks like:

logo for pattern

Where the X axis (Position) represents a character position in the pattern and the Y axis is the frequency of each character. Each ball contains a character and by looking at characters of similar frequencies it is possible to identify sub-patterns that may not be clear from the regular expression. For example “compatibility-at-addons” is a sub-pattern that does not emerge clearly from the original regular expression:


The technique itself is inspired in Motif Discovery commonly used in Computational Biology to uncover Transcription Factor Binding Sites patterns. These techniques fundamentally employ sampling methodologies that heavily rely on randomness. The technique works well because they are executed in a constrained set of Genetic sequences. The DITL data-set is at least an order of magnitude larger than the human genome and since the analysis is applied on a wide scale, we must use alternatives procedures that will converge faster.

Similarity search or Nearest Neighbor is the right choice for this problem. Traditionally, similarity search has been considered a very expensive operation specially for metric and pseudometric spaces. During the past 10 years, I have been working on a similarity index that I call “Ramiel” that improves other techniques by an order of magnitude (some benchmarks here). My company simMachines commercializes the technology. I used my index to cluster the entire data-set (about 2 billion records). The clustering algorithm will return many groups of similar strings. We use a clustering algorithm that discovers an arbitrary number of quality clusters where quality is a function of density and size. The number of clusters discovered is determined in real-time so you do not have to impose a number of clusters. The clustering algorithm will provide a “center” for each group, this object is then compared against all the other elements in the cluster by using Smith Waterman Algorithm to align the center and each object of the cluster. This alignment process is then used to build a PWM and from the PWM it is trivial to infer a regular expression.

The process generated many regular expressions, but one particular expression stood out as an anomaly. In fact, several patterns were extracted that contained a similar substring. This lead to the conclusion that there was a DNS query that could be configured hitting the root servers in a very consistent way. The set of patterns led us to believe that there was a problem in the Active Directory service of Microsoft Windows. Effectively, it was possible to remotely control a client by performing a man in the middle attack, so when a person that works remotely is trying to login into the company's network, there will be an improper use of DNS during the authentication process. It is here where the client can be attacked and arbitrary code may be executed in their machine. The improper use of DNS was being stored in the DITL data-set but nobody noticed in until we applied large scale clustering and similarity algorithms.

The vulnerability was not noticed in 15 years because similarity search is a computationally expensive resource and it was necessary to first build fast similarity and clustering techniques that scaled to billions of records. Now we have the required performance and this allowed us to create the enumeration of regular expressions that helped us to find the bug. Similarity search has been the key to uncovering a security issue that is bigger than anything we have seen in the past 15 years. From my point of view, a distance function is the single most flexible tool that a data scientist can have at her / his disposal.

Arnoldo Muller-Molina Arnoldo Müller-Molina (@amuller) is the founder and CEO of simMachines, a company that applies metric similarity search to data problems to discover, to recommend and to predict. He was recognized as one of the top innovators under 35 years from his region by MIT Technology Review.