KDnuggets Home » News » 2017 » Sep » Tutorials, Overviews » Fundamental Breakthrough in 2 Decade Old Algorithm Redefines Big Data Benchmarks ( 17:n38 )

Fundamental Breakthrough in 2 Decade Old Algorithm Redefines Big Data Benchmarks

Read on to find out how the two-decade-old minwise hashing computational barrier has been overcome with a significantly efficient alternative.

By Anshumali Shrivastava, Rice University.

Search is a fundamental operation that lies at the heart of every large data-processing system. Given an object of interest, an ability to quickly find relevant objects from a large repository is the first prerequisite of a variety of Big Data applications, including web search, recommendation systems, answers to questions and more.

Most online applications require that search queries have a few milliseconds of response time to be practical. Therefore, the trivial idea of comparing every element from the data to a given object of interest -- also known as the "brute force" way -- is prohibitively expensive.

A related problem is filtering mirror pages from the web for near duplicates. In near-duplicate detection, for each page, we need to search for very similar pages to filter. The brute force way of comparing every pair of pages to identify near duplicates is a quadratic operation. Thus, if we have a repository of a trillion pages, then the brute force method requires around a trillion trillion operations, an infeasible number.

Minwise hashing, popularly known as minhash, developed in 1997, was one of the first successful algorithms to make duplicate detection over the web practical. It's no surprise this was also when search companies like Google came into existence. The elegant algorithm takes a given page as input and generates a set of probabilistic numbers as output. These numbers have a special property: if two pages are duplicates, then their corresponding minwise hashes are very likely to be identical. For non-duplicate pages, the corresponding minwise hashes have a low chance of being identical. These numbers, or minwise hashes, therefore act as fingerprints of the page. Fingerprint matches are probabilistic evidence of high similarity.

It turns out the possibility of computing such fingerprints makes search computationally easier. In particular, whenever we see a page p, we can index the page with its fingerprints. In other words, we can store a reference to page p at the memory location pointed to by its minwise hashes or fingerprint. In the future, if we see a page q that is a near-duplicate of p, then minwise hashes of p and q will likely match. Therefore, hashes of near-duplicate page q will point to page p. (See Figure 1.) The process does not require comparing every page with q to land at p. In fact, a few memory lookups are sufficient.

The idea of indexing based on probabilistic fingerprints was formalized into the famous theory of locality sensitive hashing (LSH). The theory sheds new light on the difficulty of proximity search problems. Later, the same ability to compute fingerprints was found to be directly applicable in reducing the computational bottleneck of several basic data-mining and machine-learning subroutines. The wide applicability of minwise hashing makes it arguably the most successful randomized algorithm in the literature.

Hash based search
Figure 1: Illustration of Hash (Fingerprint) Based Search.

Due to their simplicity and effectiveness, search algorithms based on minwise hashing are heavily deployed in commercial search engines. In 2012, minwise hashing and LSH were recognized as a key breakthrough and inventors were awarded ACM Paris Kanellakis Theory and Practice Award.

However, minwise hashing has one known major caveat. To compute one fingerprint with minwise hashing, one complete pass over the data is required. In practice, with massive datasets, we need hundreds of fingerprints (or hashes) for good accuracy, which requires hundreds of passes over the dataset. Thus, with huge volumes of data, the first step in fingerprint computation becomes prohibitively expensive. Hashing time is the known fundamental computational barrier with LSH and related algorithms.

Owing to the significance of this problem, improving the hashing time received significant interest as early as 2003. The most success has been observed with hashes based only on random projections. Minwise hashing was still a costly procedure. It was further known that for high-dimensional and sparse binary-like datasets, minwise hashing was provably superior to random projection-based LSH. Thus, it was seen as important to have a faster alternative to minwise hashing.

The first concrete success came in 2014, when the idea of binning combined with a "densification" trick led to a procedure that could, in one pass over the data, compute hundreds of hashes that were nearly identical to minwise hashes. The technique suffered a slight loss in accuracy, especially for very sparse datasets, but this problem was fixed in a 2017 follow-up paper. The resulting procedure is surprisingly simple and can be written in fewer than 20 lines of code in C++.

Consequently, the two-decade-old minwise hashing computational barrier has been overcome with a significantly efficient alternative. The computational cost of classical minwise hashing that required several passes over the data was brought down to just one pass, amounting to several orders of magnitude in computational savings. For most practical purposes, this improvement came without any noticeable difference in the statistical properties of hashes. It is rare to observe such significant gains without any noticeable tradeoffs.

Previously, LSH algorithms had many overheads, such as the hashing cost, which is why hashing algorithms never achieved state-of-the-art performance in head-to-head comparisons with other highly optimized implementations. Recent benchmarking shows that the hierarchical navigable small-world algorithm (HNSW) is significantly better than all existing implementations of similarity search algorithms. Hashing-based implementations were not even the second-best benchmark, and they lose to HNSW by a significant margin.

However, the new implementation of the efficient minwise hashing scheme combined with some simple randomized algorithms, named FLASH, turns the tables completely. Minwise hashing is now at the top of similarity search benchmarks. The benchmarking was done on real, ultra-high dimensional datasets that are ubiquitous in most latency-critical applications, including click-through data, recommendation systems, documents, text, etc. In a recent head-to-head comparison, with a state-of-the-art implementation of HNSW, FLASH turned out to be an order of magnitude more efficient concerning both computations and memory.

It is fascinating to see how a fundamental algorithmic improvement makes a huge difference in practical benchmarks. This success is an auspicious illustration that there is still room for better and simpler alternatives to old textbook algorithms as fundamental as minwise hashing.

Bio: Anshumali Shrivastava is an Assistant Professor, Department of Computer Science, Rice University. His research interests include Machine Learning, Randomized Algorithms for Big-Data, and Graph Mining.