How to Remove Duplicates in Large Datasets

Dealing with huge datasets can be tricky, especially the data cleaning process. One of such processing is de-duplication, find out how you can solve this using the statistical techniques.

comments

By Suresh Kondamudi, CleverTap.

Dealing with large datasets is often daunting. With limited computing resources, particularly memory, it can be challenging to perform even basic tasks like counting distinct elements, membership check, filtering duplicate elements, finding minimum, maximum, top-n elements, or set operations like union, intersection, similarity and so on.

Probabilistic data structures to the rescue

Probabilistic data structures can come in pretty handy in these cases, in that they dramatically reduce memory requirements, while still providing acceptable accuracy. Moreover, you get time efficiencies, as lookups (and adds) rely on multiple independent hash functions, which can be parallelized. We use structures like Bloom filters, MinHash, Count-min sketch, HyperLogLog extensively to solve a variety of problems. One fairly straightforward example is presented below.

The problem

We manage mobile push notifications for our customers, and one of the things we need to guard against is sending multiple notifications to the same user for the same campaign. Push notifications are routed to individual devices/users based on push notification tokens generated by the mobile platforms. Because of their size (anywhere from 32b to 4kb), it’s non-performant for us to index push tokens or use them as the primary user key.

On certain mobile platforms, when a user uninstalls and subsequently re-installs the same app, we lose our primary user key and create a new user profile for that device. Typically, in that case, the mobile platform will generate a new push notification token for that user on the reinstall. However, that is not always guaranteed. So, in a small number of cases we can end up with multiple user records in our system having the same push notification token.

As a result, to prevent sending multiple notifications to the same user for the same campaign, we need to filter for a relatively small number of duplicate push tokens from a total dataset that runs from hundreds of millions to billions of records. To give you a sense of proportion, the memory required to filter just 100 Million push tokens is 100M * 256 = 25 GB!

The solution – Bloom filter

The idea is very simple.

Allocate a bit array of size
Choose independent hash functions whose range is
For each data element, compute hashes and turn on bits
For membership query , apply hashes and check if all the corresponding bits are ‘on’

Note that bits might be turned ‘on’ by hash collisions leading to false positives i,e a non-existing element may be reported to exist and the goal is to minimise this.

On hash functions

Hash functions for Bloom filter should be independent and uniformly distributed. Cryptographic hashes like MD5 or SHA-1 are not good choices for performance reasons. Some of the suitable fast hashes are MurmurHash, FNV hashes and Jenkin’s Hashes.

We use MurmurHash –

It’s fast – 10x faster than MD5
Good distribution – passes chi-squared test for uniformity
Avalanche effect – sensitive to even slightest input changes
Independent enough

Sizing the Bloom filter

Sizing the bit array involves choosing optimal number of hash functions to minimise false-positive probability.

With bits, hash functions and elements, the false positive probability i,e the probability of all the corresponding bits are ‘on’ falsely when the element doesn’t exist