How to Remove Duplicates in Large Datasets
Dealing with huge datasets can be tricky, especially the data cleaning process. One of such processing is deduplication, find out how you can solve this using the statistical techniques.
By Suresh Kondamudi, CleverTap.
Dealing with large datasets is often daunting. With limited computing resources, particularly memory, it can be challenging to perform even basic tasks like counting distinct elements, membership check, filtering duplicate elements, finding minimum, maximum, topn elements, or set operations like union, intersection, similarity and so on.
Probabilistic data structures to the rescue
Probabilistic data structures can come in pretty handy in these cases, in that they dramatically reduce memory requirements, while still providing acceptable accuracy. Moreover, you get time efficiencies, as lookups (and adds) rely on multiple independent hash functions, which can be parallelized. We use structures like Bloom filters, MinHash, Countmin sketch, HyperLogLog extensively to solve a variety of problems. One fairly straightforward example is presented below.
The problem
We manage mobile push notifications for our customers, and one of the things we need to guard against is sending multiple notifications to the same user for the same campaign. Push notifications are routed to individual devices/users based on push notification tokens generated by the mobile platforms. Because of their size (anywhere from 32b to 4kb), it’s nonperformant for us to index push tokens or use them as the primary user key.
On certain mobile platforms, when a user uninstalls and subsequently reinstalls the same app, we lose our primary user key and create a new user profile for that device. Typically, in that case, the mobile platform will generate a new push notification token for that user on the reinstall. However, that is not always guaranteed. So, in a small number of cases we can end up with multiple user records in our system having the same push notification token.
As a result, to prevent sending multiple notifications to the same user for the same campaign, we need to filter for a relatively small number of duplicate push tokens from a total dataset that runs from hundreds of millions to billions of records. To give you a sense of proportion, the memory required to filter just 100 Million push tokens is 100M * 256 = 25 GB!
The solution – Bloom filter
The idea is very simple.
 Allocate a bit array of size
 Choose independent hash functions whose range is
 For each data element, compute hashes and turn on bits
 For membership query , apply hashes and check if all the corresponding bits are ‘on’
Note that bits might be turned ‘on’ by hash collisions leading to false positives i,e a nonexisting element may be reported to exist and the goal is to minimise this.
On hash functions
Hash functions for Bloom filter should be independent and uniformly distributed. Cryptographic hashes like MD5 or SHA1 are not good choices for performance reasons. Some of the suitable fast hashes are MurmurHash, FNV hashes and Jenkin’s Hashes.
We use MurmurHash –
 It’s fast – 10x faster than MD5
 Good distribution – passes chisquared test for uniformity
 Avalanche effect – sensitive to even slightest input changes
 Independent enough
Sizing the Bloom filter
Sizing the bit array involves choosing optimal number of hash functions to minimise falsepositive probability.
With bits, hash functions and elements, the false positive probability i,e the probability of all the corresponding bits are ‘on’ falsely when the element doesn’t exist
for given , optimal k that minimises
i,e
so, for 100 Million push tokens with 0.001 error probability
This is significant improvement from 25 GB.
Related:
Top Stories Past 30 Days

