How to discover stolen data using Hadoop and Big data?

We discuss recent data breaches and present an approach that uses Hadoop and data fingerprint matching techniques to discover stolen data.

A few companies also call the entire process Fingerprint Matching. The hash codes of the data chunks are known as fingerprints of data and the action of matching hash codes is called fingerprint matching.

The data theft solutions are quite powerful because they are able to crawl even the Dark Web where the websites can hide their identity. In fact, crawling the Dark Web is claimed as one of the central characteristics of the data theft solutions.

Some data theft solutions also offer analytics and reporting capabilities for their clients. These solutions can be integrated with almost any Security information and event management (SIEM) systems. The SIEMs can receive alerts.

Following is a typical work flow diagram for a standard security application.


Role of Hadoop and Big Data in finding stolen data

Obviously, matching data fingerprints requires handling an enormous volume of data.

The entire process of breaking data into chunks and generating hash codes involves enormous volumes of data. It is imaginable that the database of each data theft management company must be overflowing with data. To process such a huge amount of data, the companies need a reliable Hadoop platform. Not any Hadoop solution will do. It needs to be something like an enterprise-grade version of Hadoop which is implemented in the native code and not on the Virtual Java Machine. This makes Hadoop more resource-efficient.

The data theft solutions in the market completely depend on data chunks or datasets. The more datasets, the higher is the chance to match fingerprints. So, there is a need of a system which can handle large volumes of data. Only Hadoop and Big Data are capable of doing that. According to Danny Rogers, “We are only as good as the data we collect, and our ability to collect more data depends on this key piece of technology.”

The above role of Hadoop in finding stolen data can set a template for tracking stolen data. You need a large-scale and cloud-based automation with an enterprise-grade distribution to find out stolen data. Hadoop plays two roles in this context: dataset manager and dataset processor. For any organization that attempts to match fingerprints of datasets to find stolen data, it will have to store and process huge volumes of datasets. For that, it will need a sound data management and processing system.


The development of data theft detection systems represents a change in the approach towards data theft in a sense. It is good that enterprises are realizing the potential of Hadoop in detecting stolen data. Hadoop complements data theft tracking systems. Fingerprint matching techniques should be supported by adequate data storage and processing capabilities. However, as stated earlier in this article, these are early days in developments like this. Another perspective could be ensuring the security of data storage systems which could be the targets for future attacks as these systems store a huge amount of data. In such a case, enterprise database and Hadoop could be equally facing attacks from Hackers.

Bio: Kaushik Pal ( has 16 years of experience as a technical architect and software consultant in enterprise application and product development. He has interest in new technology and innovation area along with technical writing. His main focuses are on web architecture, web technologies, java/j2ee, Open source, big data and semantic technologies.