How to discover stolen data using Hadoop and Big data?

We discuss recent data breaches and present an approach that uses Hadoop and data fingerprint matching techniques to discover stolen data.

big-data-talent-searchData theft has been a big issue for quite some time. What adds to the problem is the long time taken to identify the theft. The longer it takes to detect data theft, the more difficult it is to find a solution. Hadoop and Big Data can help organizations reduce the time to identify data theft and find a solution. A few organizations, as this article will show in due course, have been using Hadoop and Big Data to detect data theft quickly. Still, workable data theft solutions have just started to come and there is still a long time before we are able to develop sound defenses against data theft.

Data theft: some scary statistics

Reputed brands worldwide have suffered from huge loss of reputation and money because of data theft.  Consider the following statistics:

  • In the US, over 8 years, a hacking group targeted banks, departmental stores and payment processors and stole more than 160 million credit and debit card numbers.
  • KT Corp, the Korean mobile carrier suffered a huge loss of reputation when two suspects reportedly earned more than $850,000 by selling the plan details and contact information of more than 8.7 million KT subscribers.
  • Experian, one of the biggest data monitoring companies in the world, disclosed a huge breach of data of customers who had applied for services at T-Mobile. The data included names, addresses, Social Security Numbers, passport details and driving license details.
  • JP Morgan Chase suffered a loss of more than 76,000,000 customer records when hackers stole customer account numbers, names and email IDs. What added to the problem was that the theft was detected almost a month later.
  • Home Depot faced a massive loss of sensitive data when credit card details of up to 56 million customers were stolen from its cash register systems. This breach was done by malware installed by Russian and Ukrainian hackers in the cash register systems.

There are many more such incidents happening every day. The following observations can be inferred from the above samples:

  • Data theft can breach the strongest of systems because data theft methodologies are evolving with anti-data theft methodologies.
  • Data theft cannot be eliminated but it can be managed better.
  • If the systems of such reputed brands like JP Morgan and Chase and Experian can be breached, then almost nothing is safe.
  • Data theft protection systems need other dimensions as well and not just focus on protecting data. For example, there is a need to quickly identify data theft and identify the footprints.


Role of Hadoop and Big Data in recovering stolen data

It is not possible to wipe out data theft and it can strike anytime anywhere. But the approach towards data theft needs modification. While data security systems are upgraded, early theft detection and recovering lost data should also get attention. Hadoop and Big Data can play a role in quickly identifying an incident of data theft. A few companies have been working on finding data theft solutions. They are not even trying to prevent data theft — that is not possible. They are working at the following two things:

  • Identifying data theft as quickly as possible so that the data could be tracked without wasting time.
  • Tracking stolen data on the Internet and the Dark Web.

The concept behind data theft solutions

The assumption behind data theft solutions is that it is almost impossible to stop data theft. The best way to approach a situation of data theft is to assume that it is inevitable and to quickly start looking for the data before it is lost.

There is a fundamental difference between the incidents of stealing a tangible good and data. Unlike a tangible good, data thieves can only steal a copy of data. The original data can help track its copy in the web. It is about comparing the original and its copy.

To match the original and its copy, you need to generate a  hash code of the original and match it with that of the copy. A hash code is a unique number or identification assigned to a chunk of data. The technique to generate the hash code is known as cryptographic hashing. According to experts in this field, a data intelligence company that specializes in data theft solutions, ““It’s not code that’s embedded in the data so much as a computation done on the data itself”. You need to first divide the data into several chunks and then run each chunk through a mathematical function to generate a hash code. After that, you crawl the web and match the hash code with the data found on the web. If the hash code of the original matches with that of any other data, you have found your stolen data.