KDnuggets Home » News » 2016 » Sep » Tutorials, Overviews » Automating Data Ingestion: 3 Important Parts ( 16:n33 )

Automating Data Ingestion: 3 Important Parts


In the day and age of ‘Big Data”, data ingestion has to be automated on some level. How best to automate it?



By Claudia Perlich, Dstillery.

Bits

In the day and age of ‘Big Data”, data ingestion has to be automated on some level - anything else is out of the question.

The more interesting question is how to best automate it. And which parts of the data preparation stages can be done during digestion. I have a very strong opinion on wanting my data as ‘raw’ as possible. So you should for instance NOT automate how to deal with missing data. I’d much rather know that it was missing than it being replaced by the system. Likewise I prefer the highest granularity of information to be maintained - consider for instance the full URL address of a webpage that a consumer went to vs. keeping only the hostname (less desirable, but OK) vs. keeping only some content category. From a privacy perspective there are good arguments against the former - but tools like hashing can mediate some of these concerns.

So let’s talk about the how: There are 3 really important parts of the automation process:

  1. Flexibility in sampling if the full data stream is too large: if you are dealing with 50 Billion events per day - just stuffing all into a hadoop system is nice - but makes later manipulation tedious. Instead, it is great to have in addition a process that ‘fishes out’ events of specific interest. See some of the details in a recent blog we wrote on this.
  2. Annotation of histories on the fly: having event logs of everything is great, but for predictive modeling I usually need to have features that capture the entity’s history. Joining every time over Billions of rows to create a history is impossibly. So part of the ingestion process is an annotation process that appends vital historical information to each event.
  3. Having statistical tests that evaluate if the properties of the incoming data flow is changing and sends alarms if for instance some data sources go temporarily dark. Some of this is covered here.

Original. Reposted with permission.

Dstillery is a data analytics company that uses machine learning and predictive modeling to provide intelligent solutions for brand marketing and other business challenges. Drawing from a unique 360 degree view of digital, physical and offline activity, we generate insights and predictions about the behaviors of individuals and discrete populations.

Related:


Sign Up