Topics: Coronavirus | AI | Data Science | Deep Learning | Machine Learning | Python | R | Statistics

KDnuggets Home » News » 2020 » Sep » Opinions » Let’s Be Honest: We’re Drowning in Data ( 20:n35 )

Let’s Be Honest: We’re Drowning in Data


The fields of Big Data, Data Analytics/Science, and Data Integration need to face a new truth: We are drowning in data, more and more so every second of every day.



By Jonathan Martini, CTO at PixelTItan

 

The rise of smart devices, new and ever more complex streams of information, and the proliferation of more and more tools to catalog, process, and interpret all of it has brought us to a place where “keeping up” is in some cases the best we can hope for. Technology suffers from a problem of inheritance, and thus as we innovate, we tend to introduce more complexity than potentially necessary. As systems evolve, those of us who help to design them are constantly weighing options against the “necessity” of technical debt. It may help to understand what options we have going forward by first briefly going back to know the limitations and forces that shaped where we stand now.

 

How We Got Here

 
For the first roughly 30 years in the evolution of Data the world was structured, relational databases were the norm, and there was a limited set of languages an engineer had to know in order to derive a result (mostly dominated by SQL). The tools needed to interact with databases were efficient and could count on mostly clean data. Reports generated in the form of dashboards were relatively easy to interpret, and for their part provided meaningful measures upon which to base decisions. These results became even more insightful with the rise of Statistical Analysis and Data Mining in the 1990s.

As the 1990s came to a close we saw the rise of the Internet, and with it a new kind of data: Unstructured. The previous systems of structured content, with their mature analysis tools, and by comparison, easy to understand and clearly defined data types were still present. But unstructured data required a new way of doing things, and also opened new opportunities for insights to those who knew how to use it. Web use created a constant stream of media uploaded to social media sites. We saw the beginnings of a network of sensors and smart devices (that would come to be known as the Internet of Things), all generating more and more data by the day. A new way of processing and understanding all of it was required. Old systems and tools evolved into new ones to keep up, and those who could wield them became what we now know as Data Scientists. We also saw the beginnings of long-term storage moving from physical servers and data warehouses to cloud storage, leading to our industry’s evolution from data center focused to being increasingly diversified among various Cloud solutions and services.

Through the 2010s we continued to see exponential growth in unstructured data as mobile devices became more and more commonplace. With their adoption we saw the addition of storage for geo-spatial data, and behavioral data from devices to the mix of unstructured data. Further a growing adoption of “smart devices” only accelerated the creation of, and necessity for storage of more and more data.

 

The Current State

 
For many senior leaders who deal with technology infrastructure and/or services, a detailed view of how their systems are structured would run dozens of pages long and include dependencies among multiple vendors and service providers. It would also (in most cases) include separate procedures and processes for dealing with each separate class of data they deal with and have to take into account any regulations and laws that have bearing on that data. Further, these procedures require specialists in each step of the operational chain. Whether it’s analysts and data scientists scrubbing data for hygiene to derive valuable insights from the data on hand, or engineers working on integrating disparate systems so they can speak to each other, each role both contributes to and is constrained by the overall system they work within.

This system as it has evolved leads inherently to 3 major problems. First, data has a tendency to sit idle unless it is explicitly needed by someone or something. Second, in order to manage these idle data sets, a company must hire more people to meet the demands their system is generating. Be it analysts and data scientists, or engineers to design new systems to automate the processes they are currently fulfilling with people. Adding new systems to a workflow leads to the need to design secondary systems for the output of those automated processes to be re-introduced to the current data flow. And third, accept that both of the first two issues will only continue to grow at a compounded rate as long as their data sources continue to output more and more new pieces of information.

 

How it could be

 
If we could start from scratch knowing what we now know of the current state of the world and how data is being created, would we continue to use the systems in place as they are now? What would an ideal data flow need in order to be considered an “ideal” solution for the majority of large data creators and consumers?

At a high level we would need to adhere to 6 basic principles:

  1. Driven by efficiency of data processing
  2. Data type agnostic (Structured v Unstructured, and also with regard to language of processing)
  3. Human interaction should be kept to as simple and minimal an amount as possible
  4. Scalability and adaptability drive architecture decisions
  5. Security, and privacy should be the foundation of the entire system
  6. Financially it has to cost less than the current status quo

If we envision a system designed around these 6 guiding principles, what would it look like for a modern enterprise and what advantages would they now have? First, and foremost, they would have the flexibility to re-align their work groups into areas of the business where human creativity and input can bring greater value. This may lead to exploring new lines of business or services which would have been cost prohibitive previously, or simply impossible to execute given the constraints of the previous system. Further, a re-imagining of our approach to unstructured data may uncover brand new use cases for the data we already possess with which we are currently doing nothing.

From time to time an objective look at both the how and why we do things the way we do can lead to valuable insights. We stand at the end of a long evolution in computing, technology, and its implementation into our daily lives. It remains to be seen how the future will unfold with regard to our usage of the data we are constantly creating, though we do know 2 things with absolute certainty:

  1. We will continue to create data at a rate exponentially higher than we currently do
  2. Without a new approach to the systems that process the data we create we will eventually be incapable of seeing the entirety of the scope of what that data is telling us

 
Bio: Jonathan Martini is CTO at PixelTItan. PixelTitan is a new startup whose mission is to solve the puzzle of unstructured data. Founded by a group of events, photography, and technology veterans, PixelTitan’s unique patent-pending technology allows the company to process the definitions of data, instead of having to work with the data itself, meaning PixelTitan is not held back by file types or file size, and can deliver datasets at a volume necessary for AI.

Related:


Sign Up

By subscribing you accept KDnuggets Privacy Policy