Data Science and Big Data, Explained
This article is meant to give the non-data scientist a solid overview of the many concepts and terms behind data science and big data. While related terms will be mentioned at a very high level, the reader is encouraged to explore the references and other resources for additional detail.
What is data science? What is big data? What do these terms mean and why is it important to find out? These are hot topics indeed, but are often misunderstood. Further, the industries involved don’t have universally agreed upon definitions for both.
These are extremely important fields and concepts that are becoming increasingly critical. The world has never collected or stored as much data, and as fast as it does today. In addition, the variety and volume of data is growing at an alarming rate.
Why should you care about data science and big data? Data is analogous to gold in many ways. It is extraordinarily valuable and has many uses, but you often have to pan for it in order to realize its value.
Are these new fields? There are many debates as to whether data science is a new field. Many argue that similar practices have been used and branded as statistics, analytics, business intelligence, and so forth. In either case, data science is a very popular and prominent term used to describe many different data-related processes and techniques that will be discussed here. Big data on the other hand is relatively new in the sense that the amount of data collected and the associated challenges continues to require new and innovative hardware and techniques for handling it.
This article is meant to give the non-data scientist a solid overview of the many concepts and terms behind data science and big data. While related terms will be mentioned at a very high level, the reader is encouraged to explore the references and other resources for additional detail. Another post will follow as well that will explore related technologies, algorithms, and methodologies in much greater detail.
With that, let’s begin!
Data Science Defined
Data science is complex and involves many specific domains and skills, but the general definition is that data science encompasses all the ways in which information and knowledge is extracted from data.
Data is everywhere, and is found in huge and exponentially increasing quantities. Data science as a whole reflects the ways in which data is discovered, conditioned, extracted, compiled, processed, analyzed, interpreted, modeled, visualized, reported on, and presented regardless of the size of the data being processed. Big data (as defined soon) is a special application of data science.
Data science is a very complex field, which is largely due to the diversity and number of academic disciplines and technologies it draws upon. Data science incorporates mathematics, statistics, computer science and programming, statistical modeling, database technologies, signal processing, data modeling, artificial intelligence and learning, natural language processing, visualization, predictive analytics, and so on.
Data science is highly applicable to many fields including social media, medicine, security, health care, social sciences, biological sciences, engineering, defense, business, economics, finance, marketing, geolocation, and many more.
Big Data Defined
Big Data is essentially a special application of data science, in which the data sets are enormous and require overcoming logistical challenges to deal with them. The primary concern is efficiently capturing, storing, extracting, processing, and analyzing information from these enormous data sets.
Processing and analysis of these huge data sets is often not feasible or achievable due to physical and/or computational constraints. Special techniques and tools (e.g., software, algorithms, parallel programming, etc.) are therefore required.
Big Data is the term that is used to encompass these large data sets, specialized techniques, and customized tools. It is often applied to large data sets in order to perform general data analysis and find trends, or to create predictive models.
You may be wondering why the term Big Data has become so buzzworthy. We’ve collected a lot of data of various types on a large variety of data storage mechanisms for a long time, right? Yes we have, but we’ve never before enjoyed such inexpensive data collection, storage capabilities, and computational power as we do today. Further, we’ve previously not had such easy access to as inexpensive and capable raw data sensing technologies, instrumentation, and so forth that lead to the generation of today’s massive data sets.
So where exactly does this data come from? Large amounts of data are gathered from mobile devices, remote sensing, geolocation, software applications, multimedia devices, radio-frequency identification readers, wireless sensor networks, and so on.
A primary component of big data is the so-called Three Vs (3Vs) model. This model represents the characteristics and challenges of big data as dealing with volume, variety, and velocity. Companies such as IBM include a fourth “V”, veracity, while Wikipedia also notes variability.
Big data essentially aims to solve the problem of dealing with enormous amounts of varying-quality data, often of many different types, that is being captured and processed sometimes at tremendous (real-time) speeds. No easy task to say the least!
So in summary, Big Data can be thought of being a relative term that applies to huge data sets that require an entity (person, company, etc.) to leverage specialized hardware, software, processing techniques, visualization, and database technologies in order to solve the problems associated with the 3Vs and similar characteristic models.
Types of Data and Data Sets
Data is collected in many different ways as mentioned earlier. The life-cycle of usable data usually involves capture, pre-processing, storage, retrieval, post-processing, analysis, visualization, and so on.
Once captured, data is usually referred to as being structured, semi-structured, or unstructured. These distinctions are important because they’re directly related to the type of database technologies and storage required, the software and methods by which the data is queried and processed, and the complexity of dealing with the data.
Structured data refers to data that is stored as a model (or is defined by a structure or schema) in a relational database or spreadsheet. Often it’s easily queryable using SQL (structured query language) since the “structure” of the data is known. A sales order record is a good example. Each sales order has a purchase date, items purchased, purchaser, total cost, etc.
Unstructured data is data that’s not defined by any schema, model, or structure, and is not organized in a specific way. In other words, it’s just stored raw data. Think of a seismometer (earthquakes are a big fear of mine by the way!). You’ve probably seen the squiggly lines captured by such a device, which essentially represent energy data as recorded at each seismometer location. The recorded signal (i.e., data) represents a varying amount of energy over time. There is no structure in this case, it’s just variations of energy represented by the signal.
It follows naturally that Semi-structured data is a combination of the two. It’s basically unstructured data that also has structured data (a.k.a. metadata) appended to it. Every time you use your smartphone to take a picture, the shutter captures light reflection information as a bunch of binary data (i.e., ones and zeros). This data has no structure to it, but the camera also appends additional data that includes the date and time the photo was taken, last time it was modified, image size, etc. That’s the structured part. Data formats such as XML and JSON are also considered to be semi-structured data.
Data Mining, Description, Modeling, and Visualization
For data to be used in a meaningful way, it’s initially captured, pre-processed, and stored. After this process, the data can be mined, processed, described, analyzed, and used to build models that are both descriptive and predictive.
Descriptive statistics is a term used to describe the application of statistics to a data set in order to describe and summarize the information that the data contains. Basically it includes describing data in the context of a distribution that has a mean, median, mode, variance, standard deviation, and so on. Descriptive statistics describes other forms of analysis and visualization as well.
Inferential statistics and data modeling on the other hand are very powerful tools that can be used to gain a deep understanding of the data, as well as extrapolate (i.e., predict) meaning and results for conditions outside of those that data has been collected. Using certain techniques, models can be created and decisions can be made dynamically based on the data involved.
In addition to descriptive statistics and inferential statistics, another field called computational statistics (a subset of computational science) can often play a large role in data science and big data applications. Computational statistics involves leveraging computer science, statistics, and algorithms in order for computers to implement statistical methods. Many of these methods are utilized heavily in fields called predictive analytics or predictive modeling. Machine learning can be considered an application of certain algorithms in the context of predictive modeling.
Often data is also mined in order to be analyzed visually. Many people are able to understand data quicker, deeper, and in a more natural way through the strategic use of appropriate graphs, charts, diagrams, and tables. These methods of displaying information can be used to show both categorical and quantitative data. The application of these display types to represent data is known as data visualization.
These techniques, methodologies, statistics, and visualization topics will be covered to a much greater extent in upcoming posts.
Data Management and Tools of the Trade
There are many software and database technologies required for data science and big data handling. Many databases are designed to adhere to the ACID principles, which stands for Atomicity, Consistency, Isolation, Durability.
Let’s begin by discussing database technologies. Database management systems (DBMS) and their relational counterparts (RDBMS) were the most widely used database systems for a long time since the 1980s. They are generally very good for transaction-based operations and adhering to the ACID principles in general.
The downside to relational systems is that these databases are relatively static and biased heavily towards structured data, represent data in non-intuitive and non-natural ways, and incur significant processing overhead and are therefore less performant. Another downside is that the table-based stored data does not usually represent the actual data (i.e., domain/business objects) very well. This is known as the object-relational impedance mismatch, and thus requires a mapping between the table-based data and the actual objects of the problem domain. Database Management systems as described include Microsoft SQL Server, Oracle, MySql, and so on.
NoSql database technologies have become very trendy these days, and for good reason. NoSql is a term used to describe database systems that are non-relational, highly scalable, allow dynamic schemas, and handle large volumes of data access with high frequency. They also represent data in a more natural way, can easily deal with the three types of data mentioned earlier, and are very performant.
NoSql databases are therefore largely used for high-scale transactions. NoSql database systems include MongoDB, Redis, Cassandra, and CouchDb to name a few. Note that there are multiple types of NoSql databases, which include document, graph, key-value, and wide-column.
NewSQL is a relatively new type of database management system. These systems try to blend the best characteristics (e.g., ACID) and querying language (i.e., SQL) of relational database management systems with the highly scalable performance of NoSQL databases. The jury is still out on NewSQL as to whether it will garner enough popularity to gain adoption and traction like relational and NoSQL databases have.
Practitioners of Big Data have seen the creation and proliferation of specific technologies needed for high-scale data storage, processing capabilities, and analytics of enormous amounts of data. The most popular systems include Apache Hadoop, Cloudera, Hortonworks, and MapR. There are many others trying to compete in this space as well.
For statistical and algorithmic-based data processing and visualization, R, python, and Matlab are some popular choices.
We have never before collected as much varying data as we do today, nor have we needed to handle it as quickly. The variety and amount of data that we collect through many different mechanisms is growing exponentially. This growth requires new strategies and techniques by which the data is captured, stored, processed, analyzed, and visualized.
Data science is an umbrella term that encompasses all of the techniques and tools used during the life cycle stages of useful data. Big data on the other hand typically refers to extremely large data sets that require specialized and often innovative technologies and techniques in order to efficiently “use” the data.
Both of these fields are going to get bigger and become much more important with time. The demand for qualified practitioners in both fields is growing at a rapid pace, and they are becoming some of the hottest and most lucrative fields to work in.
Hopefully this article has provided a relatively simple explanation of the major concepts involved with data science and big data. Armed with this knowledge, you should be better able to understand what the latest industry headlines mean, or at least not feel completely out of the loop in a discussion on either topic.
Original. Reposted with permission.