The What, Where and How of Data for Data Science
Here we will take data science apart and build it back up to a coherent and manageable concept. Bear with us!
Data Science is a term that escapes any single complete definition, which makes it difficult to use, especially if the goal is to use it correctly. Most articles and publications use the term freely, with the assumption that it is universally understood. However, data science – its methods, goals, and applications – evolve with time and technology. Data science 25 years ago referred to gathering and cleaning datasets then applying statistical methods to that data. In 2018, data science has grown to a field that encompasses data analysis, predictive analytics, data mining, business intelligence, machine learning, and so much more.
In fact, because no one definition fits the bill seamlessly, it is up to those who do data science to define it.
Recognizing the need for a clear-cut explanation of data science, the 365 Data Science Team designed the What-Where-Who infographic. We define the key processes in data science and disseminate the field. Here is our interpretation of data science (click on the infographic for larger image)
Of course, this might look like a lot of overwhelming information, but it really isn’t. In this article, we will take data science apart and we will build it back up to a coherent and manageable concept. Bear with us!
Data science, 'explained in under a minute', looks like this.
You have data. To use this data to inform your decision-making, it needs to be relevant, well-organized, and preferably digital. Once your data is coherent, you proceed with analyzing it, creating dashboards and reports to understand your business’s performance better. Then you set your sights to the future and start generating predictive analytics. With predictive analytics, you assess potential future scenarios and predict consumer behavior in creative ways.
But let’s begin at the beginning.
The Data in Data Science
Before anything else, there is always data. Data is the foundation of data science; it is the material on which all the analyses are based. In the context of data science, there are two types of data: traditional, and big data.
Traditional data is data that is structured and stored in databases which analysts can manage from one computer; it is in table format, containing numeric or text values. Actually, the term “traditional” is something we are introducing for clarity. It helps emphasize the distinction between big data and other types of data.
Big data, on the other hand, is… bigger than traditional data, and not in the trivial sense. From variety (numbers, text, but also images, audio, mobile data, etc.), to velocity (retrieved and computed in real time), to volume (measured in tera-, peta-, exa-bytes), big data is usually distributed across a network of computers.
That said, let’s define the What-Where-and-Who in data science each is characterized by.
What do you do to data in data science?
Traditional data in Data Science
Traditional data is stored in relational database management systems.
That said, before being ready for processing, all data goes through pre-processing. This is a necessary group of operations that convert raw data into a format that is more understandable and hence, useful for further processing. Common processes are:
- Collect raw data and store it on a server
This is untouched data that scientists cannot analyze straight away. This data can come from surveys, or through the more popular automatic data collection paradigm, like cookies on a website.
- Class-label the observations
This consists of arranging data by category or labelling data points to the correct data type. For example, numerical, or categorical.
- Data cleansing / data scrubbing
Dealing with inconsistent data, like misspelled categories and missing values.
- Data balancing
If the data is unbalanced such that the categories contain an unequal number of observations and are thus not representative, applying data balancing methods, like extracting an equal number of observations for each category, and preparing that for processing, fixes the issue.
- Data shuffling
Re-arranging data points to eliminate unwanted patterns and improve predictive performance further on. This is applied when, for example, if the first 100 observations in the data are from the first 100 people who have used a website; the data isn’t randomized, and patterns due to sampling emerge.
Big Data in Data Science
When it comes to big data and data science, there is some overlap of the approaches used in traditional data handling, but there are also a lot of differences.
First of all, big data is stored on many servers and is infinitely more complex.
In order to do data science with big data, pre-processing is even more crucial, as the complexity of the data is a lot larger. You will notice that conceptually, some of the steps are similar to traditional data pre-processing, but that’s inherent to working with data.
- Collect the data
- Class-label the data
Keep in mind that big data is extremely varied, therefore instead of ‘numerical’ vs ‘categorical’, the labels are ‘text’, ‘digital image data’, ‘digital video data’, digital audio data’, and so on.
- Data cleansing
The methods here are massively varied, too; for example, you can verify that a digital image observation is ready for processing; or a digital video, or…
- Data masking
When collecting data on a mass scale, this aims to ensure that any confidential information in the data remains private, without hindering the analysis and extraction of insight. The process involves concealing the original data with random and false data, allowing the scientist to conduct their analyses without compromising private details. Naturally, the scientist can do this to traditional data too, and sometimes is, but with big data the information can be much more sensitive, which masking a lot more urgent.
Where does data come from?
Traditional data may come from basic customer records, or historical stock price information.
Big data, however, is all-around us. A consistently growing number of companies and industries use and generate big data. Consider online communities, for example, Facebook, Google, and LinkedIn; or financial trading data. Temperature measuring grids in various geographical locations also amount to big data, as well as machine data from sensors in industrial equipment. And, of course, wearable tech.
Who handles the data?
The data specialists who deal with raw data and pre-processing, with creating databases, and maintaining them can go by a different name. But although their titles are similar sounding, there are palpable differences in the roles they occupy. Consider the following.
Data Architects and Data Engineers (and Big Data Architects, and Big Data Engineers, respectively) are crucial in the data science market. The former creates the database from scratch; they design the way data will be retrieved, processed, and consumed. Consequently, the data engineer uses the data architects’ work as a stepping stone and processes (pre-processes) the available data. They are the people who ensure the data is clean and organized and ready for the analysts to take over.
The Database Administrator, on the other hand, is the person who controls the flow of data into and from the database. Of course, with Big Data almost the entirety of this process is automated, so there is no real need for a human administrator. The Database Administrator deals mostly with traditional data.
That said, once data processing is done, and the databases are clean and organized, the real data science begins.
There are also two ways of looking at data: with the intent to explain behavior that has already occurred, and you have gathered data for it; or to use the data you already have in order to predict future behavior that has not yet happened.
Data Science explaining the past
Before data science jumps into predictive analytics, it must look at the patterns of behavior the past provides, analyze them to draw insight and inform the path for forecasting. Business intelligence focuses precisely on this: providing data-driven answers to questions like: How many units were sold? In which region were the most goods sold? Which type of goods sold where? How did the email marketing perform last quarter in terms of click-through rates and revenue generated? How does that compare to the performance in the same quarter of last year?
Although Business Intelligence does not have “data science” in its title, it is part of data science, and not in any trivial sense.
What does Business Intelligence do?
Of course, Business Intelligence Analysts can apply Data Science to measure business performance. But in order for the Business Intelligence Analyst to achieve that, they must employ specific data handling techniques.
The starting point of all data science is data. Once the relevant data is in the hands of the BI Analyst (monthly revenue, customer, sales volume, etc.), they must quantify the observations, calculate KPIs and examine measures to extract insights from their data.
Data Science is about telling a story
Apart from handling strictly numerical information, data science, and specifically business intelligence, is about visualizing the findings, and creating easily digestible images supported only by the most relevant numbers. After all, all levels of management should be able to understand the insights from the data and inform their decision-making.
Business intelligence analysts create dashboards and reports, accompanied by graphs, diagrams, maps, and other comparable visualizations to present the findings relevant to the current business objectives.
Where is business intelligence used?
Price optimization and data science
Notably, analysts apply data science to inform things like price optimization techniques. They extract the relevant information in real time, compare it with historicals, and take actions accordingly. Consider hotel management behavior: management raise room prices during periods when many people want to visit the hotel and reduce them when the goal is to attract visitors in periods with low demand.
Inventory management and data science
Data science, and business intelligence, are invaluable for handling over and undersupply. In-depth analyses of past sales transactions identify seasonality patterns and the times of the year with the highest sales, which results in the implementation of effective inventory management techniques that meet demands at minimum cost.
Who does the BI branch of data science?
A BI analyst focuses primarily on analyses and reporting of past historical data.
The BI consultant is often just an ‘external BI analysts’. Many companies outsource their data science departments as they don’t need or want to maintain one. BI consultants would be BI analysts had they been employed, however, their job is more varied as they hop on and off different projects. The dynamic nature of their role provides the BI consultant with a different perspective, and whereas the BI Analyst has highly specialized knowledge (i.e., depth), the BI consultant contributes to the breadth of data science.
The BI developer is the person who handles more advanced programming tools, such as Python and SQL, to create analyses specifically designed for the company. It is the third most frequently encountered job position in the BI team.
So, is this all data science is?
Data science is a slippery term that encompasses everything from handling data – traditional or big – to explain patterns and predict behavior.
Data science is done through traditional methods like regression and cluster analysis or through unorthodox machine learning techniques.
We discuss data science forecasting methods in the second part of this article. Hopefully, once you read that, all pieces of the data science puzzle will fit together well!
Bio: IliyaValchanov is a Co-founder at 365 Data Science.
Parts of this blog were published on 365 Data Science Blog. Reposted with permission.
- The Executive Guide to Data Science and Machine Learning
- Command Line Tricks For Data Scientists
- DIY Deep Learning Projects