The Great Big Data Science Glossary
To help those new to the field stay on top of industry jargon and terminology, we've put together this glossary of data science terms.
By Wolf Howard, Dataquest
Getting started in data science can be overwhelming, especially when you consider the variety of concepts and techniques a data scienctist needs to master in order to do her job effectively. Even the term "data science" can be somewhat nebulous, and as the field gains popularity it seems to lose definition.
To help those new to the field stay on top of industry jargon and terminology, we've put together this glossary of data science terms. We hope it will serve as your handy quick reference whenever you're working on a project, or reading an article and find you can't quite remember what "ETL" means.
Are we missing a term? Get in touch.
These are some baseline concepts that are helpful to grasp when getting started in data science. While you probably won't have to work with every concept mentioned here, knowing what the terms mean will help when reading articles or discussing topics with fellow data lovers.
An algorithm is a set of instructions we give a computer so it can take values and manipulate them into a usable form. This can be as easy as finding and removing every comma in a paragraph, or as complex as building an equation that predicts how many home runs a baseball player will hit in 2018.
The back end is all of the code and technology that works behind the scenes to populate the front end with useful information. This includes databases, servers, authentication procedures, and much more. You can think of the back end as the frame, the plumbing, and the wiring of an apartment.
Big data is a term that suffers from being too broad to be useful. It’s more helpful to read it as, “so much data that you need to take careful steps to avoid week-long script runtimes.” Big data is more about strategies and tools that help computers do complex analysis of very large (read: 1+ TB) data sets. The problems we must address with big data are categorized by the 4 V's: volume, variety, veracity, and velocity.
Classification is a supervised machine learning problem. It deals with categorizing a data point based on its similarity to other data points. You take a set of data where every item already has a category and look at common traits between each item. You then use those common traits as a guide for what category the new item might have.
As simply as possible, this is a storage space for data. We mostly use databases with a Database Management System (DBMS), like PostgreSQL or MySQL. These are computer applications that allow us to interact with a database to collect and analyze the information inside.
A data warehouse is a system used to do quick analysis of business trends using data from many sources. They're designed to make it easy for people to answer important statistical questions without a Ph.D. in database architecture.
The front end is everything a client or user gets to see and interact with directly. This includes data dashboards, web pages, and forms.
Algorithms that use fuzzy logic to decrease the runtime of a script. Fuzzy algorithms tend to be less precise than those that use Boolean logic. They also tend to be faster, and computational speed sometimes outweighs the loss in precision.
An abstraction of Boolean logic that substitutes the usual True and False and for a range of values between 0 and 1. That is, fuzzy logic allows statements like "a little true" or "mostly false."
A greedy algorithm will break a problem down into a series of steps. It will then look for the best possible solution at each step, aiming to find the best overall solution available. A good example is Dijkstra's algorithm, which looks for the shortest possible path in a graph.
A process where a computer uses an algorithm to gain understanding about a set of data, then makes predictions based on its understanding. There are many types of machine learning techniques; most are classified as either supervised or unsupervised techniques.
Overfitting happens when a model considers too much information. It’s like asking a person to read a sentence while looking at a page through a microscope. The patterns that enable understanding get lost in the noise.
Regression is another supervised machine learning problem. It focuses on how a target value changes as other values within a data set change. Regression problems generally deal with continuous variables, like how square footage and location affect the price of a house.
Statistic vs. Statistics
Statistics (plural) is the entire set of tools and methods used to analyze a set of data. A statistic (singular) is a value that we calculate or infer from data. We get the median (a statistic) of a set of numbers by using techniques from the field of statistics.
Training and Testing
This is part of the machine learning workflow. When making a predictive model, you first offer it a set of training data so it can build understanding. Then you pass the model a test set, where it applies its understanding and tries to predict a target value.
Underfitting happens when you don’t offer a model enough information. An example of underfitting would be asking someone to graph the change in temperature over a day and only giving them the high and low. Instead of the smooth curve one might expect, you only have enough information to draw a straight line.
Fields of Focus
As businesses become more data-focused, new opportunities open up for people of various skill sets to become part of the data community. These are some of the areas of specialization that exist within the data science realm.
Artificial Intelligence (AI)
A discipline involving research and development of machines that are aware of their surroundings. Most work in A.I. centers on using machine awareness to solve problems or accomplish some task. In case you didn’t know, A.I. is already here: think self-driving cars, robot surgeons, and the bad guys in your favorite video game.
Business Intelligence (BI)
Similar to data analysis, but more narrowly focused on business metrics. The technical side of BI involves learning how to effectively use software to generate reports and find important trends. It’s descriptive, rather than predictive.
This discipline is the little brother of data science. Data analysis is focused more on answering questions about the present and the past. It uses less complex statistics and generally tries to identify patterns that can improve an organization.
Data engineering is all about the back end. These are the people that build systems to make it easy for data scientists to do their analysis. In smaller teams, a data scientist may also be a data engineer. In larger groups, engineers are able to focus solely on speeding up analysis and keeping a data well organized and easy to access.
This discipline is all about telling interesting and important stories with a data focused approach. It has come about naturally with more information becoming available as data. A story may be about the data or informed by data. There’s a full handbook if you’d like to learn more.
Given the rapid expansion of the field, the definition of data science can be hard to nail down. Basically, it’s the discipline of using data and advanced statistics to make predictions. Data science is also focused on creating understanding among messy and disparate data. The “what” a scientist is tackling will differ greatly by employer.
The art of communicating meaningful data visually. This can involve infographics, traditional plots, or even full data dashboards. Nicholas Felton is a pioneer in this field, and Edward Tufte literally wrote the book.
This field is highly focused on using alogrithms for to gain an edge in the financial sector. These algorithms either recommend or make trading decisions based on a huge amount of data, often on the order of picoseconds. Quantitative analysts are often called "quants."