277 Data Science Key Terms, Explained
This is a collection of 277 data science key terms, explained with a no-nonsense, concise approach. Read on to find terminology related to Big Data, machine learning, natural language processing, descriptive statistics, and much more.
This post presents a collection of data science related key terms with concise, no-nonsense definitions, organized into 12 distinct topics. Starting with Big Data and progressing through to natural language processing, this definition train has stops at machine learning, databases, Apache Hadoop, and several more. It may take come time, but once you get through the terminology presented herein, you should have a good idea of the key terms of importance in data science. And don't worry if the definitions are too slim for you; links abound for expanded related reading opportunities where appropriate.
Big Data. If somehow you've made it to this website and have not heard the term since it first gained momentum toward becoming a popular term at least a decade and a half ago, I really don't know what to say.
But just because one has heard the term, or has taken part in (or opposed) its flippant usage, that really doesn't mean one knows what it actually means, or what it fully encompasses. Indeed, trying to exhaustively describe what Big Data is in a single post would be nonsensical, not the least of which reason being that there is no agreed-upon exhaustive description, nor should there be. Collecting some key terms associated with Big Data is not a bad idea, however, as it lays a common foundation from which to work forward.
This is the first in a series of such posts on KDnuggets which will offer concise explanations of a related set of terms (machine learning, in this case), specifically taking a no-frills approach for those looking to isolate and define. After some thought, it was determined that these foundational-yet-informative types of posts have not been given enough exposure in the past.
So, let's start with a look at machine learning and related topics.
Clustering is a method of data analysis which groups data points together in order to "maximizing the intraclass similarity and minimizing the interclass similarity," (by Han, Kamber & Pei) without using predefined labels of points (i.e., an unsupervised learning technique). This post introduces key words for common techniques in cluster analysis.
Deep learning is a relatively new term, although it has existed prior to the dramatic uptick in online searches of late. Enjoying a surge in research and industry, due mainly to its incredible successes in a number of different areas, deep learning is the process of applying deep neural network technologies - that is, neural network architectures with multiple hidden layers - to solve problems. Deep learning is a process, like data mining, which employs deep neural network architectures, which are particular types of machine learning algorithms.
Data needs to be curated, coddled, and cared for. It needs to be stored and processed, so that it may be transformed into information, and further refined into knowledge. The mechanism for storing data, subsequently facilitating these transformations, is, clearly, the database.
This post presents 16 key database concepts and their corresponding concise, straightforward definitions.
Statistics, though a central set of tools for data science, are often overlooked in favor of more solidly technical skills like programming. Even machine learning learning algorithms, with their reliance on mathematical concepts such as algebra and calculus -- not to mention statistics! -- are often treated at a higher level than is required to appreciate the underlying math, leading, perhaps, to "data scientists" who lack a fundamental understanding of one of the key aspects of their profession.
This article compiles the key definitions included throughout PAW Founder Eric Siegel’s popular, award-winning book, Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die (Revised and Updated, 2016), which has been adopted as a textbook at over 35 universities—but reads like pop science, dubbed “The Freakonomics of big data.”
Cloud computing mainly makes it possible for companies to get their applications deployed faster, without the need for excessive maintenance, which is managed by the service provider. This also leads to better use of computing resources, as per the needs and requirements of a business from time to time.
While the Internet is full of terms related to the cloud, here are some pretty basic, but important ones, that one should definitely have some knowledge about. Knowing these key terms will help you understand industry developments and future trends in cloud computing.
Hadoop is a very powerful open source platform managed by Apache Foundation. Hadoop platform is built on Java technologies and capable of processing huge volume of heterogeneous data in a distributed clustered environment. Its scaling capability makes it a perfect fit for distributed computing.
Hadoop ecosystem consists of Hadoop core components and other associated tools. In the core components, Hadoop Distributed File System (HDFS) and the MapReduce programming model are the two most important concepts. Among the associated tools, Hive for SQL, Pig for dataflow, Zookeeper for managing services etc are important. We will explain these terms in details.
One of the reasons why Apache Spark has become so popular is because Spark provides data engineers and data scientists with a powerful, unified engine that is both fast (100x faster than Apache Hadoop for large-scale data processing) and easy to use. This allows data practitioners to solve their machine learning, graph computation, streaming, and real-time interactive query processing problems interactively and at much greater scale.
In this blog post, we will discuss some of the key terms one encounters when working with Apache Spark.
The Internet of Things (IoT) is the concept to allow internet based communications to happen between physical objects, sensors, and controllers. This post will define 12 Key Terms for the Internet of Things, in straightforward manner.
This post aims to serve an introductory role, taking a no-nonsense approach to defining some key NLP terminology. While you certainly won't be a linguistic expert after reading this, we hope that you are better able to understand some of the NLP-related discourse, and gain perspective as to how to proceed with learning more on the topics herein.
So here they are, 18 select natural language processing terms, concisely defined, with links to further reading where appropriate.