Big Data Key Terms, Explained

Just getting started with Big Data, or looking to iron out the wrinkles in your current understanding? Check out these 20 Big Data-related terms and their concise definitions.



Database

12. Database

Data needs to be curated, coddled, and cared for. It needs to be stored and processed, so that it may be transformed into information, and further refined into knowledge. The mechanism for storing data, subsequently facilitating these transformations, is, clearly, the database.

13. Data Warehouse

Data warehouse is another potentially elusive term. Han, Kamber & Pei define a data warehouse as data storage architectures which allow for "business executives to systematically organize, understand, and use their data to make strategic decisions." Vague, to be sure, but generally speaking, a data warehouse exhibits these characteristics:

  • maintained separately from an organization's operational and transactional databases, which are characterized by frequent access and are used by day-to-day organizational operations
  • allow for the integration of various disparate application systems
  • house and allow access to consolidated historical data for processing and analysis

Data warehouse

Bill Inmon, the Godfather of the Data Warehouse, gave this original and lasting definition, with which we will conclude:

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.

14. ETL

ETL stands for Extract, Transform and Load. This is the process of extracting data from source systems, such as transactional databases, and placing it into data warehouses. If you are familiar with online transactional processing (OLTP) and online analytical processing (OLAP), ETL can be thought of as the bridge between these 2 system types.

15. Business Intelligence

And perhaps the most ambiguous term of all (an incredible feat in a set of Big Data terminology definitions) is business intelligence (BI). BI is an unstable, ill-defined set of tools, technologies, and concepts which support business by providing historical, current, and predictive views on its operations. The relationship between BI and data mining, in particular, is a curious one, with various definitions proposing that: BI is a subset of data mining; data mining is a subset of BI, BI is driven by data mining; BI and data mining are separate and mutually exclusive. So, that settles that.

In the age of data science and Big Data, BI is generally thought to include OLAP, competitive intelligence, benchmarking, reporting, and other business management approaches (all of which tend toward ambiguity in definition as well), and is heavily influenced by the dashboard culture.

16. Apache Hadoop

Apache's Hadoop could almost single-handedly be responsible for the rise of the Big Data Revolution, at least from a software point of view.

Hadoop

Apache Hadoop is an open-source framework for processing large volume of data in a clustered environment. It uses simple MapReduce programming model for reliable, scalable and distributed computing. The storage and computation both are distributed in this framework.

(From Kaushik Pal's Hadoop Key Terms, Explained)

17. Apache Spark

Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics, with APIs in Java, Scala, Python, R, and SQL. Spark runs programs up to 100x faster than Apache Hadoop MapReduce in memory, or 10x faster on disk. It can be used to build data applications as a library, or to perform ad-hoc data analysis interactively. Spark powers a stack of libraries including SQL, DataFrames, and Datasets, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. You can combine these libraries seamlessly in the same application. As well, Spark runs on a laptop, Apache Hadoop, Apache Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Apache Cassandra, Apache HBase, and S3.

(From Denny Lee and Jules Damji's Apache Spark Key Term's, Explained)

18. Internet of Things

The Internet of Things (IoT) is a growing source of Big Data moving forward. IoT is:

The concept to allow internet based communications to happen between physical objects, sensors, and controllers.

(From Geethika Bhavya Peddibhotla's Internet of Things Key Terms, Explained)

19. Machine Learning

Machine learning can be employed for predictive analysis and pattern recognition in Big Data. According to Mitchell, machine learning is "concerned with the question of how to construct computer programs that automatically improve with experience." Machine learning is interdisciplinary in nature, and employs techniques from the fields of computer science, statistics, and artificial intelligence, among others. The main artefacts of machine learning research are algorithms which facilitate this automatic improvement from experience, algorithms which can be applied in such diverse fields as computer vision, artificial intelligence, and data mining.

20. Data Mining

Fayyad, Piatetsky-Shapiro & Smyth define data mining as "the application of specific algorithms for extracting patterns from data." This demonstrates that, in data mining, the emphasis is on the application of algorithms, as opposed to on the algorithms themselves. We can define the relationship between machine learning and data mining as follows: data mining is a process, during which machine learning algorithms are utilized as tools to extract potentially-valuable patterns held within datasets.

Related: