Apache Drill Makes Big Data Analysis Easier for Everyone
Apache Drill is an open source query engine that provides interactive and secure SQL analytics at the scale of petabytes. Provides data querying and exploring capabilities from varied NoSQL databases and file formats.
By Kaushik Pal, (Techalpine)
Many people consider big data analysis as something that can be done only by the data scientists. According to these people, big data processing and analysis require special skills such as statistics, technical knowledge and prior experience that are possessed by a select group of people. The Apache Drill has the potential to change the perception. The Apache Drill is simple and straightforward to use and almost anyone can process and analyse big data with it. The Apache Drill can query multiple sets of structured and unstructured data with the help of ANSI SQL. The user just needs to use SQL and any Business Intelligence (BI) tool to run the run queries on data from multiple sources. Apache Drill can be used on laptops and there is no need to define any schema. There is no doubt that Apache Drill has in it to make big data analytics more accessible to wider groups of people.
What is Apache Drill?
Apache Drill is an open source query engine that provides interactive and secure SQL analytics at the scale of petabytes. Drill is the first of its kind in providing data querying and exploring capabilities from varied sources such as the NoSQL database or multiple file formats. Since the Apache Drill follows the ANSI SQL standards, there is no effort required in fresh learning. All you need to do is to have the knowledge of SQL and you can get started.
The main benefits of Apache Drill are:
- Enterprises do not need to depend on a select talent pool to access and analyse data. The existing talent pool and resources on ANSI SQL can be utilized to get results quickly. The SQL Analysts and Business Intelligence specialists can query and analyse data quickly and enterprises do not have to depend on data preparation by the IT department for a long time.
- The IT department can bypass the unnecessary schema maintenance tasks and ETL cycles and still have simple and straightforward governance with the help of granular access mechanisms that are easy to deploy.
The use case below describes how you can easily use the Data Drill to quickly query and analyse data for your business purposes.
Let us assume that there is a retail giant named T Retail which has its offices and showrooms across countries. Like all other commercial enterprises, T Retail wants to constantly improve its revenue and profits and to do that, it relies a lot on data analytics because that helps it to better understand customer preferences, purchasing habits and behaviour vis-à-vis its products and services. Now, T Retail is launching a large campaign to promote and sell its new and existing products. Data analytics is going to be a significant part of its sales strategy. Data analytics will help T Retail identify the following:
- The time of the year the promotional activities are likely to yield the maximum results.
- The regions more likely to buy its products.
To find out the above information, T Retail needs to find out the following data first:
- The top months based on gross sales.
- The top countries or regions based on gross sales.
- The top products based on gross sales.
T Retail will use the Apache Drill to perform this task and more. To find out the above information, T Retail will use the following table.
Table name: Orders
To enable data analysis with Apache Drill, T Retail has offloaded its customer data from an Oracle system to a Hive table that resides on a Hadoop cluster.
As you can see, the table above contains the following columns:
- order_id: Unique identification number of an order placed.
- month: The month the order was placed.
- cust_id: Unique identification number of the customer that placed the order.
- county: State where the order was fulfilled.
- prod_id: Unique identification number of the item sold.
- order_total: Total number of orders for this customer and item.
Now, let us find out how to use query by Apache Drill to find out the following information:
The top months based on gross sales
SELECT `month`, SUM(order_total) as sales
GROUP BY `month`
ORDER BY sales desc;
The top countries or regions based on gross sales
SELECT `month`, `state`, SUM(order_total) as sales
GROUP BY `month`, `state`
ORDER BY sales desc;
The top 20 products based on gross sales
SELECT `prod_id`, SUM(order_total) as sales
GROUP BY `prod_id`
ORDER BY 2 desc limit 20;
So, the above queries will give T Retail the required information that it can use in its marketing campaigns. But the main point here is that it is extremely easy to run SQL query on a set of data using Apache Drill.
A Quick Look inside Apache Drill
This is how a typical Apache Drill database looks like.
This is how a typical query with Apache Drill looks like
Additional Stuff you can do
There is more to Apache Drill than just querying information from a database. For example, T Retail can analyse potential customer choices and purchasing habits by analysing website visiting patterns. Let us see how T Retail can do this with the help of the Apache Drill.
- T Retail collects click data which gives information about the clicks of website visitors to different pages. The patterns of clicks such as navigational pattern, types of pages visited can reveal a lot about a consumer’s preferences.
- The click data is collected as flat text in log files. The log file maintains different information such as IP addresses, bytes served, HTTP code, browser types, and page requests.
However, the challenge with the above assignment is that the data is collected in semi-structured or unstructured format. This is where Apache Drill comes in. With Drill, you can easily query and analyse semi structured data. The main advantages of Drill are:
- Drill can identify the schema of any data on the fly, anytime. This allows the data analysts to query the data from any source without needing to create any type of schema definition.
- Data analysts can use the Apache Drill to query the Parquet and JSON files just with the help of normal ANSI SQL format, just like in the case of structured database.
- Apache Drill has a flexible data model that makes it easy for anyone to manipulate or query data from almost any type of source.
The main benefit or advantage of Apache Drill is that it is going to significantly reduce the investment towards big data analysis. Now, enterprises do not have a good reason to invest in complex technology or skill sets always to access and analyse big data. With Apache Drill, big data analysis has become accessible to more people. It seems that Apache Drill marks the beginning of a trend when more tools and technologies are going to follow suit by making big data analysis much easier. That will indeed be a defining moment in the history of big data.
Bio: Kaushik Pal (www.techalpine.com) has 16 years of experience as a technical architect and software consultant in enterprise application and product development. He has interest in new technology and innovation area along with technical writing. His main focuses are on web architecture, web technologies, Java/j2ee, Open source, big data and semantic technologies.