Training a Champion: Building Deep Neural Nets for Big Data Analytics
Introducing Sisense Hunch, the new way of handling Big Data sets that uses AQP technology to construct Deep Neural Networks (DNNs) which are trained to learn the relationships between queries and their results in these huge datasets.
By Nir Regev, Sisense.
The world of data is now the world of Big Data. The genie is out of the bottle and there’s no going back. We produce more and more data every day and the datasets being generated are getting more and more complex. Traditionally, the way to handle this has been to scale up computing resources to handle these bigger datasets. That’s not feasible for the long term on a global scale, nor is it tenable in the near term for smaller organizations who may have limited resources to deal with their analytics challenges.
Further complicating matters, in order to facilitate true interactivity with large datasets, user queries need near-real-time response rates. This is a challenge even for the most powerful large-scale systems. Approximate Query Processing (AQP) removes the need to query the entire Big Data set and serves up usable results rapidly.
Sisense Hunch™ is a new way of handling Big Data sets that uses AQP technology to construct Deep Neural Networks (DNNs) which are trained to learn the relationships between queries and their results in these huge datasets. This provides users with a fast, scalable inference layer above the data that they can interactively query to get actionable insights. Large batches of data can be approximated by the DNN which utilize concurrent processing in a Graphical Processing Unit (GPU). In addition, since Hunch’s DNNs are typically on the Mb scale, they can be easily deployed and distributed to thousands of users or IOT devices, putting incredibly fast Big Data analytics almost anywhere. Embedding Big Data analytics at the edge or anyplace that users can benefit from them is a huge leap forward when dealing with these immense datasets. Sisense Hunch balances speed, resource consumption, and accuracy and gives users unparalleled flexibility in where they put these insights.
The power of the Hunch system lies in the DNN training process, which ultimately yields the query approximation model. The process is divided into three phases: generating artificial SQL queries, obtaining the labels for the training set (i.e., executing the queries on the database), and finally encoding the queries into numeric tensor using the on-the-fly generated query encoder. Once this training set is created, Hunch utilizes a supervised approach to learn to approximate the training set queries.
The Training Process
Generating SQL Queries
To approximate the wide array of queries that could come in for a given dataset, Hunch self-generates a robust set of queries.
This is the formal query structure description:
An example of a Hunch-generated query is:
“Select AVG(sales) from table where store_type in (‘online’) and computer_type in (‘Mac’) and hour between 20 and 23 and hdisk_tb_size between 1 AND 5”
The ultimate goal of this phase is to generate a large representative set of aggregated SQL queries to cover many (user-oriented) aspects of the raw data. In order to do this, the system first extracts statistical values (data distribution) from the raw data.
For each column with continuous numeric values, the algorithm calculates quartile distributions (min, 25%, median, 75%, max). Then, it uses this information to sample values from a uniform distribution which fits to the column numeric boundaries. The system also executes “Group By” queries, which accelerates training set generation, mainly because all permutations of nominal columns’ values can be obtained in one path to the data. Additionally, in the query format our method supports, “WHERE” clause statements are connected with an “AND” operator, so the order of the statement does not make a difference. To ensure the DNN does not learn specific statements’ order, our method randomly shuffles the statements before constructing the query. This is done intentionally to force the DNN to learn to approximate semantically identical queries that occur with different “WHERE” clause ordering.
Labeling the Training Set
Building a supervised data set for training the DNN requires real query results. For this, our method runs the set of generated queries against the dataset. Since the DNN requires a relatively large number of training samples, the system generates and executes hundreds of thousands of queries using concurrent technology to optimize costs. However, this process takes place only once for each dataset. When new data arrives, necessitating the generation of new queries, Hunch uses incremental (transfer) learning (more on that later). This approach saves time, effort, and costs both in the training set generation phase and in the DNN training phase.
Since the DNN can digest only numeric input, we developed an encoding model to encode the queries into numeric matrices. This goal is obtained by an encoder model which is constructed on the fly (during SQL queries generation), making use of vector embedding technique.
The encoder is designed to condense data sets with millions of distinct values into light memory footprint numeric tensors by utilizing proprietary embedding techniques.
Tackling New Data with Incremental Learning
When new data is added to the dataset, Hunch needs to adapt the DNN to approximate queries based on new data. Building on the previous learning processes, Hunch utilizes transfer learning, meaning that the DNN will be trained from its last weights state, against a new training set which is generated using the previous DNN and the new data.
Evaluation and Distribution
Once the training is finalized, we use a set of accuracy metrics to evaluate the model’s approximation. We use Normalized Root Mean Squared Error (NRMSE) to measure the model accuracy on a hold out testing set of queries. Once the NRMSE is converged below a threshold, Hunch publishes a cloud API end-point around the DNN. Then, users or IOT devices can send query requests. The end-user benefits with from prompt response fixed query latency (which is independent of the raw data size!) and interactivity when dealing with Big Data in a wide variety of settings.
Just because we now live in a world of Big Data does not mean we need to surrender our ability to interact with that data and truly understand it. Deep neural networks using Approximate Query Processing offer users a way to work with these huge datasets, pull out powerful insights, and put those insights at the edge for IoT and human usage—all without requiring huge technological investments or costs. Nothing will stop the Big Data from getting even bigger, but technology like Hunch will help humans stay in the driver’s seat.
Graph A: Query Results and Predictions for a 1,000,000,000-Row, 250-Gb Dataset
Original. Reposted with permission.