KDnuggets Home » News » 2014 » Dec » Software » SlamData Open Source Analytics Tool for MongoDB ( 14:n33 )

SlamData Open Source Analytics Tool for MongoDB


SlamData is an open source SQL-based tool designed to make accessing data in MongoDB easy for developers and non-developers alike with the goal of making application intelligence easier.



By John A. De Goes (SlamData), Dec 2014.

SlamData is an open source tool that makes analytics on MongoDB easy and accessible to developers and non-developers alike. We just launched v1.1, which greatly increases the power of the tool and fixes a number of issues identified in the 1.0 release.

Why SlamData?

MongoDB is currently the fastest growing and most successful NoSQL database. Companies are using the database primarily to build web and mobile applications.

Successful applications built on MongoDB end up capturing or generating large amounts of data. The process of understanding this data, which I call Application Intelligence, is vital to multiple stakeholders in the business:

  • Product. How can we use this data to improve the product or better understand users? How can we allow users to learn from the data themselves?
  • Marketing. What does this data tell us about how users are using the application? Can this application data help us better understand marketing ROI?
  • Support. How can this data be used to help identify and resolve issues that users are having?
  • Managers. What does this data tell us about resource allocation? How can we tie this data to sales and other data sets?
  • IT. What type of data is being generated by the application, and how might we tune the database for this kind of data?

MongoDB does not have rigid schemas (every “row” may have a different structure from every other “row), and allows arbitrary nesting of data (“rows” can contain other “tables”).

While this flexibility leads to faster application development and better performance and scaling properties, it comes at a cost: existing analytics tools don't work with MongoDB.

In the relational world, to answer these types of questions you’d simply use a data discovery and ad hoc analytics tool. But in the world of MongoDB, if you need Application Intelligence, you have exactly two choices:

  1. Coding. Have a developer write low-level, one-off code to interact with the MongoDB API to discover the data stored there and answer relevant questions.
  2. ETL. Develop a workflow to migrate, homogenize, and normalize the data into an RDBMS, where you can use existing analytics tooling (albeit on a data model which does not accurately represent the underlying data).

Ultimately, neither approach is scalable, which is why we started SlamData, an open source project based on the premise that NoSQL data is here to stay, and analytics tooling needs to catch up with modern data.

Introducing SlamData

SlamData provides a standard SQL interface to NoSQL data stored in MongoDB.

Every SQL query is executed 100% in the database (or in a replica set), and operates on the actual structure of the data.

This approach differs substantially from other solutions to the problem, which stream data from the database to handle complex queries, and which superimpose a fake relational view of the underlying data (even when it is not relational).

SlamData’s dialect of SQL (called SlamSQL) extends ANSI SQL to support nested data, heterogeneous data, and aggregation over nested dimensions (for example, summing elements in an array stored inside a row).

An example SlamSQL query is shown below:

SELECT DISTINCT user_name, SUM(music[*].likes[*].strength)
    AS strength
FROM collection WHERE music[*].likes[*].name = 'david bowie'
GROUP BY user_name
ORDER BY strength DESC
LIMIT 10

In this query, documents which are doubly-nested in arrays are being used to filter and sum values in the overall result. This query would be impossible in an RDBMS, and the equivalent code for the MongoDB API would be very difficult to write, troubleshoot, and understand.

SlamData Rabid Fans

By leveraging industry standard SQL, SlamData makes it possible for a wide range of users and tools to interface with MongoDB, and helps teams quickly and easily understand the data generated or collected by their MongoDB applications.

In the current 1.1 release, all standard SQL clauses are supported, including SELECT, AS, FROM, JOIN, WHERE, GROUP BY, HAVING, OUTER JOIN, CROSS, and more.

SlamData interactive prompt example

Opening up the Box

The SlamData project innovates in several key ways:

  1. Structural type inference. SlamData does not scan the database to learn the structure of the data. Instead, SlamData uses a structural type system, complete with bidirectional type inference, which allows SlamData to parse the intent of a query and generate an execution plan consistent with that intent. For example, if your query uses a field as if it were a string, then SlamData will look for documents in which the field is a string. SlamData will also warn you when you attempt to do nonsensical things, like adding 4 to a string, because even though SlamData doesn’t know what’s in the database, it does know what operations make sense on what data types.
  2. Multi-dimensional relational algebra. SlamSQL is built on a formal extension of relational algebra called multi-dimensional relational algebra (MRA). This more powerful (but backward-compatible) foundation allows slicing, dicing, and aggregating nested, non-uniform data. As a pleasant side-effect, it also gives a sensible semantic to many SQL queries which are not allowed in ANSI SQL (for example, SELECT price / SUM(price) AS percent FROM ORDERS).
  3. Advanced multi-staged compilation. MongoDB has three distinct mechanisms for executing a query (one of them being full-fledged map/reduce), and each has different strengths and weaknesses. In general, efficiently executing a complex query might require a combination of all three. SlamData has an advanced multi-stage, optimizing planner which attempts to find the optimal combination of all three mechanisms.
  4. In-database execution. SlamData is extremely aggressive about pushing execution of queries into the database. In fact, 100% of every query will be executed directly in the database, with no streaming back to the client for post-processing. Other attempts at solving this problem rely on client-side processing for most queries, because executing every part of every query inside the database is extremely difficult to do in a performant way (hence the need for the advanced, multi-staged compilation).

The combination of these features make SlamData “point and query”: point SlamData at your MongoDB database, and do whatever you want on any kind of data. SlamData will generate the optimal query plan and execute it 100% in the database.

Learning More

If you are using MongoDB and would like to try SlamData, you can find installers on the official website, or you can compile the project from source code on Github.

SlamData is a 100% open source project, so if you like what you see, please consider supporting the project in various ways:

We also have a newsletter you can sign up for on the official website. Enjoy!

John A. De Goes, @jdegoes, is a founder and CTO of SlamData, and a contributor to the open source SlamData project. Previously, he was General Manager of DataMesh, Principal Architect at RichRelevance, and CEO/CTO of Precog.

Related:

Sign Up