Submit a blog to KDnuggets -- Top Blogs Win A Reward

Topics: AI | Data Science | Data Visualization | Deep Learning | Machine Learning | NLP | Python | R | Statistics

KDnuggets Home » News » 2021 » Jul » Tutorials, Overviews » Shareable data analyses using templates ( 21:n27 )

Shareable data analyses using templates


We've been using shared data analyses in production for three years. Here's how you can create reusable templates for common metrics and analyses.



By Cedric Dussud, Cofounder of Narrator.ai

Shareable data analyses using templates
Photo by Joanna Kosinska / Unsplash

 
Our friend Benn Stancil recently wrote a great post about templates — his term for sharable, pre-built dashboards and reports. Do yourself a favor and read it.

The basic idea is that shared, reusable analyses for data have been a pipe dream for years and aren't yet on their way:

Even though our data is the same, and our companies are the same, there’s no one-click way to spin out an entire suite of dashboards


Templates do seem inevitable: the concept of reusable code is something software developers have relied on for literally decades. It's fundamental to how all software is built. The data community has been borrowing best practices from the software world since the beginning, from version control in Git to staging environments to testing. But we still can't use their single most powerful technique.

Each time an analyst switches jobs they have to rebuild the same metrics from scratch. Even if they've done it (lots of times) before, the data is so different that the SQL needs to be fully rewritten. Software engineers would just call a function like monthly_recurring_revenue(data) and move on with their lives.

At this point it's literally easier to set up a data warehouse and populate it from scratch than it is to get a single metric into production.

We've been using templates in production at Narrator for three years — we'll walk through why they're not common yet and how they could be used more broadly.

 

Why don't we have reusable dashboards or analyses?

 
The problem with templates are that they require some consistency. The reason an analyst has to rebuild a monthly recurring revenue metric at a new company is because the data is structured differently, and even if it wasn't (maybe both are exclusively using Shopify), the business might interpret revenue slightly differently.

Benn points out that templates rely on these two core assumptions

  1. everyone's data is mostly the same
  2. everyone's business problems are mostly the same

Obviously we don't live in that world. But even though source data can be wildly different, they fundamentally use similar concepts.
 

A data model for templates

 
Even though no one's source data is the same, there's no reason we can't build consistent data models from them. Benn suggests this approach in his post

A better method—which is similar to the one taken by Narrator—may be to invert this: ask for meaning, and tell people to provide a table that matches it.


In other words, get agreement on what an 'order' table looks like. To use a template first transform your source data (Stripe or Shopify or whatever) into that format, and configure and apply the template.

The insight here is pretty important: even though companies have vastly different data, some high-level concepts like 'orders' or 'invoices' fit nearly everyone's business. As we said before, even though business source data can be pretty unique, the business concepts underlying it are fairly common.

Take, for instance, Salesforce. They define a conceptual model of leads, accounts, opportunities, and contacts. No two businesses run their sales team the same way, but a vast number of them have aligned their processes to this framework.

Companies with completely different data will still think in terms of opportunities going through stages when analyzing data. The specific business problems may differ, but the base concepts are the same.

If we can agree on a common data model then we can build templates.

 

A universal data model

 
Modeling standardized concepts like 'opportunities', or 'emails sent' seems like a promising approach. The only problem is that there are too many, and we'll never get agreement on them anyway. It's easier to make templates for Stripe and Shopify and call it a day.

What we need is a universal way to model any business concept. In other words we're looking for

  1. A fixed table structure with known columns → so templates can be built on it
  2. Flexibility to represent anything → works on anyone's data

At Narrator we call it the activity schema.

 

Activity Schema

 
The activity schema is an open specification for modeling data. Ahmed, our CEO, first proposed it back at WeWork and we've built Narrator's business around it.

It's an 11-column table that represents a customer doing an activity over time.

Activity schema table

 

In our experience all data for analytics can be expressed this way. It's also a fairly natural way to think of things

customer doing an activity over time literally represents how customers interact with a business.


The activities can be anything relevant to a business. For example, Uber might have 'booked a ride'. Netflix might have 'rated a video'. The point is no one has to agree on the activities. Just the data model.

So what does this have to do with templates? Let's make up an example and see how it works.

Say I've got all my data in a single table (called activity_stream) modeled as an activity schema. I've built two activities: viewed_marketing_page and inbound_sales_call.

I've decided to count the unique number of visitors to the marketing page per month:

SELECT 
    month,
    COUNT (DISTINCT customer) 
FROM ( 
    SELECT
        customer,
        date_trunc('month', ts) as month 
    FROM activity_stream a
    WHERE a.activity = 'viewed_marketing_page'
)
GROUP BY month
ORDER BY month DESC


It's pretty straightforward. Now imagine I wanted to count the unique number of people who called us. It's the same query — I just have to substitute inbound_sales_call for viewed_marketing_page

This query is pretty close to a template — we can switch activities without changing its structure at all. A template built from it could perhaps be written as function with a single parameter.

monthly_unique_customers(activity_name)

The point is that a predictable table structure opens up a ton of possibilities.

 

Templates in practice

 
This toy example is neat, but does it work in practice? We think so. At Narrator we've been using templates in production for the past three years.

Our templates are full in-depth analyses instead of functions, but the overall approach is the same. After we build an analysis for customer A, we'll configure it for customer B, then customer C, and so on.

Templated analyses are more complex to write, but the benefits outweigh the cost. Just like open source software we can iterate on them and version them over time. Each time we pick up an analysis again we're making it better, instead of rewriting it from scratch to fit a totally new set of data models.

The activity schema has a ton of modeling benefits beyond supporting templating. I'll talk more in-depth about it in a future post, but the full specification is open and available.

 

Will templates become mainstream?

 
It's hard to say. They work for us, but we'll be the first to admit that picking up a whole new modeling approach is a high bar to clear. That said, it's definitely possible to do. And we're obviously biased.

If anyone knows of other templating implementations I'd love to hear about them. Or really any feedback or thoughts you might about the activity schema.

If you want to try the activity schema out on your data let us know.

 
Bio: Cedric Dussud is Cofounder of Narrator.ai. Formerly Engineering Director at WeWork and Senior Software Engineer at Microsoft.

Original. Reposted with permission.

Related:


Sign Up

By subscribing you accept KDnuggets Privacy Policy