Gold BlogMeet whale! The stupidly simple data discovery tool

Finding data and understanding its meaning represents the traditional "daily grind" of a Data Scientist. With whale, the new lightweight data discovery, documentation, and quality engine for your data warehouse that is under development by Dataframe, your data science team will more efficiently search data and automate its data metrics.



By Robert Yi, CDO at Dataframe.

 

First, how Airbnb’s data discovery tool changed my life

 

In my career, I’ve been fortunate enough to work on some fun problems: I studied the mathematics of rivers during my Ph.D. at MIT, worked on uplift models and open-sourced pylift at Wayfair, and implemented novel homepage targeting models & CUPED improvements at Airbnb. But in all of this work, the day-to-day has never been glamorous — in reality, I’d often spend the bulk of my time finding, learning about, and validating data. Though this had always been the state of my work life, it hadn’t occurred to me that it was a problem until I got to Airbnb, where it had been solved by their data discovery tool, Dataportal.

Where can I find {{ data }}? Dataportal.
What does this column mean? Dataportal.
How is {{ metric }} doing today? Dataportal.
What is the meaning of life? Dataportal, probably.

Well, you get the picture.

It would take just minutes (not hours) to find data and understand what it meant, how it was made, and how to use it.

I could spend my time crafting homespun analyses or building novel algorithms (…or responding to random data questions), not scrounging through notes, writing repetitive SQL queries, and slack-tagging my coworkers to try to recreate context that someone else already had.

So what’s the problem?

Well, I realized most of my friends didn’t have access to a tool like this.

Few companies want to dedicate vast resources to building and maintaining a platform tool like Dataportal. And although there are a few open source solutions, they are generally built for scale, making setup and maintenance difficult without a DevOps engineer dedicated to the job.

So I thought I’d build something new.

 

Enter whale: the stupidly simple data discovery tool

 

And yes, by stupidly simple, I mean stupidly simple. Whale has just two components:

  1. A Python library that scrapes metadata and formats it as markdown.
  2. A Rust CLI interface to search over that data.

In terms of backend infrastructure to maintain, you only have a bunch of text files and a program that updates that text. That’s it, making it trivial to host over a git server like Github. No new query language to learn, no infra to manage, no backups. Everyone knows git—so sync and collaboration come for free.

Let’s take a closer look at the features of Whale v1.0.

 

A fully-featured git-based GUI

 

Whale was born to sit on a git remote server like Github. It’s super easy to set up: define some connections, copy our Github actions script (or write one for your CI/CD platform of choice), and you’ll immediately have yourself a web-based data discovery tool—you can search, view, document, and share your tables directly on Github.

A sample table stub generated through Github actions. For a fully working demo, see https://github.com/dataframehq/whale-bigquery-public-data.

 

Blazingly fast CLI search against your warehouse

 

Whale lives and breathes on the command-line, providing a rich, millisecond search over your tables. Even for millions of tables, we managed to make whale incredibly performant by using some clever caching mechanisms and rebuilding the backend in rust. You won’t notice any search latency [probably — hey there, Google DS].

A demo of whale, searching through one million tables.

 

Automatic metric calculations [beta]

 

One of my least favorite things as a data scientist was running the same queries over and over simply to QA the data I was using. Whale now supports the ability to define metrics in plain SQL that will be scheduled alongside your metadata scraping pipelines. Simply define a ```metrics block in the following YAML format within your table stub, and whale will automatically schedule and run the enclosed queries.

```metrics
metric-name:
  sql: |
    select count(*) from table
```

 

In conjunction with Github, this means whale can serve as a lightweight central source of truth for metrics definitions. Whale even saves these values alongside a timestamp in a ~/.whale/metrics directory if you want to do some plotting or deeper exploration.

 

The future

 

After talking with users of our pre-release versions of whale, we realized people wanted deeper functionality. Why just a table search tool? Why not metrics? Why not monitoring? Why not a SQL runner? Though originally conceived as a simple CLI Dataportal/Amundsen companion tool, whale v1 has already grown into a fully-featured standalone platform, and we hope to see it become an essential part of the data scientist’s toolbelt.

If there’s something you want to see us build, join our Slack community, open an issue on Github, or even reach out directly on LinkedIn. We have a number of exciting features in mind already — Jinja templating, bookmarks, search filters, Slack alerts, Jupyter integration, even a CLI dashboard for metrics — and we’d love your input.

Check out the repo here.

Original. Reposted with permission.

 

Related: