What is Data Catalog and Why You Should Care?
Learn why data catalogs could be just the thing you need to meet the challenges of data and metadata management and collaboration.
By Kushal Saini, Content Strategist, Atlan.
Two data scientists walk into a library at the end of a long day...
Data scientist #1 to the librarian: “Can I get a copy of this book on statistical methods?” Goes on to share the name of the obscure book.
Data scientist #2 to Data scientist #1: “They’ll never be able to find that book.”
The librarian clacks away on the keyboard for a couple of seconds before replying: “Found it! Here are the details of its author, publishing house, and borrowing history. Oh, and someone left a comment saying they found it super useful for understanding logistic regressions. I can grab it for you in a jiffy.”
Data scientist #1 to Data scientist #2: “Ummmm... why can’t the same thing happen with our data?”
But what if it could? Enter data catalogs—the missing link in your data lake. Now get the data you need with the context you need! ????
First... what is a data catalog?
As seen in the chat above, a data catalog is a library or inventory of all your data sets—a place where all your data is neatly indexed, organized, and kept ready for use.
(If Monica from Friends made a data catalog, this would be it—neat to the T!)
According to leading research firm Gartner: “A data catalog creates and maintains an inventory of data assets through the discovery, description, and organization of distributed datasets. The data catalog provides context to enable data stewards, data/business analysts, data engineers, data scientists and other lines of business (LOB) data consumers to find and understand relevant datasets for the purpose of extracting business value.”
But more importantly, Gartner goes on to say: “Modern machine-learning-augmented data catalogs automate various tedious tasks involved in data cataloging, including metadata discovery, ingestion, translation, enrichment and the creation of semantic relationships between metadata. These next-generation data catalogs can, therefore, propel enterprise metadata management projects by allowing business users to participate in understanding, enriching, and using metadata to inform and further their data and analytics initiatives.
Thus, modern data catalogs can help you manage your metadata (aka metadata management) in a way that you can easily curate and access important business context around your data—along with your data itself.
Sounds like a dream? Well, it’s possible!
A truly powerful data catalog can help you:
- Create a repository for all your data, including the structure, quality, definitions and stats on usage of the data
- Allow users to access the metadata alongside the data itself
- View and understand the lineage of the data—including the source, the transformations applied, and who has been using it
- Ensure data consistency and accuracy by updating itself automagically, while allowing humans to edit and remain in the loop
- Simplify data governance and compliance by providing a graphical representation of the lineage of the data assets—tracing it across its lifecycle.
But wait, still not convinced on...
Why you even need a data catalog?
Here’s the short of it. If you need to use data, understand your data and where it came from, and share this data with your team securely—you need a data catalog.
Too oversimplified? Well, here’s the long of it.
If you’re reading this article, you know that companies today are dealing with vast amounts of data.
- According to IDC’s Data Age 2025 report, the Global Datasphere will grow from 33 Zettabytes (ZB) in 2018 to 175 ZB by 2025. The ICD defines datasphere as the sum of all data created, captured or replicated across core sources (traditional and cloud data centers) edge sources (like cell towers and branch offices) and data endpoints (such as PCs, smartphones, and Internet of Things or IoT devices).
“If it feels like we’re all drinking from a data firehose, it’s because we are,” said Mary Meeker about data growth while releasing the Internet Trends 2019 report.
And companies that can harness the enormous signal power of this data are expected to win.
- According to Booz Allen Hamilton’s Data Science Playbook, businesses that deploy analytics across most of the organization, align daily operations with senior management’s goals, and incorporate big data will see a 1,000 percent increase in ROI.
- As expected, companies are investing heavily in big data to gain a competitive edge—they were expected to invest $114 billion in big data in 2018, up from $31 billion in 2013.
But there are many challenges in this process of becoming data-driven.
One of the primary challenges is enabling teams to discover, understand, govern and consume the data they need to make better decisions.
“The two biggest challenges in data management are centered around data catalogs—finding and identifying data that delivers value, and supporting data governance and data security.” - Gartner Data Management Strategy Survey 2017
And the stakes are high.
“By 2022, over 60% of traditional IT-led data catalog projects that do not use ML to assist in finding and inventorying data distributed across a hybrid/multicloud ecosystem will fail to be delivered on time, leading to derailed data management, analytics, and data science projects.” Gartner research
But don’t take our or Gartner’s word for it. The pain of siloed and missing data is real. Here’s what we saw on Reddit.
But the proof of the pudding lies in the eating, so we ask you—yes, the business analyst trying to uncover the mystery of column latest_Kirk_02122019_keep and the IT admin who’s tired of asking for email permissions to access data.
- How much time do you spend just looking for the data you need?
- How much do you even know about your data?
- Do you know the source of your data?
- Do you know the quality of your data?
- Can you rate your data assets?
- Can you get and give data access easily and securely?
If your answer to any of the above is a big resounding “UMMMMM,” the writing’s on the wall. It’s time to get a data catalog.
The need of the hour is to remove data silos, let analytics flow at the speed of thought and create a single source of truth for your entire team.
Oh, already out of the door to shop for the latest shiny data catalog tool? Not so fast, mister, because...
Beware, there are metadata silos everywhere
Simply plugging in an isolated data catalog tool within your data lake may not be the answer to your data woes. Today’s business mandates that data be available for whoever needs it, wherever and whenever they need it (read more on DataOps here).
That’s why it’s essential for a data catalog tool to let its data stay updated automagically by crowdsourcing updates and knowledge (such as versions, lineage, user ratings) AND to allow updated data to be plugged in across your data applications/analytics tools and platforms—thus creating one source for truth for your data. So that everyone stays on the same data page! (And knows how to switch between pages or even other books!)
“By 2024, machine-learning-augmented data preparation, data catalogs, data unification, and data quality tools will converge into a consolidated modern enterprise information management platform used for the majority of new analytics projects. - Gartner 2019 Market Guide for Data Preparation
Stay ahead of the curve. Don’t get yet another data catalog tool that will create siloed metadata catalogs. Adopt a data catalog tool that will let you bring your data, human tribal knowledge, and business context together—in one place. And gets you brownie points from your compliance team!
Going beyond traditional data catalogs with Atlan
As seen above, gone are the days when you could create one single catalog for your company via the IT Team and then direct everyone to use it. Today, the sources, users and use cases of data have multiplied and become dynamic. And data catalogs need to keep up with the times. That’s why yet another data catalog tool won’t make the cut.
We’ve put all these principles into action with Atlan—the home for data teams.
Atlan helps create a ‘living’ catalog that grows as your data and team grow. With a smart data catalog you can:
Give data its own profile: You can now create a snapshot of your data with Atlan—just like sales leads on SalesForce or code on GitHub. You can easily discover, understand, share and collaborate on data in one place. Shareable via a simple URL!
Bring human tribal knowledge alongside data: As the business context is just as important as the data itself, Atlan helps you add and share your metadata right where your data exists via README summaries, ratings, discussions, and notes.
Integrate with all the tools you already love and use: Atlan helps you create one source of truth across your entire data ecosystem by letting you seamlessly plug in data from your catalog into any other downstream or upstream application, from Microsoft Excel to PowerBI.
As always, keep your humans first. Consider their needs and challenges.
As Randy Bean and Thomas H. Davenport write for HBR, “Many companies have invested heavily in technology as a first step toward becoming data-oriented, but this alone clearly isn’t enough. Firms must become much more serious and creative about addressing the human side of data if they truly expect to derive meaningful business benefits.”
Quick recap, before you go
A modern data catalog will help you:
- Create a single source of truth for your data across all its applications
- Make data cataloging a part of your data processes, not an isolated activity
- Quickly access and share the insights you need via a centralized repository
- Enforce and simplify data security and compliance (GDPR, CCPA, etc.)
And that’s it! Time to go forth and jumpstart your data management strategy—create one source of truth for your data.
Original. Reposted with permission.
Bio: Kushal Saini (@AtlanHQ) believes that everyone has a story. She currently works as a Lead Content Strategist at Atlan, a home for data teams.
Related:
- Why and how should you learn "Productive Data Science"?
- Eight Data Science Specializations, and Why You Should Pick One
- Top 4 tricks for competing on Kaggle and why you should start
- Why You Should Consider Being a Data Engineer Instead of a Data Scientist
- Top 5 Reasons Why You Should Avoid a Data Science Career
- 3 Reasons Why You Should Use Linear Regression Models Instead of…