Accessing Data Commons with the New Python API Client

Easily acquire data for your analysis via reliable sources.



Accessing Data Commons with the New Python API Client
Image by Editor

 

Introduction

 
Data is at the core of any data professional's work. Without useful and valid data sources, we cannot perform our responsibilities. Furthermore, poor-quality or irrelevant data will only cause our work to go to waste. That’s why having access to reliable datasets is an important starting point for data professionals.

Data Commons is an open-source initiative by Google to organize the world's available data and make it accessible for everyone to use. It’s free for anyone to query publicly available data. What sets Data Commons apart from other public dataset projects is that it already performs the schematic work, making data ready to use much more quickly.

Given the utility of Data Commons for our work, accessing it is becoming crucial for many data tasks. Fortunately, Data Commons provides a new Python API client to access these datasets.

 

Accessing Data Commons with Python

 
Data Commons works by organizing data into a queryable knowledge graph that unifies information from diverse sources. At its core, it uses the schema-based model from schema.org to standardize data representations.

Using this schema, Data Commons can connect data from various sources into a single graph where nodes represent entities (such as cities, locations, and people), events, and statistical variables. Edges depict the relationships between these nodes. Each node is unique and identifiable by a DCID (Data Commons ID), and many nodes include observations — measurements linked to the variable, entity, and period.

With the Python API, we can easily access the knowledge graph to acquire the necessary data. Let’s try out how we can do that.

First, we need to acquire a free API key to access Data Commons. Create a free account and copy the API key to a secure location. You can also use the trial API key, but access is more limited.

Next, install the Data Commons Python library. We will use the V2 API client, as it is the most recent version. To do that, run the following command to install the Data Commons client with optional support for Pandas DataFrames as well.

pip install "datacommons-client[Pandas]"

 

With the library installed, we are ready to fetch data using the Data Commons Python client.

To create the client that will access the data from the cloud, run the following code.

from datacommons_client.client import DataCommonsClient

client = DataCommonsClient(api_key="YOUR-API-KEY")

 

One of the most important concepts in Data Commons is the entity, which refers to a persistent and physical thing in the real world, such as a city or a country. It becomes an important part of fetching data, as most datasets require specifying the entity. You can visit the Data Commons Place page to learn about all available entities.

For most users, the data that we want to acquire is more specific: the statistical variables stored in Data Commons. To select the data we want to retrieve, we need to know the DCID of the statistical variables, which you can find via the Statistical Variable Explorer.

 
Accessing Data Commons with the New Python API Client
 

You can filter variables and select a dataset from the options above. For example, choose the World Bank dataset for “ATMs per 100,000 adults.” In this case, you can obtain the DCID by examining the information provided in the explorer.

 
Accessing Data Commons with the New Python API Client
 

If you click on the DCID, you can see all the information related to the node, including how it connects to other information.

 
Accessing Data Commons with the New Python API Client
 

For the statistical variable DCID, we also need to specify the entity DCID for the geography. We can explore the Data Commons Place page mentioned above, or we can use the following code to see the available DCIDs for a certain place name.

# Look up DCIDs by place name (returns multiple candidates)
resp = client.resolve.fetch_dcids_by_name(names="Indonesia").to_dict()
dcid_list = [c["dcid"] for c in resp["entities"][0]["candidates"]]
print(dcid_list)

 

With output similar to the following:

['country/IDN', 'geoId/...' , '...']

 

Using the code above, we fetch the DCID candidates available for a specific place name. For example, among the candidates for “Indonesia,” we can select country/IDN as the country DCID.

All the information we need is now ready, and we only need to execute the following code:

variable = ["worldBank/GFDD_AI_25"]
entity = ["country/IDN"]

df = client.observations_dataframe(
    variable_dcids=variable,
    date="all",
    entity_dcids=entity
)

 

The result is shown in the dataset below.

 
Accessing Data Commons with the New Python API Client
 

The current code returns all available observations for the selected variables and entities across the entire time frame. In the code above, you will also notice that we are using lists instead of single strings.

This is because we can pass multiple variables and entities simultaneously to acquire a combined dataset. For example, the code below fetches two distinct statistical variables and two entities at once.

variable = ["worldBank/GFDD_AI_25", "worldBank/SP_DYN_LE60_FE_IN"]
entity = ["country/IDN", "country/USA"]

df = client.observations_dataframe(
    variable_dcids=variable,
    date="all",
    entity_dcids=entity
)

 

With output like the following:

 
Accessing Data Commons with the New Python API Client
 

You can see that the resulting DataFrame combines the variables and entities you set previously. With this method, you can acquire the data you need without executing separate queries for each combination.

That’s all you need to know about accessing Data Commons with the new Python API client. Use this library whenever you need reliable public data for your work.

 

Wrapping Up

 
Data Commons is an open-source project by Google aimed at democratizing data access. The project is inherently different from many public data projects, as the datasets are built on top of a knowledge graph schema, which makes the data easier to unify.

In this article, we explored how to access datasets within the graph using Python—leveraging statistical variables and entities to retrieve observations.

I hope this has helped!
 
 

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy


Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

Get the FREE ebook 'KDnuggets Artificial Intelligence Pocket Dictionary' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

No, thanks!