Querying the Most Granular Demographics Dataset
Having access to broad and detailed population data can potentially offer enormous value to any organization looking to interact with specific demographics. However, access alone is not sufficient without being able to leverage advanced techniques to explore and visualize the data.
By Matti Grotheer, startup enthusiast and Co-Founder of Kuwala.
There are a plethora of use cases that require detailed population data. For example, having a detailed breakdown of the demographic structure is a significant factor in predicting real estate prices or finding the perfect retail outlet location. Also, humanitarian projects such as vaccination campaigns or rural electrification plans highly depend on good population data.
It is very challenging to find high-quality and up-to-date data on a global scale for these use cases. Usually, census data is published every four years, which makes those datasets outdated quickly. Arguably the best datasets out there for population densities and demographics are published by Facebook under their Data for Good initiative. They combine official census data with their internal data and leverage machine learning algorithms for image recognition to determine buildings' location and type.
Facebook Data for Good and Kuwala (2021).
Using those different sources can give a detailed statistical breakdown of demographic groups in 1-arcsecond blocks, a resolution of approximately 30 meters. Each square contains statistical values for the following demographic groups:
- Children under 5
- Youth 15 - 24
- Elderly 60 plus
- Women of reproductive age 15 – 49
Facebook delivers for each country a file per demographic group, either as a GeoTIFF or CSV. The CSV contains the latitude and longitude of the cell and the respective population value.
The files are stored per country and key metric on 1-arcsecond blocks. This results in gigabytes and millions of rows of data for a single country. If you want to prototype or visualize the data for a single city, you need to browse through the endless files and parse the information.
That is why we created an open-source wrapper that exposes the data through a package. You can directly download the data for entire countries over a CLI. We preprocess the data to make it easily queryable. For that, we are leveraging the power of Uber's H3 spatial indexing.
Thanks to the H3 indexing, it is easy to build queries on top of the database. Using either H3 cells or coordinate pairs, you can retrieve the population based on a point, a given radius, or polygon. Furthermore, it is straightforward to aggregate the population on a zip code level, for example.
Uber H3 and Kuwala (2021).
The data integration follows a sequential process. The CSV files for countries and demographic characteristics are automatically loaded and linked by Spark. The data is efficiently stored in a Parquet file. The Parquet file is then automatically loaded to a Neo4j database (graph database). Then, using Cypher, queries can be made for specific polygons, points with a given radius, and different aggregations using H3. For a medium-sized country like Germany (approximately 7-8 GB), the data is processed locally in less than 30 minutes and ready for your spatial analysis.
Neo4j was chosen as the database because it can intuitively connect other pipelines of the Kuwala ecosystem. In a similar process, POI information from OpenStreetMap can be loaded and directly related to the demographics data. Many more geo-related sources, such as Google Trends, location-based urban events, or social media data, will follow as connectors to enable you with fast and holistic queries on comparable worldwide datasets.
For quick data exploration and visualization, you can directly create datasets compatible with Kepler.gl or Unfolded.ai to make beautiful maps. We published an example map for Malta. It is directly visible where the highly populated regions are and where the heart of the city is.
By having Facebook's population data now directly queryable, it is much faster to create predictive models or visualizations so data teams can spend time on the value-adding tasks. That is also the main reason why we are building an open-source community for third-party data integration with Kuwala. So if you want to get your hands on more connectors like these, star us on GitHub and join our Slack community.
But our open-source project does not stop here. Our big goal is to facilitate access to external data sources, ensure data quality, and help data scientists quickly develop features that they can incorporate into their modeling. For example, we are planning a Jupyter notebook that can be used to manipulate and observe the data swiftly. So stay tuned for that!
- The secret to analysing large, complex datasets quickly and productively?
- 3 Key Data Science Questions to Ask Your Big Data
- How Visualization is Transforming Exploratory Data Analysis