Graph Analytics Using Big Data
An overview and a small tutorial showing how to analyze a dataset using Apache Spark, graphframes, and Java.
By Rajat Mehta.
Graphs are one of the most popular computer science concepts. They have been extensively used in real world applications be it a GPS on your phone or GPS device in your car that shows you the shortest path to your destination to a social network that suggests you friends that you can add to your list, graphs are everywhere. As the amount of data increases the concepts of graphs (breadth first search, Djikstra’s etc.) all remain the same but the way the graphs are actually built changes. If you take the case of a social network, a particular person in a network can have hundreds of connections in his network and those connections might be further connected to hundreds of other users which may be physically in a different country altogether. Storing all this information in a typical relational database would not scale at all. Hence, we need specific technologies that cater to this scale of data and hence the usage of big data and big data system.
So, what would we cover in this article
- Building graphs on big data stored in HDFS using graphframes on top of Apache Spark.
- Analyzing a real-world flights dataset using graphs on top of big data.
To build graphs and analyze graphs on big data using apache spark, we have used an open source library graph frames. Currently to build graphs and analyze graphs using ‘Java’ this is the only option available on apache spark. Spark has an excellent inbuilt library ‘GraphX’ but that is directly coupled with Scala and I did not try using it with java. Graphframes is also massively scalable as it is built on top of datasets and is much easier to use as you will see.
Graph Analytics on Airports and Flights dataset
This is a very popular real-life dataset that we are using for our analysis. It is obtained from open-flights airports database (https://openflights.org/data.html). There are 3 datasets in this and they are
This dataset contains information about airports as shown below
|Airport ID||The id given to the airports per row in this dataset|
|Airport IATA Code||3-letter IATA code. Null if not assigned/unknown.|
|Airport ICAO Code||4-letter ICAO code. Null if not assigned.|
|Airport Name||Name of the airport|
|Country||Country in which the airport is located|
|State||State in which the airport is located|
This dataset contains information about the routes between the airports as shown:
|Airline||2-letter (IATA) or 3-letter (ICAO) code of the airline.|
|Airline ID||Unique OpenFlights identifier for airline|
|Source airport||3-letter (IATA) or 4-letter (ICAO) code of the source airport|
|Source airport ID||Unique OpenFlights identifier for source airport|
|Destination airport||3-letter (IATA) or 4-letter (ICAO) code of the destination airport|
|Destination airport ID||Unique OpenFlights identifier for destination airport|
This dataset contains information about the airlines that are represented in this dataset.
|Airline ID||Unique OpenFlights identifier for this airline.|
|Name||Name of the airline.|
|IATA||2-letter IATA code, if available.|
|ICAO||3-letter ICAO code, if available.|
|Country||Country or territory where airline is incorporated.|
Let’s start our analysis using apache spark and graph frames.
Analysis of Flights data
Before we run any analysis, we will build our regular spark boiler plate code to get started. We will create the spark session to start loading our datasets.
Let’s now load the airports dataset. Even though this file is stored locally but it can reside it HDFS or in amazon s3 and apache spark is quite flexible to let us pull this.
Now let’s see the first few rows of this data. Spark has a handy show() method for this as:
|Top Stories Past 30 Days|