Graph Analytics Using Big Data

An overview and a small tutorial showing how to analyze a dataset using Apache Spark, graphframes, and Java.



By Rajat Mehta.

Graphs are one of the most popular computer science concepts. They have been extensively used in real world applications be it a GPS on your phone or GPS device in your car that shows you the shortest path to your destination to a social network that suggests you friends that you can add to your list, graphs are everywhere. As the amount of data increases the concepts of graphs (breadth first search, Djikstra’s etc.) all remain the same but the way the graphs are actually built changes. If you take the case of a social network, a particular person in a network can have hundreds of connections in his network and those connections might be further connected to hundreds of other users which may be physically in a different country altogether. Storing all this information in a typical relational database would not scale at all. Hence, we need specific technologies that cater to this scale of data and hence the usage of big data and big data system.

So, what would we cover in this article

  • Building graphs on big data stored in HDFS using graphframes on top of Apache Spark.
  • Analyzing a real-world flights dataset using graphs on top of big data.

GraphFrames

To build graphs and analyze graphs on big data using apache spark, we have used an open source library graph frames. Currently to build graphs and analyze graphs using ‘Java’ this is the only option available on apache spark. Spark has an excellent inbuilt library ‘GraphX’ but that is directly coupled with Scala and I did not try using it with java. Graphframes is also massively scalable as it is built on top of datasets and is much easier to use as you will see.

Graph Analytics on Airports and Flights dataset

This is a very popular real-life dataset that we are using for our analysis. It is obtained from open-flights airports database (https://openflights.org/data.html). There are 3 datasets in this and they are

Airports dataset

This dataset contains information about airports as shown below

 

Attribute Description
Airport ID The id given to the airports per row in this dataset
Airport IATA Code 3-letter IATA code. Null if not assigned/unknown.
Airport ICAO Code 4-letter ICAO code. Null if not assigned.
Airport Name Name of the airport
Country Country in which the airport is located
State State in which the airport is located

Routes Dataset
This dataset contains information about the routes between the airports as shown:

Attribute Description
Airline 2-letter (IATA) or 3-letter (ICAO) code of the airline.
Airline ID Unique OpenFlights identifier for airline
Source airport 3-letter (IATA) or 4-letter (ICAO) code of the source airport
Source airport ID Unique OpenFlights identifier for source airport
Destination airport 3-letter (IATA) or 4-letter (ICAO) code of the destination airport
Destination airport ID Unique OpenFlights identifier for destination airport

Airlines Dataset
This dataset contains information about the airlines that are represented in this dataset.

Attribute Description
Airline ID Unique OpenFlights identifier for this airline.
Name Name of the airline.
IATA 2-letter IATA code, if available.
ICAO 3-letter ICAO code, if available.
Country Country or territory where airline is incorporated.

Let’s start our analysis using apache spark and graph frames.

Analysis of Flights data
Before we run any analysis, we will build our regular spark boiler plate code to get started. We will create the spark session to start loading our datasets.

SparkConf conf = ...
SparkSession session = ...

 

Let’s now load the airports dataset. Even though this file is stored locally but it can reside it HDFS or in amazon s3 and apache spark is quite flexible to let us pull this.

Dataset rawDataAirport = session.read().csv("data/flight/airports.dat");

 

Now let’s see the first few rows of this data. Spark has a handy show() method for this as:

rawData.show();