Populating a GRAKN.AI Knowledge Graph with the World

This updated article describes how to move SQL data into a GRAKN.AI knowledge graph.



By Jo Stichbury, Grakn Labs.

Grakan.ai
Image by davecito is licensed under CC BY 2.0

Highly interconnected data from complex domains is challenging to model, maintain and query. With GRAKN.AI, which is an open source, distributed knowledge base (think a graph database with extra punch), building intelligent systems becomes dramatically easier.

If you are thinking of trying GRAKN.AI, the chances are that you already have some data and want to see what our stack can do. This post shows how to move data from a SQL database to GRAKN.AI, and how to use our visualiser to explore the data.

I’m going to use a simple, well-known, example set of data about cities and countries of the world. In the course of this post, I will go through the basics of how to set up a SQL database and make some simple queries (which you can omit if you’re already a SQL user), then I will explain how to migrate the contents of the database into GRAKN.AI. I’ll use our declarative query language, Graql, to make the same queries as shown in SQL. You can find the code for this example on Github.

I have used MySQL for this example, although the Grakn Labs team have also tested with Oracle and PostgresQL. The version of GRAKN.AI I used was 0.12.1 but, as ever, I recommend that you use the latest version.

 

Setting up MySQL

 
If you already have a MySQL setup, you may want to skip ahead to “Hello World” section.

Getting MySQL installed is relatively straightforward. I followed these instructions for MacOS X (downloading the installation package here). It wasn’t totally clear how to start the MySQL Server (I eventually worked out that installation puts an icon in your System Preferences menu). Having started the server, you need to start the MySQL shell from the terminal

mysql -u root -p


On first use, change the temporary password it was installed with (I set my new password to root, which is the same as the username, so it was memorable).

ALTER USER ‘root’@’localhost’ IDENTIFIED BY ‘root’;


 

“Hello World”

 
Once I had MySQL set up, it was time to open the example world database from the command line, as described here.

SOURCE [sql-path]/world.sql;


It is a classic example dataset published by MySQL that is easily understood and is commonly used by beginners to introduce them to SQL queries (however, I cannot guarantee that the data it contains is accurate).

Using the following query to retrieve information about the columns of the city table:

DESCRIBE city;


As you can see, there are 5 columns in the table (confusingly shown as rows in the output): ID, Name, CountryCode, District and Population. The ID is the Primary Key for the table, which uniquely identifies each record. This will be important later, when we migrate to Grakn.

The following statement gives us a peek at the first 10 items of city data:

SELECT * from city LIMIT 10;


Let’s get the information about a specific city: Sydney, Australia

SELECT * FROM City WHERE Name = 'Sydney';


You can experiment in a similar way with the country and countrylanguage tables.

 

Migrating to GRAKN.AI

 
OK, you should have a reasonable handle on the data, so now let’s migrate it into GRAKN.AI. The first thing, if you’ve not done it already, is to follow the quickstart guide to download GRAKN.AI and start the Grakn engine.

Ontology

There are limitations on the SQL format that prevent it from expressing the semantics of the data. By “semantic”, I mean that the meaning of the data cannot easily be encoded alongside the data itself. In contrast, a knowledge graph is self-descriptive, or, simply put, it provides a single place to find the data and understand what it’s all about.To have the full benefit of a knowledge graph, we must write the ontology for the dataset.

Writing an ontology for every field in the SQL tables will be too unwieldy so we will take just some of the data contained in the table, as follows:

insert 

country sub entity    
  has countrycode    
  has name    
  has surfacearea    
  has indepyear    
  has population    
  has lifeexpectancy    
  has gnp    
  has gnpold    
  has localname    
  has governmentform    
  has headofstate    
  plays speaks-language     
  plays contains-city; 

city sub entity    
  has population    
  has name    
  plays in-country; 

language sub entity    
  has name    
  plays language-spoken; 

name sub resource datatype string;
countrycode sub resource datatype string;
surfacearea sub resource datatype double; 
indepyear sub resource datatype long; 
population sub resource datatype long;
lifeexpectancy sub resource datatype double;
gnp sub resource datatype double;
gnpold sub resource datatype double;
localname sub resource datatype string;
governmentform sub resource datatype string;
headofstate sub resource datatype string;
iscapital sub resource datatype boolean;
isofficial sub resource datatype boolean;
percentage sub resource datatype double; 

speaks sub relation    
  relates speaks-language    
  relates language-spoken    
  has percentage    
  has isofficial; 
 
has-city sub relation    
  relates contains-city    
  relates in-country    
  has iscapital;

speaks-language sub role;
language-spoken sub role;
contains-city sub role;
in-country sub role;


We define 3 entities to represent country, city and language, and some relations between them (between language and country and country and city).

In the terminal, load the ontology as follows:

<path-to-GRAKN.AI>/bin/graql.sh -f ./ontology.gql


Here’s a visual representation of the ontology:

Migration Templates

Once we have written and loaded the ontology for the dataset, we need to use the Graql templating language to instruct the SQL migrator on how the SQL data is mapped to the above ontology. The SQL migrator applies the templates we provide to the results of an SQL query, to each row of results in turn, replacing the indicated sections in the template with the corresponding data. The column header is the key, while the content of each row at that column is the value.

To migrate the country data, the template code is as follows:

insert $country isa country    
  has name <Name>    
  has countrycode <Code>    
  if(<IndepYear> != null) do { has indepyear <IndepYear> }
  if(<Population> != null) do { has population <Population> }
  if(<LifeExpectancy> != null) do { has lifeexpectancy <LifeExpectancy> }    
  if(<GNP> != null) do { has gnp <GNP> }    
  if(<GNPold> != null) do { has gnpold <GNPold> }    
  if(<LocalName> != null) do { has localname <LocalName>  }
  if(<GovernmentForm> != null) do { has governmentform <GovernmentForm> }    
  if(<HeadOfState> != null) do { has headofstate <HeadOfState> };


The language migration template:

insert $language isa language has name <language>;


Then, to insert a relation between the language and the countries in which it is spoken, there is a match-insert query, which matches a language and country, then builds a relation between them:56

match    
  $language isa language has name <Language>;    
  $country isa country has countrycode <CountryCode>;
insert    
 $relation (speaks-language: $country, language-spoken: $language) isa speaks        
  has isofficial if(<IsOfficial> = "F") do { false } else { true }
  has percentage <Percentage>;t6


For city migration, the template is as follows:

match    
  $country isa country has countrycode <CountryCode>;    
insert 
  $city isa city    
  has name <Name>    
  has population <Population>; 
  (contains-city: $country, in-country: $city) isa has-city;


To determine if it is the capital city:

if(<capital> != null) do {
match 
  $country isa country has countrycode <code>;    
  $city isa city has name <capital>;    
  $rel (in-country: $country, contains-city: $city) isa has-city;
insert    
  $rel has iscapital true;}


Migration Tools

There are two ways in which you can migrate SQL data into GRAKN.AI. You can use Java to perform the migration in a few lines of code, which is described further in our SQL migration example.

Alternatively, there is a shell script that you can call to apply the templates above to the SQL data. In effect, the shell script calls a set of Java functions so you don’t have to. I like this option, as Java is not my natural habitat. The migration documentation shows the script options, which are as follows:

usage: migration.sh sql -template <arg> -driver <arg> -user <arg> -pass <arg> -location <arg> -keyspace <arg> [-help] [-no] [-batch <arg>] [-uri <arg>] [-retry <arg>] [-v] 
 
 -a,--active <arg>     Number of tasks (batches) running on the server at any one time. Default 25.
 -b,--batch <arg>      Number of rows to execute in one                                                     transaction. Default 25.
 -c,--config <arg>     Configuration file.
 -driver <arg>         JDBC driver
 -h,--help             Print usage message.
 -k,--keyspace <arg>   Grakn graph. Required.
 -location <arg>       JDBC url (location of DB)
 -n,--no               Write to standard out.
 -pass <arg>           JDBC password
 -q,--query <arg>      SQL Query
 -r,--retry <arg>      Retry sending tasks if engine is not available
 -t,--template <arg>   Graql template to apply to the data.
 -u,--uri <arg>        Location of Grakn Engine.
 -user <arg>           JDBC username
 -v,--verbose          Print counts of migrated data.


Whether you use the shell script or Java code, what you are doing in either case is extracting SQL data using the JDBC API and importing it into a graph.

For this example, running from within the examples/example-sql-migration/shell-migration directory of the GRAKN.AI installation, I called Grakn’s migration.sh script on each of the templates shown above, passing in the appropriate query. For example, for the countries migration:

<path-to-GRAKN.AI>/bin/migration.sh sql -q "SELECT * FROM country;" -location jdbc:mysql://localhost:3306/world -user root -pass root -t ./templates/countries-template.gql -k grakn


However, to make it simpler, you can chain them all together in a batch file, which I’ve done — you can find it on Github — and simply call that:

./loader.sh <path-to-GRAKN.AI>


 

Querying the Graph

 

We can now make queries on the graph as follows. Let’s reproduce some of the queries we made previously in SQL.

SQL
SELECT * from city LIMIT 10;

Graql
match $x isa city; offset 0; limit 10;


If successful: you should see a list of 10 countries following the query.

Similarly, query for information about the city of Sydney:

SQL
SELECT * FROM city WHERE Name = ‘Sydney’;

Graql
match $x isa city, has name "Sydney";


 

Problems?

 
The Grakn team is super helpful in sorting out when things went wrong and, if you have any problems, please get in touch for help too. Just make contact via our Community page, or leave a comment below or on our Slack channel.

One issue that I hit on initially is that you need to make sure that you download the JDBC driver from here and place the .jar file (mysql-connector-java-5.1.40-bin.jar) in the /lib directory of the Grakn environment that you downloaded and set up.

 

Visualising the Data

 
The Grakn visualiser is a cool way to look at the resulting graph and explore the data. With the Grakn engine running and the graph loaded, in your browser, navigate to http://localhost:4567/ which will allow you to make queries on the graph. It’s a nice way of seeing how data is connected, which fits much better — in my view — with how I think about cities and countries, than a table with rows and columns. Let’s explore …

We can show 10 countries and their cities:

match $x isa has-city; offset 0; limit 10;


In the GRAKN.AI visualiser:

Further guidance on using the visualiser, which is rapidly evolving, can be found in the GRAKN.AI documentation.

 

Why?

 
You may be wondering why I’ve bothered moving the data from a relational database into a Grakn graph. After all, isn’t it fine as it is? Well yes, and no. Although relational databases have benefits that include simplicity and familiarity, they also have limitations. If you are just doing basic read/writes on straightforward data, SQL may well be adequate for your needs. But, remember that I chose a simple, familiar example specifically to make this article easy to follow. If you have a more complex domain with highly interconnected data, which is very probable in today’s information landscape, you will quickly see significant benefits, since describing the relationships within data is the primary characteristic of a graph database. In this aspect, a relational database cannot begin to provide the equivalent speed or flexibility as a graph.

As a knowledge base, GRAKN.AI has an additional benefit over standard graph databases, since it allows complex data modeling, verification, scaling, querying and analysis. A key step is the definition of an ontology, which facilitates the modeling of complex datasets and guarantees information consistency. Inference rules allow the extraction of implicit information from explicit data, to achieve logical reasoning over the represented knowledge.

 

Conclusion

 
At the beginning of this article, I introduced GRAKN.AI as a graph database with extra punch. I have hardly scratched the surface of what it can do, but I hope I have at least shown that it is easy to set up a Grakn graph with familiar data, and how to query and visualise it.

There are a number of blog posts on blog.grakn.ai that will give you a flavour of what GRAKN.AI can do, and an FAQ that possibly answers some of the questions you may have! If it does not, please ask away in the comments below!

 
Jo Stichbury is a Technical Author at Grakn Labs. Former Head of Technical Publications / Communications at Nokia and Symbian. PhD from the University of Cambridge. Interests include data science & machine learning, cats, cakes, driverless cars & Manchester City.

GRAKN.AI is an open source distributed knowledge graph platform to power the next generation of intelligent applications. Using the power of machine reasoning, we provided a platform to help manage and make sense of highly interconnected big data. Grakn performs machine reasoning through Graql, a graph query language capable of reasoning and graph analytics.

Original. Reposted with permission.

Related: