Salaries in IT – Scrape, refine, and plot case study

Very good case study, showing how to scrape with, refine with OpenRefine, and plot with Also learn about salaries vs age in Belgium.

By Stijn Diependaele, Oct 2014.

I’m currently active in IT. At a certain moment I was curious as to what people of my age are earning so I could compare.  The easiest way is to go and check on standard job sites or google it. Still not all data is released there, only some figures. And most of all there is no fun in doing this. Luckily we have message boards where users can anonymously post their salaries and bunch of other interesting data. And luckily we have some easy tools to collect, refine and plot this data thus enabling us to do our own research.

In the picture below you find an example of a post on the message board as you can see it’s divided in topics: personal, labour agreement, terms of employment and working conditions. It’s interesting to know what one person is making but it would be more interesting to know what the average is, what people earn versus age is, to plot it into a histogram to know the distribution of the salaries. The people posting on this forum are mostly young(er) and they’re more active in engineering jobs, so interesting for myself to compare but not representative for the entire population. Below an example of the template people could fill in(in dutch).

Template post

Template post

The techniques and tools that are described below require minimal technical knowledge and almost no programming skills. This was my first time using any of the mentioned tools and they were easy to learn.

Scrape the data with

To collect the data I’ve used enables you to turn any website into a table of data or an API with no coding required. can be freely downloaded at their website. In the example used here the following steps were taken:

1. Create an extractor and start with training the rows, in this case every post is a row. 20 rows per page, this could differ on what your posts/page setting is.


2. Next up is training the columns, this is the time consuming part. Add a column and mark where the data is to be found. Most of the time you’ll have to train one column a few times with different rows, so it really learns correctly where to find the particular data.


Click add column


Name your column and click done


Mark the data to be extracted


Repeat for other columns

3. You do this for two example pages, train rows and train columns.  After training two it should be enough.train2pages

4. Create a Dataset and select the extractor you created


select dataset

5. Profit

In the screenshot below we now have our dataset for the first two pages of the thread. On the left you can add more pages from which you want to extract data. There’s is a way to automatically paginate through more pages but unfortunately and also understandably this is limited to 10 pages. Adding them manually is what you can do.

Table with the data

Table with the data

Clean the data with OpenRefine

Even though they are using templates these posts on the message board are still like free text fields. There are lots of different ways of saying how much you make, if you have a company car or not and so on. So before anything useful can be done with the data we need to “refine” it. For this I used OpenRefine, formerly known as GoogleRefine. OpenRefine offers a nice toolset to clean your data. Their website:

What I did in this example is cleaning the gross and net wages column. OpenRefine offers basic transform actions. It also let’s you do custom transforms written yourself with Clojure, Jython or GREL (Google Refine Expression Language).


Basic transforms


Custom transformations

OpenRefine has some other great features, for example logging all the actions you did on a dataset and then replaying them on a similar dataset. You can also extend your data with more data to derive more from it. Check to learn how.

Plot your data with

As said in the introduction, now that we have collected the data and cleaned some of it we can get some statistics. will enable you to plot your data in a lot of insightful ways. For this example I got some basics like an histogram showing the distribution of the salaries, salary versus age and if it correlates. As you can see here you can embed your plots into your blog or any website you want. It’s interactive and you can click through if you want to have a look at the data or the code itself. No knowledge of programming is required to make plots like these.

Distribution of gross salaries:

IT Salary distribution

Gross salary versus age:

IT Salary vs. Age

Thank you for reading and hopefully you’ll use some of it.

Content from

Bio: Stijn Diependaele is an IT Project manager and blogger in Belgium.