Hitchhikers Guide to Azure Machine Learning Studio

Learn Azure ML Studio through this brief hands-on tutorial. This step-by-step guide will help you get a quick-start and grasp the basics of this Predictive Modeling tool.

Let’s now move on to next step.

As you are well aware of the fact that world is full of noise, dirt and mess. So there’s no data that comes with flowers, but most of the data comes with tangible thorns. The missing values are cancer to the data and we should always get rid of them.

Azure ML provides a missing value scrubber module for that.
Drag the module to the work space and connect the dataset node with it.
Click on it and at the right hand side; you would see the properties of it.

Select Custom Substitution Value and replace all the missing value with 0.

Continuing the steps in data transformation. One of the most important modules that you will be working with all your life is Project Columns. This module is used to project, include, and exclude columns.

We necessarily have to see for income prediction, which of the columns are necessary for us and which are not.

First we search the project column, add it to the workspace and attach its node to Missing Values Scrubber.
We click the Launch Column Selector from properties and a windows pop up.
We would exclude these columns, as we believe these columns may not be a good feature for income.

To learn more about feature engineering, I would recommend taking a detailed edX course on Feature Engineering.

Now if you right click on Project Column Node and visualize the data, you would see that the above-mentioned columns are excluded from the dataset.


Now when we are done with Data Transformation, it’s time to kick into modeling.

We would go again to the search pane and search for Split Data. Drag the Module into workspace and connect the node with Project Columns.

Splitting data into test and train is an important concept of Machine Learning. Whenever we are making a predictive model, we split data into partition of 60 % for training the model and 40 % for testing. The ratio could differ for different problems and data, however in basic it usually remains same. For more details on it, you can have a quick read of machine learning concepts over Internet.
I have split the data into 50-50.

It’s time to train the model. Click on Machine Learning > Train and Drag the Train Model Module into Work space.
Remember, the split module would have two output nodes, one would go into train model and about the other one, I would explain later in the post.

Click on train model, go to right hand side properties, you’d see a Launch Column Selector. Click it and a window will pop it. You would have to mention your target variable here. For our case, its income.
Now that you have specified the target variable its time to identify an algorithm for training. Since it is a classification problem that is we want to see if the income of an individual is greater than or less than 50k. We would use a classification algorithm. I would use Boosted Decision Trees.
We would connect the algorithm to the other node of train model.

The next step would be to score the model results. Scoring model would give us an output column of the scored prediction results.

Continued on next page ...