KDnuggets Home » News » 2016 » Nov » Tutorials, Overviews » Introduction to Machine Learning for Developers ( 16:n42 )

Introduction to Machine Learning for Developers


 
  http likes 349

Whether you are integrating a recommendation system into your app or building a chat bot, this guide will help you get started in understanding the basics of machine learning.



Naïve Bayes Classification is an algorithm that attempts to make predictions based on previously labeled data using a probabilistic model. Features are independent of each other meaning that one feature doesn’t impact the value of another feature and a set of labels are considered and assigned in advance.

Some examples of labels used in classifiers are sentiment scores (can either be strings, integers or float for a scaled score), or for object detection you could have labels such as chair, table or desk to describe objects in images. Feature detection is decided in advance such as the appearance of key words or email length in spam detection.

Code screenshot of naive bayes example

This example shows code that is modified from the NLTK book, chapter Learning to Classify Text and shows the steps to train the model on known data with the last letter of a name as the feature.

conventional validation methods machine learning

The basic steps needed to use a classification model that has a large dataset:

  • Training Set: Fit the model based on known data
  • Validation Set: Used for parameter tuning – choose model complexity
    • Hyperparameters: can be done by setting different values and choosing which tests better or via statistical methods
      • Number of clusters in k-means: in our K-means example we used the elbow method.
      • Number of leaves in a decision tree
  • Test Set: Assess model after model has been run on the training set – run confusion matrix to find errors and compare models

cross validation machine learning methods

Cross validation methods help to understand how a model will generalize to unseen data and are used for smaller datasets. For example the K-fold cross-validation follows these steps:

  • Training data set is split into subsets of data – one as the test set, the remaining datasets are for training. – so you are using the same test set on every subset that is used for training data
  • Calculate the standard deviation of each test/training set.
  • Averages error rate over rounds to estimate model performance.

Machine learning resources in R
R is great for statistical/data analysis and machine learning, but not as good for production systems or utility functions due to performance and security issues.

Regression diagnostics: Outlier Tests (p-value), Influential Observations, Evaluating nonlinearity, Correlations, descriptive stats.

All the things statistics: ANOVA, Resampling Techniques, Clustering, PCA for unsupervised ML, Decision Trees and more.

Pandas resources for machine learning
Pandas is a Python library that uses data frames such as R. While it slow to use in production (Numpy arrays would be faster), Pandas is a favorite in using for data analysis and machine learning in a Python environment.

The benefits of using Pandas is that it will reduce your code by at least two-thirds and you can use really cool SQL-like features such as joins, merges, pivots and aggregating functions.

There are also many I/O methods available that make inputing and exporting your data easy such as: DataFrame.to_excel, .to_json, .to_csv, and more.

Scikit-learn machine learning resources
Scikit-learn is another favorite Python library and is a great place to find machine learning models with tutorials and documentation that have been vetted by many Python developers. It has everything from image classification algorithms to natural language processing ones.

Machine learning resources

Here is a list of clickable links of the above slide that lists tools, tutorials and videos:

Check out the rest of the blog for more resources on natural language processing and machine learning algorithms such as LDA for text classification or increasing the accuracy on a Nudity Detection algorithm and a beginners tutorial on using Scikit-learn to solve FizzBuzz.

Bio: Stephanie Kim is a Developer Evangelist at Algorithmia.

Original. Reposted with permission.

Related:


Sign Up