LIONbook Chapter 17: Semi-supervised learning
The LIONbook on machine learning and optimization, written by co-founders of LionSolver software, is provided free for personal and non-profit usage. Chapter 17 looks at Semi-supervised learning.
Here is the latest chapter from LIONbook, a new book dedicated to "LION" combination of Machine Learning and Intelligent Optimization, written by the developers of LionSolver software, Roberto Battiti and Mauro Brunato.
This book is freely available on the web.
Here are the previous chapters:
- Chapters 1-2: Introduction and nearest neighbors.
- Chapter 3: Learning requires a method
- Chapter 4: Linear models
- Chapter 5: Mastering generalized linear least-squares
- Chapter 6: Rules, decision trees, and forests
- Chapter 7: Ranking and selecting features
- Chapter 8: Specific nonlinear models
- Chapter 9: Neural networks, shallow and deep
- Chapter 10: Statistical Learning Theory and Support Vector Machines (SVM).
- Chapter 11: Democracy in machine learning: how to combine different methods.
- Chapter 12: Top-down clustering: K-means.
- Chapter 13: Bottom-up (agglomerative) clustering.
- Chapter 14: Self-organizing maps.
- Chapter 15: Dimensionality reduction by linear transformations (projections).
- Chapter 16: Visualizing Graphs and Networks.
You can also download the entire book here.
The latest chapter is Chapter 17: Semi-supervised learning.
Let us consider the international airport example which motivated unsupervised learning methods in Chapter 14: you walk through a gate and clearly identify clusters of people speaking different languages, even if the language names are unknown. Now, if some people languages are identified, for example if some people are waving flags or wearing costumes of their countries, for sure we could select only the labeled speakers and run a supervised learning algorithm to map phonetic characteristics to languages.
The question now is: can one also use some information from the unlabeled people to improve language classification? Let us note that clusters of people usually speak the same language ("birds of a feather flock together") and we may be tempted to label some of the unknown speakers with the same language as the one spoken by at least one member of the same cluster. If the assumption is true, one greatly increases the number of examples and can improve the overall generalization capability of the trained classifier. For example, young children clustered with their older and identified parents can be added to the database so that even young people voices (usually with higher frequencies) can be correctly classified.
In a similar manner, one can use some supervised data to aid unsupervised learning and clustering. This is the underlying idea of semi-supervised learning: use both the labeled examples and also (some) unlabeled ones to improve the overall classification accuracy.