Data Structures Related to Machine Learning Algorithms
If you want to solve some real-world problems and design a cool product or algorithm, then having machine learning skills is not enough. You would need good working knowledge of data structures.
A heap is another hierarchical, ordered data structure similar to a tree except instead of a horizontal ordering, it has a vertical ordering. This ordering applies along the hierarchy, but not across it: the parent is always larger than both its children, but a node of higher rank is not necessarily larger than a lower one that’s not directly beneath it.
Both insertion and retrieval are performed by promotion. An element is first inserted in the highest available position. Then it is compared with its parent and promoted until it reaches the right rank. To take an element off the heap, the larger of the two children is promoted to the missing position, then the larger of those two children is promoted and so on until everything has trickled up the ranks.
Typically, the highest ranking value at the top is pulled off the heap in order to sort a list. Unlike a tree, most heaps are simply stored in an array with the relationships between elements only implicit.
A stack is defined as “first in, last out.” An element is pushed onto the top of the stack where it covers the previous element. The top element must be popped off before any of the others can be accessed.
Stacks are mainly useful for parsing grammars and implementing computer languages.
There are many machine learning applications for which a domain specific language (DSL) is the perfect solution. For instance, the libAGF library uses a recursive control language to generalize binary classification to multi-class. A special character is used to repeat a previous option, but because the language is recursive, the option must be taken from the same hierarchical level or higher. This is implemented by a stack.
A queue is defined as “first in, first out.” Think of the line at the bank teller (for those of us still old enough to remember a time before internet banking). Queues are useful in real time programming so that the program can maintain a list of jobs to be processed.
Consider an application to record split times of athletes. You type in the bib number and hit enter, except in the time it took you to do that the next athlete behind has also passed. So you type in a list of bib numbers of the nearest approaching athletes, then hit a separate key to register the next in the queue as having passed.
A set consists of an un-ordered list of non-repeating elements. If you add an element that’s already in the set, there will be no change. Since much of the mathematics of machine learning deals with sets, they are very useful data structures.
In an associative array, there are two types of data which are stored in pairs: the key and its associated value. The data structure is relational in nature: the value is addressed by its key. Since much of the training data is also relational, this type of data structure would seem ideally suited to machine learning problems.
In practice, it’s not used so much, in part because most associative arrays are only one-dimensional, whereas machine learning data is typically multi-dimensional.
Associative arrays are good for building dictionaries.
Suppose you are building a DSL, want to store a list of functions and variables, and need to distinguish between the two.
“sin” → function
“var” → variable
“exp” → function
“x” → variable
“sqrt” → function
“a” → variable
Querying the array on “sqrt” would return, “function.”
Custom data structures
As you work on more problems, you are sure to encounter those for which the standard recipe box does not contain optimal structures. You will need to design your own data structure.
Consider a multi-class classifier, which generalizes a binary classifier to work with classification problems having more than two classes. An obvious solution is bisection: recursively split the classes into two groups. You could use something similar to a binary tree to organize the binary classifiers, except that a hierarchical solution is not the only method of solving for multi-class.
Consider several partitions that are then used to solve for all the class probabilities simultaneously.
The most general solution would combine the two, thus each hierarchical partition need not be binary but could be solved by a non-hierarchical multi-class classifier. This is the approach taken in the libAGF library.
More complex data structures can also be composed of the basic structures. Consider a sparse matrix class. In a sparse matrix, most of the elements are zero and only the non-zero elements are stored. We could store the position and value for each element as a triplet and have a list of them in an extensible array.
Consider the 3 by 3 identity:
Data structures are only occasionally interesting in their own right. What makes them truly interesting are the kinds of problems you can solve with them.
For most of the work I do, I’m using a lot of basic fixed-length arrays. I mostly use more sophisticated data structures to make the programs a little smoother in how they run and interface with the outside world and a little more user friendly. Less like the Fortran programs of yore where you had to endure a compile cycle of close to half an hour just to change the grid sizes (I actually worked on a program like this!).
Even if you can’t come up with an application off the top of your head I still I think it’s good to know about things like stacks and queues. You never know when one might come in handy.
Really sophisticated artificial intelligence applications might use things like directed and undirected graphs, which are really just generalizations of trees and linked lists. How are you going to build things like the former if you can’t cope with the latter?
If you want to practice and realize data structures for ML algorithm yourself, try to solve some of problems below:
- Encapsulate the matrix-vector multiplication code snippet into a subroutine called matrix_times_vector. Design the calling syntax for the subroutine.
- Using struct, typedef or class, encapsulate both vectors and matrices into a pair of abstract types called vect and matrix, respectively. Design an API for the types.
- Find at least three libraries online that do the above.
- Download and install the LIBSVM library. Consider the method Kernel::k_function on line 316 of “svm.cpp”. What are the advantages and disadvantages of the data structure used to hold vectors?
- How would you re-factor calculation of kernel functions in the LIBSVM library?
- Which data structures described in the text are abstract types?
- What internal representation or data structure could you use to implement the abstract data types? Are there any that are not included in the list above?
- Using a binary tree, design an associative array.
- Consider the vector type in LIBSVM. How can this be used to represent a sparse matrix? Contrast this with the sparse matrix class described above. Look at the complete type. What are the advantages and disadvantages of each representation?
- Implement a treesort and a heapsort. Now use the same data structures to find the top k elements. What common machine learning algorithm is this good for?
- Implement your favorite data structure in your favourite language.
Bio: Peter Mills is passionate about science, and is interested in atmospheric physics, chaos theory and machine learning.
Original. Reposted with permission.
- Top 10 Machine Learning Algorithms for Beginners
- How Much Mathematics Does an IT Engineer Need to Learn to Get Into Data Science?
- Data scientists mostly just do arithmetic and that’s a good thing