Foundations of Data Science: The Free eBook

As has become tradition on KDnuggets, let's start a new week with a new eBook. This time we check out a survey style text with a variety of topics, Foundations of Data Science.

We're back at it with a new free eBook again this week. This time we will be covering a text with a name that speaks for itself, Foundations of Data Science, written by Avrim Blum, John Hopcroft, and Ravindran Kannan. A book with a such a name is making a pretty big statement. Luckily, its content backs it up.


First off, it should be noted that this book is not structured like a typical data science book. Neither its chapters nor their progression fit the mold of a standard contemporary data science text in my view. You can see, from the table of contents listed below, that the text really surveys a wide array of disparate topics, as opposed to simply creating an equivalency between data science and machine learning, for example, and progressing as such:

  1. Introduction
  2. High-Dimensional Space
  3. Best-Fit Subspaces and Singular Value Decomposition (SVD)
  4. Random Walks and Markov Chains
  5. Machine Learning
  6. Algorithms for Massive Data Problems: Streaming, Sketching, and Sampling
  7. Clustering
  8. Random Graphs
  9. Topic Models, Nonnegative Matrix Factorization, Hidden Markov Models, and Graphical Models
  10. Other Topics
  11. Wavelets
  12. Appendix

The varied high-level topics, and early inclusion of chapters on high-dimensional space, subspaces, and random walks or Markov Chains, reinforces this survey style. This also makes me think of another classic book in data science with which you may be familiar, Mining of Massive Datasets. Stressing that this text focuses on "foundation," you won't find the latest neural network architectures covered herein. However, if you want to eventually be able to understand the whys and hows of some of these more complex approaches to data science problem solving, you should find Foundations of Data Science useful.

Matrix factorization, graph theory, kernel methods, clustering theory, streaming, gradients descent, data sampling; these are all concepts that will serve you well later, when it comes to solving data science problems, and they are all essential building blocks to implementing more complex approaches as well. You won't be able to understand neural networks without gradient descent. You can't analyze social media networks without graph theory. The models you build won't be of value if you can't understand when and why you would sample from data.

Similar to some other books we have recently profiled (such as The Elements of Statistical Learning and Understanding Machine Learning), this book is unabashedly theoretical. There is no code. There are no Python libraries being leaned on. There is no hand-waviness. There are only thorough explanations leading to understanding of these varied topics, should you spend the necessary time reading.

The motivation of the authors for writing such a book has been captured in this excerpt from the book's introduction:

While traditional areas of computer science remain highly important, increasingly researchers of the future will be involved with using computers to understand and extract usable information from massive data arising in applications, not just how to make computers useful on specific well-defined problems. With this in mind we have written this book to cover the theory we expect to be useful in the next 40 years, just as an understanding of automata theory, algorithms, and related topics gave students an advantage in the last 40 years. One of the major changes is an increase in emphasis on probability, statistics, and numerical methods.

In many contemporary books, data science has been reduced to a series of programming tools which, if mastered, promise to do the data science for you. There seems to be less emphasis on the underlying concepts and theory divorced from code. This book is a good example of the opposite to this trend, a book which will undoubtedly arm you with the theoretical knowledge necessary to approach a career in data science with a strong set of foundations.