15 Minute Guide to Choose Effective Courses for Machine Learning and Data Science

Advice for young professionals in non-CS field who wants to learn and contribute to data science/machine learning. Curated from personal experience.

The motivation

Bill Gates proclaimed in a recent graduation ceremony, that artificial intelligence (AI), energy, and bio science are three most exciting and rewarding career choices today’s young college graduates can choose from.

I couldn’t agree more.

I have come to believe strongly that some of the most important questions of our generation - related to sustainability, energy generation and distribution, transportation, access to basic amenities of life etc., are dependent on how intelligently we can mix the the first two branches of knowledge Mr. Gates mentions.

In other words, the world of physical electronics (semiconductor industry comprises a central portion of that world), must do more to embrace fully the fruits of information technology and new developments in AI or data science.

I wanted to learn, but where to start?

I am a semiconductor professional with 8+ years of post-PhD experience in a top technology company. I take pride in the fact that I work in the cross-section of physical electronics which directly contributes to the energy sector. I develop power semiconductor devices. They are built to carry the electrical power efficiently and reliably and they power everything from the tiny sensor inside your smartphone to the large industrial motor drives which process food or cloth for everyday consumption.

Therefore, naturally, I want to learn and apply the techniques of modern data science and machine learning to improve the design, reliability, and operation such devices and systems.

But I am no computer science graduate. I could not tell a linked list from a heapSupport vector machines sounded like (few months back) some special equipment for people with disabilities. And the only keyword of AI I remembered (from my junior year elective course) was ‘first order predicate calculus’, a remnant of the so-called ‘old AI’ or knowledge-engineering approach as opposed to the newer machine learning based approach.

I had to start somewhere to learn the basics and then study my way deep. The obvious choice was MOOC (Massive Open Online Courses). I am still very much in the learning phase but believe that I have at least gathered some good experience in choosing the right MOOC for this path. In this article, I wanted to share my insights on that aspect.

Know your ‘Chi’ and your ‘Enemy’

Sorry for the bad analogy :-) It’s from Netlfix’s latest superhero ensemble saga — The Defenders.

But it’s true that you should know your strengths, weakness, and technical inclination very well before you start the learning-through-MOOC process.

Because, let’s face it, time and energy are limited and you cannot afford to waste your precious resources on something you are highly unlikely to practice in your current work or future job. And this is assuming that you want to take the (almost) free learning path i.e. auditing the MOOCs rather than paying for the certificates. I have an ‘almost’ there because at the end of this article, I would like to list few MOOCs which I think you should pay for to showcase the certificates. And, for my personal journey, I had to pay for few Udemy courses I took because they are never free but you can buy them at the cost of a good lunch sandwich when the promotion runs.

What you can and cannot learn from MOOCs

In this picture, I just want to show the possibilities and impossibilities of this process i.e. what you can hope to learn through self-studying and practice and what must be learned on the job or what kind of mentality must be cultivated no matter what your profession is. Having said that, however, these circles broadly encompass the core skills that one can study to venture into the field of data science/machine learning from a non-CS background. Please note that even if you are in information technology (IT) sector, you may have a steep learning curve ahead because traditional IT is being disrupted by these new fields and the core skills and good practices are often different.

I, for one, view the data science field as more democratic than many other professional domains (e.g. my own area of work semiconductor technology), where the entry barrier is low and with sufficient hard work and zeal, anybody can acquire meaningful skills. For me personally, I have no burning desire to ‘break in’ this field, rather I just have a passion to borrow the fruits to apply to my own area of expertise. However, that end goal does not impact the initial learning curve that one has to traverse. So, you could be aiming to be either data engineer, or business analyst, or machine learning scientist, or a visualization expert — the field and choices are wide open. And if your aim is like mine - stay in the current domain of expertise and apply the newly learned techniques — you are fine too.

You can start with real basics, no shame there :)

Istarted with real basic —learning Python on Codeacademy. In all likelihood, you cannot go more basic than this :-). It worked though. I had this aversion towards coding but the simple and fun interface and the right pace of Codeacademy’s free course was appropriate to excite me enough to keep going. I could have picked a Java or C++ course on Coursera or Datacamp or Udacity but some reading and research told me that Python is the optimal choice balancing learning complexity and utility (especially for data science) and I decided to trust the insight.

After a while, you crave for deeper knowledge (but at a gentle pace)

Codeacademy’s introduction was a fine base to start with. I had choices from so many online MOOC platforms and predictably enough, I signed up for multiple courses at the same time. However, after dabbling with a Coursera class for few days, I realized I was not ready enough to learn Python from a professor! I was looking for a course taught by some enthusiastic instructor who will take time to go over the concepts in great detail, teach me other essential tools like Git and Jupyter notebook system, and maintain a right balance between basic concepts and advanced topics in the curriculum. And I found the right man for the job: Jose Marcial Portilla. He offers multiple courses on Udemy and is one of the most popular and positively reviewed instructors on that platform. I signed up and completed his Python Bootcamp course. It was an amazing introduction to the language with right pace, depth and rigor. I recommend this course highly for new learners even though you have to fork out $10 (Udemy courses are generally not free and their regular price is $190 or $200 but you can always wait few days to have the recurrent promotion cycle and sign up for $10 or $15).

It’s important to keep your focus on data science

The next step proved crucial for me. I could have gone astray and try to study anything and everything I could on Python. Especially, the object-oriented and class definition part which easily can suck you in for a long and arduous journey. Now, taking nothing away from that key sphere of Python universe, one can safely say that you can practice deep learning and good data science without being able to define your own class and methods in Python. One of the fundamental reasons of Python’s ever increasing popularity as de facto language of choice for data science, is the availability of large number of high-quality, peer-reviewed, expert-written libraries, classes, and methods, just waiting to be downloaded in a nice packaged form and unwrapped for seamless integration into your code.

Therefore, it was important for me to quickly jump into the packages and methods used most widely for data science — NumPyPandas, and Matplotlib.

I was introduced to those by a neat little course from edX. Although most courses on edX are from universities and rigorous (and longish) in nature, there are few short and more hands-on/less theoretical courses offered by technology companies like Microsoft. One of them is the Microsoft Professional Program in Data Science. You can register for as many courses under this program as you want. However, I took only the following courses (and I intend to come back for other courses in future)

  • Data Science Orientation: Discusses everyday life of a typical data scientist and touches upon the core skills one is expected to have in this role along with basic introduction to the constituting subjects.
  • Introduction to Python for Data Science: Teaches basics of Python — data structures, loops, functions, and then introduces NumPyMatplotlib, and Pandas.
  • Introduction to Data Analysis using Excel: Teaches basic and few advanced data analysis functions, plotting, and tools with Excel (e.g. pivot tablepower pivot, and solver plug-in).
  • Introduction to R for Data Science: Introduces R syntax, data types, vector and matrix operations, factors, functions, data frames, and graphics with ggplot2.

Although these courses present the material in a rudimentary fashion and cover only the most basics of examples, they were enough to spark the plug! Boy, I was hooked!

I switched to learning R in detail — for some time

The last course made me realize few important things: (a) statistics and linear algebra are at the core of data science process, (b) I did not know/had forgot enough of that, and (c) R is naturally suited for the kind of work I want to do with my data set — few MB sized data generated by controlled wafer fab experiment or TCAD simulation, primed for basic inferential analysis.

This prompted me to search for a solid introductory course in R language and who better to turn to than Jose Portilla again! I signed up for his “Data Science and Machine Learning Bootcamp with R” class. This was a ‘buy one get another free’ deal as the course covered essentials of R language in the first half and switched to teaching basic machine learning concepts (all the important concepts, expected in an introductory course, were covered with sufficient care). Unlike the edX Microsoft course, which used a server-based hands-on lab environment, this course covered the installation and setup of R Studio and necessary packages, introduced me to kaggle and gave the required push to graduate from being a passive learner (aka MOOC video watcher) to a person who is not afraid of playing with data. It also followed the great “Introduction to Statistical Learning in R” (ISLR) book by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, chapter by chapter.

If you are allowed to read only one book in your lifetime to learn machine learning and nothing else, pick this book and read all the chapters, no exception. By the way, there is no neural network or deep learning material in this book, so there’s that…

Armed with the course materials, the ISLR book, and practice on random data sets downloaded from kaggle or even my own electricity usage data from PG&E, I was no longer afraid of writing small bytes of codes which can actually model something interesting or useful. I analyzed some US county-level crime data, why a large design-of-experiment can lead to spurious correlation, and even my apartment’s electricity usage over past 3 months. I also successfully used R to built predictive models based on some real-world data sets from my work. The statistical/functional nature of the language and ready-made estimate of the confidence intervals (p-values or z-score) for a variety of models (regression or classifications) really help a new learner to gain easy foothold in the domain of statistical modeling.