KDnuggets Home » News » 2018 » Feb » Opinions, Interviews » The Doing Part of Learning Data Science ( 18:n06 )

The Doing Part of Learning Data Science


Consider this a beginner’s answer to “Studied Basics, What Next?”



By Aparna C Shastry

This first article on learning data science got me many new LinkedIn connections followed by requests like, “I am in the same state as you are/were. Can you guide me on the next step?”  It is practically impossible to address such questions individually because they range from “What I should refer for Linear Algebra?” to “What dataset I should take for my project?”  I think “Follow”ing LinkedIn profiles of a few experienced data scientists get you loads of information, much more than I could ever cover with my limited experience of few months. That is my simple generic advice!

Shastry image
Image 1: Learning is Fun, How about Doing Machine Learning?

If you are not satisfied with generic advice above, read on. Here is an attempt to address  project related questions I received regarding what to do after crossing the initial study phase. I might publish a detailed one once I am done with my current project. My views are based on the path I am taking and some other aspirant would have succeeded by taking an entirely different approach. Find out what works best for you and just do that.

Ready, Steady, Go: Once done with a few books and MOOCs, it is important to take action. However, don’t fall for “Go for messy data, because that is what real projects look like”. It can easily demotivate you. For a beginner what is more important is, to know if one can visualize, draw insights, make inferences from the exploratory data analysis and write a story. Once you know that you are too familiar with the data you are working with, try some predictive modeling. Then it becomes fun, otherwise it is pain. “Familiarity breeds love” in this case.

Where to find Projects / Ideas: I know initially this type of advice from pros, “If you want to impress the hiring manager, don’t say what you know or how you know, but demonstrate by showcasing projects” sounds too overwhelming! Trust me, I struggled for one whole month thinking of ‘ideas that no one has done before’. Here’s a good news. Recently, datacamp announced a few beginner level projects. I have not worked on them, but had a quick look. I think it is a good place to try, as it splits a project into several subtasks and gives instructions, without hand-holding too much. I think it is a good value for money with low monthly fee of $29.

Shastry image
Image 2: A Screen shot of BabyNames project from DataCamp

Some of my friends also pointed out that Kaggle has launched a free learning through hands on projects, that’s another resource to consider.

Where to find the Real data? Take toy datasets, play with them for a while, say one month. UCI repo can be a good place to start, as it has nice filters on the left. Besides, they also mention whether they are real datasets. Kaggle challenges are good if you don’t get addicted. (Kaggle normally has either clean data or contrived messy data, and focus is almost on Machine learning and ANN algorithms). Become confident, and then go for messy data. Take it step by step with the help of a ladder.

My Story So Far: When I did what I wrote in the previous article, I was pretty sure that being data scientist is not easy. I sort of understood what makes potential employers hesitant to hire people without any ‘data science industry experience’. I think realizing these two by reflecting on what you have read and learned is very important. I know the value of paid advice, be it on technical or career tips. I don’t mind paying a couple of thousands of dollars if it can get me a job at least 2 weeks earlier than, had I done everything on my own. That is not it. Doing projects with guidance or getting tips from a person who had trodden the same path on getting the first job has added advantages like,

  • Knowing one’s own blind spots and gaps
  • Knowing what to learn and what to master, to make a start, and learn it all in the right way the first time.
  • Getting rid of impostor syndrome, being more confident on the approach
  • Learn to work in a team of two people, how to present the results.
  • Better chances of landing into an interview:
    • Being visible in the network of person from whom you took these services
    • Standing out from the crowd in interviews
  • Adding value from the first day of joining the first job!

and so on.

Therefore, I did pay and get two separate paid advice: one for technical stuff and one for getting that first job. I can not recommend them right now, because I have not fully gone through the experience. Side by side, I download  a few intermediate size datasets (A few thousands of rows/records and a few 10s of columns) and play with them. An example using IBM telecom Dataset:

Shastry image
Image 3: Screenshot of my approach to a project

One can compare the results by installing the IBM Watson Analytics tool. It also teaches you a few other stuff like prediction power, feature importance.

Is the paid advice for you? The paid advice do not work the same way for everyone. It depends on the approach taken by the individual. Stated in electronic communication language,

Transmitter can only transmit at a rate that can be demodulated and decoded by the receiver. That is dependent on a lot of factors, and top 2 are channel and receiver design. Transmitter once designed is more or less reliable always.

Ask yourself, are you a good receiver? How can you design yourself to be a good one? If you are not sure, take this course I mentioned already in my first article. Then go for paid advice if you like. Channel of communication between the student and the guide gets better along the way. While looking for a guide, beware of the people on LinkedIn who claim to be data science mentors. Look at their profiles and see if they are just educators or practicing data scientists.

What else? Side by side, I do a few other things:

  • “Follow” data scientists who  post great content on LinkedIn. Try connecting with local data scientists with a personal note stating the purpose. My suggestion is avoid making generic note like, “It will be good to stay in touch”. Add some humor. For example, recently I connected with someone with, “Hi, we both live in <> area. Perhaps we can convene plan to start meetups?”, and “I like your posts, want to connect with you; won’t ask for referrals (not right away!), just kidding”
  • Quickly parse any github repository that I come across. KDNuggets articles on data science often gives me access to repositories of people who are slightly ahead of me. [I have a github repository, which I am not comfortable sharing right now, as it is ‘work in progress’]
  • Publish my journey as I reach certain milestones. It is satisfying. It is an act of paying forward for the help and encouragement received by what is aptly called as “data scientists community”. Publishing articles also hopefully lets recruiters/potential employers know about my genuine interest/effort in breaking into data science.

My experience with datasets and machine learning exercises so far has taught me:

  1. Understanding basics of statistics like population mean, sample mean, different ways of measuring center of distributions, conducting hypothesis tests, p-value, alpha, confidence interval etc is more important than learning Machine Learning algorithms to make a good start. It is too easy to misinterpret inferential statistics and use the statistical tools wrongly, yet believe that the results were correct. Doing this course should be enough to digest all essentials.
  2. Andrew Ng often states in all his courses, “These techniques are easy to learn, hard to Master”. True. The MOOC courses on Machine Learning or even initial hands on experiences on toy datasets can very well give an illusion of competence, unless one has really gone into matters in point 1 above in depth and correlated Machine Learning with statistics. In fact, Machine learning is nothing but statistical learning. Many experts in the industry suggest this book to refer to while doing the projects if you get questions: http://www-bcf.usc.edu/~gareth/ISL/  The book has Lab section as well, done in R.

I was about to publish this post, and my predictive model suggested that the next question from readers would be, “Okay, I’ve done toy datasets, but at this time I really can’t afford to pay, and I don’t mind the delay or not having those added factors. Can you please this one time tell me what I can do?”. Well, the only option is, you really evaluate your work so far and put yourself in a potential hiring manager’s shoes. Would he/she say “Wow! Awesome!”? I think you would want to take a dataset with few MBs of data, with 50+ features which contain rich information. You might have to scrape websites for this. The real world projects will be largely specific to the business. Normally there will be a business question you need to answer. That would mean you should demonstrate your skills at collecting relevant data by defining a problem first. Try doing that and you will never need anyone’s advice, because you will eventually land in a data scientist job or you will know if it is not for you. Either way, the learning is yours to cherish for lifetime.

I would like to end this post with one last take away:  No one knows why one is born or how one should live. Can only be grateful for being alive, be kind and try to have fun.

P.S.:

  1. Thanks for reading this article. Please give your feedback by commenting or asking questions. Asking questions here will help in addressing them in one place, also helping others who might have similar questions.
  2. Let us share ideas through My LinkedIn Profile

 
Bio: Aparna C Shastry is a Data Science Aspirant, Learner, Parent of 2 lovely kids (teachers).

Related: