Insights from Data Science Handbook

Here you can find perspective of lead data scientists on the definitions ranging from data science, metrics selection while solving a problem, work ethics, the art of storytelling and why data science is important in todays world.



By Vasanth Gopal.

The last few years I have been trying to get an handle on the field which encompasses analytics, big data, modeling, prediction, machine learning, algorithms, data mining techniques, rules, computational complexity, latency, data products, data engineering, statistical inference, R programming, data wrangling, data hacking, statistical modeling, supervised / unsupervised learning, data visualization, unstructured data and many other subjects that make the world of Data Science. Someone said that the term itself is an umbrella term. There are many books out there which would deal with great detail on some of the topics mentioned above, written by their respective masters.

The moment I picked the book…”The Handbook of Data Science” I sensed that this would be different and profoundly impact me.  I must confess that, my understanding of this beautiful art/science of drawing insights from data has gone up to a whole new level.

data-science-handbook

Definition

The book has many versions of definitions of data science like the one from Josh Wills and the conversations carry on from there and the transitions they make from their rich academic backgrounds to the real world where they hone their skills practicing the craft of seducing the information, insights, and the signal  out from  complex, unyielding and noisy datasets and creating data products that can capture data from a captive audience and in the process building rich data sources for further analysis.

It is from these conversations that my understanding of this science has deepened with my longing to continue to pursue the craft still enhanced. Every artiste in this book has got his/her own view on the definition of data science but broadly speaking they seem to agree on the convergence of the  fields of Math/Statistics, computer science and domain expertise.

Target Audience

The target audience for this book ideally can be the following

  1. An aspiring data scientist
  2. A practitioner data scientist
  3. A leader of a team of data scientists
  4. An entrepreneur or business owner
  5. A data curious citizen

As I have already said the 25 artistes themselves come from varying disciplines and there cannot be a better representation of the different backgrounds than the list comprises of. Having said that you are treated to some deep diving sessions and one can only marvel at the brilliance of these artistes performing seamlessly. There is something for everyone right from the aspiring data scientist to the elite or just the curious or the connoisseur.

Transitioning and upgrading

The book also courses through an important topic of learning and upgrading the skills required for practicing data science. Making a transition from a purely academic background to the real world of business. Organizations like Insight Data science founded by Jake Klamka is specifically designed for helping PhD’s transition into industry. At the other end of the spectrum, aspiring data scientists, who have enough domain expertise and are keen to pursue this art can take umbrage from the example of Clare Corthell who has embarked on a self crafted journey to embrace the art of data science purely on online learning MOOCs. In Fact she has herself come out with a curriculum for data science with the Open Source Data Science Masters (OSDSM) program. These courses can help you to bridge the gap in your learning and practicing the craft.

The OSDSM is a collection of open source resources that will help you to acquire skills necessary to be a competent entry level data scientist. You can access the curriculum here.

You have to be adept at learning and upgrading on the job and on the fly.

Jace Kohlmeier the data scientist at Khan Academy who joined the company after listening to the TED talk of Salman, had a  background of finance in the field of high frequency trading adds to the discussion on  learning new skills and crossing the learning curve of knowing more about data science on the job…

There is not a steady rate at which you learn new techniques and employ them; it definitely comes in waves. When I made the transition into this new domain of education and internet-generated data, I went through a period of needing to learn new modeling techniques. I wasn’t familiar with probabilistic graphical models; that wasn’t something that I had used in high frequency trading. Once I got past that initial learning curve, learning came very much in waves. There will be a very concrete and motivating need or goal.

Finally Joe Blitztein, Harvard professor who teaches Statistics, hits the nail on its head when he observes that….

You have to be energetic and work really hard, but not get discouraged just because you don’t know everything.

Metrics

The metrics for data science depends upon the problem that your customer wants you to solve. As Diane Wu data scientist at Palantir, says.. 

some may want a streaming solution while others may want a static model based off the information from their databases. These can be ranging from one to several  dozens.

She says that in her role, success is very measurable – it is the accuracy or the precision / recall of your model performance. It also depends on what questions you ask or try to find the right questions and try to make an impact with the answer.

Industrious

The book also touches upon the value of industry and hard work and discipline. You have to be prepared to put in long hours of hard work not only in bridging the gaps in understanding of data science ,  certain missing links in your armory but also with the problem in hand which you are trying to decipher. DJ Patil, who lead a team of data scientists in LinkedIn and since then has gone on to become the first chief data scientist of the US–talks about the same very eloquently, when he says ..

One of the first things I tell new data scientists when they get into the organization is that they better be the first ones in the building and the last ones out… You’re not putting in your time because of some mythical ten thousand hours thing (I don’t buy that argument at all, I think it’s false because it assumes linear serial learning rather than parallelized learning that accelerates). You put in your time because you can learn a lot more about disparate things that fit into the puzzle together. It’s like a stew, it only becomes good if it’s been simmering for long time.

Finding the relevant question and the art of Story Telling

In the book there is also the emphasizing of the narration of a problem and the ability to communicate the solution in the form of a story without losing the feeling of passion and curiosity. This is equally important along with the usual skills. Hilary Mason the  New York based scientist and founder of Fast Forward Labs says…

For each data project you’re working on, you need to ask yourself these questions: what are you working on? How will I know when it’s done? What does it impact?

She has very valuable advice for the aspiring data scientists..

Try to do a project that plays to your strengths. In general, I divide the work of a data scientist into three buckets: Stats, Code, and Storytelling/Visualization. Whichever one of those you’re best at, do a project that highlights that strength. Then, do a project using whichever one of those you’re worst at. This helps you grow, learn something new, and figure out what you need to learn next. Keep going from there.

John Foreman the data scientist at Mail Chimp makes an important point when he talks about…

For me, a core skill that any data scientist should possess is the ability to communicate with the business. It’s dangerous to rely on others at a business to actively identify and throw problems at the data scientist while he or she passively waits to receive work.

Rock stars

The rock stars themselves… As I have already mentioned in this post you will find the list a heady mix of who is who in the field.

DJ Patil, Hilary Mason, Pete Skomoroch, Riley Newman, Jonathan Goldman, Michael Hochster, George Roumeliotis, Kevin Novak, Jace Kohlmeier, Chris Moody, Erich Owens, Luis Sanchez, Eithon Cadag, Sean Gourley, Clare Corthell, Diane Wu, Joe Blitzstein, Josh Wills, Bradley Voytek, Michelangelo D’Agostino, Mike Dewar, Kunal Punera, William Chen, John Foreman, Drew Conway

Conclusion

The diversity of the backgrounds of all these artistes is what makes it very interesting- academic, career or domain wise, but still something that ties them all is curiosity and the hunger to satisfy that famished state. These artistes make you think and contemplate.

Why is data science so important in today’s world and economy?

  • How does one master the triple disciplines of programming, statistics and domain expertise to become an effective data scientist?
  • How do you transition from academia, or other fields, to a position in data science?
  • What separates the work of a data scientists from a statistician, and a software engineer? How can they work together?
  • What should you look for when evaluating data science roles at companies?
  • What does it take to build an effective data science team?
  • What mindsets, techniques and skills distinguishes a great data scientist from the merely good?
  • What lies in the future for data science?

Apart from the above rock stars , you can additionally follow the following Grand Masters who have not been featured in the book but are also equally working hard, untiringly for the growth of this industry. You can  just Google and follow them through twitter or their websites..

Vincent Granville, Gregory Piatetsky, Kirk Borne, Eric Colson, Marck Vaisman, Milind Bhandarkar, Monica Rogati, Simon Zhang, Dean Abbot, Nate Silver

You will never miss out on their rays of insight  and the  sprinkle of stardust on you.

If still your hunger has not satiated then you can follow the list of top 50/100 influencers and brands in the industry that will surely get you going.

Finally I leave you with another gem from the BOOK …This time it is from Sean Gourley Co founder and CTO at QUID…  

I think data science is really going to become more of a product design process; actually an algorithm design process. Algorithms take information and direct us; whether it’s the information we read, the music we listen to, the places we drink coffee, the friends we meet, or the updates in our lives.

Bio: Vasanth Gopal is an aspiring data scientist with 25 years of sales experience in Pharma and 5 years in Business Intelligence and Advanced Analytics. He is currently pursuing Data Science specialization through Coursera.

This post a revised version of

this article.

Related