Is Data Scientist the right career path for you? Candid advice
Tags: Advice, Career, Data Science, Data Scientist, Hadoop, Paco Nathan, Recommendation, Visualization
Candid advice from an industry veteran reveals the true picture behind the much-talked-about Data Scientist "glamour" and helps people have the right expectations for a Data Science career.
Allured by the tremendous opportunities, great compensation and visibility to business leaders, many people are moving towards the Data Scientist career path without a thorough careful assessment of the day-to-day responsibilities of such a role, the required attitude; and balance of technical and business skills.
In the pursuit to provide data science aspirants a clear realistic picture of the data scientist role, which they can assess against their personality and career ambitions, I recently discussed this with Paco Nathan, a data science expert with 25+ years of industry experience. His candid, detailed response is very likely to be an eye-opener for many.
Paco Nathan’s short bio is provided at the end of the post.
Anmol Rajpurohit: Data Scientist has been termed as the sexiest job of 21st century. Do you agree? What advice would you give to people thinking of a long career in Data Science?
Paco Nathan: I don’t agree. Not many people have the breadth of skills to perform the role, nor the patience that is absolutely needed to acquire those skills, nor the desire to get there.
As a self test:
- prepare an analysis and visualization of an unknown data set, while impatient stakeholders watch over your shoulder and ask pointed questions; be prepared to make quantitative arguments about the confidence of the results
- describe “loss function” and “regularization term” each in 25 words or less, with a compare/contrast of several examples, and show how to structure a range of tradeoffs for model transparency, predictive power, and resource requirements
- pitch a reorg proposal to an executive staff session which implies firing some ranking people
- interview 34 different departments that are hostile to your project, to tease out the metadata for datasets that they’ve been reluctant to release
- build, test, and deploy a mission-critical app with realtime SLAs, efficiently across a 1000+ node cluster
- troubleshoot intermittent bugs in somebody else’s code which is at least 2000 lines long, without their assistance
- leverage ensemble approaches to enhance a predictive model that you’re working on
- work on a deadline in paired programming with people from 34 different fields completely disjoint from the work that you’ve done
If one doesn’t feel absolutely comfortable performing each of those listed above, right now, then my advice is to avoid “Data Science” as a career.
The term Data Scientist was “sexy” as a new role circa 2012 in the sense of DJ Patil, Hilary Mason, et al. However, not everyone gets a chunk of a $4B IPO! (full disclosure: I got invited 3x to join LI prior to their IPO but stubbornly pursued other opportunities; what an excellent team there!)
Circa 2012: that was then, this is now. Actual work in Data Science entails:
- some opportunities to innovate from a “greenfield” state, but not often
- mostly being called into an existing project — which is somehow at risk
- having to speak truth to power (not fun, but the essence of the role)
To echo what DJ and others have articulated so well before: most datarelated problems are social/organizational (e.g., data silos, lack of metadata, matrix org infighting, etc.) or else the key insights probably would have been apparent within that organization already.
I have a hunch that much of the interesting work in ecommerce has played out already — big players will continue to reap big revenue, but the work to be done now is mostly outside of Silicon Valley. Or rather, other industries coming here to learn, partner, purchase, etc.
For example, Monsanto launched a private equity firm in SF that, practically speaking, can invest more money at more favorable terms into Ag data ventures than just about any VC firm. Meanwhile, VCs in the area have all but ignored datarelated ventures in domains that matter — with the exception of Khosla. In the past several months they’ve acquired business units within SV: Climate Corp, Solum, etc., which by the way were funded by Khosla. Expect more of that trend.
From my perspective, the big issues in data now are not in adtech, but instead real issues: food supply, drought/flooding, energy security, health care, telecom, transportation apart from oil dependency, smarter manufacturing, deforestation monitoring, oceanographic analysis, etc.
Also, IT budgets are still enormously flawed w.r.t. data insights. Too much budget goes into the priesthood of “data engineering”, and far too much budget tends to be earmarked for data that’s already cleaned up. Also, I find that the notion of “Product Management” in SV is almost antithetically opposed to effective use of data: in many cases product managers are incentivized to discourage use of data within companies.
Hence our value is generally going to be realized at:
- writing code to prepare data
- automating process to improve feature engineering and model tournaments
- speaking truth to power
The first speaks to IT budgets earmarked the wrong way, and the second speaks to Product Management being almost systematically hostile to effective use of data. The third speaks to the fact that several of my biggest contributions as a data scientist have been to provide exec staff with hard evidence to fire other executives and get the company back on track. Again, industry disruptions have impact.
For people just starting out, be really careful about where you go to work. If a firm claims to have “excellent engineering” but insufficient use of data circa 2014, then they are *not* the sharpest tools on the workbench; pick some other firm in which to start. Find mentors. Join teams that have strong sponsorship from Finance or Operations (which generally understand data and variance) while perhaps avoiding teams that have sponsorship from Engineering or Marketing (which generally do not understand effective use of data).
Recommendations, not necessarily in order:
- learn to leverage the evolving Py data stack: IPython, Pandas, scikitlearn, etc.
- learn how to lead an interdisciplinary team
- get experience in 1+ domains outside of data/analytics/programming
- get a good grounding in design and apply it to data visualization
- do everything you can to become a better writer and speaker (outside of academic confs)
- participate in meetups; publish blogs, presentations, etc. (hiring managers ignore resumes and look for published content online)
- get a good grounding in abstract algebra, Bayesian stats, linear algebra, convex optimization
- study up on algorithms and frameworks for streaming data (the bigger use cases on the horizon are not batch)
- learn Scalding and functional programming with type safety
- avoid Business Intelligence (like the plague)
- avoid anything referred to as “The Hadoop Ecosystem” or “Hadoop as an OS”
Paco Nathan is a "player/coach" in the field of Big Data, having led innovative Data teams on large-scale apps for 10+ years. An expert in distributed systems, machine learning, and Enterprise data workflows, Paco is an O'Reilly author and an advisor for several firms including The Data Guild, Mesosphere, Marinexplore, Agromeda, and TagThisCar. Paco received his BS Math Sci and MS Comp Sci degrees from Stanford University, and has 25+ years technology industry experience ranging from Bell Labs to early-stage start-ups.