Scott Nicholson is a data scientist & economist who works with data to help people and businesses make better decisions. He is currently Chief Data Scientist at Accretive Health, and previously worked at LinkedIn and Adara Media. He has a Ph.D. in Economics from Stanford University.
Gregory Piatetsky: How is the perspective of applied economist is different from data scientists who come from machine learning/data mining background?
Scott Nicholson: In terms of applied work, economists are primarily concerned with establishing causation. This is key to understanding what influences individual decision-making, how certain economic and public policies impact the world, and tells a much clearer story of the effects of incentives. With this in mind, economists care much less about the accuracy of the predictions from their econometric models than they do about properly estimating the coefficients, which gets them closer to understanding causal effects.
At Strata NYC 2011, I summed this up by saying:
"If you care about prediction, think like a computer scientist, if you care about causality, think like an economist."It was a bit controversial as many machine learning folks pointed out the work of Judea Pearl, and many economists pointed out forecasting exercises, but for the most part I think it is an accurate representation.
GP: There are many situations when randomized A/B trials are not possible, and so economists (and other scientists) turn to Natural experiments. One of the more interesting tools is regression discontinuity - can you give an example where it was used for learning about the world ?
SN: One example that I've always liked is in a paper by Marc Meredith, an Assistant Professor of Political Science at UPenn. He wanted to understand how voting can be habit-forming, but the standard problem is that people who choose to vote in the first place not only form a voting habit but also are just different that those who don't vote in some unobservable way. The regression discontinuity used in the paper is a standard one (birthdays), but in a novel context (voting). People who turn 18 just in time to register for an election are similar to those who turn 18 just after the deadline. Looking at these two classes of voters essentially is a natural experiment, and you can longitudinally track people to properly estimate the effect of voting in this election on voting in future elections.
(GP: this paper is Persistence in Political Participation )
GP: Can you tell more about how bad weather helps establish usefulness of LinkedIn?
SN: I've mentioned this in some talks and it's more of a hypothetical, but in terms of the intuition I think it's a nice illustration of the instrumental variables technique which is used to attempt to identify some level of causation. Let's say we wanted to measure whether and by how much activity on LinkedIn predicts someone getting a new job. The problem is that if you were to predict one with the other, you end up with a selection problem given the people who decide to update their LinkedIn profiles.
So, you need some source of exogenous (random) variation. If it's raining outside, then wouldn't you expect more people to stay in and stream Netflix? Or use the internet in general? Assuming this is true, you can then use the random variation in the weather as an 'instrument' to correlate it back to LinkedIn activity (note that the weather is not correlated with my ability to get a job) and thus how LinkedIn activity effects positive job outcomes. This example is a bit of a stretch and I'm not sure if the weather is a strong enough instrument for LinkedIn activity, but it's a nice example.
GP: Tell us about your experience with LinkedIn Analytics team - what was most interesting?
SN: LinkedIn was a fascinating place to work. The data infrastructure, tools, and talent were all on the frontier. The most interesting part of that team, and something that other companies try to emulate, is that you have such a diverse group of folks. Everyone has a different expertise and view on how to approach a problem, which at the end of the day makes everyone smarter. This was the big fingerprint left by former Chief Scientist DJ Patil. I got a lot out of my experience there, and this approach is something I am actively doing with my team at Accretive Health.
GP: What convinced you to join Accretive and what are some of the more interesting problems you are working on now?
SN: The ability to have huge societal impact and solve some fundamental problems in a very screwed up industry. I chose Accretive and not a healthcare startup or some other larger company because of the rich dataset that they have and the ability to implement and scale solutions at the large set of hospitals that Accretive works with. It's not easy to find companies that are working jointly on behalf of hospitals, doctors, and patients to increase the quality of care, decrease cost, all while helping hospitals be smarter about how they bill out to insurance companies.
It's a set of very practical problems, both from a business perspective as well as something that I can tell my family I'm working on. We're working on a bunch of problems, but some of the more interesting ones are identifying the patients that doctors should be proactive in reaching out to (rather than a reactive system), and using machine learning to help hospitals be smarter about how they send out bills. They are both fascinating problems from the data science perspective, while one being more about patient care and the other being a more practical business problem. Very fun stuff for real data nerds, but I really think of it all as the "good work".
GP: What are some of your favorite analytics tools, for you and your teams?
SN: R, Python, Gephi, Excel. We have some experience with other proprietary statistical languages but for several reasons I very much prefer to use R and Python.
GP: What recent book you read and liked ?
SN: Overtreated: Why Too Much Medicine Is Making Us Sicker and Poorer, by Shannon Brownlee. You've got to read this book if you're a health care newcomer or thinking of getting into the industry, as I was. It will blow your mind at how screwed up health care is, and it will massively underscore how data and data science are at the core of the solutions.
GP: What advice you have for aspiring data scientists?
SN: Focus less on algorithms and fancy technology & more on identifying questions, and extracting/cleaning/verifying data. People often ask me how to get started, and I usually recommend that they start with a question and follow through with the end-to-end process before they think about implementing state-of-the-art technology or algorithms. Grab some data, clean it, visualize it, and run a regression or some k-means before you do anything else. That basic set of skills surprisingly is something that a lot of people are just not good at but it is crucial.
GP: Your opinion on the hype around Big Data - how much is real?
SN: Overhyped. Big data is more of a sudden realization of all of the things that we can do with the data than it is about the data themselves. Of course also it is true that there is just more data accessible for analysis and that then starts a powerful and virtuous spiral. For most companies more data is a curse as they can barely figure out what to do with what they had in 2005.