Interview: Hobson Lane, SHARP Labs on the Beauty of Simplicity in Analytics

We discuss Predictive Analytics projects at Sharp Labs of America, common myths, value of simplicity, tools and technologies, and notorious data quality issues.

hoson-laneHobson Lane is Principal Data Scientist at Sharp Labs of America. He has designed, simulated, patented, and built terrestrial and space robotic systems often stretching his optical, thermal, mechanical, chemical, and electrical engineering knowledge. He has chased his unreliable software across a field and into a street sign (1st and 2nd DARPA autonomous vehicle Grand Challenge competitor). He has also renovated and skippered a fiberglass sailboat halfway around the world with his wife. He can't wait to see what automation technology will make possible next year.

Here is my interview with him:

Anmol Rajpurohit: Q1. Can you share some of the prominent use cases of Predictive Analytics at Sharp Laboratories of America? What are some projects that you are currently working on or have recently completed?

Hobson Lane: sharp-labsSHARP contracted me to mine ERP (Enterprise Resource Planning) data and predict product return rates for consumer electronics products. The objective was to find "business-actionable intelligence." For one product line the data revealed savings of millions of dollars at the cost of a minor process change. This change will also significantly improve the customer experience and product quality and thus increase future revenue.

For a second project, SHARP needed predictions of commercial building daily power consumption profiles and peaks. We delivered a neural net and quantile filter with predictions that will enable fully autonomous operation of a system that dramatically reduces the energy bill for commercial buildings where it is deployed.

AR: Q2. Based on your extensive experience, what do you observe as the most common myths (or errors) prevalent in Predictive Analytics?


One persistent myth is that complex, inscrutable models simplicityare required to deliver valuable predictions. We data scientists are often responsible for propagating that myth, due to an obvious conflict of interest.

For example, on that project at SHARP that I mentioned, the non-technical sales team came up with a database query and statistical measure that was sufficiently accurate to monitor the effect of process improvements and forecast return rates well into the future. And it was in place, integrated into their process long before my slightly more accurate, precise, and complicated model was ready. And we continued developing and implementing "value-add" features such as natural language processing and interactive visualizations long after the lion's share of the value had been extracted from the data.

AR: Q3. In general, what approach do you follow for high-impact Predictive Analytics projects? How do you measure the success of your projects?

analytics-approachHL: I talk with people who understand the business area and technology I'm analyzing. I use data to put statistical weight behind their hunches, or, sometimes, steer misperceptions back onto sound scientific ground.

We brainstorm while visualizing and slicing data from various angles until a trend emerges. Only then do the executives have strong supporting evidence and team buy-in to support them as they begin the challenging task of redirecting a large, complex organization.

AR: Q4. What tools and technologies are used most often by your team?

HL: I default to open source. Fortunately SHARP's Big Data open-sourceproject supported my preference for Python, Django, Postgres, numpy, pandas, d3.js, and bootstrap.js, on Linux. These are the most flexible and effective tools I know of for predictive data analytics and they've not let me down in the 4 years since I settled into that stack. On the devops side, I'm a big fan of GitHub and Travis CI. For presentations I use reveal.js, sometimes with a Choose-Your-Own-Adventure slide wired up to a twilio SMS number.

AR: Q5. What have been the most notorious data quality issues you have come across? How did you deal with them?

data-qualityHL: Over 21 years of munging data I've seen, and generated, a wide variety of data errors, like CSVs with unquoted strings and delimiters, misspelled and multilingual categorical (ENUM) values, unenforced database rules, churning database schemas, mutating primary keys, impossible dates, insidiously misidentified units of measure, even easter-eggs encoded in data.

Fortunately python, Django, Postgres, pandas, and Python itself have features that make it straight-forward to identify outliers and impute or delete troublesome records.

Second part of the interview