KDnuggets Exclusive: Part 2 of the interview with Paco Nathan

We discuss about Paco's upcoming book "Just Enough Math", problems with current university curriculum around Math for Data Science and Big Data trends.

Paco Nathan
Paco Nathan recently delivered a presentation at Strata 2014 on “Apache Mesos as an SDK for Building Distributed Frameworks”. He walked through the unique benefits of Apache Mesos, an open source cluster manager and shared the case studies of Mesos uses in production at scale (at Twitter, Airbnb, etc.). Besides his presentation, Paco also gave a tutorial on “Big Data Workflows on Mesos Clusters”.

Here is Part 1 of the interview.

Anmol Rajpurohit: 4. Can you talk a bit about your upcoming book "Just Enough Math" - advanced math for business people? What do you see as the major problems with the current university curriculum around Maths for Data Science students?

Paco Nathan: That’s a fun one! For an example, here’s a workshop coming up in Washington DC, based on that material: http://ow.ly/tRsWw

So many people get into college­-level math but stop at Calculus. Most colleges have a rather lengthy track for Calculus before moving beyond it, 2­-3 years in some cases. In particular, I find that it’s rare to encounter business people who have studied math beyond Calculus. That’s an artifact of Cold War priorities: highly­-trained mechanical engineers were needed to build ICBMs (Intercontinental Ballistic Missiles), etc.

Data Science MathsPriorities have changed enormously. These days, people in business need more understanding of how data at scale is being leveraged for their organization’s competitive advantage. They need to understand more about the high­-ROI apps. “Computational Thinking” has emerged to describe this shift in thinking — away from Cold War era precepts, onto more practical everyday use of data and analytics.

I’ve been working on “Just Enough Math”, along with co­author Allen Day. Our goal is to teach “just enough” advanced math for business people, especially for use cases that depend on ML and large­-scale data. Actually, I’m already teaching full­day workshops based on the material, and a large part of my livelihood depends on that, so I hope the project is working out well!

Our approach to this material follows a simple formula. We introduce a “morsel” of advanced math, describing the mathematical properties in accessible terms, and then, we show a business use case as an illustration. We’re trying to work from popular business frameworks — what most MBAs would recognize immediately. Then we show 20­50 lines of Python code to solve the business problem, being careful that variable names align with problem description so that the code is easy to follow. From that, we show results, visualize the insights obtained, and then, suggest various open source frameworks to consider for further study. Throughout all of this, we bring in historical context, LOTS of links for subsequent reading, some mini interviews with key people in the field, and suggested books.

I’ve encountered many business execs in my workshops, eager to educate themselves about ML before making strategic decisions for their organizations. I find that MOOCs (Massive Open Online Course) about ML are great, but they tend to lack accessibility and business context. So we’re providing those missing pieces! Having lots of fun doing it, too.

We’ve got an amazing review board, with 20+ experts: mathematicians, physicists, business executives, etc. The book is 3/4 complete, and we’re expecting galley copies to begin circulating this summer.

AR: 5. You have had a long career in Data Science spanning over more than 25 years. Trends that you have witnessed but would have never expected?

TrendsPN: I led an in­-depth analysis of large companies (transnationals) circa 2000 and found unexpected trends (risk) that indicated two of the largest firms at the time were outliers and ostensibly at risk: Walmart and ATT. Clearly, Amazon went after Walmart at that point, while Google took lead in areas that had belonged historically to ATT (which then became a hostile acquisition by its own spin­off SBC).

At the time, my analysis was considered a bit over the top. A friend who’s a science fiction author, Bruce Sterling, helped post it to a broader audience, but otherwise the material was marginalized. However, within 2­3 years those disruptions became painfully apparent. One can follow the trail of litigation between those firms circa early 2000s to see *how* starkly the two couplets of competitors fell into conflict, as industries experienced an enormous sea change — based on leveraging data at scale.

Those two use cases, AMZN and GOOG, in turn helped drive the landscape for Big Data, Data Science as a practice (see Leo Breiman on “Algorithmic Modeling”), oh­-so­-many open source frameworks, and cloud computing that made many new use cases feasible.

When I was in grad school in the mid­1980s I’d started out in Machine Learning, then was forced to switch by department changes, so I moved into Distributed Systems. At the time ML was considered “not academic enough” at Stanford. Not so many years later, some of those same Stanford CS profs became quite wealthy (at least one billionaire) thanks to Machine Learning use cases!

On the one hand, I’m struck by how much “business as usual” became overturned, so rapidly at such large magnitude. We could talk about technology trends, but the drivers come from use cases and industry disruptions are what gets noticed.

On the other hand, I’m grateful to have pursued a dual background in both math/data and systems engineering — that combination comes in quite handy now. I would not have expected that as a grad student 30 years ago. At the time it felt like a compromise.
Same question, looking ahead:

For the past 10 years (since 2004) I’ve been doing research in agriculture data+analytics — almost at the level of dissertation work, as a side project, waiting for proper timing. Two years ago I mentioned this among investors at a cocktail party and drew much laughter. Frankly, agriculture as a sector is extremely conservative and has about a 10­year cycle for new technology adoption. However, now with extensive drought in California there’s been an industry-­wide wake up call. Consider that 40% of the world’s population are farmers, that nearly $15T of global GDP is in agriculture, and also let’s not forget the impact of snowpack variance, water shortages, energy crisis, weather related risks, GMOs, pesticides, etc. It is difficult to imagine any other field being quite as imperative as agriculture.

Needing to wait 10 years to work on something that really matters — not just some other ad­-tech venture — that is the part that was unexpected.

AR: 6. What interesting book you recently read and liked?

BioCoderPN: Not a book, but a new periodical is one of my favorite recent reads: BioCoder http://www.oreilly.com/biocoder/news.html
Think: Maker movement meets Radical Science.

In case you missed, here is Part 1 of the interview.