Strata Data Conference, NYC – Key Trends and Highlights

Strata is a conference I very much enjoyed attending. This year, I observed a few common themes that ran across much of the conference content: Data Science Collaboration, Data Ethics, and Platform Optimization.

By Sean Livelsberger, Target Data,

This was my second consecutive year attending the O’Reilly Strata Conference in New York City. This year, I was fortunate enough to be the winner of a free pass courtesy of KDnuggets.

This is a conference that I have very much enjoyed attending. Fundamentally, it is focused on the intersection of business and data. The event covers a large range of tools, technologies, and techniques in the big data space. It also covers a number of important topics for data-driven businesses such as artificial intelligence, machine learning, data strategy, collaboration, reproducibility, emerging architecture, and building data teams. The topics are presented through formats which include case studies, tutorials, panel discussions, and training sessions.

From my experience this year, I observed a few common themes that ran across much of the conference content.

  1. Data Science Collaboration
  2. Data Ethics
  3. Platform Optimization

I wanted to run through each point highlighting some of the more interesting presentations that I attended.

Data Science Collaboration

The challenge inherent to this concept was directly discussed in the session, “Data science at team scale: Considerations for sharing, collaborating, and getting to production” by Tristan Zajonc (Cloudera), Thomas Dinsmore (Cloudera), and Lucas Glass (QuintilesIMS).

This talk highlighted some common issues on data science teams. The first and arguably most important point was that analytics executives frequently report that they are not focused on collaboration. On top of that, team members face many of their own collaboration related issues. Surveys report that Data Scientists frequently spend time on projects that ultimately are not used by their company, leaving them out of the mainstream. Also, Data Scientists encounter frequent disconnect between project goals and feasibility.

In light of these issues, the presenters explored why collaboration can be so difficult. First, they cited the number of people that can touch a data science project. The list is naturally extensive and can range from the business analyst to the data scientist to the engineer. Close collaboration is challenging with very large teams. A secondary challenge arises in attempting to balance the agility of your data science team with more conservative and structured IT requirements. Data science teams also face the challenge of disagreements on the use of certain packages, libraries, and tools, which  complicates code sharing.

With such challenges in mind, the presenters described some key, high-level workflow elements that would support collaboration. At the base of the workflow, there needs to be secure access to one’s data lake. This should obviously be in compliance with any IT measures, but also be made easily accessible by data science team members. Against the data lake, the presenters advised taking a code first approach. While the presenters felt that this approach has proven most successful in their experience, they noted that it should be executed within a workbench where the Data Scientists are coding in R or Python and utilizing version control. Doing so will allow the Data Scientists to compute in the same environment, collaborating over both code and results. Related to this point, the use of containerization technology was suggested as a way to aid in the collaborative process. Once the exploratory and data science development phase is completed, the idea would then be for the team to pass off their unified results to an engineering team.

This talk ultimately provided a good overview of the challenges inherent to collaboration in the data science space as well as a conceptual overview of a framework more conducive to collaboration.

Data Ethics

Artist Sam Lavigne (The New Inquiry) referenced this topic throughout his keynote presentation, “White Collar Crime Risk Zones.” Lavigne showcased a street crime prediction application, noting that the usage of machine learning without careful consideration to the data source could create problematic results. In the case of street crime prediction, applying algorithms carelessly to data from police departments would allow the algorithms to train on and ultimately reinforce the biases inherent to the institutions providing the data, which in this case are often race and class based.

Lavigne then went on to explain a brilliant statement piece that he had built on this topic. Essentially, he mimicked the street crime prediction methodology, but instead aimed to predict white-collar crime, utilizing a number of financial and regulatory data sources for training purposes. In displaying the prediction map of this effort as well as an algorithmically generated composite image of the average white-collar criminal, Lavigne highlighted the problematic nature to these prediction efforts in that each tends to criminalize class and race divisions. Using a relevant and important example, Lavigne illustrated the ethical obligation to give careful consideration to the biases inherent within data sources themselves.

Platform Optimization and Integration

This appeared to be a crucial industry need that was clear from both the amount of coverage it received in various talks as well as my conversations with vendors in the expo hall.

On a conceptual level, this idea was covered in Kurt Brown’s (Netflix) talk, “24 Netflix-style principles and practices to get the most out of your data platform.” I would like to make a personal note that in my opinion, the Netflix employee presentations are among the best of Strata. In both years that I have attended Strata, the talks have stood out from the rest of the conference content. I suspect this is primarily due to how the company’s unique corporate culture is reflected through their employees’ presentations. With this in mind, I wanted to describe a few of the key takeaways from this presentation.

Principle 4: Avoid analysis paralysis.

This was referring to making critical changes and choices when it comes to the data platform. Brown pointed out that there may be good arguments all around, but ultimately a decision has to be made and tradeoffs suggested. The advice here was to put a good structure in place to make these decisions with a suggestion that the final call typically be made by the person most affected by the potential changes.

Principle 10: The users own the platform.

This point was made in reference to determining who gets to decide the features of a given platform. Brown made note of the fact that if the platform team owns their own product, then they will very likely make locally focused decisions. Such decisions may make the most sense for the platform team, but might not be helpful for the user base. Since one does not want a scenario where the users are not utilizing the product, Brown suggested that it is better to fight a few small fires that come with maximal platform usage than it is to have very few fires to fight due to lack of platform utilization and users.

Principle 12: You can’t have it all.

This was a very simple point. For a platform, one can choose only two of the following three options: flexibility, speed, or scale.

Principle 16: Building blocks.

It is important to not focus a massive amount of effort in attempting to create perfect end state products that theoretically suit all needs, but instead focus on creating base building blocks that allow for future movement, expansion, and changes.

Principle 21: Make it a shared decision.

Brown suggested making decisions together, alongside the users of platform. It is important to let the users decide upon what they care about. With the user needs in mind, the platform team can analyze the consequences. From there, the team and the users can agree to make the tradeoffs together.

Bio: Sean Livelsberger is an analyst at Target Data located in Chicago, Illinois. His professional and educational interests lie within the fields of machine learning and artificial intelligence.