5 Key Takeaways from Strata London 2018
5 highlights and thoughts from my attendance to Strata London 2018.
1. Regulation and Ethics.
General Data Protection Regulation (GDPR) is here to stay. Many of the topics throughout the conference were related to GDPR. It affects how Data Science will be performed, forcing companies and professionals who handle information to follow a guideline on how user data should be used. However, this regulation will create a new opportunity for companies to develop products based on trust. “GDPR, one of the most significant regulations in the past 20 years” Alison Howard – Microsoft.
On the Internet, there are plenty of tutorials, workshops, MOOCs, etc. ranging in all topics from Intro to Programming Languages, and explaining how to prepare and visualize data, or how to create models and operationalize them. However, how many MOOCs are there on Ethics? It would be interesting to find a report on how much attention is given to ethics within all these resources when we are reading about Machine Learning. We shouldn’t be concerned only about how robust is our model; the ethical implication of the use case should be considered as a critical part of the Machine Learning pipeline.
Fig. 1: Know how your work is impacted by GDPR
2. Optimize for transparency and trust.
“Make sure users trust your solution from the start” Jean Francois Puget – IBM. That’s a challenge when most users want to understand how the model works in order to trust it. For instance, a person would probably have more confidence in their lawyer than in an AI suggestion. Currently, there are many black-box models which are doing amazing things in the field and the good news is we’re also advancing in developing methods to explain these models. One of them is Local Interpretable Model-Agnostic Explanations (LIME), which consists in perturbing the inputs of a model to see how predictions change. This can help understand the behaviour in a more human sense. Another method is Causal Models to Explain Learning (CAMEL) which is a program ran by DARPA with two main goals:
- Maintaining a high level of learning performance and be more explainable.
- Enable human practitioners to manage, understand and trust the output of artificially intelligent partners.
Fig. 2: CAMEL
These methods are setting up the platform for a future in which the professionals of the fields known today will be working alongside AI agents.
3. Data Science Process = Teamwork + Pipeline + Documentation
One important message delivered over and over again across all ranges of sessions was the importance of the Data Science process. I would say that Pipeline + Teamwork + Documentation are three important components to deliver good projects. A good pipeline and a well-documented registration of the outcome from every step lets you bring faster results and avoid mistakes made in previous projects. In a talk explaining the lessons learned from several Data Science projects at Microsoft, Danielle Dean gave key points they’ve acknowledge from more than 100 projects. Among them were:
- Keep a human in the loop; this was mentioned in different sessions as well, let’s remember that AI is extremely good at one individual task.
- Accumulate a toolbox of tricks; this is already done with tools such as StackOverflow, but it’s important always to document things that have worked for us in the past and reuse them.
- Adopt a process; she referred to a good resource, the Microsoft Data Science process.
Fig. 3: Team Data Science Process lifecycle
4. Deep Learning and Streaming Data.
Deep Learning Is the new Machine Learning, this is no news. If we look at the most cited papers in machine learning most of them are related to Deep Learning. There were sessions with applications of Deep Learning in manufacturing, video streams, visual inspection, recommender systems, etc.
Another interesting concept is Streaming Data which can be seen as the new trend. Before, we talked about Big Data vs Data, and now the conversation goes around Streaming data vs Batch Data.
Fig. 4: Neural Network Demo
5. Keep it simple.
A lot of data related tools were covered in the conference. It’s essential to bear in mind that most of the technologies used today may not be the same in a few years. Hadoop, Spark, Rstudio, Python, Tensorflow, Kafka, Neo4j, Dock, and Apache Flink were some of the examples mentioned across the event. Although, in the session given by Jeroen Janssens, he focused on one tool that has been around for over 40 years and it is still relevant: The Bourne shell. He pointed out 50 reasons why to use the command line for Data Science. This makes me think about the KISS principle (Keep it simple, stupid), also mentioned in one of the conferences. With all the tools mentioned above we should always consider that a simpler process can bring extraordinary results as well.
Fig. 5: Cowsay command in Bash