Open Data Science in Collaborative Workflows – IBM June 6 event
On June 6, IBM will share important announcements for making R, Spark, and open data science a sustainable business reality at the Apache Spark Maker Community Event in San Francisco, Attend in person or watch live.
Data science is a team activity. Ideally, you want to build a team in which diverse specialists pool their skills and knowledge, wield a core set of sophisticated productivity tools, and collaborate flexibly and efficiently.
Data scientists produce a steady stream of machine learning, predictive, segmentation, and other advanced analytic models. As a team, their effectiveness depends on having advanced tools such as Spark and languages such as R and Python at their disposal. It also depends on having tools to support creative design, agile collaboration, and workflow management of data, algorithms, models, and other artifacts. In addition, high-performance teams need distributed environments that can accelerate automation of many pipeline functions---ranging from upfront data discovery and acquisition to downstream data wrangling, refinement, modeling, exploration, and governance—to the maximum extent possible.
Open environments are revolutionizing the data-science landscape. Increasingly, the more innovative ideas for data-science projects are emanating from new participants, such as self-taught “citizen data scientists” as well as from crowdsourcing communities. In the process of opening to fresh thinking, data science teams are becoming inexorably more dynamic, creative, and productive. These 21st century data-science collaboration will thrive on open tools and platforms that enable collaborative sharing of ideas, samples, templates, models, requests, and feedback across geographies, projects, and platforms.
As data science initiatives open up to include diversified knowledge ecosystems, team performance will grow by orders of magnitude. In open-data science environments, team productivity will inexorably expand along all of the following dimensions:
- produce, refine, and deploy far wider range of machine learning and other statistical models and applications more rapidly than ever;
- develop these artifacts in a much wider range of tools and languages;
- design a greater number of models that incorporate more complex feature engineering and a wider range of predictors;
- construct these models from much larger and more diversified libraries of algorithms;
- train and score the models from larger volumes and varieties of data sources more rapidly;
- accelerate data acquisition, transformation, and preparation in a more automated fashion; and
- deploy models into a much wider range of business applications more rapidly and efficiently.
Though the productivity potential is undeniable, there is a flipside risk. Open data-science teams may become too productive for their own good. High-performance teams may become swamped with more models--and with more versions of those models in various stages of iterative refinement--than they can easily and securely track and manage.
To mitigate these risks, teams will need collaborative workflow environments that support continual tracking and control of their collective work product. These governance features will be essential for maximizing the effectiveness and reducing wasted resources throughout the data-science lifecycle.
How will you tap into the exciting promise of open data science in collaborative development initiatives while also ensuring strong workflow, governance, security, and management of these efforts?
On June 6, IBM will share important announcements for making R, Spark, and open data science a sustainable business reality. At the Apache Spark Maker Community Event, IBM will host a stimulating evening featuring of keen interest to data scientists, data application developers, and data engineers. The event will feature special announcements, a keynote, and maker awards. Leading industry figures who have already committed to participate include John Akred, CTO Silicon Valley Data Science; Ritika Gunnar, Vice President of Offering Management, IBM Analytics; Todd Holloway, Director of Content Science and Algorithms, Netflix; Matthew Conley, Data Scientist, Tesla Motors, and Nick Pentreath, Spark Technology Center.