Boosting Productivity of the Next-Generation Data Scientist: IBM June 6 event
On June 6, IBM will share important announcements for making R, Spark, and open data science a sustainable business reality at the Apache Spark Maker Community Event in San Francisco, Attend in person or watch live.
Data science has become an axis of the 21st century economy.
Data scientists are key developers in the era of cognitive computing and open data. Their core focus is on developing repeatable data-application artifacts—such as machine learning and other statistical models--for deployment within always-on business environments.
Boosting the productivity of data scientists requires teams with the right mix of individuals of diverse aptitudes, skills, and roles, as I discussed in this post from late last year. It also requires that teams build processes and collaboration environments for accelerating repeatable pipelines of patterned tasks across the data-science lifecycle. These tasks range from those that are largely manual--such as building statistical models, visualizing their performance against real-world data, and explaining the results--to those that can be automated to a considerable degree. The latter include such traditionally labor-intensive tasks as data discovery, profiling, sampling, preparation, as well as model building, scoring, and deployment.
Open-source efforts are beginning to address the need for standardized machine-learning pipelines to boost the productivity of data scientists within complex team environments. In this post from last year, O’Reilly’s Ben Lorica discusses one such effort, which promises the following productivity benefits:
- Automating machine-learning ingest and analysis of new data types, especially the image, audio, and video content so fundamental to the streaming-media and cognitive-computing revolutions, though sample pipelines for computer vision and speech as well as more data loaders for other data types
- Scaling machine-learning algorithm execution on massively parallel Spark-based runtimes, thereby accelerating the training, iteration, and refinement of more sophisticated models on vision, speech, and other media data
- Streamlining end-to-end machine-learning processing across myriad steps (data input through model training and deployment) and diverse tools and platforms through support for a standard API
- Enabling richly multifunctional machine-learning processing pipelines through extensible incorporation of diverse data loaders, memory allocators, featurizers, optimizers, libraries, and other components
- Benchmarking the results of machine-learning projects, in keeping with well-defined error bounds, to enable iterative refinement and reproducibility on model and algorithm performance
Open source tools of all sorts—ranging from Spark, R, and Python to Hadoop, Kafka, and beyond—are the essential foundation for today’s data-science teams. Deepening the productivity of data-scientist teams requires that all specialists—including statistical modelers, data engineers, data application developers, and subject matter experts—share an open-source productivity environment that enable them to:
- Source data from diverse data lakes, big-data clusters, cloud data services, and other sources;
- Discover, acquire, aggregate, curate, prepare, pipeline, model, and visualize complex, multi-structured data;
- Prototype and program data applications in Spark, R, Python, and other languages for execution in in-memory, streaming, and other low-latency runtime environments;
- Tap into a rich library algorithms and models for statistical exploration, data mining, predictive analytics, machine learning, natural language processing, and other functions;
- Develop, share, and reuse data-driven analytic applications as composable microservices for deployment in hybrid cloud environments;
- Secure, govern, track, audit, and archive data, algorithms, models, metadata, and other assets throughout their lifecycles
How will you achieve continued improvements in the productivity of your data scientists?
On June 6, IBM will share important announcements for making R, Spark, and open data science a sustainable business reality. At the Apache Spark Maker Community Event in San Francisco, IBM will host a stimulating evening featuring of keen interest to data scientists, data application developers, and data engineers. The event will feature special announcements, a keynote, and maker awards. Leading industry figures who have already committed to participate include John Akred, CTO Silicon Valley Data Science; Ritika Gunnar, Vice President of Offering Management, IBM Analytics; Todd Holloway, Director of Content Science and Algorithms, Netflix; Matthew Conley, Data Scientist, Tesla Motors; and Nick Pentreath, Spark Technology Center.