KDnuggets Home » News » 2016 » Sep » Meetings » The Acceleration of Data Science Excellence ( 16:n32 )

The Acceleration of Data Science Excellence


If you’re a working data scientist, data engineering, or data application developer, attend IBM DataFirst Launch Event on Sep 27 in New York City. Engage with open-source community leaders and practitioners and learn how to accelerate your processes for putting data to work.



Data scientists do painstakingly complex work under relentless time pressures. Their ability to deliver a consistent level of excellence from one project to the next depends on providing every team member with easy access to all necessary modeling, data management, and other productivity tools. It also depends on fostering a collaborative environment within which all statistical modelers, data engineers, app developers, and other specialists can do their best work.

IBM Data Science

Accelerating the data science lifecycle without compromising the quality of team deliverables can be a challenge. The typical day in the working life of a data science professional often involves any or all of the following detail-oriented tasks:

  • Acquiring data from diverse data lakes, big data clusters, cloud data services and more
  • Discovering, acquiring, aggregating, curating, preparing, pipelining, modeling and visualizing complex, multistructured data
  • Tapping into libraries of algorithms and models for statistical exploration, data mining, predictive analytics, machine learning, natural language processing, and interactive visualization, among other functions
  • Prototyping and programming data applications to be executed in in-memory, streaming, cloud, and other runtime environments
  • Securing, managing, tracking, auditing and archiving data, algorithms, models, metadata and other assets throughout their lifecycles

The best data science teams engineer their collaborative process to produce a steady flow of repeatable artifacts—for example, machine learning and other statistical models—that are designed to be deployed within always-on business environments. The key pipeline tasks to be accelerated include:

  • Discovery and acquisition of data from diverse data lakes, big data clusters, cloud data services and other stores;
  • Accelerated ingestion and analysis of new data types—especially the image, audio and video content that is so fundamental to the streaming media and cognitive computing revolutions—through sample pipelines for computer vision and speech as well as data loaders for other data types
  • Prototyping, programming, and modeling of data applications in Spark, R, Python and other languages for execution in in-memory, streaming and other low-latency run-time environments.
  • Development of data-driven applications using a reusable, composable library of algorithms and models for statistical exploration, data mining, predictive analytics, machine learning, natural language processing and other functions.
  • Scaled machine learning algorithm execution on massively parallel Spark-based runtimes, accelerating the training, iteration and refinement of sophisticated models for vision, speech, and other media data
  • Streamlined end-to-end machine learning processing across myriad steps (data input through model training and deployment) and diverse tools and platforms through support for a standard API
  • Richly multifunctional machine learning processing pipelines through extensible incorporation of diverse data loaders, memory allocators, featurizers, optimizers and libraries, among other components
  • Benchmarking of the results of machine learning projects, in keeping with well-defined error bounds, to enable iterative refinement and reproducibility of model and algorithm performance
  • Securing, governance, tracking, auditing, and archival of data, algorithms, models, metadata and other assets throughout their lifecycles

If you’re a working data scientist, data engineering, or data application developer, register here to attend the IBM DataFirst Launch Event on Tuesday, September 27 in New York. Engage with open-source community leaders and practitioners and learn how to accelerate your processes for putting data to work in your burgeoning cognitive business.