Follow Gregory Piatetsky, No. 1 on LinkedIn Top Voices in Data Science & Analytics

KDnuggets Home » News » 2016 » Nov » Software » RCloud – DevOps for Data Science ( 16:n42 )

RCloud – DevOps for Data Science


 
  http likes 52

After almost two decades of software development, term – DevOps was coined and officially given importance to collaboration between development and deployment of software systems. In this early stage of Data Science field, use of standardized and empirical practises like DevOps will definitely speed up its evolution.



By Jo Frabetti, AT&T Research.

The process of accelerating the data analysis insights by reducing the time between coding and deployment, now called DevOps, has become more relevant with the emerging role of data science teams in large organizations.  Effective data science teams must share their findings with each other and the organization at large, be agile enough to embed new features or address additional goals during development, and move results from data wrangling, exploratory data analyses (EDA) and predictive analytics into automated visualizations, diagnostics and reports intended for wider consumption.  In the recent past, data wrangling, EDA and predictive analytics were done with one set of tools and automated visualizations, recommendations and reports were done with another. 

This separation often extended to the very systems where the tools were located (e.g., a development environment versus a production environment).  Separating the tools and the environments hinders mid-process feedback and development modifications and by its very nature creates time lags between results discovery and results sharing. In addition, reproducing or modifying projects could become a project itself if the original development environment was no longer in existence or the data scientist who created it had left the firm.  RCloud is open-source software created at AT&T Labs by Simon Urbanek, Gordon Woodhull and Carlos Scheideggerto solve data analysis development to deployment issues of collaboration, sharing, scalability and reproducibility.

rcloud-notebook-directory

Fig. 1: Directory of user notebooks; everyone notebook may be viewed, run, forked (copied), favorited (starred) and shared.

Collaboration
RCloud is browser-based software which is installed on a server or a distributed system (e.g., Hadoop, Cassandra, Spark) that provides a web-based R, Python or shell session, text and equation capabilities (markdown), web page layout and a notebook integrated development environment (IDE) interface. Notebooks are project containers which include all the components and dependencies of a data analysis, including code, data, code comments, equations, data analysis narrative, visualizations and deployment capabilities in a manner similar to other systems including Mathematica, Jupyter and Sage.

RCloud differs from these systems by providing browser-based access to extensive social coding functionality, including the ability to search notebooks, automated version control and most importantly a user directory that provides access to every registered user’s notebooks.  Having access to all the notebooks in your cloud (e.g., RCloud can be limited to a single organization) means that new users can view or fork (copy) any other user’s notebook, edit, modify or update the analysis with new data or a different visualization technique, for example, or search on any term to learn how other data scientists have applied code, libraries or supporting languages such as CSS or JavaScript.  Code and widget libraries are an ancillary benefit that comes with the creation of every RCloud notebook.  Rather than recreating the wheel, these “knowledge assets” may be leveraged on similar projects or to train other employees, for example.

In this environment, data scientist teams benefit from easier sharing of scripts and data feeds, experiments, annotations and automated recommendations which are well beyond what traditional individual or locally based development environments provide. In addition, since RCloud access is provided in a web-browser, data scientist teams may work from anywhere with an internet connection.  RCloud promotes data science by allowing Data Scientists to easily share ideas and techniques with each other.

Sharing

In RCloud, every notebook is named by a URL, so converting collaborative work into reports and recommendations is accomplished by sharing (e.g., emailing or texting) the notebook URL.  RCloud Data Scientists can switch from an executed notebook view (“view.html”), a dashboard display (“shiny.html” or “mini.html”) or a web-service provider (“notebook.R”) to the underlying code by changing the notebook to edit mode by amending the URL in the browser (“edit.html”).

rcloud-edit-mode

Fig. 2: Go from executed notebook (“view.html”) to the underlying code (“edit.html”) by editing the URL in the browser.

Unlike similar systems, RCloud shared notebooks are not static webpages, but rather code that is being executed “live”. RCloud’s unique “notebook.R” web-service interface means that any notebook asset (e.g., data, code (e.g., .R, .py, .css or .js) or images) may be integrated with other web-technologies by URL referencing. This gives developers an enormous amount of flexibility to create any type of complex interactive widget, notebook or dashboard (e.g., custom CSS or JavaScript UI widgets). In addition, as a fundamental feature of DevOps, RCloud notebooks may be “published” so that both registered and unregistered users (e.g., non-developers – business executive, domain experts, etc.) may view and interact with an executed (live) notebook.

RCloud unique URL sharing functionality not only allows Data Scientists to deliver value faster by reducing the time between coding and deployment, but to be agile enough to deliver results at any stage of the analysis in order to obtain feedback from domain experts, validate code and/or leverage other data scientist’s recommendations.

Scalability

RCloud was specifically designed to leverage existing systems and standards so communication between most parts of the system happens through HTTP.  The communication between a web browser and an active R session as a user edits a notebook is performed by a combination of HTTP and Websockets; RCloud can perform parallel connections to multi-server systems.  This means Data Scientists can run big data packages like iotools (fast data import using chunk loading), H20 and PySpark/SparkR, for example, without having to write complex code. AT&T’s Data Scientists are using RCloud for big data applications, including consumer experience visualizations, urban traffic anomaly detection and topic model analysis for domain name data structure discovery.

rcloud-architecture

Fig  3: RCloud high-level architecture.

RCloud supports efficient, secure, client/server connections via the FastRWeb package and an adopted discipline known as the Object-Capabilities (ocap). RCloud uses the RServe protocol to implement the ocap methodology and the client-server communication. This means that web browsers never directly instruct the RCloud backend to execute arbitrary code which prevents unauthenticated clients from making unauthorized calls to the RCloud runtime environment and RCloud notebooks may be encrypted for added security; read more about RCloud security features on our Documentation pages.  An RCloud session runs on both client and server and it is possible for R functions on the server to call JavaScript functions on the client and vice versa.  In the notebook product space, authenticated client-server channeling and notebook encryption are unique to RCloud.

Reproducibility

A data analysis in RCloud may be verified and executed (reproduced) by anyone with access to the notebook without concern for environmental variables.  Local development environment parameters no longer create hurdles for viewing and running analyses since RCloud is platform independent and user access and controls remain constant which promotes user confidence and engagement.

github-logoIn addition, all of the elements of an RCloud notebook are stored on a Github (or Git) server; it is a core design principle of RCloud that no command should be run without being saved, even if it gets deleted later. The Github server supports continual and automatic background versioning of notebooks as they are developed and modified. This feature provides the benefits of basic Git version control for casual users and the option to use more sophisticated Git features for advanced users.  RCloud also maintains an automatic Git based trail of code modifications which document the development history of an RCloud based project.

Interested?

To learn more about RCloud, you can take our public instance for a test drive by creating an account or viewing examples on our Gallery page.  If you would like to set up your own private organization-wide RCloud, RCloud is free / open-source software so all you have to do is visit our Download page for installation instructions.  If you have any questions or you would like to request a demo, please email us at info@rcloud.social.

Bio:Jo Frabetti has over ten years of experience in analytics for healthcare, energy, insurance and financial clients in big data technologies including development of powerful visualizations using both structured and unstructured data sets, creation of prototypes using available ecosystem to support and tell strategies, collaborations with leads and team members in succeeding with high quality deliverables.

Related:


Sign Up