RCloud – DevOps for Data Science
After almost two decades of software development, term – DevOps was coined and officially given importance to collaboration between development and deployment of software systems. In this early stage of Data Science field, use of standardized and empirical practises like DevOps will definitely speed up its evolution.
By Jo Frabetti, AT&T Research.
The process of accelerating the data analysis insights by reducing the time between coding and deployment, now called DevOps, has become more relevant with the emerging role of data science teams in large organizations. Effective data science teams must share their findings with each other and the organization at large, be agile enough to embed new features or address additional goals during development, and move results from data wrangling, exploratory data analyses (EDA) and predictive analytics into automated visualizations, diagnostics and reports intended for wider consumption. In the recent past, data wrangling, EDA and predictive analytics were done with one set of tools and automated visualizations, recommendations and reports were done with another.
This separation often extended to the very systems where the tools were located (e.g., a development environment versus a production environment). Separating the tools and the environments hinders mid-process feedback and development modifications and by its very nature creates time lags between results discovery and results sharing. In addition, reproducing or modifying projects could become a project itself if the original development environment was no longer in existence or the data scientist who created it had left the firm. RCloud is open-source software created at AT&T Labs by Simon Urbanek, Gordon Woodhull and Carlos Scheideggerto solve data analysis development to deployment issues of collaboration, sharing, scalability and reproducibility.
Fig. 1: Directory of user notebooks; everyone notebook may be viewed, run, forked (copied), favorited (starred) and shared.
RCloud is browser-based software which is installed on a server or a distributed system (e.g., Hadoop, Cassandra, Spark) that provides a web-based R, Python or shell session, text and equation capabilities (markdown), web page layout and a notebook integrated development environment (IDE) interface. Notebooks are project containers which include all the components and dependencies of a data analysis, including code, data, code comments, equations, data analysis narrative, visualizations and deployment capabilities in a manner similar to other systems including Mathematica, Jupyter and Sage.
In this environment, data scientist teams benefit from easier sharing of scripts and data feeds, experiments, annotations and automated recommendations which are well beyond what traditional individual or locally based development environments provide. In addition, since RCloud access is provided in a web-browser, data scientist teams may work from anywhere with an internet connection. RCloud promotes data science by allowing Data Scientists to easily share ideas and techniques with each other.
In RCloud, every notebook is named by a URL, so converting collaborative work into reports and recommendations is accomplished by sharing (e.g., emailing or texting) the notebook URL. RCloud Data Scientists can switch from an executed notebook view (“view.html”), a dashboard display (“shiny.html” or “mini.html”) or a web-service provider (“notebook.R”) to the underlying code by changing the notebook to edit mode by amending the URL in the browser (“edit.html”).
Fig. 2: Go from executed notebook (“view.html”) to the underlying code (“edit.html”) by editing the URL in the browser.
RCloud unique URL sharing functionality not only allows Data Scientists to deliver value faster by reducing the time between coding and deployment, but to be agile enough to deliver results at any stage of the analysis in order to obtain feedback from domain experts, validate code and/or leverage other data scientist’s recommendations.
RCloud was specifically designed to leverage existing systems and standards so communication between most parts of the system happens through HTTP. The communication between a web browser and an active R session as a user edits a notebook is performed by a combination of HTTP and Websockets; RCloud can perform parallel connections to multi-server systems. This means Data Scientists can run big data packages like iotools (fast data import using chunk loading), H20 and PySpark/SparkR, for example, without having to write complex code. AT&T’s Data Scientists are using RCloud for big data applications, including consumer experience visualizations, urban traffic anomaly detection and topic model analysis for domain name data structure discovery.
Fig 3: RCloud high-level architecture.
A data analysis in RCloud may be verified and executed (reproduced) by anyone with access to the notebook without concern for environmental variables. Local development environment parameters no longer create hurdles for viewing and running analyses since RCloud is platform independent and user access and controls remain constant which promotes user confidence and engagement.
In addition, all of the elements of an RCloud notebook are stored on a Github (or Git) server; it is a core design principle of RCloud that no command should be run without being saved, even if it gets deleted later. The Github server supports continual and automatic background versioning of notebooks as they are developed and modified. This feature provides the benefits of basic Git version control for casual users and the option to use more sophisticated Git features for advanced users. RCloud also maintains an automatic Git based trail of code modifications which document the development history of an RCloud based project.
To learn more about RCloud, you can take our public instance for a test drive by creating an account or viewing examples on our Gallery page. If you would like to set up your own private organization-wide RCloud, RCloud is free / open-source software so all you have to do is visit our Download page for installation instructions. If you have any questions or you would like to request a demo, please email us at firstname.lastname@example.org.
Bio:Jo Frabetti has over ten years of experience in analytics for healthcare, energy, insurance and financial clients in big data technologies including development of powerful visualizations using both structured and unstructured data sets, creation of prototypes using available ecosystem to support and tell strategies, collaborations with leads and team members in succeeding with high quality deliverables.