Data Science Toolbox virtual environment

Data Science Toolbox: a new virtual environment for command-line data science - how it compares with similar environments: Mining the Social Web, Data Science Toolkit, and Data Science Box.

In the post Lean, mean data science machine, Jeroen Janssens compares data science environments and describes his proposed solution, an environment created and configured using Vagrant, a wrapper around VirtualBox and other virtualization software such AWS EC2. With a few commands, a fresh virtual machine is spun up and configured according to a simple script.

Jeroen Janssens writes:

Data scientists love to create interesting models and exciting data visualizations. However, before they get to that point, usually much effort goes into obtaining, scrubbing, and exploring the required data. I argue that the *nix command-line, although invented decades ago, remains a powerful environment for processing data. It provides a read-eval-print loop (REPL) that is often much more convenient for exploratory data analysis than the edit-compile-run-debug cycle associated with large programs and even scripts.

Unfortunately, setting up a workable environment and installing the latest command-line tools can be quite a pain. This post describes how to alleviate that pain and how to get you started doing data science on the command-line in a matter minutes.

You can install Data Science Toolbox (DST) from
Github: jeroenjanssens/data-science-toolbox

This installs R, the Python scientific stack, and many command-line tools for processing data. Uses Vagrant and for now it can be deployed on VirtualBox, only.

See more detailed instructions at

Other notable environments for data science include

Mining the Social Web (MTSW)
Created by Matthew Russel, @ptwobrussell
Github: ptwobrussell/Mining-the-Social-Web-2nd-Edition
Uses Vagrant and can be deployed on both VirtualBox and AWS. Installs IPython Notebook, numpy, mongo, and NLTK, which allows you to follow along with the examples provided in the book. An AWS AMI is available as well.

Data Science Toolkit (DSTK)
Created by Pete Warden, @petewarden
Github: petewarden/dstk
The website provides a sandbox from which you can try out many interesting APIs. These APIs can also be accessed from the command line. An AWS AMI is available.

Data Science Box (DSB)
Created by Drew Conway, @drewconway
Github: drewconway/data_science_box
This is a bash script for which you need have an AWS EC2 instance running. It installs R, Shiny, IPython Notebook, and the Python scientific stack.

Here is Jeroen Janssens comparison of data science environments:

Data Science Environments comparison

Read more.