Big Data Revolution Analytics Interview
"A data scientist will often combine transactional data from a NoSQL system, demographic data from a RDBMS, unstructured data from Hadoop, and social data from a streaming API" - ODBMS Editor Roberto Zicari interviews David Smith, VP at Revolution Analytics, on Big Data, Analytics, R, Data Scientists, and more.
ODBMS blog, Feb 27, 2013, Roberto V. Zicari.
RVZ: Q1. How would you define the job of a data scientist?
David Smith: A data scientist is someone charged of analyzing and communicating insight from data.
It's someone with a combination of skills: computer science, to be able to access and manipulate the data; statistical modeling, to be able to make predictions from the data; and domain expertise, to be able to understand and answer the question being asked.
... Q3. R is an open source programming language for statistical analysis. Is R useful for Big Data as well? Can you analyze petabytes of data with R, and at the same time ensure scalability and performance?
David Smith: Petabytes? That's a heck of a lot of data: even Facebook has "only" 70 Pb of data, total. The important thing to remember is that "Big Data" means different things in different contexts: while raw data in Hadoop may be measured in the petabytes, by the time a data scientist selects, filters and processes it you're more likely to be in the terabytes or even gigabyte range when the data's ready to be applied to predictive models.
Open Source R, with its in-memory, single-threaded engine, will still struggle even at this scale, though. That's why Revolution Analytics added scalable, parallelized algorithms to R, making predictive modeling on terabytes of data possible. With Revolution R Enterprise, you can use SMP servers or MPP grids to fit powerful predictive models to hundreds of millions of rows of data in just minutes.
... Q7. In your opinion, is there a technology which is best suited to build Analytics Platform? RDBMS, or instead non relational database technology, such as for example columnar database engine? Else?
David Smith: The data you're likely to need for any real-world predictive model today is unlikely to be sitting in any one data management system. A data scientist will often combine transactional data from a NoSQL system, demographic data from a RDBMS, unstructured data from Hadoop, and social data from a streaming API. That's one of the reasons the R language is so powerful: it provides interfaces to a variety of data storage and processing systems, instead of being dependent on any one technology.