Book Review: Data Just Right

An introduction to technology and software at play in the current quest to define the Big Data Analytics computing paradigm, the book Data Just Right is reviewed in detail here.

Data Just Right
By Ajay Ohri, Apr 3, 2014.

Data Just Right: Introduction to Large-Scale Data & Analytics
by Michael Manoochehri (Addison Wesley, Dec 2013), is a useful book for people who feel overwhelmed by the comparative recent rise of the buzz words "Big Data" and "Data Science". It is supremely readable for both technical  and non technical readers. The more technical minded of you may feel disappointed by the superficial glance given to the examples, but the area covered is broad. More importantly the book aims to define an evolving technology landscape including a history of the computer science paradigms that have led to the current state of the Data Universe.

The book would thus be useful even to seasoned CTOs and CIOs who find the current rapid changes in Big Data and deluge of information a bit confusing, and it would certainly be useful for aspiring data scientists wanting to know more than just the code and software ruling this era.

Chapter 1 deals with  Big Data pipelines with historical evolution and some ground rules for the trade offs involved in the choices of the components for such a pipeline. Chapter 2 deals with the storage and sharing of Big Data. The comparison between XML, JSON and CSV is particularly lucid in it's simplicity and appeal. It then moves to the latest data serialization software like Apache Avro, Thrift and Protocol Buffers.

Occasionally the difference between simple examples and high level overviews of latest software in Big Data can be a bit jarring. If the book suffers from a defect, even a mild one, it would be the scale and breadth of ambition can dilute the focus for a technical reader.  Perhaps this could have been addressed by richer examples of all the software mentioned, or even enhancing the appendix as such. Chapter 3 deals with building a NoSQL web application for crowd sourcing data. The chapter name itself points to a huge ambition in explaining the breadth of topics. It manages to do so with adequate skill and dexterous writing, and I particularly enjoyed  the conceptual review of normalization, SQL, ACID and CAP Theorem.  The chapter also lists out key value and document stores both conceptually, in application and through lucid examples.  It then explains Redis and Twemproxy for sharding Redis (though one wonders if an example of sharding MySQL may have been quite useful too in the earlier chapter).

Chapter 4 deals with the traditional view and modern approach on the phenomenon known as Data Silos. There are no examples here and it reads more like a cross between a history and philosophy of data systems. Perhaps the author assumes that examples of ETL are not relevant or may be distracting or cumbersome.

Chapter 5 is when the book moves into a higher gear  by addressing Hadoop, Hive and the relatively newer Shark.  There are no examples of Map Reduce, perhaps there being too many examples of Map Reduce already. Fortunately the book returns to it's earlier excellent practice of lucid examples by giving us a showcase of a Hive query. Spark is again given a rather short treatment with a page of explanation and no examples.

Chapter 6 deals with building a data dashboard using Google BigQuery. It talks of the evolution of OLTP and a bit on OLAP, Dremel and it's difference with MapReduce. The historic evolution is useful for the reader and it is adequately supplanted by an example of Google BigQuery using it's customized SQL. The part on API authentication and access is particularly delicious for the reader seeking a wider overview. One drawback I felt in the chapter was a lack of images, flowcharts and reliance on verbose text to explain things.

Chapter 7 on data visualization starts with a promising quote from Tufte ( all chapters are preceded by quotes, a practice that makes them enjoyable reading though sometimes wondering at the context and suitability). It disappoints in giving us the historical cholera epidemic and Napoleon's march visualization without any recent data visualizations to match them. R and ggplot2 are explained off in a page or two to be rapidly followed by Python's matplotlib. The chapter ends with a a rather broad and verbose explanation of both D3.js and javascript code though disappoints with no color graphs at all. A chapter on data visualization should contain more data visualizations, you night think or at least a wee bit more  links and references than a single quote from Edward Tufte.

Chapter 8 deals with MapReduce data pipelines and Chapter 9 with Pig data transformation workflows. There are enough examples here to give a flavor but the topics are hardly covered extensively for any actionable follow through. Part 5 deals with a single chapter on Machine Learning using Mahout which is clearly a shame given the plethora of approaches aside from Mahout to deal with it. The author's apparently computer scientist like enthusiasm for  machine learning bubbles over leaving much lesser room for other Mahout specific things. The lack of other chapters in this part can be considered a major drawback to this part.

Chapter 11 on using R for large data sets ignores the Revolution Analytics package RevoscaleR - a major omission. It however names all the other packages with a flavor of examples of each that are just right to tease an interest in the reader who does not know R well enough to be intrigued by it (disclaimer- this Reviewer wrote a book on R). I do wish that much less page space was wasted on the oft repeated data science is sexy thing and the R wont be sufficient without RAM especially given the days of cloud instances. The choice between entertaining the reader, being readable for the not so technical audience and actual examples that are actionable seems a bit skewed here.

The author may be biased towards Python though as he clearly elucidates it in Chapter 12 with much more gusto (including a heading called "The Snakes Are Loose In The Data Zoo", just as the reviewer may be sentimental about R. However the author correctly identifies the ongoing and interesting contest between these two software camps for Big Data Statistical supremacy, and the comparison is one of the very few and indeed very valuable comparisons between R and Python within the  pages of the same book. I only wish there was more technical examples to it, and less of the New York Times style verbose pomp. The cursory and superficial  treatment to Numpy, SciPy, Pandas and iPython can only make the technical audience wanting for more.

Chapter 13 goes outright in managerial territory with metered cloud computing and cost-benefit measured open source strategies being the recommended go to options. Chapter 14 is an exercise in futurism with crystal ball gazing, an exercise that dilutes some of the rigor of an otherwise appetizing book.

Indeed that is the shame of the book- it had great to legendary potential, but gets lost in just another good book to have territory. However we recommend it for the reader struggling to cope with the Big changes in Big Data, and can only appeal to the writer to add bit more meat (technical examples) and bit less sauce (NYT style tech prose) in the next technology flavored book or edition he chooses to author. Hopefully some of these slight imperfections in an otherwise beautifully thought and authored book would be addressed in it's accompanying blog

Read it as an appetizer for the next round of technology platforms fighting to catch your attention in the next generation of computing, and you won't be disappointed.