STRATA + Hadoop World 2014 NYC Report

Strata + Hadoop World this year included workshops on subjects like Spark, R, and Python, interesting keynotes, and impressive detailed technical talks on subjects on Hadoop and new trends in big data.

By Sheamus McGovern, Nov 2014 (special report for KDnuggets).

Making the annual trip to STRATA + Hadoop World can be an expensive undertaking, however, for many, the trip to Strata is a necessary part of being part of the big data scene and immersing oneself in the latest data trends.

Strata Conference + Hadoop World Strata Conference + Hadoop World,
Tools and Techniques that make data Work,
Oct 15-17, 2014.
New York, NY, USA

This pilgrimage has gone global, with Strata conferences now in New York, San Jose, London, and Barcelona. By the numbers, the NYC Strata was certainly an impressive event, with 5,500 attendees and around 135 exhibitors at the massive Javits conference center. The Thursday talk series alone consisted of ten keynotes and sixty-seven talks--eighteen of which were sponsored. Friday had similar numbers. Thus the three-day data extravaganza is a feast for the mind for any data scientist or data professional!

The event kicked off Wednesday with tutorial sessions that continued to be a big draw with over a dozen on offer and many whole-day workshops.  Sessions included a full day Spark camp, which was introduced this year and reflects the momentum Spark has gained over the last year or so. R, Python and D3.js continued to have a strong representation.

However, it was Thursday when things really got into high gear with about ten keynotes stretched over an hour and a half. The keynote format guaranteed that things were kept interesting by restricting each talk to fifteen minutes and ensuring an impressive lineup of speakers. That’s about enough time to deliver a meaningful take away. Keynotes are a bit like icing on a cake - not essential, but leaving a pleasant taste in your mouth, especially when the speaker speaks to your inner data scientist.

You can review the keynotes for yourself at

Once the keynotes were complete it was on to what for many was the main event: the talks themselves.  This offered quite the dilemma.  For any given hour attendees had a choice of eleven different talks to choose from.  Quite a few talks were by vendors, although the conference app did give you the choice of filtering these out.  Thus one had to be quite judicious in choosing which talk to attend.

Obviously, many talks had a Hadoop big data focus, but others were focused on the less technical side, such as “The Great Debate; If You Can’t Code You Can’t Be a Data Science”.  Interesting as that talk may have been, I decided to give it a skip since I’ve listened to that debate many times before and do consider myself a decent coder so already know that answer to that one ;)

Instead, I sunk my teeth into a presentation by Goldman Sachs & Co entitled "How Goldman Sachs is Using Knowledge to Create an Information Edge”.  The talk was presented by Peter Fems, who works in GS compliance.  Compliance has a big role in any financial institution, especially when it comes to identifying potential conflicts of interest.  Peter did a thorough overview of the GS architecture and how they use graph databases to show relationships, for example, between a trader, research analyst, and banker working on a deal.  It was a compelling use case for nodes and weighted relationships that graph databases were built for.  The key takeaway was that one can have many raw outputs for analysis but sometimes you just need to see where those paths take you.

The next talk I attended had the enticing title of “Big Data Architectural Patterns”.  The title was about the best part of the talk.  Going to a vendor talk is a bit like a recently converted vegetarian going to a steak house.  It may look good, but if you are not using that platform then chances are you are not going to touch it!  This was sponsored by SPLUNK, the makers of the powerful analytics platform. Splunk is an impressive platform, especially with its offering of Hunk, which is basically Splunk analytics for Hadoop.  Not having used the platform myself, the talk was somewhat interesting but did make me realize I needed to focus on non-vendor sessions.

Having learned that lesson, I next attended Greg Rahn’s talk.  He did an insightful take on comparing open source SQL-on-Hadoop.  He reviewed various benchmarks, including the "Big Data Benchmark"

He made the observation that one should distinguish between benchmarking and "benchmarketing" and entertainingly quoted what he referred to as Gregorio's Benchmarking Theorem that
"Given any benchmarketing claim C, there exists at least one workload W or at least one query Q that will prove claim C correct".
  That was one of the most observant quotes I’d heard in a while!

The exhibitor space could only be described as impressive, with 135 exhibitor booths of various sizes.  Truth be told, I’m usually more interested in the talks than conversing with vendors.  However, the better booths are usually staffed with experts in addition to the usual sales and marketing teams.  I had an illuminating conversation at the MongoDB booth with Edouard Servan-Schreiber, their Director for Solution Architecture, regarding MongoDB performance.

At the MapR booth I had the chance to speak with the always impressive Ted Dunning, who is their Chief Application Architect. We chatted a little about his new book, Time Series Databases, which he co-authored with Ellen Friedman.  Apart from MapR, Ted is also involved with such great projects as Apache Mahout, Drill, and Zookeeper, and also a mentor for the Storm and Spark projects.

Unfortunately work commitments brought me back to Boston early Friday. In retrospect it was well worth the trip and I hope to make it back next year.

Sheamus McGovern is a CTO and founder of Startup Code Works, a company that specializes in building online platforms for starups. He is also an organizer of Boston Analytics Meetups and Boston Datafest. He has many years of experience building complex software platforms particularly in quantitative finance and business intelligence.