Exclusive: Interview with Tom H. C. Anderson, the leader in Big Data, Market Analytics, and Text Analytics
KDnuggets talks with Tom H. C. Anderson, a pioneer in text mining, founder of NGMR Market Research group, a leader in Big Data and in Social media, an award winning blogger, and a very cool guy. Part 1 of the interview
By Gregory Piatetsky, May 20, 2013.
I recently had a pleasure to inteview Tom H. C. Anderson, whom I know for quite a few years. Besides his many qualities listed above, Tom is also a very eloquent writer, so enjoy the interview.
Short Bio:Tom H. C. Anderson, is the managing partner of Anderson Analytics, the first market research firm to leverage text analytics. A true pioneer in the text mining field, his firm's OdinText software was recently named a key challenger in the text analytics space and has received much praise from Fortune 500 consumer insights and customer loyalty professionals. Tom is an award winning blogger and frequent speaker on data and text mining. He was named one of the industry's "Four under 40" market research leaders by the American Marketing Association (2010). Tom also manages NGMR, the most active Market Research group on LinkedIn and has almost 50,000 followers on Twitter for @tomhcanderson.
Gregory Piatetsky: Q1. Congratulations on your #1 rank in Big Data 100 Most Influential list by Big Data Republic. What does Big Data mean to you?
Tom H. C. Anderson: Thank you Gregory, it was quite a pleasant surprise. That is a great question because the term "Big Data" really is relative.
Depending on how you look at it, for me the term "Big Data" may be a bit smaller than average because at Anderson Analytics we deal with so much unstructured (text) data, and this type of data tends to increase in size when you start working with it.
So, depending on what we're doing, a data file that may have been considered big by a client because it had a lot of text data becomes an order of multitude more complex once it's processed by one of our OdinText servers. This size is one of the reasons analysts who try to do text analysis on desk top applications face so many performance problems.
Having said that, when I think about data size, I'm usually thinking about how much data is necessary before text analytics becomes applicable (a good ROI). We sometimes get requests about working with data from a single focus group. I have to explain that text analytics or text mining, which I believe is really a better term, is really about finding patterns in data. If you don't have a sufficient amount of data, you're not going to have very many patterns to find.
That's not to say you need "Big Data" for text analytics you don't, but here's a case where bigger usually is better. While you can certainly apply some text analytics approaches to what even just one person says, unless that person is extremely important, it's generally not the best use of resources.
GP: 2. How much hype and reality do you see in "Big Data" - the buzzword and the trend?
THCA: LOTS. A great deal of it is surrounding "social media monitoring" which is basically just tweets. I'm not going to get into the meaning or representativeness of twitter to the general population here, as that's quite a long discussion in itself (though we've done quite a bit of research on this). I also don't think we have time to discuss the value of that data as a viable signal as it really depends on the specific domain. However, even from a basic game theory perspective, anyone should see that the current state of affairs is unsustainable.
Anderson Analytics has done a lot of work with social media analytics, even before it was known as "social media" we were the first to mine large scale discussion boards before Twitter, Facebook or LinkedIn existed. While subsequently we have worked with both LinkedIn and Facebook, I have to admit we do comparatively less with Twitter data (though a few of our clients are using it in OdinText).
But I do understand the attraction of social media data to developers. One of the nicest things about 'social media' (aka mainly Twitter and blogs) is that it's just one single large data set. It's so easy to customize your software for one single data set that never changes. That also means that in the end there technically is really only room for one player here in the end. Whomever is best, actually, you don't even need to be best, you just need to execute fastest, will win. I think this is partly what we saw going on with Radian6 and is the only reason they were purchased for $360M. In that case, it wasn't that they were best, but they had been executing faster than anyone else.
Obviously though, it didn't end with Radian6. There are still more competitors in the social media monitoring space than I can keep track of and more entering.
It may be that this data is just too simple (140 characters) and that there will be many providers who can do it perfectly. I think OdinText does an excellent job with Twitter for those who want to use it that way. However, what we have with Twitter is a commodity situation with questionable value and where low price will end up being the main component and that's why it's a bubble.
I for one want to compete in an area where we have expertise and can add real unique value whether it's Big, Mid or Small data.
GP: 3. You wear many different hats - the head of Anderson Analytics, a market researcher, a text analytics expert, OdinText developer, an active blogger, a social media leader, and chairman of the Foundation for Transparency in Offshoring (FTO). How do you combine these activities and which are more important to you?
Text analytics has always been and will continue to be most important to me, everything else is a hobby. Anderson Analytics used to offer many other types of digital research but OdinText analytics software development and support for our clients is now our one mission.
However I enjoy sharing and giving back to the market research community when possible. The NGMR group on LinkedIn and the FTO is a means of doing that.
GP: 4. Can you explain what "Market Research" is to KDnuggets readers?
THCA: Ironically I think the market research industry is struggling a bit itself with a definition of what it is right now. Traditionally, it has meant any research carried out to help with the 4 P's of Marketing (Price, Product, Place and Promotion). Usually it has meant surveys and focus groups but also syndicated data from grocery store sales scanners to TV viewing monitoring. Since the early days when the focus was on just collecting the data, it has expanded to online ad tracking, eye tracking, video ethnography, use of EKG and CAT, MRI scans, you name it. The biggest change is that the focus has shifted away from collecting data to the value of analysis and I think that's great.
The problem now is, while market researchers have become very good at understanding and leveraging various marketing related data for analysis, we've started losing control of the data that we owned. So now "IT" or "BI" etc. have access to some of the larger data because they have the engineering skills needed to unlock this data and most market researchers do not. Conversely, the IT/BI folks often do not have the 'business ~ market ~ consumer' knowledge, nor in many cases statistical understanding needed to make best use of this data that they store and have easier access to.
This gap is the main issue in my mind. It's why I saw a need to embrace what I called Next Gen Market Research (NGMR) back in 2005. What I originally meant by the term "NGMR" is probably closest to what outside of market research has come to be called "Data Science". In other words, it requires an understanding of the technical/engineering (access) as well as statistical analysis (methodology) as well as experience in solving business problems and communicating this work effectively.
Here is the second part of the interview.