KDnuggets : News : 2009 : n07 : item5 < PREVIOUS | NEXT >


From: Gregory Piatetsky-Shapiro
Subject: Interview with Seth Grimes: part 1, Text Analytics

It is my pleasure to interview Seth Grimes, who has agreed to write a monthly column for KDnuggets on Text Analytics.

Seth Grimes
  Seth Grimes

Seth Grimes is an analytics strategy consultant, a recognized expert on business intelligence and text analytics. He is contributing editor at Intelligent Enterprise magazine, founding chair of the Text Analytics Summit, Data Warehousing Institute (TDWI) instructor, and text analytics channel expert at the Business Intelligence Network. Seth founded Washington DC-based Alta Plana Corporation in 1997. He consults, writes, and speaks on information-systems strategy, data management and analysis systems, industry trends, and emerging analytical technologies.

Gregory Piatetsky-Shapiro: What is text analytics?

Seth Grimes: For me, functionally, text analytics is the same as text mining: applying statistical, linguistic, and machine-learning methods to extract information from text, improve search and information retrieval, and automate document processing. With text analytics, there is a sense that you may want to integrate these functions with or into BI and line-of-business systems. Contrast with (text) data mining where the workbench or programming interface is the only way to go.

GPS: Are there other ways text analytics differs from information retrieval and from text mining?

SG: Focusing on applications, "text mining" is perhaps used more often in scientific and technical contexts and "text analytics," a newer term, is found more often in business contexts. I also wonder if people don't think primarily of statistical and machine-learning approaches, extended from the data mining world, when they picture text mining. When you bring in computational linguistics for, say, building better indexes to support conceptual or semantic search and information retrieval (IR), that's not considered text mining.

IR does differ from text mining/analytics. IR is about bringing back documents, where text analytics additionally aims to automate their processing and make sense of their contents. Automated processing: that could involve routing e-mail to the right customer-service rep, it could involve identifying legally discoverable documents, to could involve tagging and selectively forwarding news articles. And sense making involves information extraction -- pulling named and pattern-based entities (e.g., phone and Social Security numbers), topics, concepts, facts, and attitudinal information from text -- and diverse analytical steps such as clustering and classification, data integration, and visualization.

Alright, that was a dense response.

Obviously, text analytics also operates not only on retrieved documents but also on text that comes to you, for instance, e-mail, stuff you get via RSS or Atom feeds, text in databases and enterprise systems, and so on. Similarly, IR covers all information sources and not just textual documents.

GPS: How did you become interested in text analytics?

SG: In '96-'97, I had a gig with a Web developer as the database guy and technical director. We used Illustra, Michael Stonebraker's Postgres commercialization, which provided character large objects and a couple of text-search options. Illustra let us manage Web-page templates in-database and query text and conventional fields together, so I got a taste of a certain style of unified analysis.

I've done a lot of work with governmental statistics. Coming off a long contract at the Census Bureau -- I designed the 2000 Census analysis system, which the Bureau used to produce hundreds of billions of statistical tables -- I saw text as an area with huge potential. This was back in 2002-3. I started writing on the text-analytics topic. Text analytics resonates with a lot of people so I've focused on it more and more.

My involvement has been really rewarding. I feel I've been able to contribute to making benefits visible to a broad range of potential users, which has accelerated business uptake.

GPS: Why is text analytics not yet widely adopted - is it just the matter of time?

SG: It's perhaps more widely used than you'd think. Consider that Google, Yahoo!, and Live search respond to "43+99", "map philadelphia", and "ORCL" appropriately, as requests for arithmetic help, a map, and Oracle share-price information, rather than as requests for a long list of URLs. That is, they all recognize named entities and query patterns and infer user intent. That's basic text analytics.

I estimate a $350 million 2008 market for text analytics software and vendor support and professional services, up 40% from 2007. The three of the five biggest BI vendors, IBM, SAP, and SAS, have serious text-analytics capabilities, albeit not yet widely deployed in their product lines. Applications such as Voice of the Customer, used for media monitoring and brand and reputation management, are getting a huge-amount of attention. So (even) wider adoption is just a matter of time.

(Interview part 2: Text Analytics Future )

KDnuggets : News : 2009 : n07 : item5 < PREVIOUS | NEXT >

Copyright © 2009 KDnuggets.   Subscribe to KDnuggets News!