KDnuggets : News : 2009 : n07 : item6 < PREVIOUS | NEXT >


From: Gregory Piatetsky-Shapiro
Subject: Interview with Seth Grimes, part 2: Text Analytics Future

Gregory Piatetsky-Shapiro: Where text analytics is most used right now? Where do you see the biggest potential?

Seth Grimes: The greatest potential for text analytics is in enabling a computer to pass the Turing Test. Text analytics will decode whatever the tester sends to the computer, it will mine a corpus for response material, and it will support generation of a credible and convincing response. It will do all this throughout a contextualized, conversational exchange complete with noise, external references and anaphora, and multiple topics and voices. That is, text analytics has the potential to enable a computer to understand and talk to people.

Alright, that's visionary stuff. The current hot topics are long-standing applications to life sciences, for instance pharmaceutical drug discovery, and intelligence, and newer applications for functions that include customer support; marketing; media and publishing; insurance, risk, and fraud; search enrichment, etc. As I said, I see a $350 million 2008 market, and that figure does not consider the value created by university and industrial research, systems integrators, and custom development, nor the value of the products and capabilities enabled by text analytics.

GPS: What is the relationship of Text Analytics with Semantic Web and XML ?

SG: The Semantic Web is a concept that is being implemented with technologies that include XML, notably RDF (Resource Description Framework) and OWL (Web Ontology Language). But the Semantic Web is about a lot more than those encoding technologies and XML is of course used for many applications, most of which have nothing to do with the Semantic Web.

But it's funny you should ask about the Semantic Web, or perhaps you did so knowing that I recently published a blog article entitled Semantic Web Snake Oil. I got a lot of fire for that one but I stick to my main point, that intentional publication of semantic mark-up is largely not happening -- most of the strongest proponents still aren't doing it -- which is a telling indicator. I need to post a follow-up clarifying my views and looking at Linked Data, a worthy effort but one that I believe will have modest adoption and utility.

The answer to information findability issues isn't an intentional and, I believe, rigid and difficult approach like the Semantic Web's. It's analytics, which a) can make sense of information in whatever form it comes in and b) does not pre-judge the use that the information will be put to.

GPS: I understand that you like Python. Is it better than Perl and other scripting languages for text analysis?

SG: Perl, historically, was obscure. Python code is clear(er). Perl is great for pattern matching via regular expressions, but so is Python. I do know that the Natural Language Toolkit for Python, nltk, has a great reputation -- I've played with it only a bit -- and that I find Python powerful and earlier to program.

GPS: Advice to people who want to enter text analytics area?

SG: Same as for any computing arena: Just do it. There are a couple of open-source options out these, GATE if you prefer a linguistic approach and RapidMiner if you prefer statistics.

Professionally, if you have a steady analytics job, look for ways you can extend your analyses to textual sources and then for appropriate tools and then, again, just give it a shot. It shouldn't be hard for most data miners to extend their work in this fashion.

Text analytics is on a fast growth curve with interesting challenges and clear business benefits, however you define "business." Now's a good time to get started.

(part 3 of interview with Seth Grimes: Beyond Text Analytics)

KDnuggets : News : 2009 : n07 : item6 < PREVIOUS | NEXT >

Copyright © 2009 KDnuggets.   Subscribe to KDnuggets News!