KDnuggets News 06:23, item 1, Features

KDnuggets : News : 2006 : n23 : item1

Features

From: Gregory Piatetsky-Shapiro
Date: 29 Nov 2006
Subject: Inaugural KDD Webcast on Web Content Mining - recording, Q&A

SIGKDD (www.kdd.org) is the premiere association for Knowledge Discovery and Data Mining, and part of its mission is to support data mining education.

As Chair of SIGKDD, I am pleased to report that the Inaugural SIGKDD Webcast was very successful.

Here is the recording of Web Content Mining webcast, presented by Bing Liu, on Nov 29, 2006.

A total of 113 people from 24 countries attended this free webcast. By-country breakdown is: USA (54), Canada (8), Germany (7), Spain (5), Brazil (4), India (4), Israel (4), Italy (3), Taiwan (3), UK (3), Ecuador (2), Netherlands (2), Portugal (2), Turkey (2), and one each from China, Colombia, Czech Republic, France, Japan, Peru, Poland, Slovakia, Switzerland, and Tunisia.

Additional information about this webcast is at www.cs.uic.edu/~liub/WCM-Refs.html.

Here are questions asked during the webcast and answers provided by Bing Liu.

Q1: Are there any programming frameworks to make wb content mining easier to do?

Answer: Although many Web content mining problems have the same framework of extraction and integration, the current techniques for dealing with them are very different. One does not deal with structured data in the same way as unstructured text.

I am not aware of any common programming framework for Web content mining, or even for each specific task. Our research works were done mainly using C and C++. However, for structured data extraction, there are tools on the market that either help you extract data or make it easy for you to write rules to extract data.

For opinion mining, there are natural language processing packages that are helpful, e.g., part-of-speech taggers, parsers etc.

Q2: What are the main statistical algorithms used for the techiques you described?

Answer: For data extraction, one can use machine learning methods (if you consider them as statistical methods). For example, the wrapper induction technique that we discussed in the talk is based on supervised learning, specifically, rule learning using the sequential covering strategy. Note that although the general principle is similar, the actual algorithm for extraction rule induction is quite different from rule learning in normal machine learning.

For information integration, all kinds of learning methods have been tried, decision trees, naive Bayesian, and SVM. For opinion mining, researchers have tried to use various supervised learning methods as well together with natural language processing techniques. However, it is not clear whether the current learning or statistical methods are sufficiently effective.

Q3: First, thank you very much... well done and informative. You have made several references to the web data mining book - 2006, would you say this is a good book/starting point for someone new to the subject ?

Answer: Thanks. Yes, the book is a good starting point as it does not assume any prior knowledge of data mining, machine learning, or web mining.

It meant to be a textbook for senior undergraduate students, graduate students, researchers and practitioners.

Q4: how does collaborative filtering like bayesian inference and netflix or amazon style data mining comes in practice with the approach you discussed here?

Answer: The collaborative filtering and netflix or amazon style data mining is generally considered Web usage mining, which uses user ratings of products, user purchase histories and click data to mine patterns.

It is quite different from typical Web content mining that we discussed in the talk. Web usage mining usually does not study page contents. Even if they do, they use the IR style, i.e., keywords. There are many papers on Web usage mining. My book also has a chapter on Web usage mining.

Q5: Do we need to know the features before we extract opinions from web pages? How do we identify them?

Answer: There are different sub-problems in opinion mining (there is an in-depth discussion of this in my book). In the most general case, we need to solve all, which include discovering features. However, it is also possible to provide features if the application domain is narrow. For example, if a product manufacturer is only interested in opinions on its own products, e.g., cellphones, then coming up with a set of features of the products may not be hard. With given features, the opinion mining problem is significantly simplified. Of course, it is still very challenging as we are touching natural language understanding.

Q6: Do you know any studies of opinion mining in different languages?

Answer: Yes, I am aware of opinion mining research in Chinese, Japanese and in Dutch. I do not have a reference for Dutch. I came to know this because a Dutch researcher came to talk to me in a conference and mentioned that his group was analyzing voter opinions in a local election. For Chinese and Japanese, you can check these out. These researchers also have subsequent papers in AAAI and other places.

L.-W. Ku, Y.-T. Liang and H.-H. Chen. Opinion Extraction, Summarization and Tracking in News and Blog Corpora. In Proc. of the AAAI-CAAW'06, 2006.

N. Kobayashi, R. Iida, K. Inui and Y. Matsumoto. Opinion Mining on the Web by Extracting Subject-Attribute-Value Relations. In Proc. of AAAI-CAAW'06, 2006.

KDnuggets : News : 2006 : n23 : item1