Web Crawling Social Media for Topic Shift Detection and Expert Spotting

The best asset of an open source software is not price, but the community. This paper is a case study of analyzing community forums for Topic Shift Detection and Expert Spotting.

By Rosaria Silipo, 23 Sep 2013.

What is the best asset of an open source software? The price of course, the openness, the flexibility as well; but, if you ask me, it is the community. This study is about analyzing the KNIME community using the KNIMEKNIME data mining tool - here is PDF:
Analyzing the Web from Start to Finish Knowledge Extraction from a Web Forum using KNIME.

Social Media Data. A web community is usually better studied via its forum discussions. Thus, for this study we crawled and downloaded all KNIME Forum discussions from 2007 till end of 2012.

Web Crawling. To download the KNIME Forum pages we needed a web crawling tool. Such a tool is not available in KNIME Desktop core. However, a web crawling node was made available by the Palladian project into the KNIME community extensions. We took advantage of this community extension to collect the necessary data. We used the community to extend the KNIME core software and used the result to investigate the KNIME community itself. That already says how important the KNIME community is to KNIME!

User Answers: KNIME vs CommunityBasic Statistical Measures. First of all, let's get the facts straight. How many people are actively taking part in the KNIME community? How active are they? How many questions do actually get an answer? How many answers are necessary to close a question? How fast are the answers? Basically, how reliable is the KNIME community as a support tool? All those issues can be answered through a few counting, percent, average, and standard deviation measures.

Text Mining for Topic Detection. Once established that many KNIME users talk a lot via the KNIME Forum, the next question is: What do they have to talk so much about? Topic detection is a relatively well established procedure by now and it is easy to implement if the topic classes have already been defined. In the case of the KNIME Forum, posts belong to wide categories, like KNIME General, that cannot be reduced to a single topic. We needed an ontology of topics to use for the topic classification. If the ontology is not available, and you cannot create one, then you borrow it from a similar data space. We borrowed that from the node description XML files hierarchy available within the KNIME Desktop software. From this part of the study, the most popular topics and their popularity shift over time emerged.

Network Analytics to single out Experts. Who are the KNIME users talking to? Is it possible to detect one or more experts for each topic? Applying Network analytics, experts emerge very clearly as the central point of the user interaction graph for each discussed topic.

Moving into Production. The last whitepaper sections are dedicated to moving the whole workflow set into production, through access to remote cached data, meta-node sharing, GUI insertion, and remote execution from a web browser.

Read the paper at www.knime.com/files/knime_web_knowledge_extraction.pdf