KDnuggets News 06:24, item 17, Jobs

KDnuggets : News : 2006 : n24 : item17

Jobs

From: Jed Harris
Date: 13 Dec 2006
Subject: Headquarters Berkeley, CA; contractor can be worldwide: Statistical modeling software developer -- contractor at Mindloom

Contact: modeling_job@peerworks.org

We're an open source project building content classification tools to help online browsing, discussion, and social discovery. We're seeking part time or full time contractors to build statistical modeling components for a online content filtering service. We are funded by a foundation and can pay competitive consulting rates. As a distributed, web coordinated development team, we have no geographical constraints and already have a non-US developer. Since we're developing open source software, developers who work for us will be able to use the code we create in other environments.

We expect to have a basic content classification (tagging) mechanism in early 2007, using more or less "cookbook" naive bayes methods. Beyond that we see lots of interesting extensions and problems that can only be tackled with far more statistical modeling expertise than we have.

Conversely, we expect the system as it gets deployed to produce a great deal of user tagging data for various content. We'd be interested in working with statistical research projects that would like to use that data. Of course this raises privacy issues, but we will try to deploy the system under terms that will give researchers access to as much data as possible.

Context / Goals

All of our work is aimed at making online peer production (broadly defined) more productive, enjoyable, and easier to learn and set up.

Why do we need statistical modeling?

Most online peer production involves a lot of content, both external and internally generated, and a lot of social interaction. However both the content and the interaction are often hard for potential participants to understand and manage.

Helping users navigate and manage the content and the social environment will contribute a lot to our overall goal. The vast majority of content important in online peer production is semi-structured text, with formatting, headers, links, etc. -- e.g discussion groups, mailing list archives, blogs, web pages, etc. Statistical models can be derived from this content, plus user interaction, in at least two important ways:

Content models: We are modeling content from the point of view of each *individual user*, not the consensus "view from nowhere", to help each user find the content most relevant to them, which will vary with their goals, experience, mood, etc.

Social models: We are want to model the latent social structure implied by the data *plus* the individual user perspectives, to help users initiate and maintain productive and enjoyable social relationships.

Our ideal candidate
We are looking for a pragmatic, creative, productive developer with an excellent ability to implement statistical modeling for textual and social data. Given our situation, anyone we hire will need to be able to work well in a loosely coupled team, communicate well online, and manage their own time and (short term) priorities. Experience in web-coordinated open source development is a plus.

Breaking this down:

Pragmatic: Whowever we hire needs to be continually guided in their decisions by anticipated user benefits, development feasibility, scaleability, etc. Academic rigor, cool technology, development methods, etc. are all good, but our first priorities are getting things working, keeping them working, and making them work better.

Creative: We're doing exploratory development, and we'll often need to find cost-effective ways to solve problems that haven't been solved before. Furthermore we'll often have opportunities to address potential new user benefits. We need someone who has the mental flexibility and judgment to identify these problems and opportunities, and adapt available technology and redefine our development goals to address them.

Productive: We need someone who is good at building production quality code. We aren't requiring specific language skills, since the statistical modeling is fairly separable, and may anyway end up implemented in whatever language has the best libraries and/or performance for a given task. However candidates need to have a good development track record.

Project details and status
Our first service is individual content classification -- i.e. auto-tagging. Each user creates and assigns their own tags, and we train classifiers to assign tags for them. We have an initial implementation of classifiers, and are improving them using cross-validation.

Testing and tuning classifiers with cross-validation needs a lot of tagged content, and RSS feeds provide a lot of diverse, freely available content of the right general sort. So we built a feed aggregator and have been accumulating a big corpus. Our technology has many more uses than just tagging blog posts, but that is actually a useful application domain.

We've carved out a smaller "clean" corpus (about 7000 items that meet various criteria) and multiple people are now tagging it. At the same time we're adding new feeds to the collection process to improve item diversity. We can easily carve out new corpuses using different criteria, but of course it is a big effort to get them fully tagged.

We have a cross-validation testing and reporting framework set up. We're gradually enhancing it as we better understand our requirements.

Our classifiers are naive bayes -- more or less the SpamBayes variant, by no means pure naive bayes. However they work reasonably well already (on smaller corpuses) and we think that by fairly simple enhancements we can get them working "adequately" (we haven't established definite criteria yet).

The main area where we may need help in this phase, if the straightforward enhancements aren't adequate, is stronger feature extraction. Right now we are just using simple tokenizing. We may also need help with ongoing performance work since the classification is compute intensive enough to be an issue.

Unless we hit major rocks, which looks unlikely at this point, we expect to have adequate individual classifiers working by sometime in January. If we need fancier features that would delay things, but quite likely not by more than a month or two.

Once we've got adequate classifiers and start getting a significant user base, we'll want to model the structure of the semantic space across users. Modeling the semantic space exceeds our "cookbook" level understanding of statistical modeling technology, so when we get near this point we very much need people with a lot more expertise.

A simple example is that we'd like to tile the semantics space with clusters of users who have sufficiently similar interests. Then we could let new users pick an initial cluster, so they get a set of classifiers "pre-built" and can just tune and extend the set, rather than having to build one from scratch. Of course these clusters would change over time as the user population evolves.

Trickier examples are (1) collaborative training -- learning from other users' similar classifiers and (2) identifying sufficiently strong "interest groups" -- clusters of users with enough in common to perhaps enjoy shared discussions, etc.

There are probably a lot of exciting things to do with the user data that we aren't clever enough to imagine. We are also interested in contributing data (with appropriate privacy protection) to research projects.

Contact:
modeling_job@peerworks.org

KDnuggets : News : 2006 : n24 : item17

PREVIOUS | NEXT