KDnuggets : News : 2002 : n12 : item18    (previous | next)

Briefs

GeneViz 1.1 implementing Double Conjugated Clustering

Contentsoft AG releases GeneViz 1.1 and offers free academic software.

GeneViz features the first publicly available implementation of Double Conjugated Clustering (DCC) delivering ease and precision for the analysis of high-dimensional data such as microarray data.

Background: Current biochip (microarray) technology allows for simultaneous measurement of thousands of gene expression levels in any given organism. Two-way hierarchical clustering methods are popular with microarray researchers because sample classification and gene classification is often uncertain. However sorting the data matrix is only partially helpful since typical datasets have thousands of features (gene expressions) which need to be visually inspected - this agglomeration of information has to be performed by the user. Agglomerative techniques such as SOM or K-Means however cannot agglomerate samples and features simultanously.

Double Conjugated Clustering: DCC is an agglomerative two-way node-driven clustering technique - clustering sample and feature (genes) on two separate node maps. The clustering of both maps is coupled by conjugated projections between coupled nodes. This projection is essentially one step of the power method for eigen analysis.

What it achieves: The coupling projection causes samples clusters and their respective correlated feature clusters to be attracted to their respective coupled nodes on the maps - matching sample clusters with their correlated features. Noisy features uncorrelated to any of the sample clusters are attracted to "junk" nodes having no corresponding sample cluster. This offers the opportunity to construct low-dimensional classifiers.

Essentially the method partitions the data matrix unsupervised into submatrices of closely clustered samples and their correlated features

  • and "junk". Thereby the amount of information which needs to be manually inspected by researchers is greatly reduced.
In contrast to one-space node-driven techniques such as SOM or K-Means the method is not sensitive to the number of nodes chosen as long as the number of nodes is above a minimum threshold. Therefore the researcher doesn't need to make assumptions in respect to the number of classes in the data.

These properties make DCC highly desirable for many types of high-dimensional clustering problems such as the analysis of microarray data. The method is patent-pending.

Free GeneViz versions are available for qualifying non-commercial research institutions. The software is provided with biochip / microarray demo projects and the Fisher Iris data set. Papers and software may be downloaded from Contentsofts website. http://www.contentsoft.de

Contentsoft AG Schwanthaler Strasse 81 80336 Munich Germany +49 89 5445 989 0


KDnuggets : News : 2002 : n12 : item18    (previous | next)

Copyright © 2002 KDnuggets.   Subscribe to KDnuggets News!