University of Konstanz
Graduiertenkolleg / PhD Program
Computer and Information Science

Colloquium of the Department and the PhD Program


Taxonomy-based Similarity Measures for Gene Clustering and Knowledge Discovery


Prof. Dr. James Keller, University of Missouri
Missouri, USA

date & place

Wednesday, 24.05.2006, 16:15 h
Room C252


In clustering and subsequent knowledge discovery on unknown gene products, the primary features to date are the gene sequence and expression values found following a microarray experiment. One major goal is to determine the function of this gene product and its similarity in function or structure to other up-regulated or down-regulated gene products. Many measures have been proposed to calculate closeness of sequences. However, for many gene products, additional information comes from the set of Gene Ontology (GO) annotations and the set of journal abstracts related to the gene product. For these genes, it is reasonable to include similarity measures based on the terms found in the GO and/or the index term sets of the related documents (MeSH annotations). In both cases we deal with comparing two sets of terms arranged in a taxonomy (GO or MeSH.). Some measures have been constructed to assess closeness of terms in a taxonomy, including shortest path length between terms and information theory-related values where node probabilities are estimated using a corpus of relevant documents. Utilizing such factors in addition to sequence and expression should aid in the process of knowledge discovery. It will be easier to annotate clusters, for example, when they share common descriptive terms. When an unknown gene product joins the group via sequence and expression, it is reasonable to conjecture that this gene will also share the cluster annotations (at least partially). In this talk we propose a fuzzy measure-based similarity (FMS) for computing the similarity of two sets of terms found in a taxonomy (and hence, the two gene products annotated with terms from the taxonomy). The advantage of FMS is that it takes into consideration the context of the whole set when computing the similarity. The initial testing on a group of 194 sequences representing three proteins families shows promising results when the similarity is presented visually and from the standpoint of hierarchical clustering utilizing the FMS. The visual representation of the similarities can help the human curator to assess the consistency of the members of an automated extracted family. In our experiments, for example, we discovered incomplete annotations and substructures indicating potential problems in the family definition. In dealing with large groups of terms and/or documents describing the objects under consideration, not only do we determine the similarity between the document pairs, but, by introducing the Choquet integral, we fuse this partial agreement function on pairs of documents into a single value relating the gene products. The measures for the final integral fusion can be tailored to produce order weighted average (OWA) operators (e.g., "at least two documents must support the connection") or can be based on assessments of the "worth" of individual and subsets of documents towards building the strength of connection.