University of Konstanz
Graduiertenkolleg / PhD Program
Computer and Information Science

Guest Talks

title

Graph-Based Methods for Text and Web Mining

speaker

Dr. Mark Last, Ben-Gurion University of the Negev
Be'er Sheva, Israel

date & place

Monday, 05.05.2008, 10:00 h
Room C252

abstract

Most text mining and web content mining methods are based on the "classical" vector-space model of information retrieval. However, this popular method of document representation does not capture important structural information, such as the order and the proximity of word occurrence or the location of a word within a document. It also makes no use of the mark-up information that can easily be extracted from HTML tags of web documents. One may expect that a representation that contains more information about a document should enhance the performance of text mining techniques.
In this talk, I will present the recently developed graph-based document representation models, which preserve this structural information. In the graph-based approach, a document in any language is represented by a directed graph, where the nodes stand for words/phrases and the edges represent syntactic relationships between them (like the order of occurrence). The new models are shown to outperform the traditional vector representation using the k-Nearest Neighbor (k-NN) classification algorithm. Since the eager (model-based) classifiers cannot work with the graph-based representation directly, we have also developed a hybrid approach to web document representation, built upon both graph and vector space models, thus preserving the benefits and overcoming the limitations of each. The empirical results demonstrate that the hybrid methods outperform, in most cases, existing approaches in terms of classification accuracy, and in addition, achieve a significant reduction in the classification time.
Currently, we are trying to utilize the graph-based models for generating visual summaries of text documents. A visual document summary can be obtained by identifying all salient subgraphs in a graph of a complete document and then pruning nodes and edges that do not belong to any important sub-graph. In our research, we study two approaches: supervised, where we train classification algorithms on collections of annotated document graphs and unsupervised, where we use graph-based ranking algorithms, such as HITS and PageRank, for visual summary extraction.