University of Konstanz
Graduiertenkolleg / PhD Program
Computer and Information Science

Graduation Talks

title

Document Structure Analysis for Large Document Collections

speaker

Andreas Stoffel, University Konstanz
Konstanz, Germany

date & place

Wednesday, 24.06.2009, 16:15 h
Room C 252

abstract

Worldwide libraries, archives, museums, or companies have collected large sets of printed documents during the past. Accessing information within these archives is a laborious task, and to find information implicitly hidden in the documents is impossible. To simplify the access of the information in such collections, the archives are digitalized and the text is recognized for retrieval purposes. Beside document retrieval, information extraction or data mining tasks are important as well. In these cases the traditional bag-of-word approach is of limited use and additional information is required, for instance the structure of the documents.
An example for such a collections are reports collected from different organizations: companies collect reports about defects and repairs of their products, physicians or hospitals record the results of examinations ans operations of their patients. These collections have in common, that they contain multiple different types of documents that originate from multiple sources. In the case of medical records, up to 1200 different document types are reported.
A problem that arises with large archives is the variety of document types. Not only do the documents change their layout and content over time, but also different types of documents are collected in the same archive. For instance, medical records do contain reports, examinations results, and so forth. Structure analysis methods for such collections must be able to handle these problems. Contemporary approaches use different techniques like manually created rules, document grammars or optimization approaches, but they all focus on a particular type of document and a certain task. Applying such methods to a more complex document collection, means that rules, grammars or cost functions must be adapted and reference data must be created manually.