The World behind a PDF Document

Jour Fixe talk by Tamir Hassan on June 20, 2013

Many of us regularly encounter PDF files in our day-to-day work, and the great thing about them is that they always look the same, regardless of whether they are being printed, viewed on-screen or on a smartphone or tablet.  But this versatility also has its drawbacks: it is very difficult to edit a PDF once it has been created.  In fact, many non-computer scientists even see this as an advantage, although it must be pointed out that minor edits, such as fiddling a few figures on an invoice, are actually relatively easy to perform.

As computer scientist Tamir Hassan has often been asked why it is so difficult to edit PDF documents. One of his research topics is logical structure analysis, which he sees as the first step towards making PDF and other print-oriented documents editable. In his Jour Fixe talk on “Rediscovering the structure of PDF documents” Tamir Hassan first explained the general characteristics of print-oriented documents such as PDF: they consist of a physical and a logical structure. The physical structure represents the document's visual appearance, i.e. the layout and typographic conventions that have been used in its presentation. The logical structure refers to elements such as headings, paragraphs, captions, headers and footers, and the reading order of the physical blocks. Unlike a Word document, a PDF file does not usually contain adequate machine-readable information about its logical structure. There are many applications such as search and text mining, repurposing for small-screen devices, archiving, accessibility and information extraction methods that are dependent on this structure.

To rediscover this structure Tamir Hassan applies an evaluation model which, in its simplest form, is composed of nested rules for each type of object, such as words, lines and paragraphs: each rule contains a search method to, for example, “find the N most likely combinations of paragraphs on this page” and an evaluation method to evaluate each result, e.g. “how likely is this combination of objects a paragraph?” In order to retain computational feasibility, these evaluation results are then re-used in the higher levels, and the number of hypotheses at each level is restricted. By using these methods he hopes to mimic the way humans analyse documents, how large and small objects on the page interact with each other, leading to increased accuracy.  By structuring the rules and separating hereditary knowledge from acquired knowledge about document structure, he aims to make it easy to make improvements and customizations, e.g. by adding publication-specific rules, and adjust the tradeoff between accuracy and computation time.  In general Tamir Hassan wants to find out how knowledge about the conventions governing a document's appearance can be efficiently represented and reasoned on, and how this rediscovered structure can be represented in a standardized format, enabling its reuse.

At the end of his talk the computer scientist mentioned his visions for the future: Besides PDF, the methods being developed can also be applied to image and HTML documents.  Furthermore, he sees his work broadening into the areas of stylesheet reconstruction and automatic document layout, thus completing the repurposing cycle. He also introduced a further interdisciplinary research project currently being planned in collaboration with theatre scholars from Vienna, in which he plans to work on methods to support the digitization of historic theatre playbills, many of which are in poor condition and have been printed in the Fraktur (German Gothic) script, and extract the information in a structured form, enabling machine-aided analysis of the entire collection of >500,000 playbills.