# Dawid Piątek

Doctoral Student in the PhD program from 01.03.2011 to 30.06.2013.
Currently working as Analyst at Accenture, Warsaw, Poland.

1. Prof. Dr. Michael Berthold
2. Prof. Dr. Ulrik Brandes

## organisational data

 Room : Z 809 Tel. : (+49) 07531 / 88-5075 Fax : (+49) 07531 / 88 5132 E-mail : dawid.piatek(at)uni-konstanz.de Other Resources : Personal Page at Workgroup

## project description

Data analysis plays in the contemporary world role greater than ever before. Increasing amounts of data we have to our disposal allow gaining new insights and making better decisions in fields ranging from archeology to financial markets. However, the data are increasing not only in terms of their volume, but also in their complexity. An important kind of data, which attracted a lot of attention recently is network data. Networks are a natural way of representing, besides the objects themselves, relations between the objects. Examples may be webpages referring one to another by hyperlinks or concept graphs representing information units and relations between them. One of the possible analytical goals of network analysis is searching for new, unusual patterns to support creative discoveries. A novel approach to supporting creativity is looking for bisociations , i.e. objects associating different, seemingly unrelated domains.

In the initial period of my work I collaborated in research carried out within BISON Project devoted to analyzing information networks in order to identify domains and potentially bisociative connections between them. As an example of such an information network one may think of graph representation of Wikipedia where nodes represent articles and (undirected) edges represent hyperlinks between the articles. A subset of Wikipedia was also used in the research to exemplify developed methods. To identify the domains in the dataset first distances between all articles were computed using a spreading activation based similarity measure developed earlier within the BISON Project. These similarities were used as an input for agglomerative clustering which returns a set of hierarchically organized sets of articles. These sets may be interpreted as domains. Having identified the domains, the next step of the analysis was developing an index which evaluates the subgraphs connecting different domains (bisociation candidates ) with respect to their potential of representing an interesting bisociative discovery. Proposed index consists of 3 components, each of them evaluating one, in opinion of the authors important, aspect of a bisociation. These aspects are exclusiveness , size , and balance . Exclusiveness means that the two domains should be connected by a bisociation candidate which is small relatively to the domains and the bisociation candidate should have few connections except of that to the domains of interest. Size means that the two domains should be possibly large. Preference for large domains stems from the assumption that domains which are large and due to that fact are high in the hierarchy of domains are likely to represent abstract, general concepts. In the opinion of authors such general bisociations are more valuable. The balance requirement means that the two connected domains should be of similar size and as a result represent similar level of abstraction. Highly unbalanced connections are likely to represent connections between a domain and its subdomain.

The application of methods described above results in a list of connected domain pairs with a score representing their potential to be a surprising discovery. Applying these methods to a subset of Wikipedia articles resulted in finding connections between e.g. the domain of classical music composers and computer operating systems (Jet Set Willy - a computer game with Beethoven's Moonlight Sonata as a soundtrack) or astronomy and south-east Asia (Nine Million Bicycles - a song comparing Beijing and a couple of astronomical facts). All connections with high score were surprising to the authors confirming their potential for being a creative discovery.

My further work was focused on improving the similarity measure used above to establish the distances between Wikipedia articles. A shortcoming of this measure is its high computational cost, which is $O\left(n3\right)$. This limits the applicability of this measure to datasets of at most moderate size. To overcome this limitation I noticed that the spreading activation process which is the essential component of the similarity measure may be seen as exploring all possible paths of some fixed length \$k\$ starting from the query node. Number of such paths increases fast both with increasing path's length \$k\$ and the size of the graph. To overcome this problem I suggested two modifications of the spreading activation process. The first one, which I termed random walks spreading activation , consists in exploring only a fixed number $r$ of randomly selected paths originating from the query node. In this way I lessen the dependence of the computational cost on the graph size because the number of analyzed paths is independent of the graph size. A shortcoming of this approach is its random character. In order to diminish the effect of the noise caused by random selection of the paths the number of paths $r$ must be relatively high. As a remedy to this problem I proposed another modification of spreading activation. In this variant I again fix the number of paths to explore, but instead of selecting the paths randomly I choose paths in a deterministic way starting from the shortest ones until I reach the required number of paths ~$r$. This allows to keep the number of analyzed paths constant without introducing randomness into the value of the measure. Both similarity measures were presented during the GK Summer School 2012 in Gaschurn, Austria.

After completing this research I decided to quit the PhD Programme. The reason for this is my willingness to develop in the direction of statistical inference, which is the field I have background in. Unfortunately, when entering the PhD Programme I was not aware that although there is an overlap between statistics and data mining these are still distinct fields with quite different research focus.

## publications

The following list of publications covers only those, which are or were published during participation at the Graduiertenkolleg / PhD program.

id should be a number

## curriculum vitae

 since March 2011 Member of PhD program Explorative Analysis and Visualization of Large Information Spaces, University of Konstanz October 2008 - January 2011 M.Sc. in Statistics, Ludwig-Maximilians-University of Munich, Germany October 2008 - January 2011 Holder of a scholarship of The Bavarian Academic Center for Central, Eastern and Southeastern Europe (BAYHOST) August 2006 - August 2008 Data Analyst, StatConsulting, Warsaw, Poland January 2005 - May 2005 Data Analyst, Market Research Institute ISM Global Dynamics, Frankfurt am Main – Kronberg, Germany April 2004 - March 2005 Socrates-Erasmus exchange student, Johannes Gutenberg-University of Mainz, Germany September 2000 - June 2006 M.A. in Quantitative Methods in Economics & Information Systems, Warsaw School of Economics, Poland