## 004 Datenverarbeitung; Informatik

### Refine

#### Year of publication

#### Document Type

- Doctoral Thesis (29)
- Conference Proceedings (1)
- Master's Thesis (1)
- Other (1)

#### Keywords

- Internet (3)
- Middleware (3)
- Abfrageverarbeitung (2)
- Ad-hoc-Netz (2)
- Algorithmus (2)
- Mobile Computing (2)
- OBDD (2)
- OBDDs (2)
- Peer-to-Peer-Netz (2)
- Selbstorganisation (2)

#### Institute

- Informatik (26)
- Computerlinguistik und Digital Humanities (1)
- Fachbereich 1 (1)
- Medienwissenschaft (1)
- Sinologie (1)
- Universitätsbibliothek (1)
- Wirtschaftswissenschaften (1)

With the advent of highthroughput sequencing (HTS), profiling immunoglobulin (IG) repertoires has become an essential part of immunological research. The dissection of IG repertoires promises to transform our understanding of the adaptive immune system dynamics. Advances in sequencing technology now also allow the use of the Ion Torrent Personal Genome Machine (PGM) to cover the full length of IG mRNA transcripts. The applications of this benchtop scale HTS platform range from identification of new therapeutic antibodies to the deconvolution of malignant B cell tumors. In the context of this thesis, the usability of the PGM is assessed to investigate the IG heavy chain (IGH) repertoires of animal models. First, an innovate bioinformatics approach is presented to identify antigendriven IGH sequences from bulk sequenced bone marrow samples of transgenic humanized rats, expressing a human IG repertoire (OmniRatTM). We show, that these rats mount a convergent IGH CDR3 response towards measles virus hemagglutinin protein and tetanus toxoid, with high similarity to human counterparts. In the future, databases could contain all IGH CDR3 sequences with known specificity to mine IG repertoire datasets for past antigen exposures, ultimately reconstructing the immunological history of an individual. Second, a unique molecular identifier (UID) based HTS approach and network property analysis is used to characterize the CLLlike CD5+ B cell expansion of A20BKO mice overexpressing a natural short splice variant of the CYLD gene (A20BKOsCYLDBOE). We could determine, that in these mice, overexpression of sCYLD leads to unmutated subvariant of CLL (UCLL). Furthermore, we found that this short splice variant is also seen in human CLL patients highlighting it as important target for future investigations. Third, the UID based HTS approach is improved by adapting it to the PGM sequencing technology and applying a custommade data processing pipeline including the ImMunoGeneTics (IMGT) database error detection. Like this, we were able to obtain correct IGH sequences with over 99.5% confidence and correct CDR3 sequences with over 99.9% confidence. Taken together, the results, protocols and sample processing strategies described in this thesis will improve the usability of animal models and the Ion Torrent PGM HTS platform in the field if IG repertoire research.

This thesis considers the general task of computing a partition of a set of given objects such that each set of the partition has a cardinality of at least a fixed number k. Among such kinds of partitions, which we call k-clusters, the objective is to find the k-cluster which minimises a certain cost derived from a given pairwise difference between objects which end up the same set. As a first step, this thesis introduces a general problem, denoted by (||.||,f)-k-cluster, which models the task to find a k-cluster of minimum cost given by an objective function computed with respect to specific choices for the cost functions f and ||.||. In particular this thesis considers three different choices for f and also three different choices for ||.|| which results in a total of nine different variants of the general problem. Especially with the idea to use the concept of parameterised approximation, we first investigate the role of the lower bound on the cluster cardinalities and find that k is not a suitable parameter, due to remaining NP-hardness even for the restriction to the constant 3. The reductions presented to show this hardness yield the even stronger result which states that polynomial time approximations with some constant performance ratio for any of the nine variants of (||.||,f)-k-cluster require a restriction to instances for which the pairwise distance on the objects satisfies the triangle inequality. For this restriction to what we informally refer to as metric instances, constant-factor approximation algorithms for eight of the nine variants of (||.||,f)-k-cluster are presented. While two of these algorithms yield the provably best approximation ratio (assuming P!=NP), others can only guarantee a performance which depends on the lower bound k. With the positive effect of the triangle inequality and applications to facility location in mind, we discuss the further restriction to the setting where the given objects are points in the Euclidean metric space. Considering the effect of computational hardness caused by high dimensionality of the input for other related problems (curse of dimensionality) we check if this is also the source of intractability for (||.||,f)-k-cluster. Remaining NP-hardness for restriction to small constant dimensionality however disproves this theory. We then use parameterisation to develop approximation algorithms for (||.||,f)-k-cluster without restriction to metric instances. In particular, we discuss structural parameters which reflect how much the given input differs from a metric. This idea results in parameterised approximation algorithms with parameters such as the number of conflicts (our name for pairs of objects for which the triangle inequality is violated) or the number of conflict vertices (objects involved in a conflict). The performance ratios of these parameterised approximations are in most cases identical to those of the approximations for metric instances. This shows that for most variants of (||.||,f)-k-cluster efficient and reasonable solutions are also possible for non-metric instances.

Digital libraries have become a central aspect of our live. They provide us with an immediate access to an amount of data which has been unthinkable in the past. Support of computers and the ability to aggregate data from different libraries enables small projects to maintain large digital collections on various topics. A central aspect of digital libraries is the metadata -- the information that describes the objects in the collection. Metadata are digital and can be processed and studied automatically. In recent years, several studies considered different aspects of metadata. Many studies focus on finding defects in the data. Specifically, locating errors related to the handling of personal names has drawn attention. In most cases the studies concentrate on the most recent metadata of a collection. For example, they look for errors in the collection at day X. This is a reasonable approach for many applications. However, to answer questions such as when the errors were added to the collection we need to consider the history of the metadata itself. In this work, we study how the history of metadata can be used to improve the understanding of a digital library. To this goal, we consider how digital libraries handle and store their metadata. Based in this information we develop a taxonomy to describe available historical data which means data on how the metadata records changed over time. We develop a system that identifies changes to metadata over time and groups them in semantically related blocks. We found that historical meta data is often unavailable. However, we were able to apply our system on a set of large real-world collections. A central part of this work is the identification and analysis of changes to metadata which corrected a defect in the collection. These corrections are the accumulated effort to ensure data quality of a digital library. In this work, we present a system that automatically extracts corrections of defects from the set of all modifications. We present test collections containing more than 100,000 test cases which we created by extracting defects and their corrections from DBLP. This collections can be used to evaluate automatic approaches for error detection. Furthermore, we use these collections to study properties of defects. We will concentrate on defects related to the person name problem. We show that many defects occur in situations where very little context information is available. This has major implications for automatic defect detection. We also show that properties of defects depend on the digital library in which they occur. We also discuss briefly how corrected defects can be used to detect hidden or future defects. Besides the study of defects, we show that historical metadata can be used to study the development of a digital library over time. In this work, we present different studies as example how historical metadata can be used. At first we describe the development of the DBLP collection over a period of 15 years. Specifically, we study how the coverage of different computer science sub fields changed over time. We show that DBLP evolved from a specialized project to a collection that encompasses most parts of computer science. In another study we analyze the impact of user emails to defect corrections in DBLP. We show that these emails trigger a significant amount of error corrections. Based on these data we can draw conclusions on why users report a defective entry in DBLP.

Automata theory is the study of abstract machines. It is a theory in theoretical computer science and discrete mathematics (a subject of study in mathematics and computer science). The word automata (the plural of automaton) comes from a Greek word which means "self-acting". Automata theory is closely related to formal language theory [99, 101]. The theory of formal languages constitutes the backbone of the field of science now generally known as theoretical computer science. This thesis aims to introduce a few types of automata and studies then class of languages recognized by them. Chapter 1 is the road map with introduction and preliminaries. In Chapter 2 we consider few formal languages associated to graphs that has Eulerian trails. We place few languages in the Chomsky hierarchy that has some other properties together with the Eulerian property. In Chapter 3 we consider jumping finite automata, i. e., finite automata in which input head after reading and consuming a symbol, can jump to an arbitrary position of the remaining input. We characterize the class of languages described by jumping finite automata in terms of special shuffle expressions and survey other equivalent notions from the existing literature. We could also characterize some super classes of this language class. In Chapter 4 we introduce boustrophedon finite automata, i. e., finite automata working on rectangular shaped arrays (i. e., pictures) in a boustrophedon mode and we also introduce returning finite automata that reads the input, line after line, does not alters the direction like boustrophedon finite automata i. e., reads always from left to right, line after line. We provide close relationships with the well-established class of regular matrix (array) languages. We sketch possible applications to character recognition and kolam patterns. Chapter 5 deals with general boustrophedon finite automata, general returning finite automata that read with different scanning strategies. We show that all 32 different variants only describe two different classes of array languages. We also introduce Mealy machines working on pictures and show how these can be used in a modular design of picture processing devices. In Chapter 6 we compare three different types of regular grammars of array languages introduced in the literature, regular matrix grammars, (regular : regular) array grammars, isometric regular array grammars, and variants thereof, focusing on hierarchical questions. We also refine the presentation of (regular : regular) array grammars in order to clarify the interrelations. In Chapter 7 we provide further directions of research with respect to the study that we have done in each of the chapters.

In dieser Arbeit präsentieren wir einen systematischen Ansatz zur Multithread Implementierung von Divide und Conquer sowie inkrementellen Algorithmen. Wir stellen für beide Kategorien von Algorithmen ein Framework vor. Beide Frameworks sollen die Behandlung von Threads erleichtern, die Implementierung von parallelen Algorithmen beschleunigen und die Rechenzeit verringern. Mit Hilfe der Frameworks parallelisieren wir beispielhaft zahlreiche Algorithmen insbesondere aus dem Bereich Computational Geometry, unter anderem: Sortieralgorithmen, konvexe Hülle Algorithmen und Triangulierungen. Der Programmcode zu diese Arbeit ist in C++ unter Verwendung templatisierter Klassen implementiert und basiert auf der LEDA Bibliothek.

We are living in a connected world, surrounded by interwoven technical systems. Since they pervade more and more aspects of our everyday lives, a thorough understanding of the structure and dynamics of these systems is becoming increasingly important. However - rather than being blueprinted and constructed at the drawing board - many technical infrastructures like for example the Internet's global router network, the World Wide Web, large scale Peer-to-Peer systems or the power grid - evolve in a distributed fashion, beyond the control of a central instance and influenced by various surrounding conditions and interdependencies. Hence, due to this increase in complexity, making statements about the structure and behavior of tomorrow's networked systems is becoming increasingly complicated. A number of failures has shown that complex structures can emerge unintentionally that resemble those which can be observed in biological, physical and social systems. In this dissertation, we investigate how such complex phenomena can be controlled and actively used. For this, we review methodologies stemming from the field of random and complex networks, which are being used for the study of natural, social and technical systems, thus delivering insights into their structure and dynamics. A particularly interesting finding is the fact that the efficiency, dependability and adaptivity of natural systems can be related to rather simple local interactions between a large number of elements. We review a number of interesting findings about the formation of complex structures and collective dynamics and investigate how these are applicable in the design and operation of large scale networked computing systems. A particular focus of this dissertation are applications of principles and methods stemming from the study of complex networks in distributed computing systems that are based on overlay networks. Here we argue how the fact that the (virtual) connectivity in such systems is alterable and widely independent from physical limitations facilitates a design that is based on analogies between complex network structures and phenomena studied in statistical physics. Based on results about the properties of scale-free networks, we present a simple membership protocol by which scale-free overlay networks with adjustable degree distribution exponent can be created in a distributed fashion. With this protocol we further exemplify how phase transition phenomena - as occurring frequently in the domain of statistical physics - can actively be used to quickly adapt macroscopic statistical network parameters which are known to massively influence the stability and performance of networked systems. In the case considered in this dissertation, the adaptation of the degree distribution exponent of a random, scale-free overlay allows - within critical regions - a change of relevant structural and dynamical properties. As such, the proposed scheme allows to make sound statements about the relation between the local behavior of individual nodes and large scale properties of the resulting complex network structures. For systems in which the degree distribution exponent cannot easily be derived for example from local protocol parameters, we further present a distributed, probabilistic mechanism which can be used to monitor a network's degree distribution exponent and thus to reason about important structural qualities. Finally, the dissertation shifts its focus towards the study of complex, non-linear dynamics in networked systems. We consider a message-based protocol which - based on the Kuramoto model for coupled oscillators - achieves a stable, global synchronization of periodic heartbeat events. The protocol's performance and stability is evaluated in different network topologies. We further argue that - based on existing findings about the interrelation between spectral network properties and the dynamics of coupled oscillators - the proposed protocol allows to monitor structural properties of networked computing systems. An important aspect of this dissertation is its interdisciplinary approach towards a sensible and constructive handling of complex structures and collective dynamics in networked systems. The associated investigation of distributed systems from the perspective of non-linear dynamics and statistical physics highlights interesting parallels both to biological and physical systems. This foreshadows systems whose structures and dynamics can be analyzed and understood in the conceptual frameworks of statistical physics and complex systems.

This thesis centers on formal tree languages and on their learnability by algorithmic methods in abstractions of several learning settings. After a general introduction, we present a survey of relevant definitions for the formal tree concept as well as special cases (strings) and refinements (multi-dimensional trees) thereof. In Chapter 3 we discuss the theoretical foundations of algorithmic learning in a specific type of setting of particular interest in the area of Grammatical Inference where the task consists in deriving a correct formal description for an unknown target language from various information sources (queries and/or finite samples) in a polynomial number of steps. We develop a parameterized meta-algorithm that incorporates several prominent learning algorithms from the literature in order to highlight the basic routines which regardless of the nature of the information sources have to be run through by all those algorithms alike. In this framework, the intended target descriptions are deterministic finite-state tree automata. We discuss the limited transferability of this approach to another class of descriptions, residual finite-state tree automata, for which we propose several learning algorithms as well. The learnable class by these techniques corresponds to the class of regular tree languages. In Chapter 4we outline a recent range of attempts in Grammatical Inference to extend the learnable language classes beyond regularity and even beyond context-freeness by techniques based on syntactic observations which can be subsumed under the term 'distributional learning', and we describe learning algorithms in several settings for the tree case taking this approach. We conclude with some general reflections on the notion of learning from structural information.

This work is concerned with two kinds of objects: regular expressions and finite automata. These formalisms describe regular languages, i.e., sets of strings that share a comparatively simple structure. Such languages - and, in turn, expressions and automata - are used in the description of textual patterns, workflow and dependence modeling, or formal verification. Testing words for membership in any given such language can be implemented using a fixed - i.e., finite - amount of memory, which is conveyed by the phrasing finite-automaton. In this aspect they differ from more general classes, which require potentially unbound memory, but have the potential to model less regular, i.e., more involved, objects. Other than expressions and automata, there are several further formalisms to describe regular languages. These formalisms are all equivalent and conversions among them are well-known.However, expressions and automata are arguably the notions which are used most frequently: regular expressions come natural to humans in order to express patterns, while finite automata translate immediately to efficient data structures. This raises the interest in methods to translate among the two notions efficiently. In particular,the direction from expressions to automata, or from human input to machine representation, is of great practical relevance. Probably the most frequent application that involves regular expressions and finite automata is pattern matching in static text and streaming data. Common tools to locate instances of a pattern in a text are the grep application or its (many) derivatives, as well as awk, sed and lex. Notice that these programs accept slightly more general patterns, namely ''POSIX expressions''. Concerning streaming data, regular expressions are nowadays used to specify filter rules in routing hardware.These applications have in common that an input pattern is specified in form a regular expression while the execution applies a regular automaton. As it turns out, the effort that is necessary to describe a regular language, i.e., the size of the descriptor,varies with the chosen representation. For example, in the case of regular expressions and finite automata, it is rather easy to see that any regular expression can be converted to a finite automaton whose size is linear in that of the expression. For the converse direction, however, it is known that there are regular languages for which the size of the smallest describing expression is exponential in the size of the smallest describing automaton.This brings us to the subject at the core of the present work: we investigate conversions between expressions and automata and take a closer look at the properties that exert an influence on the relative sizes of these objects.We refer to the aspects involved with these consideration under the titular term of Relative Descriptional Complexity.

Virtuelle Umgebungen erfreuen sich in den letzten Jahren einer großen Beliebtheit und können kontinuierlich wachsende Nutzerzahlen vorweisen. Deshalb wird erwartet, dass die ohnehin große Nutzergemeinde weiterhin wächst. Dieses Wachstum stellt jedoch die Entwickler und die Betreiber von virtuellen Umgebungen vor eine Herausforderung. Die meisten heutzutage erfolgreichen virtuellen Umgebungen basieren auf einer Client-Server-Infrastruktur, in der der Server für die Verwaltung der gesamten Umgebung zuständig ist. Bei einer gleichzeitigen Inanspruchnahme durch viele Nutzer werden die serverbasierten Umgebungen jedoch massiven Skalierbarkeits- und Lastverteilungsproblemen ausgesetzt. Diese Probleme führen beispielsweise dazu, dass Nutzer Wartezeiten beim Einloggen in Kauf nehmen müssen bzw. selbst im Falle eines erfolgreichen Logins die "überfüllten" virtuellen Regionen nicht betreten dürfen. Deshalb wird in dieser Arbeit untersucht, ob eine Verbesserung der Skalierbarkeit und der Lastverteilung in virtuellen Umgebungen durch die Verwendung alternativer Infrastrukturansätze erreicht werden kann. Außerdem werden Maßnahmen zur weiteren Entlastung der Basis-Infrastruktur entwickelt. Diese Verbesserung soll zum einen dadurch erzielt werden, dass anstelle eines Servers ein Server-Verbund (Peer-to-Peer-System) als eine Art Management-Schicht agieren soll, in der jeder Knoten nur einen bestimmten Teil der virtuellen Umgebung verwaltet. Dabei wird die gewünschte Lastverteilung durch die Aufteilung der Verantwortungsbereiche auf die Knoten der Management-Schicht erreicht. Gleichzeitig entsteht aber ein Mehraufwand für die Konstruktion bzw. Erhaltung einer solchen Management-Schicht. Deshalb werden in dieser Arbeit die Vor- und Nachteile eines Infrastrukturwechsels hin zu dezentralisierten Infrastruktur-Ansätzen untersucht. Zum anderen wird eine spürbare Entlastung der Management-Schicht und somit eine Verbesserung der Skalierbarkeit und der Lastverteilung durch die Übertragung einiger Verwaltungsaufgaben an die Clients angestrebt. Zu diesen Verwaltungsaufgaben zählen die Gruppenkommunikation und das Konsistenz-Handling. Dazu wird in dieser Arbeit erforscht, inwiefern die Clients, die sich innerhalb derselben virtuellen Region befinden, selbst die Gruppenkommunikation über eine Multicast-Infrastruktur regeln können bzw. in die Wahrung der Konsistenz eines gemeinsamen Zustandes und der gemeinsam genutzten Objekte involviert werden können. Bei dem clientbasierten Konsistenz-Handling sollte vor allem auf die besonderen Interaktivitäts- und Konsistenzanforderungen, die Nutzer an die virtuellen Umgebungen stellen, geachtet werden. Dazu wird ein neues Konsistenzmodell definiert, welches die Einhaltung der strengen Interaktivitätsanforderungen erzwingt und die Konsistenz mit einer bestimmten Wahrscheinlichkeit garantieren kann.