Muhammad Elhossary (Supervisors: Förstner, Stoye)
In this project, we aim to identify, annotate and characterize novel sRNAs from a diverse set of microbes from the class of Gammaproteobacteria (a total of 20 species). Samples will be collected from four different growth conditions including iron-limitation and cell membrane stress to ensure that sRNAs expressed under a broad range of environmental settings will be detected. Focusing on sRNA regulators that depend on the RNA chaperon Hfq for their function and act through base-pairing with target mRNAs, we will study their distribution as well as their evolution.
The figure shows an example of small RNA regulatory network,
source: Wagner and Romby 2015
Benedikt Osterholz (Supervisors: Sczyrba, Schlüter)
In anaerobic digestion of biomass, a huge number of microbial species is involved possessing a wide variety of metabolic properties. However, a major part of the species that can be detected in biogas reactors has not been adequately characterized either in terms of its specific substance conversion properties or in terms of its respective ecological role in the microbiological system. Accordingly, the trophic network responsible for the degradation of crop biomass in biogas reactors is understood to be only piecemeal and only in terms of basic microbial processes.
The aim of this project is the use of high-throughput molecular data for a detailed representation of microbial networks by means of a comprehensive bioinformatic evaluation (meta-analysis) including abiotic process factors. Established bioinformatics solutions and concepts will be reimplemented to make optimal use of available de.NBI cloud resources to identify the core-microbiome of biogas communities, determine unique taxa for specific communities and elucidate relationships between taxonomic units. It is expected that obtained results will contribute to the identification and characterization of key organisms to better understand and improve the biogas process at a whole.
Andreas Rempel (Supervisors: Stoye, Förstner)
A pangenome is a collection of genomic sequences from different individuals. It holds information on conserved regions, local polymorphisms, and structural variations and can provide insights into genomic differences and evolutionary relationships. There are different data structures used for the storage and comparison of the sequences, such as colored De Bruijn Graphs, Variation Graphs, or Sequence Bloom Trees. The aim of this project is to compare existing software tools for computational pangenomics, to define a common standard interface for the data structures, and to set up an automated (cloud-based) test environment to evaluate their performance and to support users in finding the tool that suits their demands best.
Tom Tubbesing (Supervisors: Sczyrba, Schlüter)
Analysing the transcriptome of microbial strains to identify Differentially Expressed Genes (DEGs) is a common approach. The DESeq2-package (Love et al., 2014) is well established for carrying out this kind of analyses based on count data from RNA sequencing experiments. However, when studying microbial communities, reliably identifying DEGs based on a metatranscriptome sequencing datasets is compounded by the fact that the abundances of microbial taxa vary between sampling conditions. This project is aimed at implementing a comprehensive software workflow for the analysis of such datasets and use it to expand the Elastic MetaGenome Browser (EMGB) platform.
Emanuel Lange (Supervisors: Heyer, Nattkemper)
Physiological processes in living cells are controlled by metabolism, signaling, and regulatory networks. Existing knowledge of these biological networks can be compiled into mechanistic models, which facilitate comprehension, prediction, and optimization of cellular processes. Model predictions can be improved, by incorporating omics data into them. At the same time, analysis and interpretation of omics data can benefit from model predictions. However, models are rarely used in experimental studies generating omics data.
The first objective of my project is to develop strategies to integrate models and omics data with two potential applications: The investigation of cancer metabolism, and the investigation of signaling in neutrophil granulocyte migration. My second objective is to make modeling more accessible for experimentalists to establish models for experimental studies. To achieve this goal, I plan to implement user-friendly capabilities for modeling, data analysis, and algorithms for network visualization into our “MPA-Pathway-Tool” (Walke et al., 2021). The “MPA-Pathway-Tool” is a web application already supporting pathway mapping and metabolic modeling.
Manuel Feser (Supervisors: Scholz, Sczyrba)
The main objective of this project is to convert the molecular passport data (diversity matrix) of selected genebank accessions (Plant Genetic Resources) into a data structure that can be stored and used for analyses. For example, a user may have a diversity vector of a genotype of interest and wishes to find the phylogenetically closest genebank material from a particular geographical region or with particular traits. To increase the power of the analysis, an imputation service called DivImpute is developed. This will increase the marker density and enrich the input for the subsequent phylogenetic similarity search. DivImpute is designed as a cloud-enabled pipeline, minimizing the cost of imputation by distributing the computational load, with the input split into overlapping genome windows.
Leonard Bohnenkämper (Supervisors: Stoye, Bräutigam)
Genome rearrangements have been studied extensively in theoretical works of Comparative Genomics. These results however, have only been applied on a limited scale to real genomes. The continuing progress of sequencing projects and technology made more and more high quality genomes available and enabled even Pangenomic analyses, that is, analyses that include all availailable genomes of a species. Pangenomics and theoretical Rearrangement Studies utilize remarkably similar graph data structures. Given the abundance of theoretical results in Comparative Genomics, it is likely that many of these results can be applied in Pangenomics. Conversely, the abundance of practical results in the construction of Pangenome graphs can likely contribute to these theoretical results seeing more real world applications.
Luna Pianesi (Supervisor: Schönhuth)
Drug Discovery has long needed a speed-up of some sort in its process, and computational drug design might come to the rescue. By exploiting the vast potential of artificial neural networks - and particularly Graph Neural Networks - one could devise a fully computational drug design pipeline to address the huge work that goes into pre-clinical studies. Developing fruitful partnerships between artificial intelligence and biology can lead to a flexible method that is able to produce hundreds of novel drug candidates targeting a large variety of biological targets. This study currently focuses on de novo design of ATP-competitive small-molecule inhibitors for the cancer-inducing dysregulation of the Aurora protein kinase.
Luca Parmigiani (Supervisor: Stoye)
Pangenomics aims to capture the collective genomic diversity of taxonomically related genomes, typically from the same species, offering potential insights for medicine and biology. Initially defined as collections of genes, pangenomes are now more accurately represented through variation graphs. These graphs illustrate genomes by using nodes to depict shared sequences, edges to denote sequential connections, and paths to reconstruct the original genomes. Despite their benefits, there is a lack of scalable software for essential analyses, such as estimating the core genome size and evaluating the extent of genomic variability (pangenome growth). To address this need, we developed Panacus (pangenome-abacus), a tool designed for efficient data extraction from GFA files, the standard for pangenome graphs. Panacus facilitates quick generation of pangenome growth and core curves, handling millions of nodes in variation graphs within minutes.
Vanessa Scharf (Supervisors: Förstner, Cimiano)
Bacterial small non-coding RNAs are important molecular regulators, and help bacteria as well as
archaea to adapt efficiently to environmental conditions. Besides other regulators, they are part
of complex molecular networks in those species but are often neglected in the modelling of those.
Their detection and characterization as well as the description of interaction partners (other RNAs or
proteins) is very costly and the results of such studies are often only deposited in small data bases or
are only found individually in scholarly articles. This project aims to model sRNAs and their features
as well at the interaction topology in an open knowledge graph. This will make the knowledge about
these regulators sustainably accessible, easily query-able and connected to further databases. Having
the knowledge of these regulators of several species in such a machine readable form and linked to
other databases will enable to apply graph machine learning approaches and, by that, to perform link
prediction and further analysis methods. With this approach, the project aims to conduct network
inference and to reveal so far unknown regulatory interactions while also making information about
sRNAs efficiently searchable and accessible.
Hanna Marie Schilbert (Supervisors: Weisshaar, Holtgräwe, Stoye)
The increasing demand in high quality plant based food products requires the generation of improved crops. To achieve this aim, the molecular basis of relevant seed quality parameters needs to be analysed in relevant crop species. By harnessing large genomic and transcriptomic data sets, loci and genes associated with seed oil-, seed protein-, and antinutrients content will be identified in rapeseed (Brassica napus). The development of dedicated tools will help to facilitate the automatic analyses of involved genes and encoded enzymes and provide predictions for their functionalities. Tools will be made freely available on github, e.g. KIPEs (Knowledge-Based Identification of Pathway Enzymes).
Figure source: Pucker, Reiher, Schilbert 2020
Lisa Kühnel (Supervisors: Fluck, Cimiano)
The enormous growth in electronic research data requires semantic interoperability and computational methods to generate information and knowledge. However, heterogeneity, restricted access, non-standardized and low-quality data hamper data-driven research in the (bio)medical area. This thesis investigates the application of computational methods to convert (bio)medical data and information into accessible, machine-readable formats with the aim to support researchers. Thereby, the focus lies on two data types: 1) For biomedical literature, this work investigates the robustness of state-of-the-art NLP methods to allow the transfer from science to services. 2) For clinical data, the reliability of synthetic data generation algorithms based on a defined use case is examined.
Machine readability and access to data, information and knowledge are core requirements for data-driven research. Furthermore, the enormous growth in freely available, electronic research data increases the need for semantic interoperability as well as computational methods to generate new information and knowledge from the data.
Tizian Schulz (Supervisors: Stoye, Hach)
Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g., compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. Here we develop a new, heuristical method to find all maximum scoring local alignments between a DNA query sequence and a pangenome represented as a compacted colored de Bruijn graph. Furthermore, we introduce the notions of quorum and search color set allowing to concentrate searches on any part of the pangenome. The source code of our implementation and test data are available on gitlab.
Donat Wulf (Supervisors: Bräutigam, Sczyrba)
Gene regulation is an important mechanism for organisms to react to changing environmental conditions. These regulatory mechanisms are governed by transcription factors and organized in a gene regulatory network (GRN). Machine learning enables the inference of these networks. In this project, I develop methods to establish and analyze GRNs by gene ontology enrichment tools. I compare GRNs and transcription factor binding sites by high throughput construction of phylogenetic trees and intragroup analyses. GRNs are validated by DAP-seq and EMSA.