• Graduate School DILS

    "Digital Infrastructure for the Life Sciences" — Graduate School of the Bielefeld Institute for Bioinformatics Infrastructure (BIBI) more about BIBI

    Model of DNA double helix in front of a student.
    © iStock.com/carloscastilla

Research Projects

Functional genomics of and bioinformatic analyses tools for seed quality parameters in rapeseed

Hanna Marie Schilbert (Supervisors: Weisshaar, Holtgräwe, Stoye)

The increasing demand in high quality plant based food products requires the generation of improved crops. To achieve this aim, the molecular basis of relevant seed quality parameters needs to be analysed in relevant crop species. By harnessing large genomic and transcriptomic data sets, loci and genes associated with seed oil-, seed protein-, and antinutrients content will be identified in rapeseed (Brassica napus). The development of dedicated tools will help to facilitate the automatic analyses of involved genes and encoded enzymes and provide predictions for their functionalities. Tools will be made freely available on github, e.g. KIPEs (Knowledge-Based Identification of Pathway Enzymes).

Figure source: Pucker, Reiher, Schilbert 2020

© shutterstock.com/pingebat

From hidden data and information towards data-driven research

Lisa Kühnel (née Langnickel) (Supervisors: Fluck, Cimiano)

The enormous growth in electronic research data requires semantic interoperability and computational methods to generate information and knowledge. However, heterogeneity, restricted access, non-standardized and low-quality data hamper data-driven research in the (bio)medical area. This thesis investigates the application of computational methods to convert (bio)medical data and information into accessible, machine-readable formats with the aim to support researchers. Thereby, the focus lies on two data types: 1) For biomedical literature, this work investigates the robustness of state-of-the-art NLP methods to allow the transfer from science to services. 2) For clinical data, the reliability of synthetic data generation algorithms based on a defined use case is examined.

Machine readability and access to data, information and knowledge are core requirements for data-driven research. Furthermore, the enormous growth in freely available, electronic research data increases the need for semantic interoperability as well as computational methods to generate new information and knowledge from the data.


Comparitive analysis of small RNA regulatory networks in Gammaproteobacteria

Muhammad Elhossary (Supervisors: Förstner, Stoye)


In this project, we aim to identify, annotate and characterize novel sRNAs from a diverse set of microbes from the class of Gammaproteobacteria (a total of 20 species). Samples will be collected from four different growth conditions including iron-limitation and cell membrane stress to ensure that sRNAs expressed under a broad range of environmental settings will be detected. Focusing on sRNA regulators that depend on the RNA chaperon Hfq for their function and act through base-pairing with target mRNAs, we will study their distribution as well as their evolution.

The figure shows an example of small RNA regulatory network,

source: Wagner and Romby 2015

Characterizing biogas microbioms by meta analysis of metagenomes

Benedikt Osterholz (Supervisors: Sczyrba, Schlüter)

In anaerobic digestion of biomass, a huge number of microbial species is involved possessing a wide variety of metabolic properties. However, a major part of the species that can be detected in biogas reactors has not been adequately characterized either in terms of its specific substance conversion properties or in terms of its respective ecological role in the microbiological system. Accordingly, the trophic network responsible for the degradation of crop biomass in biogas reactors is understood to be only piecemeal and only in terms of basic microbial processes.

The aim of this project is the use of high-throughput molecular data for a detailed representation of microbial networks by means of a comprehensive bioinformatic evaluation (meta-analysis) including abiotic process factors. Established bioinformatics solutions and concepts will be reimplemented to make optimal use of available de.NBI cloud resources to identify the core-microbiome of biogas communities, determine unique taxa for specific communities and elucidate relationships between taxonomic units. It is expected that obtained results will contribute to the identification and characterization of key organisms to better understand and improve the biogas process at a whole.

Software for computational pangenomics

Andreas Rempel (Supervisors: Stoye, Förstner)

A pangenome is a collection of genomic sequences from different individuals. It holds information on conserved regions, local polymorphisms, and structural variations and can provide insights into genomic differences and evolutionary relationships. There are different data structures used for the storage and comparison of the sequences, such as colored De Bruijn Graphs, Variation Graphs, or Sequence Bloom Trees. The aim of this project is to compare existing software tools for computational pangenomics, to define a common standard interface for the data structures, and to set up an automated (cloud-based) test environment to evaluate their performance and to support users in finding the tool that suits their demands best.

Applications of colored de Bruijn graphs

Tizian Schulz (Supervisors: Stoye, Hach)

Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g., compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. Here we develop a new, heuristical method to find all maximum scoring local alignments between a DNA query sequence and a pangenome represented as a compacted colored de Bruijn graph. Furthermore, we introduce the notions of quorum and search color set allowing to concentrate searches on any part of the pangenome. The source code of our implementation and test data are available on gitlab.

Bioinformatics solutions for microbiome meta-transcriptome analyses

Tom Tubbesing (Supervisors: Sczyrba, Schlüter)

Analysing the transcriptome of microbial strains to identify Differentially Expressed Genes (DEGs) is a common approach. The DESeq2-package (Love et al., 2014) is well established for carrying out this kind of analyses based on count data from RNA sequencing experiments. However, when studying microbial communities, reliably identifying DEGs based on a metatranscriptome sequencing datasets is compounded by the fact that the abundances of microbial taxa vary between sampling conditions. This project is aimed at implementing a comprehensive software workflow for the analysis of such datasets and use it to expand the Elastic MetaGenome Browser (EMGB) platform.

Machine learning based analysis of crop regulatory networks

Donat Wulf (Supervisors: Bräutigam, Sczyrba)

Gene regulation is an important mechanism for organisms to react to changing environmental conditions. These regulatory mechanisms are governed by transcription factors and organized in a gene regulatory network (GRN). Machine learning enables the inference of these networks. In this project, I develop methods to establish and analyze GRNs by gene ontology enrichment tools. I compare GRNs and transcription factor binding sites by high throughput construction of phylogenetic trees and intragroup analyses. GRNs are validated by DAP-seq and EMSA.

Modeling of biological networks and development of user-friendly software for modeling applications

Emanuel Lange (Supervisors: Heyer, Nattkemper)

Physiological processes in living cells are controlled by metabolism, signaling, and regulatory networks. Existing knowledge of these biological networks can be compiled into mechanistic models, which facilitate comprehension, prediction, and optimization of cellular processes. Model predictions can be improved, by incorporating omics data into them. At the same time, analysis and interpretation of omics data can benefit from model predictions. However, models are rarely used in experimental studies generating omics data.
The first objective of my project is to develop strategies to integrate models and omics data with two potential applications: The investigation of cancer metabolism, and the investigation of signaling in neutrophil granulocyte migration. My second objective is to make modeling more accessible for experimentalists to establish models for experimental studies. To achieve this goal, I plan to implement user-friendly capabilities for modeling, data analysis, and algorithms for network visualization into our “MPA-Pathway-Tool” (Walke et al., 2021). The “MPA-Pathway-Tool” is a web application already supporting pathway mapping and metabolic modeling.

Conversion of molecular passport data for phylogenetic analysis and accession selection

Manuel Feser (Supervisors: Scholz, Sczyrba)

The main objective of this project is to convert the molecular passport data (diversity matrix) of selected genebank accessions (Plant Genetic Resources) into a data structure that can be stored and used for analyses. For example, a user may have a diversity vector of a genotype of interest and wishes to find the phylogenetically closest genebank material from a particular geographical region or with particular traits. To increase the power of the analysis, an imputation service called DivImpute is developed. This will increase the marker density and enrich the input for the subsequent phylogenetic similarity search. DivImpute is designed as a cloud-enabled pipeline, minimizing the cost of imputation by distributing the computational load, with the input split into overlapping genome windows.

Comparative Pangenomics

Leonard Bohnenkämper (Supervisors: Stoye, Bräutigam)

Genome rearrangements have been studied extensively in theoretical works of Comparative Genomics. These results however, have only been applied on a limited scale to real genomes. The continuing progress of sequencing projects and technology made more and more high quality genomes available and enabled even Pangenomic analyses, that is, analyses that include all availailable genomes of a species. Pangenomics and theoretical Rearrangement Studies utilize remarkably similar graph data structures. Given the abundance of theoretical results in Comparative Genomics, it is likely that many of these results can be applied in Pangenomics. Conversely, the abundance of practical results in the construction of Pangenome graphs can likely contribute to these theoretical results seeing more real world applications.

Graph Neural Networks and Explainable AI for a new Aurora kinase inhibitor

Luna Pianesi (Supervisor: Schönhuth)

Drug Discovery has long needed a speed-up of some sort in its process, and computational drug design might come to the rescue. By exploiting the vast potential of artificial neural networks - and particularly Graph Neural Networks - one could devise a fully computational drug design pipeline to address the huge work that goes into pre-clinical studies. Developing fruitful partnerships between artificial intelligence and biology can lead to a flexible method that is able to produce hundreds of novel drug candidates targeting a large variety of biological targets. This study currently focuses on de novo design of ATP-competitive small-molecule inhibitors for the cancer-inducing dysregulation of the Aurora protein kinase.