• Graduate School DILS

    "Digital Infrastructure for the Life Sciences" — Graduate School of the Bielefeld Institute for Bioinformatics Infrastructure (BIBI) more about BIBI

    Model of DNA double helix in front of a student.
    © iStock.com/carloscastilla

Research Projects

Functional genomics of and bioinformatic analyses tools for seed quality parameters in rapeseed

Hanna Marie Schilbert (Supervisors: Weisshaar, Holtgräwe, Stoye)

The increasing demand in high quality plant based food products requires the generation of improved crops. To achieve this aim, the molecular basis of relevant seed quality parameters needs to be analysed in relevant crop species. By harnessing large genomic and transcriptomic data sets, loci and genes associated with seed oil-, seed protein-, and antinutrients content will be identified in rapeseed (Brassica napus). The development of dedicated tools will help to facilitate the automatic analyses of involved genes and encoded enzymes and provide predictions for their functionalities. Tools will be made freely available on github, e.g. KIPEs (Knowledge-Based Identification of Pathway Enzymes).

Figure source: Pucker, Reiher, Schilbert 2020

© shutterstock.com/pingebat

From hidden data and information towards data-driven research

Lisa Kühnel (née Langnickel) (Supervisors: Fluck, Cimiano)

The enormous growth in electronic research data requires semantic interoperability and computational methods to generate information and knowledge. However, heterogeneity, restricted access, non-standardized and low-quality data hamper data-driven research in the (bio)medical area. This thesis investigates the application of computational methods to convert (bio)medical data and information into accessible, machine-readable formats with the aim to support researchers. Thereby, the focus lies on two data types: 1) For biomedical literature, this work investigates the robustness of state-of-the-art NLP methods to allow the transfer from science to services. 2) For clinical data, the reliability of synthetic data generation algorithms based on a defined use case is examined.

Machine readability and access to data, information and knowledge are core requirements for data-driven research. Furthermore, the enormous growth in freely available, electronic research data increases the need for semantic interoperability as well as computational methods to generate new information and knowledge from the data.


Comparitive analysis of small RNA regulatory networks in Gammaproteobacteria

Muhammad Elhossary (Supervisors: Förstner, Stoye)


In this project, we aim to identify, annotate and characterize novel sRNAs from a diverse set of microbes from the class of Gammaproteobacteria (a total of 20 species). Samples will be collected from four different growth conditions including iron-limitation and cell membrane stress to ensure that sRNAs expressed under a broad range of environmental settings will be detected. Focusing on sRNA regulators that depend on the RNA chaperon Hfq for their function and act through base-pairing with target mRNAs, we will study their distribution as well as their evolution.

The figure shows an example of small RNA regulatory network,

source: Wagner and Romby 2015

Computational pangenomics in plants

Katharina Sielemann (Supervisors: Weisshaar, Stoye)

Understanding the complete genomic diversity of larger taxonomic groups requires the effective comparison of genomes of many species. We follow a pangenomic approach. Comparative annotation of related high quality genome sequence assemblies will provide power to the identification of genes, pseudogenes, transposable elements and structural variation. Differences concerning gene copy number, presence/absence variation, centromeric/telomeric repeats and epigenetic DNA modifications will be assessed automatically. Integration of phylogenetic methods will allow the characterization of genomic features that confer unique properties (traits, phenotypes, etc.) to particular species.

Computational quality assessment of sequencing data and artifacts in the field of amplicon-based phylomic

Sebastian Jünemann (Supervisors: Stoye, Goesmann)

My dissertation covers three main topics. The first main topic is the design, application and comparison of bioinformatics pipelines for the analysis of 16S rRNA gene-based amplicon data sets. The second main topic is the systematic investigation of the data quality obtained by different next-generation sequencing techniques and the evaluation of assembly procedures. Furthermore, I focus on a specific quality aspect of amplicon data, the artificial chimeric formation of DNA strands during PCR, a fundamental step in amplicon sequencing. A newly developed bioinformatic method for the detection of such chimeras and the evaluation of these chimeras on data sets generated by current longread sequencing technologies will be the third main aspect of my dissertation.

Characterizing biogas microbioms by meta analysis of metagenomes

Benedikt Osterholz (Supervisors: Sczyrba, Schlüter)

In anaerobic digestion of biomass, a huge number of microbial species is involved possessing a wide variety of metabolic properties. However, a major part of the species that can be detected in biogas reactors has not been adequately characterized either in terms of its specific substance conversion properties or in terms of its respective ecological role in the microbiological system. Accordingly, the trophic network responsible for the degradation of crop biomass in biogas reactors is understood to be only piecemeal and only in terms of basic microbial processes.

The aim of this project is the use of high-throughput molecular data for a detailed representation of microbial networks by means of a comprehensive bioinformatic evaluation (meta-analysis) including abiotic process factors. Established bioinformatics solutions and concepts will be reimplemented to make optimal use of available de.NBI cloud resources to identify the core-microbiome of biogas communities, determine unique taxa for specific communities and elucidate relationships between taxonomic units. It is expected that obtained results will contribute to the identification and characterization of key organisms to better understand and improve the biogas process at a whole.

Software for computational pangenomics

Andreas Rempel (Supervisors: Stoye, Förstner)

A pangenome is a collection of genomic sequences from different individuals. It holds information on conserved regions, local polymorphisms, and structural variations and can provide insights into genomic differences and evolutionary relationships. There are different data structures used for the storage and comparison of the sequences, such as colored De Bruijn Graphs, Variation Graphs, or Sequence Bloom Trees. The aim of this project is to compare existing software tools for computational pangenomics, to define a common standard interface for the data structures, and to set up an automated (cloud-based) test environment to evaluate their performance and to support users in finding the tool that suits their demands best.

Applications of colored de Bruijn graphs

Tizian Schulz (Supervisors: Stoye, Hach)

Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g., compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. Here we develop a new, heuristical method to find all maximum scoring local alignments between a DNA query sequence and a pangenome represented as a compacted colored de Bruijn graph. Furthermore, we introduce the notions of quorum and search color set allowing to concentrate searches on any part of the pangenome. The source code of our implementation and test data are available on gitlab.

Machine learning approaches for the characterization of biological systems

Janik Sielemann (Supervisors: Bräutigam, Hammer)

The interaction between transcription factors and the genome is a key factor of transcriptional regulation in all domains of life. Recent studies focusing on the characterization of single transcription factors elucidate the importance of those proteins. Although the exact mechanism of binding site recognition is still under debate, most experimentally studied protein-DNA binding events were traced down to binding sequences consisting of 6 to 10 base pairs. Within my doctoral project, I concentrate on broadening the understanding of the protein-DNA binding mechanism with the help of machine learning models. By this I want to enable more precise computational predictions of protein-DNA binding events.

Bioinformatics solutions for microbiome meta-transcriptome analyses

Tom Tubbesing (Supervisors: Sczyrba, Schlüter)

Analysing the transcriptome of microbial strains to identify Differentially Expressed Genes (DEGs) is a common approach. The DESeq2-package (Love et al., 2014) is well established for carrying out this kind of analyses based on count data from RNA sequencing experiments. However, when studying microbial communities, reliably identifying DEGs based on a metatranscriptome sequencing datasets is compounded by the fact that the abundances of microbial taxa vary between sampling conditions. This project is aimed at implementing a comprehensive software workflow for the analysis of such datasets and use it to expand the Elastic MetaGenome Browser (EMGB) platform.

Machine learning based analysis of crop regulatory networks

Donat Wulf (Supervisors: Bräutigam, Sczyrba)

Gene regulation is an important mechanism for organisms to react to changing environmental conditions. These regulatory mechanisms are governed by transcription factors and organized in a gene regulatory network (GRN). Machine learning enables the inference of these networks. In this project, I develop methods to establish and analyze GRNs by gene ontology enrichment tools. I compare GRNs and transcription factor binding sites by high throughput construction of phylogenetic trees and intragroup analyses. GRNs are validated by DAP-seq and EMSA.