Inhaltspezifische Aktionen

Open thesis topics

Within our group we can offer various topics in the field of applied bioinformatics, high-throughput data analysis, genome and metagenome research as well as postgenomics and systems biology. Below you can find a list of suggested open topics for BSc and MSc theses and student projects. For further details on each topic or alternative projects please contact us.

Exploring the Role of Nasal Microbiota in Neurological Diseases (M.Sc.)

Background
Microorganisms, including those in the human nasal cavity, maintain stability and
functionality. Recent research suggests a potential link between the nasal microbiota and
neurological diseases such as Parkinson’s disease (PD), Alzheimer’s disease (AD), and
multiple sclerosis (MS)(1). However, the nature of this relationship remains unclear due to
a limited number of studies.
While much focus has been on the gut-brain axis, the influence of the nose-brain axis on
the immune system and respiratory homeostasis requires further investigation (2). Some
studies have indicated that altering the nasal microbiota could potentially prevent or treat
neurological diseases, highlighting the need to understand the complex interactions
between the nasal microbiota and the brain. Evidence suggests that the nasal microbiome
may travel through the olfactory pathway to the brain (2, 3). The diversity of bacteria in the
nasal cavity is highly dynamic and can vary depending on age, physiology, and lifestyle.
This project will investigate how nasal microbiota stability impacts the blood-brain barrier
(BBB) and its potential role in the development and progression of neurological diseases.
Our goal is to gain a comprehensive understanding of the nasal microbial community, the
conditions under which it remains stable, and how disruptions in nasal homeostasis might
contribute to neurodegeneration.

Objective
The primary objective of this project is to explore the conditions under which nasal
microbiota stability or instability is associated with neurological diseases, focusing on
potential diagnostic and therapeutic applications.

Methodology
1. Literature Review: Conduct a thorough review of existing studies on the nasal
microbiota and its potential impact on neurological diseases.
2. Data Comparison and Analysis: Compare data gathered from literature on the nasal
microbiota, analyzing differences in composition and diversity, and identifying potential
patterns.
3. Mechanistic Studies: Explore how alterations in the nasal microbiota might influence
the BBB and contribute to the pathology of neurological diseases.
4. Model Creation and Analysis: Develop a model based on literature data to analyze
the stability of the nasal microbiota and its potential role in modulating the risk of
neurological diseases.

Expected Outcome
This project aims to shed light on the role of the nasal microbiota in neurological diseases,
potentially leading to novel diagnostic and therapeutic strategies. By understanding the
dynamics of nasal microbiota stability, we hope to uncover new insights into preventing
and treating neurodegenerative conditions.

Reference
1.García-Jiménez, Beatriz, et al., Computational and Structural Biotechnology Journal 19
(2021): 226-246.
2. Xie, Jin, et al. Pharmacological Research 179 (2022): 106189.
3. Thangaleela, Subramanian, et al., Microorganisms 10.7 (2022): 1405

Contact: Dr.
Reihaneh Mostolizadeh

 

Automated Reconstruction of High-Quality Genome-Scale Models Using Machine
Learning (B.Sc. or
M.Sc.)

Background

Genome-scale metabolic models (GEMs) are essential in biological research and
biotechnological development, as they enable the comprehensive analysis of metabolic
networks and fluxes. Reconstructing a high-quality genome-scale model (GEM) involves a
detailed workflow of 96 steps (6).
Despite the standard protocols and operating procedures available for GEM construction,
the process remains time-consuming. This has led to recent efforts aimed at automating
the reconstruction steps. Researchers have developed various protocols that combine
automated steps to streamline the reconstruction and refinement of GEMs.
In recent years, machine learning (ML) has played a significant role in the reconstruction
and analysis of GEMs, enhancing their quality and accuracy (1, 4). 

Objective:
This project aims to develop an automated protocol for reconstructing high-quality
genome-scale models using available ML approaches. We have compiled all available
literature focusing on the application of ML in the reconstruction of GEMs. By integrating
these ML-based methods into a cohesive automated procedure, we intend to facilitate the
reconstruction and refinement of GEMs.

Methodology:
1. Literature Review and Compilation: Gather and analyze literature on ML approaches
used in GEM reconstruction.
2. Automation Protocol Development: Combine the identified ML-based steps into an
automated workflow.
3. Comparison and Selection: In the first step, for organism with multiple annotated
genomes, for instance, compare the annotations and select the most comprehensive
one.
4. GEM Reconstruction: Apply the automated protocol to reconstruct the GEM.
5. Refinement Using ML: To refine the reconstructed GEM, employ ML algorithms such
as GapFill, pathway Tool prediction (2), Gene Essentiality (5), EC numbers (3), etc.

Expected Outcome:
This project will result in an automated, ML-based protocol for GEM reconstruction. It will
allow for comparing different ML approaches and improve the efficiency and quality of
GEMs. 

Reference:
1. Kim, Yeji, Gi Bae Kim, and Sang Yup Lee. "Machine learning applications in genomescale metabolic modeling." Current Opinion in Systems Biology 25 (2021): 42-49.
2. Dale, Joseph M., Liviu Popescu, and Peter D. Karp. "Machine learning methods for
metabolic pathway prediction." BMC bioinformatics 11 (2010): 1-14.
3. Ryu, Jae Yong, Hyun Uk Kim, and Sang Yup Lee. "Deep learning enables high-quality
and high-throughput prediction of enzyme commission numbers." Proceedings of the
National Academy of Sciences 116.28 (2019): 13996-14001.
4. Zampieri, Guido, et al. "Machine and deep learning meet genome-scale metabolic
modeling." PLoS computational biology 15.7 (2019): e1007084.
5. Hasibi, Ramin, Tom Michoel, and Diego A. Oyarzún. "Integration of graph neural
networks and genome-scale metabolic models for predicting gene essentiality." npj
Systems Biology and Applications 10.1 (2024): 24.
6. Thiele, Ines, and Bernhard Ø. Palsson. "A protocol for generating a high-quality
genome-scale metabolic reconstruction." Nature protocols 5.1 (2010): 93-121.

Contact: Dr. Reihaneh Mostolizadeh

 

Comparative genome analysis of Streptococcus agalactiae (GBS) from elephants (M.Sc.)

Background

Group B Streptococci are fairly common. In livestock, they are the causative agent of an udder inflamation, most often seen in dairy cows. 

In elephants, S. agalactiae is associated with Paronchya.
Under human care, elephants are known to reach a high age. This comes with an age-related decline in their immune system, which can lead usually harmless skin- or foot diseases to become chronic. Gaining a better knowledge about the bacterial infections is a vital foundation for optimized treatments and therapeutic approaches. 

In a newer study done by the "Hessische Landeslabor" (Hesse state labratory (LHL)), some S. agalactiae isolates were compared, using microbiological methods and had extensive biochemical profiles created. 
Noticable was the high number of isolates, for which the serotypes could not be determined. For this reason some isolates got sequenced, so a full comparative genome analysis could be done, using the latest methods in bioinformatics.

Thesis aims

  • Implementation of typical bioinformatic analyses (Assembly, mapping, annotation...)
  • Comparative analysis of GBS Isolates (ABR, pan- and coregenome, virulence factors...)
  • Closer inspection of Genes for serotyping

Prerequisites

  • Interested in solving biological/veterenary questions by usage of bioinformatics
  • Extensive knowledge of the Linux command line
  • Ability to work independently and methodical

Contact: Linda Fenske

 

Workflow Design (Nextflow) (M.Sc.)

 

Background

Analysing (bacterial) sequence data for biological/medical questions means often repeating certain standard processes (QC, Assembly, Annotation etc.)

For better reproduceability and simplification of these processes, flexible pipelines with a wide palette of tools are used. Often Nextflow (of similar workflow tools) is used to enable support for a variety of enviroments or to simplify the installation.

With DSL2, Nextflow recently introduced a significant development of the Nextflow language, which promises a better scalability and modulariziation of pipelines, along with a better design of workflows.

Thesis aims

  • Revision and updating of an existing workflow for analysing bacerial data
  • Transmission of the workflow from nf-DSL1 to DSL2
  • Visualising the results (creating a GUI)

Prerequisites 

  • Extensive knowledge of the Linux command line
  • Knowledge of Nextflow or motivation to become acquainted with Nextflow
  • Programming knowledge in Python, Groovy (Nextflow) or similar
  • Knowledge and interest in visualisation and processing of data

Contact: Linda Fenske

 

Platon Bioinformatics Tool Enhancement for Faster Plasmid Identification (M.Sc.) - taken

Background

Modern high-throughput sequencing devices enable the rapid determination of sequence data obtained from interacting microbial communities without a prior cultivation step. Hereby, access to genetic information from otherwise unculturable microbiota is easily achieved. (Computational) Interpretation of such data relies on either assignment of raw sequencing reads to corresponding source organisms in order to infer their taxonomic origin or gene-coding content, or, these metagenome datasets can be assembled, thereby recovering longer contiguous DNA stretches of the underlying microbial genomes.

Assembled metagenomic contigs are typically clustered (most often, depending on coverage or nucleotide composition), yielding individual draft or complete genomes of novel bacterial species. In this process, however, contigs of non-chromosomal origin such as plasmids are often overlooked.

Still, the analysis of plasmids is of utmost imoprtance, since they constitute a key mechanism of horizontal gene transfer between microbial hosts. They are known to harbor essential genes that are beneficial or important for microbial fittness or survival under certain environmental conditions (e.g. in the presence of certain antimicrobial agents) or perform metabolic processes that they otherwise wouldn‘t have been able to (e.g. degradation of novel substrates).

Several bioinformatics applications have been developed for the computational identification of plasmid-borne contigs, most typically focusing on the extraction of plasmid contigs from the assemblies of individual draft genomes. Among these tools are Platon (Schwengers et al., 2020), PlasClass (Pellow et al., 2020) and PlasFlow (Krawczyk et al., 2018), of which Platon exhibits excellent performance, but its runtime characteristics currently impede its application to potentially large metagenome assemblies.

 

Thesis aims

  • Overhaul of the Platon code base, switching from a contig-centered approach to one based on bulk data processing in order to significantly decrease overall runtime.
  • Inlining of certain sub-analysis steps such as circularity testing into the python codebase instead of relying on the invocation of external tools: (Pyrodigal, pyHMMER, PyTrimal)
  • Conditional tool execution: Do not invoke additional tools if preceding steps already exclude a sequence from being a plasmid
  • Runtime and performance assessment with regard to the original implementation

 

Requirements

  • Familiarity with Linux and (modular) python programming (incl. unit testing)
  • Methodological way of working
  • Able to work independently

Contact: Oliver Schwengers

 

Reconstruction and visualization of KEGG metabolic pathways in the EDGAR platform (M.Sc.)

Background

EDGAR is a web-based platform for analyzing microbial data. It is developed by employees of the Bioinformatics and Systems Biology department at JLU Giessen and provides multifaceted methods for investigating genomes.

KEGG ( Kyoto Encyclopedia of Genes and Genomes) provides curated databases and resources for (among other things) the functional annotation and classification of genes. In previous projects, KEGG functional categories for all organisms and their corresponding genes were computed in the EDGAR platform. These are currently displayed directly in two analysis modules, in purely quantitative terms.

MinPath is a program for reconstructing biological/metabolic pathways. It attempts to infer a minimal biological metabolic network by excluding redundant metabolic pathways that can explain the genes found in a given dataset. The above-mentioned KEGG categories will be used as input for this program.

The goal of the project is to develop a comparative analysis module, based on KEGG pathway information, for the EDGAR platform.

Thesis Aims

  • Parse the available KEGG data in a structured manner and compute KEGG metabolic pathways for all given genomes in EDGAR using MinPath.
  • Design comparative visualizations for the EDGAR frontend using the resulting data, allowing users to interactively explore their data (see fig. 4 here as an example)
  • Adjust the project scope in consultation with the student depending on the project status to accommodate shared ideas, as EDGAR incorporates a wide selection of data with potential for creative analysis methods.

Requirements 

  • Programming skills in Python and JavaScript (can also be learned during the process)

  • Basic SQL database knowledge

 

PlasmidHunter: Validation of a metagenome-based plasmid search using public plasmid sequences (M.Sc.)

Background

Plasmids play an important role in the genetic variability of organisms. They replicate independently and between organisms - within and between species. Therefore, plasmids are key drivers of horizontal gene transfer. Often, they are the effective and only difference between commensal and pathogenic bacterial strains. In recent years, it became obvious that plasmids belong to the main mechanisms for the dissemination of antimicrobial resistances and hence are of special interest in medical microbiology. Detecting plasmids and analyzing their dissemination is an important epidemiological and scientific topic that might help to detect current and prevent future outbreaks of antibiotic resistances.

One promising data source containing known and unknown plasmids are whole-metagenome datasets of samples from different sources (soil, waste water, the human gut). For many of these samples, sequencing data is freely accessible in public databases, often annotated with additional meta information such as date, source and location of each sample.

Our project processes these datasets from the MGnify database in a standardized way via modern cloud technologies and makes them accessible to users for a fast search of new plasmids within this huge amount of data.

This master thesis should validate this search via existing plasmid databases (such as PLSDB) and analyze search results including comprehensive visualizations.

Thesis Aims

  • Implementation of a workflow to process PLSDB entries with our existing search workflow
  • Statistical analysis of the results, and screen for potential interesting candidates for further analysis
  • Visualization of the results

Prerequisites 

  • Knowledge of command line tools and Python
  • Interest in cloud technologies
  • Prior experience with workflow systems, like Nextflow or Snakemake

Contact: Sebastian Beyvers

 

Webservice for searching gene families in plants (M. Sc.)

 

Background

The input is a list of protein sequences. In step 1a, a Pfam search is performed with the sequences to find common domains. In step 1b, a multiple sequence alignment of the sequences is calculated. The conserved regions are automatically extracted from the alignment to calculate HMMs. In step 2, the HMMs of the domains from 1a and 1b are used to search a database of plant proteins.

Thesis Aims

  • The results are visualized and made available for download
  • Steps 1 and 2 are also provided as a command-line tool

Prerequisites

  • The programming language(s) and frameworks can be freely chosen
  • Test data will be provided

Contact: Oliver Rupp

 

Ribosomal binding site prediction based on 16S-rRNA (M.Sc.)

 

Background

Bacterial translation is initiated by the assembly of ribosomal proteins as part of the translation initiation complex at the coding sequence (CDS) start site. For most CDS, there is a ribosomal binding site (RBS) immediately upstream of the gene, consisting of a 5-10bp spacer and a (partial or complete) Shine-Dalgarno sequence (SD) 5’-AGGAGG-3’ to which the ribosome binds. However, some genes have neither an SD nor a known RBS and are still expressed (Omotajo, D. et al., 2015). The Shine-Dalgarno sequence was first described in E. coli but is found in many bacterial genomes and is complementary to the anti-SD sequence at the 3′-end of 16S-rRNA.

The exact Shine-Dalgarno and spacer sequences vary between bacterial species. However, because the anti-Shine-Dalgarno sequence is present in the 16S-rRNA of each bacterial genome, it can be used to predict RBS in a species-independent manner.  Therefore, a deep learning approach using the 16S-rRNA sequences and the sequence upstream of the CDS is promising for accurately predicting the presence of RBS independent of species-specific variants.

Thesis Aims

  • Design and implementation of a neural network for ribosomal binding site prediction in bacteria,
  • evaluation of the features used by the neural network, and
  • analysis of the presence of RBS in exemplary bacterial genomes

Prerequisites 

  • Prior experience with deep learning frameworks such as Tensorflow/Keras, or willingness to learn them
  • Prior experience in the development of documented code and dependency management or willingness to learn them

Contact: Julian Hahnfeld

 

Integrative Omics FAIR Workflow (M.Sc.)

Background

Processing and analysing 'omics data often requires applying predefined building blocks of code, i.e. for performing quality control, statistical analysis or machine learning. However, biologists and ecologists are often overwhelmed with the technical complexity of programmatic approaches and interfaces. Hence, scientific workflows can not just automate, but also facilitate important re-occuring processes in high-throughput 'omics analysis.

The existing modularized iESTIMATE pipeline aims at automating and facilitating the complex analysis of ecological metabolomics data and the integration with other phenomics and preparation for sequencing and (meta-)genomics data. The central aim of the pipeline is to extract so called molecular traits that explain molecular mechanisms in plants or microorganisms.

Thesis Aims

  • Revision and modularisation of existing code to create the R package "iESTIMATE"
  • Implementing a workflow in NextFlow or Common Workflow Language (CWL) using test data, implementing unit tests and capture provenance information
  • Publish R package and the workflow following the FAIR principles

Prerequisites 

  • Knowledge of R and a bit of Python
  • Knowledge of Linux command line, containers, NextFlow (Groovy), YAML, or motivation to become acquainted with them
  • Keen interest in analysis of integrative 'omics data and in topics in molecular ecology

Contact: Kristian Peters