Context-based RNA-seq mapping ~ Fakultät für Mathematik, Informatik und Statistik - Digitale Hochschulschriften der LMU

In recent years, the sequencing of RNA (RNA-seq) using next
generation sequencing (NGS) technology has become a powerful tool
for analyzing the transcriptomic state of a cell. Modern NGS
platforms allow for performing RNA-seq experiments in a few days,
resulting in millions of short sequencing reads. A crucial step in
analyzing RNA-seq data generally is determining the transcriptomic
origin of the sequencing reads (= read mapping). In principal, read
mapping is a sequence alignment problem, in which the short
sequencing reads (30 - 500 nucleotides) are aligned to much larger
reference sequences such as the human genome (3 billion
nucleotides). In this thesis, we present ContextMap, an RNA-seq
mapping approach that evaluates the context of the sequencing reads
for determining the most likely origin of every read. The context
of a sequencing read is defined by all other reads aligned to the
same genomic region. The ContextMap project started with a proof of
concept study, in which we showed that our approach is able to
improve already existing read mapping results provided by other
mapping programs. Subsequently, we developed a standalone version
of ContextMap. This implementation no longer relied on mapping
results of other programs, but determined initial alignments itself
using a modification of the Bowtie short read alignment program.
However, the original ContextMap implementation had several
drawbacks. In particular, it was not able to predict reads spanning
over more than two exons and to detect insertions or deletions
(indels). Furthermore, ContextMap depended on a modification of a
specific Bowtie version. Thus, it could neither benefit of Bowtie
updates nor of novel developments (e.g. improved running times) in
the area of short read alignment software. For addressing these
problems, we developed ContextMap 2, an extension of the original
ContextMap algorithm. The key features of ContextMap 2 are the
context-based resolution of ambiguous read alignments and the
accurate detection of reads crossing an arbitrary number of
exon-exon junctions or containing indels. Furthermore, a plug-in
interface is provided that allows for the easy integration of
alternative short read alignment programs (e.g. Bowtie 2 or BWA)
into the mapping workflow. The performance of ContextMap 2 was
evaluated on real-life as well as synthetic data and compared to
other state-of-the-art mapping programs. We found that ContextMap 2
had very low rates of misplaced reads and incorrectly predicted
junctions or indels. Additionally, recall values were as high as
for the top competing methods. Moreover, the runtime of ContextMap
2 was at least two fold lower than for the best competitors. In
addition to the mapping of sequencing reads to a single reference,
the ContextMap approach allows the investigation of several
potential read sources (e.g. the human host and infecting
pathogens) in parallel. Thus, ContextMap can be applied to mine for
infections or contaminations or to map data from
meta-transcriptomic studies. Furthermore, we developed methods
based on mapping-derived statistics that allow to assess confidence
of mappings to identified species and to detect false positive
hits. ContextMap was evaluated on three real-life data sets and
results were compared to metagenomics tools. Here, we showed that
ContextMap can successfully identify the species contained in a
sample. Moreover, in contrast to most other metagenomics
approaches, ContextMap also provides read mapping results to
individual species. As a consequence, read mapping results
determined by ContextMap can be used to study the gene expression
of all species contained in a sample at the same time. Thus,
ContextMap might be applied in clinical studies, in which the
influence of infecting agents on host organisms is investigated.
The methods presented in this thesis allow for an accurate and fast
mapping of RNA-seq data. As the amount of available sequencing data
increases constantly, these methods will likely become an important
part of many RNA-seq data analyses and thus contribute valuably to
research in the field of transcriptomics.

Context-based RNA-seq mapping

Beschreibung

Weitere Episoden

Network-based analysis of gene expression data

Computing hybridization networks using agreement forests

Exploiting autobiographical memory for fallback authentication on smartphones

Efficient data mining algorithms for time series and complex medical data

Cross-species network and transcript transfer

Kommentare (0)

Abonnenten

Anmelden mit