Data Structures for Efficient String Algorithms
Beschreibung
vor 18 Jahren
This thesis deals with data structures that are mostly useful in
the area of string matching and string mining. Our main result is
an O(n)-time preprocessing scheme for an array of n numbers such
that subsequent queries asking for the position of a minimum
element in a specified interval can be answered in constant time
(so-called RMQs for Range Minimum Queries). The space for this data
structure is 2n+o(n) bits, which is shown to be asymptotically
optimal in a general setting. This improves all previous results on
this problem. The main techniques for deriving this result rely on
combinatorial properties of arrays and so-called Cartesian Trees.
For compressible input arrays we show that further space can be
saved, while not affecting the time bounds. For the two-dimensional
variant of the RMQ-problem we give a preprocessing scheme with
quasi-optimal time bounds, but with an asymptotic increase in space
consumption of a factor of log(n). It is well known that algorithms
for answering RMQs in constant time are useful for many different
algorithmic tasks (e.g., the computation of lowest common ancestors
in trees); in the second part of this thesis we give several new
applications of the RMQ-problem. We show that our preprocessing
scheme for RMQ (and a variant thereof) leads to improvements in the
space- and time-consumption of the Enhanced Suffix Array, a
collection of arrays that can be used for many tasks in pattern
matching. In particular, we will see that in conjunction with the
suffix- and LCP-array 2n+o(n) bits of additional space (coming from
our RMQ-scheme) are sufficient to find all occ occurrences of a
(usually short) pattern of length m in a (usually long) text of
length n in O(m*s+occ) time, where s denotes the size of the
alphabet. This is certainly optimal if the size of the alphabet is
constant; for non-constant alphabets we can improve this to
O(m*log(s)+occ) locating time, replacing our original scheme with a
data structure of size approximately 2.54n bits. Again by using
RMQs, we then show how to solve frequency-related string mining
tasks in optimal time. In a final chapter we propose a space- and
time-optimal algorithm for computing suffix arrays on texts that
are logically divided into words, if one is just interested in
finding all word-aligned occurrences of a pattern. Apart from the
theoretical improvements made in this thesis, most of our
algorithms are also of practical value; we underline this fact by
empirical tests and comparisons on real-word problem instances. In
most cases our algorithms outperform previous approaches by all
means.
the area of string matching and string mining. Our main result is
an O(n)-time preprocessing scheme for an array of n numbers such
that subsequent queries asking for the position of a minimum
element in a specified interval can be answered in constant time
(so-called RMQs for Range Minimum Queries). The space for this data
structure is 2n+o(n) bits, which is shown to be asymptotically
optimal in a general setting. This improves all previous results on
this problem. The main techniques for deriving this result rely on
combinatorial properties of arrays and so-called Cartesian Trees.
For compressible input arrays we show that further space can be
saved, while not affecting the time bounds. For the two-dimensional
variant of the RMQ-problem we give a preprocessing scheme with
quasi-optimal time bounds, but with an asymptotic increase in space
consumption of a factor of log(n). It is well known that algorithms
for answering RMQs in constant time are useful for many different
algorithmic tasks (e.g., the computation of lowest common ancestors
in trees); in the second part of this thesis we give several new
applications of the RMQ-problem. We show that our preprocessing
scheme for RMQ (and a variant thereof) leads to improvements in the
space- and time-consumption of the Enhanced Suffix Array, a
collection of arrays that can be used for many tasks in pattern
matching. In particular, we will see that in conjunction with the
suffix- and LCP-array 2n+o(n) bits of additional space (coming from
our RMQ-scheme) are sufficient to find all occ occurrences of a
(usually short) pattern of length m in a (usually long) text of
length n in O(m*s+occ) time, where s denotes the size of the
alphabet. This is certainly optimal if the size of the alphabet is
constant; for non-constant alphabets we can improve this to
O(m*log(s)+occ) locating time, replacing our original scheme with a
data structure of size approximately 2.54n bits. Again by using
RMQs, we then show how to solve frequency-related string mining
tasks in optimal time. In a final chapter we propose a space- and
time-optimal algorithm for computing suffix arrays on texts that
are logically divided into words, if one is just interested in
finding all word-aligned occurrences of a pattern. Apart from the
theoretical improvements made in this thesis, most of our
algorithms are also of practical value; we underline this fact by
empirical tests and comparisons on real-word problem instances. In
most cases our algorithms outperform previous approaches by all
means.
Weitere Episoden
vor 12 Jahren
vor 12 Jahren
vor 12 Jahren
In Podcasts werben
Kommentare (0)