A Corpus-based Approach to the Chinese Word Segmentation ~ Fakultät für Sprach- und Literaturwissenschaften - Digitale Hochschulschriften der LMU Podcast

For a society based upon laws and reason, it has become too easy
for us to believe that we live in a world without them. And given
that our linguistics wisdom was originally motivated by the search
for rules, it seems strange that we now consider these rules to be
the exceptions and take exceptions as the norm. The current task of
contemporary computational linguistics is to describe these
exceptions. In particular, it suffices for most language processing
needs, to just describe the argument and predicate within an
elementary sentence, under the framework of local grammar.
Therefore, a corpus-based approach to the Chinese Word Segmentation
problem is proposed, as the first step towards a local grammar for
the Chinese language. The two main issues with existing
lexicon-based approaches are (a) the classification of unknown
character sequences, i.e. sequences that are not listed in the
lexicon, and (b) the disambiguation of situations where two
candidate words overlap. For (a), we propose an automatic method of
enriching the lexicon by comparing candidate sequences to
occurrences of the same strings in a manually segmented reference
corpus, and using methods of machine learning to select the optimal
segmentation for them. These methods are developed in the course of
the thesis specifically for this task. The possibility of applying
these machine learning method will be discussed in NP-extraction
and alignment domain. (b) is approached by designing a general
processing framework for Chinese text, which will be called
multi-level processing. Under this framework, sentences are
recursively split into fragments, according to a language-specific,
but domainindependent heuristics. The resulting fragments then
define the ultimate boundaries between candidate words and
therefore resolve any segmentation ambiguity caused by overlapping
sequences. A new shallow semantical annotation is also proposed
under the frame work of multi-level processing. A word segmentation
algorithm based on these principles has been implemented and
tested; results of the evaluation are given and compared to the
performance of previous approaches as reported in the literature.
The first chapter of this thesis discusses the goals of
segmentation and introduces some background concepts. The second
chapter analyses the current state-of-theart approach to Chinese
language segmentation. Chapter 3 proposes a new corpusbased
approach to the identification of unknown words. In chapter 4, a
new shallow semantical annotation is also proposed under the
framework of multi-level processing.

A Corpus-based Approach to the Chinese Word Segmentation

Beschreibung

Weitere Episoden

On the effects of English elements in German print advertisements

Computerlinguistische Untersuchung der Stützverbkonstruktionen im Englischen

On the pragmatic and semantic functions of Estonian sentence prosody

Akustische Analysen der Sprachproduktion von CI-Trägern

Kulturgeprägte wissenschaftliche Textvernetzung im Chinesischen und Deutschen

Kommentare (0)

Abonnenten

Anmelden mit