Abstract:
This thesis studies how to segment ambiguous words in the Thai language,
where the ambiguity lies in whether the word in question should be segmented as
either a combined (whole) word or as multiple separate words. To make a correct
segmentation, we need to understand the structure of the language and the context in
which the word is being used in the sentence. Hence, it is challenging for a computer
to perform word segmentation for us. Three main approaches have been taken in the
research community: rule-based, dictionary-based, and machine-learning-based
approaches. In this thesis, we propose a machine-learning-based word segmentation
algorithm that learns from a large, existing database of pre-segmented words. We
count the number of times the considered ambiguous word is segmented in a
combined form and in a separated form. We also look at the collocation of words
surrounding the considered ambiguous word, allowing us to take into account the
context that the ambiguous word is being used. To decide how to segment an
ambiguous word, each segmented form is given a score or a weight which is proposed
to be calculated using the following four methods: weight with probability, weight
with frequency, weight with frequency and distance, and weight with term frequencyinverse
document frequency. From experimental evaluation, the four weighting
methods gave a similar performance of up to 84% in term of correctness in
segmentation, but the weight-with-probability method requires the least runtime. In
summary, the thesis proposes an ambiguous-word-segmentation method that could
achieve 84.40% correctness.