Title: Building a Morphological Analyzer for Laz
Abstract: This thesis is part of the documentation and revitalization efforts of endangered Laz language, a member of South Caucasian language family mainly spoken on northeastern coastline of Turkey, from the perspective of computational linguistics (CL). It introduces the first implementation of an automatic language analysis tool for Laz, specifically for Pazar dialect as a rule-based morphological analyser developed with finite-state networks and two-level morphology. Additional language resources such as lexicon and corpus were collected for the purposes of increasing the coverage power and evaluating the performance of the analyzer respectively. The finite-state transducer to be used for both morphological synthesis and analysis is developed with Xerox lexc and twolc formalism and compiled by using lexicon and two-level compiler in open-source Helsinki Finite State Toolkit (HFST).
Laz is a highly inflectional language with a very complex verb conjugation from the perspective of both morphotactics and morphosyntax. It exhibits characteristics of templatic morphology where a language deviates from concetenative patterns. There are prefixes and suffixes forming long–distance dependencies on the word–level conveying as well as changing the semantics and argument structure of the predicate. Therefore, the information that is most critical to understanding the language is encoded inside the words.
Morphologically rich languages create many challenges for natural language processing (NLP) tasks. Without the access to the information encoded in word forms, it is impossible to develop high or low-level NLP systems such as lemmatization, part-of-speech-tagging, spelling correction and machine translation. Therefore, in any NLP pipeline, the first aim is to do some sort of morphological analysis on text or speech. There are different methods and approaches to the computational study of morphology. In this study, due to low amount of language and computational resources, I have chosen the rule-based based approach and a method that is highly accepted and used among computational linguists for formalizing morphotactics and morphophonemics of a language, namely finite-state transducers and two-level morphology written with regular expressions. I also used techniques such as flag diacratics for non-concetenative morhophological patterns.
The evaluation is based on naive-coverage of the analyzer over text data and error analysis. The results show %83.4 of coverage over Pazar Laz corpus (PLC), %95.9 of coverage over Laz Treebank and %71.3 on compiled texts from other dialects (LC as Laz Corpus). Error analysis on PLC results indicates that most of the word forms that are not analyzed are due to Turkish words occurring in the corpus and undefined lexical items in the lexicon.
 
            

