Linguistic features in Turkish word representations


Distributed word representations which are learned using unsupervised methods are employed in many Natural Language Processing (NLP) tasks. They have led to state-of-the-art results in many NLP tasks for many languages. There have been studies reporting that word representations include morphological and semantical information. There are also work that aim to propose word representations which handle the morphological and syntactical information better. However, studies that evaluate the quality of the word representations for morphologically rich languages like Turkish are limited. In this study, we aim to explore the syntactic and morphological information captured by the Turkish word representations which are learned using skip-gram method on a large corpus. To assess the quality of information found in relations between Turkish word embeddings, analogical reasoning task is performed using couples consisting of root words and their inflected or derivative forms. We contribute with detailed experiments and show that word embeddings trained with skip-gram method have differing capabilities in capturing information for inflection and derivation groups in Turkish. We make the test sets and word embeddings publicly available to other researchers for further research.

2017 25th Signal Processing and Communications Applications Conference (SIU)