Onur Güngör

Onur Güngör

PhD Student

Boğaziçi University

Welcome

I received my PhD at Boğaziçi University Computer Engineering Department. My research focuses on named entity recognition for morphologically rich languages [ 1, 2, 3], but I also write papers about explaining NLP predictions [ 4], compiling interesting corpora [ 5], and correcting annoying spelling errors [ 6].

I also work as a senior data scientist leading a small team at sahibinden.com developing systems that solve business problems using machine learning methods. For details of my industrial experience, please refer to my LinkedIn profile.

Interests

  • Named entity recognition
  • Morphologically rich languages
  • Morphological disambiguation
  • Games with a purpose

Education

  • PhD in Computer Eng., 2020

    Boğaziçi University

  • MS in Computer Eng., 2009

    Boğaziçi University

  • BS in Computer Eng., 2006

    Boğaziçi University

News

  • I presented my PhD defense on 19 February 2021 and awarded with the PhD title!
  • We recieved a lot of media attention about our latest online experiment ( BOUN Newsletter, Milliyet Pazar)!
  • We are running an experiment aiming to compare human and machine attention when solvind spelling errors related to ``-de/-da'' clitics in Turkish. Turkish speaking people are invited to participate, and try out our error correction model with new sentences.

Recent Publications

Quickly discover relevant content by filtering publications.

EXSEQREG: Explaining sequence-based NLP tasks with regions with a case study using morphological features for named entity recognition

The state-of-the-art systems for most natural language engineering tasks employ machine learning methods. Despite the improved performances of these systems, there is a lack of established methods for assessing the quality of their predictions. This work introduces a method for explaining the predictions of any sequence-based natural language processing (NLP) task implemented with any model, neural or non-neural. Our method named EXSEQREG introduces the concept of region that links the prediction and features that are potentially important for the model. A region is a list of positions in the input sentence associated with a single prediction. Many NLP tasks are compatible with the proposed explanation method as regions can be formed according to the nature of the task. The method models the prediction probability differences that are induced by careful removal of features used by the model. The output of the method is a list of importance values. Each value signifies the impact of the corresponding feature on the prediction. The proposed method is demonstrated with a neural network based named entity recognition (NER) tagger using Turkish and Finnish datasets. A qualitative analysis of the explanations is presented. The results are validated with a procedure based on the mutual information score of each feature. We show that this method produces reasonable explanations and may be used for i) assessing the degree of the contribution of features regarding a specific prediction of the model, ii) exploring the features that played a significant role for a trained model when analyzed across the corpus.

Detecting Clitics Related Orthographic Errors in Turkish

For the spell correction task, vocabulary based methods have been replaced with methods that take morphological and grammar rules into account. However, such tools are fairly immature, and, worse, non-existent for many low resource languages. Checking only if a word is well-formed with respect to the morphological rules of a language may produce false negatives due to the ambiguity resulting from the presence of numerous homophonic words. In this work, we propose an approach to detect and correct the “de/da” clitic errors in Turkish text. Our model is a neural sequence tagger trained with a synthetically constructed dataset consisting of positive and negative samples. The model’s performance with this dataset is presented according to different word embedding configurations. The model achieved an F1 score of 86.67% on a synthetically constructed dataset. We also compared the model’s performance on a manually curated dataset of challenging samples that proved superior to other spelling correctors with 71% accuracy compared to the second-best (Google Docs) with and accuracy of 34%.