Research Projects

 

·      Universality, Diversity and Idiosyncrasy in Language Technology (UniDive)

 

o  COST Action CA21167, 2022. (Secondary Proposer) (continuing)

 

o  Efficient access to the constantly growing quantities of data, especially of language data, largely relies on advances in data science. This domain includes natural language processing (NLP), which is currently booming, to the benefit of many end users. However, this optimization-based technological progress poses an important challenge: accounting for and fostering language diversity. The UniDive Action takes two original stands on this challenge. Firstly, it aims at embracing both inter- and intra-language diversity, i.e. a diversity understood both in terms of the differences among the existing languages and of the variety of linguistic phenomena exhibited within a language. Secondly, UniDive does not assume that linguistic diversity is to be protected against technological progress but strives for both of these aims jointly, to their mutual benefit. Its approach is to: (i) pursue NLP-applicable universality of terminologies and methodologies, (ii) quantify inter- and intra-linguistic diversity, (iii) boost and coordinate universality- and diversity-driven development of language resources and tools. UniDive builds upon previous experience of European networks which provided a proof of concept for language modelling and processing, unified across many languages but preserving their diversity. The main benefits of the action will include, on the theoretical side, a better understanding of language universals, and on the practical side, language resources and tools covering, in a unified framework, a bigger variety of language phenomena in a large number of languages, including low-resourced and endangered ones.

 

·     Literary Machine Translation to Produce Translations that Reflect Translators’ Style and Generate Retranslations

(Edebi Makine Cevirisi Yoluyla Cevirmenlerin Uslubunu Yansitan Ceviriler Uretme ve Yeniden Ceviriler Olusturma)

 

o  Tübitak 1001 Research Project 121K221, 2021. (Fund: 504,207 TL) (Researcher) (continuing)

 

o  The proposed project has two main goals. The first one is improving the Turkish natural language processing resources developed mainly by the researchers in the project team based on the current technology. The second goal is compiling all these resources in an open platform and thus contributing to new research. In the scope of the university, there are plenty of resources developed by the researchers working on natural language processing. In the project, these resources will be collected, will be improved by the support of linguistics researchers and a natural language processing platform will be built. The platform will be open-sourced and will be open to development. In this respect, resources required for Turkish natural language processing will be opened to researchers.

 

·      Linguistics-supported Turkish Natural Language Processing Platform

(Dilbilim Temelli Türkçe Doğal Dil İşleme Platformu)

 

o  Boğaziçi University Research Fund 16909, Boğaziçi University, İstanbul, 2022. (Fund: 193,836 TL) (Project Leader)

 

·      Language in the Human-Machine Era (LITHME)

 

o  COST Action CA19102, 2020. (Secondary Proposer) (continuing)

 

o  "Within the next 10 years, many millions of people will be … wearing relatively unobtrusive ... devices that offer an immersive and high-resolution view of a visually augmented world" (Perlin 2016: 85). This is the ‘human-machine era’, a time when our senses are not just supplemented by handheld mobile devices, but thoroughly augmented. The language we see, hear and produce will be mediated in real time by technology. This has major implications for language use, and ultimately language itself. Are linguists ready for this? Can our theory, methods, and epistemology handle it?

 

LITHME has two aims: to prepare linguistics and its subdisciplines for what is coming; and to facilitate longer term dialogue between linguists and technology developers. How will pervasive augmentation technology affect language in areas such as international law, translation, and other forms of language work? What will this mean for how people identify with specific languages? Could increasing reliance on real-time language technologies actually change the structure of language? Longer term, could developments in brain-machine interfaces serve to complement or even supersede language altogether? Linguistics would be far stronger for robust technological foresight, while developers would benefit from better understanding potential linguistic and societal consequences of their creations.

 

Meanwhile LITHME would shine a light on the ethical implications of emerging language technologies. Inequality of access to technologies, questions of privacy and security, new vectors for deception and crime; these and other critical issues would be kept to the fore.

 

LITHME would equip linguists and stakeholders for the human-machine era.

 

·      Named Entity Recognition System based on Deep Neural Networks

 

o  Bogaziçi University Research Fund, Bogaziçi University, Istanbul, 2019. (Fund: 31,553 TL) (Researcher)

 

·     Compiling a Verbal Multiword Expression Corpus for Turkish and Developing a Multilingual Deep Learning based System for Verbal Multiword Expression Identification

(Türkçe Çok Sözcüklü Fiil İfadeleri Derlemi Oluşturma ve Farklı Dillerdeki Çok Sözcüklü Fiil İfadelerini Saptayan Derin Öğrenme Tabanlı bir Sistem Geliştirme)

 

o  Bogaziçi University Research Fund 18A01P4, Bogaziçi University, Istanbul, 2019. (Fund: 35,899 TL) (Project Leader)

o    The first aim of this project is disambiguating the definitions of Turkish verbal multi-word expressions (VMWE) and compiling a text corpus. For this purpose, we will start with the corpus developed at PARSEME action Shared Task 1.0, and then update and expand it. The VMWEs in the corpus will be labeled according to the published guidelines. The text corpus is formed of newspaper articles on the topics of politics, world, life, and art. The corpus that will be built will be published at PARSEME Shared Task 1.1. The corpus will be a valuable resource for Turkish natural language processing studies including syntactic parsing, machine translation, and n-gram language modeling. The second and main aim is, by using gold standard corpora for different languages, developing a VMWE identification system that is based on deep learning techniques and that is suitable to multi-language structures. Processing of multi-word expressions is an important challenge in natural language processing. The use of deep learning frameworks in this task is a relatively new topic. We will try to increase the success rates in VMWE identification of multi-language systems in the literature.

 

·     Developing a Comprehensive Sentiment Analysis Framework for Turkish

(Türkçe için Kapsamlı bir Duygu Analizi Çatısı Oluşturulması)

 

o  Bogaziçi University Research Fund 18A01D2, Bogaziçi University, Istanbul, 2019. (Fund: 7,044 TL) (Project Leader)

o    Sentiment analysis is the process of extracting the sentiment (positive, negative, neutral) in texts using natural language processing and machine learning techniques. This process has five components: the sentiment, person owing the sentiment, time period of the sentiment, the object and the aspect of the object related to the sentiment. There is no sentiment analysis work for Turkish that takes these five parts into account. In this project, we will form a sentiment analysis framework for Turkish. The works conducted for foreign languages cannot be adapted to Turkish easily. Moreover, most of the sentiment analysis studies for Turkish employ supervised methods; they do not use unsupervised or semi-supervised methods. This poses problems when supervised data are limited. Also, Turkish studies usually give rise to less satisfactory results compared to other languages. Other issues are forming sentiment vectors and building an aspect-based sentiment analysis system. In this project, a framework for Turkish sentiment analysis that incorporates polarity, its score, related object, its time, and owing person. For this purpose, neural networks and unsupervised/semi-supervised methods will be used. Finally, domain-specific sentiment lexicons will be built.

 

·     A Deep Learning based Turkish Dependency Parser

(Derin Öğrenme Tabanlı Türkçe Bağlılık Ayrıştırıcısı)

 

o  Tübitak 1005 Research Project 117E971, 2013-2015. (Fund: 201,802 TL) (Researcher)

o    The purpose of this study is developing a novel state-of-the-art dependency parsing system for Turkish. Dependency parsing is a crucial step for many natural language processing (NLP) systems such as question answering, machine translation and information extraction. Turkish, being a morphologically rich language and having a complex grammar, is challenging for automatic processing. The restricted amount of natural language processing tools and resources for Turkish makes the task even harder.

With the help of deep learning, which has recently become very successful in natural language processing, we built a novel, state-of-the-art Turkish dependency parser. The amount of data to train a data-driven dependency parser directly affects the performance of the parser. It is observed that deep learning-based systems need large amounts of data to achieve good performance. As a solution to this challenge, the project contributed in two main areas. First, Turkish grammar rules and morphological features of words are integrated into the deep learning model. Thus, despite the limited training data, higher success was achieved than the current methods with the proposed hybrid deep learning model. A downloadable software program and an online tool have also been created to enable the use of this hybrid deep learning based parser. Many language processing applications that need dependency parsing will be able to achieve better performance for Turkish thanks to this parser.

The second main contribution of the project is increasing the quantity and quality of data by creating the largest corpus for Turkish dependency parsing consisting of 20000 annotated sentences. In order to create the corpus, the IMST-UD and PUD treebanks have been reannotated manually by following grammar rules in a consistent way and 13500 new sentences have been annotated and included in the corpus. This new treebank, which is made publicly available, will be a base for many new natural language processing studies in Turkish.

 

·      Parsing and Multi-word Expressions: Towards Linguistic Precision and Computational Efficiency in Natural Language Processing (PARSEME)

 

o  COST Action IC1207, 2013-2017. (Secondary Proposer)

 

o  The main objective of the Action is to increase and enhance the ICT support of the European multilingual heritage by bringing about a substantial progress in the understanding and modelling of Multi-Word Expressions (MWEs) within advanced multilingual Natural Language Processing (NLP) techniques, notably deep parsing.

 

·     Developing an Adaptive Question Answering System Enabling Primary and Secondary Education Students Accessing Accurate and Reliable Information

(İlköğretim ve Ortaöğretim Öğrencilerinin Doğru ve Güvenilir Bilgiye Erişimlerine Yönelik Uyarlamalı bir Soru Cevaplama Sisteminin Geliştirilmesi)

 

o  Tübitak 1003 Research Project 113E036 (FATIH Project), 2013-2015. (Fund: 566,783 TL) (Project Leader)

o    The number of the information sources and the amount of information that exist on electronic environments and the Web are increasing steadily. Users cater for their information needs related to their topics of interest on these resources using some technologies like search engines. However, such technologies may fall short due to being general purpose, not tailored to a particular domain, and making use of natural language processing methods in a very restricted manner.

 

In this project, a question answering system that meets the information needs (questions) of students using reliable resources, that provides the students to use resources in foreign languages, and that presents the results in the form of a combined and coherent summary was developed. The theoretical aspect of the project is formed of natural language processing, question answering, text summarization, and machine translation formalisms. The system developed consists of the phases of asking questions by the students in natural language (Turkish), analysis of the questions using natural language processing techniques, identifying resources to answer the questions, translating the foreign resources to Turkish, and analysis, combination and summarization of the related resources oriented towards the requested information need. In this respect, the project aimed at providing the students to access question-based, accurate and reliable information using an environment specialized for answering the students’ information needs.

 

·     Developing Concept Mining Methods for Turkish Document Analysis

(Türkçe Metinlerin Analizine Yönelik Olarak Kavram Madenciliği Yöntemlerinin Geliştirilmesi)

 

o  Tübitak 1001 Research Project 110E162, 2011. (Fund: 115,301 TL) (Project Leader)

o    In this project funded by the Scientific and Technological Research Council of Turkey, research was conducted in order to extract concepts from Turkish documents by using natural language processing algorithms and the Turkish Language Institution (TDK) dictionary. Concept mining is a process that aims at extracting expressive concepts from textual, visual, or audio artifacts. Although there have been developed methodologies in many Western lanuages, mainly in English, for this purpose, there is no work developed for Turkish that is successful so far in this domain. In this project, research and work have been done in mainly six areas: 1. Compiling the documents and preparing the corpora that will be used for concept mining; 2. Obtaining the terms from these documents using preprocessing steps; 3. Weighing the terms in the documents that are likely to be concepts using the TDK dictionary; 4. Extracting the concepts according to the weights of the terms; 5. Clustering the documents based on the extracted concepts; 6. Developing a concept-based search engine. The outputs of the project are theses, technical reports, scientific papers, databases, and the software programs.

 

·     Design, Development and Evaluation of a Customized Web-Based Portfolio System to Improve Preservice Teachers’ Reflective Skills

(Hizmet Öncesi Öğretmenlerinin Yansıtma Becerilerini Geliştirmek için Özelleştirilmiş Web Tabanlı Portfolyo Sistemi Tasarımı, Geliştirilmesi ve Değerlendirilmesi)

 

o  Tübitak 3501 Research Project, 2012. (Advisor)

o    This study aims to design, develop, and evaluate a customized web-based portfolio system. The system will be integrated into teaching practicum courses and primarily used for improving preservice teachers’ reflective skills. The web-based system will provide an effective platform both for preservice teachers to upload their course assignments and for the course instructors to follow-up their students’ work. In addition, the system will allow instructors structure their students’ teaching experience around the notion of reflection, which is one of the most important skills that preservice teachers ought to develop.

Research in teacher education maintains that reflection is one of the most important skills that preservice teachers need to develop (Braun ve Crumpler, 2004; Freese, 1999, 2006; Loughran, 2002; Wedman ve Martin, 1986; Willard-Holt ve Bottomley, 2000; Zeichner ve Liston, 1996). Teaching portfolios are recognized to be the most effective tools to improve preservice teachers’ reflective skills. Electronic portfolios, which offer several advantages over paper-based portfolios (e.g., easy access, editing, and sharing) are now becoming a new trend in higher education.

As the proposed system will be web-based, it will offer various opportunities to students, such as being able to: (1) access to their work anytime from anywhere, (2) receive and provide continuous feedback, (3) revise and edit their documents based on the feedback. Thus, the portfolio assignments will become open learning documents that change continuously and visible to more people, rather than being documents that are completed at once and shared with just the course instructor. The portfolio system will also provide benefits to the instructors in a way that they could follow-up teacher candidates’ work easily and efficiently.

The existing electronic portfolio systems show a wide range, varying from commercial systems (content management systems that include electronic portfolio modules) to open source electronic portfolio software (Stefani, Mason, ve Pegler, 2007).  In the previous studies carried out by the principle investigator, some of these portfolio systems have been used as portfolio tools.  The results of these studies showed that the tasks developed around the theoretical framework helped improving preservice teachers’ reflection skills.  In addition, they also revealed a need for a more user-friendly electronic portfolio system, especially designed for improving preservice teachers’ reflection skills.

The e-portfolio reflection tasks will be constructed around the model developed by Fuller and Bown (1975).  In this model, a teacher’s life space includes three basic elements: (1) goals, (2) experiences, and (3) observations of themselves, or others.  Fuller and Bown stated that for beginning teachers there will be discrepancies among these three components, and the experience of becoming a teacher entails coping with all the discrepancies.  In order to cope with such differences, preservice teachers first need to become aware of such inconsistencies among the three constituent elements. For this reason, the reflection tasks of this study will require preservice teachers to consider all three elements of their teacher life space and to identify the discrepancies among them after each teaching experience cycle.

The development and evaluation of the customized web-based portfolio system involve the processes of design, programming, implementation, and evaluation. During the design and programming phase, the aim is to develop a system that is easily accessible and user-friendly.  In addition, the system will be designed in a way that will allow structuring of preservice teachers’ teaching practice assignments using certain scripts. For example, to complete a lesson plan assignment, preservice teachers will be able to edit a page that includes the components of an expected lesson plan.  In the design of the system, the main goal is to create a platform that will easily allow preservice teachers to upload and edit their artifacts (such as lesson plans or teaching practice videos), to give/receive feedback to/from their peers and instructors, and to complete their reflection tasks.

In the implementation and evaluation phases of the project, the usability of the web-based portfolio system and its contribution to the development of preservice teachers’ reflective skills will be examined. Data will be collected through using the customized web-based portfolio system within teaching practicum courses. The sample of the study will include preservice teachers who take a teaching practicum course in their final semester of their program in the Faculty of Education. The main data collection sources are preservice teachers’ responses to the reflection tasks, which also compromise the content of web-based portfolios, surveys and semi-structured interviews. The change in preservice teachers’ reflective skills will be examined using mixed method analysis of the data, that is, with quantitative and qualitative analyses. The data will be coded in an iterative manner with respect to Fuller and Bown’s (1975) model. The frequency count of each theme for each participant will be calculated. The statistical analyses will then be conducted after the qualitative data are quantified in this manner. The usability of the system will be evaluated through the qualitative analyses of the data mainly coming from surveys and interviews.

 

·     Concept Extraction from Turkish Texts by Automatic Methods

(Türkçe Metinlerden Otomatik Yöntemlerle Kavram Çikarilmasi)

 

o  Bogaziçi University Research Fund 5187, Bogaziçi University, Istanbul, 2010. (Fund: 9,456.83 TL) (Project Leader)

o    Concept extraction is a subtopic of concept mining, which forms an important branch of data mining. Concept mining can be defined as the study of extracting important concepts that appear in documents. A basic point in concept mining is processing the words for obtaining the concepts. Usually thesauri is used during this process. Word-concept matching is usually ambiguous and context is used to resolve the ambiguity. The relationships between the concepts and the context are extracted using semantic similarity. Also, formal concept analysis that makes this relationship explicit is another important topic. Concept extraction study aims at obtaining efficient solutions to some problems which are harder to solve using data mining.

The two basic approaches in concept extraction can be named as expert-based approach and statistical approach. The first one is also called rule-based or information engineering-based approach. These systems include a set of pattern matching rules compiled by experts in the field. The main disadvantage of this method is finding such experts. The second one, statistical approach, is also known as automatic learning approach. It makes use of statistical informations gathered from available corpora for learning. Mostly HMM (Hidden Markov Model) is used for this purpose. The transition probabilities are estimated from the training data. The advantage is portability of the domain. The disadvantages are the difficulty and cost of building corpora, the need to retrain the system for different concepts, and slower execution.

Our goal in this project is to build an automated concept extraction system for Turkish. Since it is more appropriate for the model we build, we will use statistical approach rather than the expert-based approach. The PASW software developed by SPSS for data mining is a successful concept extraction application for English and other well-known languages. In this project, we aim at implementing the works done by PASW Text Analytics module for Turkish. The system will work as follows: First, ‘noun’ words will be extracted from the documents, then sub-dictionary groups will be formed by clustering similar words, then these sub-dictionaries will be labeled manually, and finally data mining techniques will be applied to these concepts. For extracting words from documents and disambiguating these words, the morphological parser and the morphological disambiguator developed in the scope of Boğaziçi University BAP 08M103 and TÜBİTAK 107E261 projects will be used. The cosine similarity metric will be employed for discovering tem similarities. To group similar terms, we will use k-means or a related clustering approach, and the terms will be labeled manually. In this way, word-concept matching will be obtained. The concept extraction phase will be completed after labeling. Finally, some interesting information will be obtained from the concepts by employing data mining methods.

 

·     Developing a General-Purpose Turkish Handwritten Recognition System using a Large Lexicon

(Büyük Sözlüklü ve Genel Amaçli Türkçe El Yazisi Tanima Sistemi Gelistirilmesi)

 

o  Bogaziçi University Research Fund 09A107D, Bogaziçi University, Istanbul, 2009. (Fund: 10,800 TL) (Project Leader)

o    The aim of this project is to develop algoritms for identifying patterns in periodic sequences and applying these algorithms to handwritten character recognition problem. Handwritten character recognition is usually divided into two groups: with segmentation (general purpose) and without segmentation (special purpose). In this work, instead of these methods, it was decided to apply a cognitive approach that became popular recently. As the short sequences obtained from each handwritten sample are combined, a periodic sequence is generated. These sequences are learnt and matched using compression algorithms.

 

·     Morphology Based Language Modeling for Turkish Speech Recognition

(Biçimbilimsel Tabanli Dil Modeli ile Türkçe Konusma Tanima Sistemi)

 

o  Bogaziçi University Research Fund 08M103, Bogaziçi University, Istanbul, 2008. (Fund: 8,805 YTL) (Project Leader)

o    In this project, we aimed at developing a high performance large vocabulary continuous speech recognition system for Turkish. The most important contribution of this work has been to develop a morphology-based language model for Turkish. As a result of our previous work, we have built some language resources for Turkish such as a morphological parser, a morphological disambiguator, and a web corpus. Using these language resources, in this project, we developed an effective morphology-based language model for Turkish. We also replaced the static lexicon with a dynamic one based on the morphological parser. So that, we greatly alleviated the out-of-vocabulary problem for Turkish. We also developed a speech decoder which can do speech decoding on morphology-integrated search networks.

 

·     Morphology Based Language Modeling for Turkish Speech Recognition

(Biçimbilimsel Tabanli Dil Modeli ile Türkçe Konusma Tanima Sistemi)

 

o  Tübitak 1001 Research Project 107E261, Tübitak 2007. (Fund: 46,490 YTL) (Project Leader)

o    In this project, we aimed at developing a large vocabulary continuous speech recognition system for Turkish, with the support of the Scientific and Technological Research Council of Turkey. The state-of-the-art speech recognition systems are basically composed of three main components: acoustic model, language model, and speech decoder. In order to build the acoustic model, hidden Markov models were trained by using a speech corpus compiled previously. During the researchs for building the language model, a number of resources were obtained. These resources are a morphological parser, a morphological disambiguator, and a web corpus. By parsing the web corpus using the morphological parser and then  disambiguating these parses using the morphological disambiguation system, a statistical language model based on the lexical morphemes in the parses was estimated. The lexical morheme language model was composed with the lexical transducer of the morphological parser to construct a morphology-integrated search network. A real-time speech decoder system for Turkish that can do speech decoding on the finite-state transducer-based networks was developed. The output of the project is in the form of a PhD thesis, technical reports, journal papers, conference papers, databases, and software programs. The databases and programs were made publically available for research purposes.

 

·     Developing Structure-preserving and Query-biased Automated Summarization Methods for Web Search Engines

(Arama Motorlari için Bilgi Istegi ve Doküman Yapisi Tabanli Otomatik Ozetleme Metotlarinin Gelistirilmesi)

o  Bogaziçi University Research Fund 07A106, Bogaziçi University, Istanbul, 2007 (Fund: 6,065 YTL) (Project Leader)

o    In this project, a new summarization approach was developed to improve the effectiveness of Web search based on two stages. In the first stage, a rule-based approach and a machine learning approach were implemented to identify the sectional hierarchies of Web documents. In the second stage, query-biased summaries are created based on document structure. The evaluation results show that the system has significant improvement over unstructured summaries and Google snippets.

 

·     Morphotactic based Statistical Language Modeling for Large Vocabulary Continuous Speech Recognition Systems

(Genis Sözcük Dagarcikli Sürekli Konusma Tanima Sistemleri için Morfotaktik Tabanli Istatistiki Dil Modelleme)

 

o  Bogaziçi University Research Fund 06A102, Bogaziçi University, Istanbul, 2006 (Fund: 14,405 YTL) (Project Leader)

o    Bu projede, Türkçe gibi eklemeli dillerde geniş sözcük dağarcıklı sürekli konuşma tanıma (large vocabulary continuous speech recognition – LVCSR) sistemleri için kullanılacak yeni bir dil modelinin geliştirilmesi amaçlanmaktadır. Bilindiği gibi, eklemeli dillerde sınırsız sayıda kelime üretilebilmesi, konuşma tanıma sistemlerinde dil modeli oluşturmada zorluklara neden olmaktadır ve iyi bir dil modelinin eksikliği bu sistemlerin etkinliğini önemli ölçüde etkilemektedir. İngilizce gibi nispeten eklemeli olmayan dillerde konuşma tanıma sistemlerinin başarıyla geliştirilmiş olmasının ve Türkçe gibi dillerde henüz aynı başarıya ulaşılamamasının en önemli nedenlerinden birisi, etkin bir dil modelinin eksikliğidir. Konuşma tanıma sistemlerinde yaygın olarak n-birimli (n-gram) dil modeli kullanılmaktadır. Bu model, dili istatistiki olarak modellemeye çalışmaktadır. Geniş bir metin havuzundan (corpus) sözcüklerin birbiri ardına gelme sıklıklarını göz önüne alarak oluşturulan model, sözcük dizilerinin olasılıklarını hesaplamakta kullanılmaktadır. Türkçe’de diğer bir sorun, sözcüklerin cümle içinde diziliminin nispeten serbest olmasıdır. Bu serbestlik sorunu Türkçe’nin eklemeli bir dil olması ile birleştiğinde, sözcük bazında basit bir n-birimli dil modelinin konuşma tanıma sistemlerinde etkinliğini azaltmaktadır. Bu çalışma ile Türkçe’nin morfotaktik (morphotactic) (morfların dizilim kuralları) bilgisini n-birimli bir dil modeli ile birleştirerek etkin bir dil modelinin oluşturulması hedeflenmektedir. Böylece Türkçe için birçok uygulama alanına sahip geniş sözcük dağarcıklı konuşma tanıma sistemleri geliştirilebilecektir.

 

·     Developing Natural Language Processing-based Methods for Text Classification

(Metin Siniflandirmada Dogal Dil Isleme Tabanli Metotlarin Gelistirilmesi)

 

o  Bogaziçi University Research Fund 05A103, Bogaziçi University, Istanbul, 2005 (Fund: 10,700 YTL) (Project Leader)

o    Bu projede, metin sınıflandırma (text categorization) problemi için doğal dil işleme tekniklerinin kullanılması düşünülmektedir. Günümüzde metin sınıflandırma amaçlı pek çok araştırma yürütülmektedir ve bunlardan bazılarının pratik uygulamaları da mevcuttur. Fakat, bu çalışmaların başarı oranı belli bir sınırı geçememektedir. Bunun başlıca sebebi, hemen hemen bütün çalışmalarda, sadece sentaktik (sözdizimsel – syntactic) bilgilerin kullanılması ve semantik (anlambilimsel – semantic) bilgilerden yararlanılmamasıdır. Diğer bir deyişle, metinlerdeki kelimeler anlamlarından bağımsız olarak ele alınmaktadır. Bu projede, bu eksikliğin giderilmesi ve metinlerin içerdikleri anlam gözönüne alınarak sınıflandırılması amacıyla yeni metotlar önerilecektir.

 

·     Developing Dynamic and Adaptive Methods for Turkish Spam Filtering

(Türkçe Spam Mesajlarin Filtrelenmesine Yönelik Dinamik ve Uyarlamali Metotlarin Gelistirilmesi)

 

o  Bogaziçi University Research Fund 04A101, Bogaziçi University, Istanbul, 2004. (Fund: 6,600,000,000 TL) (Project Leader)

o    Bu projede, spam e-posta mesajlarının önlenmesine yönelik olarak Türkçe için spam-önler filtreleme metotları geliştirilecektir. Günümüzde spam mesajlar tüm e-posta mesajlarının %10’unu oluşturmaktadır ve kullanıcılar açısından önemli zaman kayıplarına neden olmaktadır. İngilizce gibi yaygın diller için filtreleme algoritmaları mevcuttur, fakat Türkçe mesajlar için henüz böyle bir çalışma yapılmamıştır. Bu tür bir çalışmada Türkçe’nin karmaşık morfolojik yapısının gözönüne alınması gerekmektedir. Bu projede geliştirilecek olan metotlar dinamik olacaktır ve yapay sinir ağları ile Bayesian ağları tekniklerine dayanacaktır. Ortaya konulacak olan algoritmaların iki temel bileşeni içereceği düşünülmektedir: Mesaj içeriklerinin morfolojik analizini yapacak bir morfoloji modülü ve mesajları normal ve spam olarak sınıflandıracak bir öğrenme modülü.

 

·     Statistical Analysis of Turkish

(Türkçe'nin Istatistiksel Incelenmesi)

 

o  Bogaziçi University Research Fund 02A107, Bogaziçi University, Istanbul, 2002-2003. (Fund: 6,350,000,000 TL) (Project Leader)

o    Bu projede, yeni bir yaklaşım olan doğal dillerin istatistiksel işlenmesi (statistical processing) konusu Türkçe’ye uygulanacaktır. Bu konuda, bazı yaygın dillerle ilgili araştırmalar yapılmaktadır; fakat Türkçe için henüz bu tür bir çalışma bulunmamaktadır. Bu amaçla, proje elemanları tarafından kapsamlı bir literatür taraması yapılacaktır. Bu taramaya dayanılarak, Türkçe’nin istatistiksel işlenmesi için bir altyapı oluşturulacak ve bir program geliştirilecektir. Programın tasarımı, implementasyonu ve testi yapılacaktır.