Research Projects
·
Universality,
Diversity and Idiosyncrasy in Language Technology (UniDive)
o COST
Action CA21167, 2022. (Secondary Proposer) (continuing)
o Efficient access to the constantly growing quantities
of data, especially of language data, largely relies on advances in data
science. This domain includes natural language processing (NLP), which is
currently booming, to the benefit of many end users. However, this
optimization-based technological progress poses an important challenge:
accounting for and fostering language diversity. The UniDive Action takes two
original stands on this challenge. Firstly, it aims at embracing both inter-
and intra-language diversity, i.e. a diversity understood both in terms of the
differences among the existing languages and of the variety of linguistic
phenomena exhibited within a language. Secondly, UniDive does not assume that
linguistic diversity is to be protected against technological progress but
strives for both of these aims jointly, to their mutual benefit. Its approach
is to: (i) pursue NLP-applicable universality of terminologies and
methodologies, (ii) quantify inter- and intra-linguistic diversity, (iii) boost
and coordinate universality- and diversity-driven development of language
resources and tools. UniDive builds upon previous experience of European
networks which provided a proof of concept for language modelling and
processing, unified across many languages but preserving their diversity. The
main benefits of the action will include, on the theoretical side, a better
understanding of language universals, and on the practical side, language
resources and tools covering, in a unified framework, a bigger variety of
language phenomena in a large number of languages, including low-resourced and
endangered ones.
· Literary Machine Translation to Produce
Translations that Reflect Translators Style and Generate Retranslations
(Edebi Makine Cevirisi Yoluyla Cevirmenlerin
Uslubunu Yansitan Ceviriler Uretme ve Yeniden Ceviriler Olusturma)
o Tübitak
1001 Research Project 121K221, 2021. (Fund: 504,207 TL) (Researcher)
(continuing)
o The proposed project has two main goals. The first one
is improving the Turkish natural language processing resources developed mainly
by the researchers in the project team based on the current technology. The
second goal is compiling all these resources in an open platform and thus
contributing to new research. In the scope of the university, there are plenty
of resources developed by the researchers working on natural language
processing. In the project, these resources will be collected, will be improved
by the support of linguistics researchers and a natural language processing
platform will be built. The platform will be open-sourced and will be open to
development. In this respect, resources required for Turkish natural language
processing will be opened to researchers.
·
Linguistics-supported Turkish Natural
Language Processing Platform
(Dilbilim
Temelli Türkçe Doğal Dil İşleme Platformu)
o Boğaziçi University Research Fund 16909, Boğaziçi University, İstanbul, 2022. (Fund: 193,836 TL) (Project Leader)
·
Language
in the Human-Machine Era (LITHME)
o COST Action CA19102, 2020. (Secondary Proposer) (continuing)
o "Within the next 10 years, many millions of
people will be
wearing relatively unobtrusive ... devices that offer an
immersive and high-resolution view of a visually augmented world" (Perlin
2016: 85). This is the human-machine era, a time when our senses are not just
supplemented by handheld mobile devices, but thoroughly augmented. The language
we see, hear and produce will be mediated in real time by technology. This has
major implications for language use, and ultimately language itself. Are
linguists ready for this? Can our theory, methods, and epistemology handle it?
LITHME
has two aims: to prepare linguistics and its subdisciplines for what is coming;
and to facilitate longer term dialogue between linguists and technology
developers. How will pervasive augmentation technology affect language in areas
such as international law, translation, and other forms of language work? What
will this mean for how people identify with specific languages? Could
increasing reliance on real-time language technologies actually change the
structure of language? Longer term, could developments in brain-machine
interfaces serve to complement or even supersede language altogether?
Linguistics would be far stronger for robust technological foresight, while
developers would benefit from better understanding potential linguistic and
societal consequences of their creations.
Meanwhile
LITHME would shine a light on the ethical implications of emerging language
technologies. Inequality of access to technologies, questions of privacy and security,
new vectors for deception and crime; these and other critical issues would be
kept to the fore.
LITHME
would equip linguists and stakeholders for the human-machine era.
·
Named Entity Recognition System based on
Deep Neural Networks
o Bogaziçi University Research Fund, Bogaziçi University, Istanbul, 2019. (Fund: 31,553 TL) (Researcher)
· Compiling
a Verbal Multiword Expression Corpus for Turkish and Developing a Multilingual
Deep Learning based System for Verbal Multiword Expression Identification
(Türkçe Çok
Sözcüklü Fiil İfadeleri Derlemi Oluşturma ve Farklı Dillerdeki
Çok Sözcüklü Fiil İfadelerini Saptayan Derin Öğrenme Tabanlı bir
Sistem Geliştirme)
o Bogaziçi University Research Fund 18A01P4, Bogaziçi University, Istanbul, 2019. (Fund: 35,899 TL) (Project Leader)
o The first aim of this project is disambiguating the definitions of
Turkish verbal multi-word expressions (VMWE) and compiling a text corpus. For
this purpose, we will start with the corpus developed at PARSEME action Shared
Task 1.0, and then update and expand it. The VMWEs in the corpus will be
labeled according to the published guidelines. The text corpus is formed of
newspaper articles on the topics of politics, world, life, and art. The corpus
that will be built will be published at PARSEME Shared Task 1.1. The corpus
will be a valuable resource for Turkish natural language processing studies
including syntactic parsing, machine translation, and n-gram language modeling.
The second and main aim is, by using gold standard corpora for different
languages, developing a VMWE identification system that is based on deep
learning techniques and that is suitable to multi-language structures.
Processing of multi-word expressions is an important challenge in natural
language processing. The use of deep learning frameworks in this task is a
relatively new topic. We will try to increase the success rates in VMWE
identification of multi-language systems in the literature.
· Developing
a Comprehensive Sentiment Analysis Framework for Turkish
(Türkçe
için Kapsamlı bir Duygu Analizi Çatısı Oluşturulması)
o Bogaziçi University Research Fund 18A01D2, Bogaziçi University, Istanbul, 2019. (Fund: 7,044 TL) (Project Leader)
o Sentiment analysis is the process of extracting the sentiment (positive,
negative, neutral) in texts using natural language processing and machine
learning techniques. This process has five components: the sentiment, person
owing the sentiment, time period of the sentiment, the object and the aspect of
the object related to the sentiment. There is no sentiment analysis work for
Turkish that takes these five parts into account. In this project, we will form
a sentiment analysis framework for Turkish. The works conducted for foreign
languages cannot be adapted to Turkish easily. Moreover, most of the sentiment
analysis studies for Turkish employ supervised methods; they do not use
unsupervised or semi-supervised methods. This poses problems when supervised
data are limited. Also, Turkish studies usually give rise to less satisfactory
results compared to other languages. Other issues are forming sentiment vectors
and building an aspect-based sentiment analysis system. In this project, a
framework for Turkish sentiment analysis that incorporates polarity, its score,
related object, its time, and owing person. For this purpose, neural networks
and unsupervised/semi-supervised methods will be used. Finally, domain-specific
sentiment lexicons will be built.
· A
Deep Learning based Turkish Dependency Parser
(Derin
Öğrenme Tabanlı Türkçe Bağlılık
Ayrıştırıcısı)
o Tübitak 1005 Research Project 117E971, 2013-2015. (Fund: 201,802 TL) (Researcher)
o The purpose of this study is developing a novel state-of-the-art
dependency parsing system for Turkish. Dependency parsing is a crucial step for
many natural language processing (NLP) systems such as question answering,
machine translation and information extraction. Turkish, being a
morphologically rich language and having a complex grammar, is challenging for
automatic processing. The restricted amount of natural language processing
tools and resources for Turkish makes the task even harder.
With
the help of deep learning, which has recently become very successful in natural
language processing, we built a novel, state-of-the-art Turkish dependency
parser. The amount of data to train a data-driven dependency parser directly affects
the performance of the parser. It is observed that deep learning-based systems
need large amounts of data to achieve good performance. As a solution to this
challenge, the project contributed in two main areas. First, Turkish grammar
rules and morphological features of words are integrated into the deep learning
model. Thus, despite the limited training data, higher success was achieved
than the current methods with the proposed hybrid deep learning model. A
downloadable software program and an online tool have also been created to
enable the use of this hybrid deep learning based parser. Many language
processing applications that need dependency parsing will be able to achieve
better performance for Turkish thanks to this parser.
The
second main contribution of the project is increasing the quantity and quality
of data by creating the largest corpus for Turkish dependency parsing
consisting of 20000 annotated sentences. In order to create the corpus, the
IMST-UD and PUD treebanks have been reannotated manually by following grammar
rules in a consistent way and 13500 new sentences have been annotated and
included in the corpus. This new treebank, which is made publicly available,
will be a base for many new natural language processing studies in Turkish.
·
Parsing
and Multi-word Expressions: Towards Linguistic Precision and Computational
Efficiency in Natural Language Processing (PARSEME)
o COST Action IC1207, 2013-2017. (Secondary Proposer)
o The main objective of the Action is to increase and
enhance the ICT support of the European multilingual heritage by bringing about
a substantial progress in the understanding and modelling of Multi-Word
Expressions (MWEs) within advanced multilingual Natural Language Processing
(NLP) techniques, notably deep parsing.
· Developing
an Adaptive Question Answering System Enabling Primary and Secondary Education Students
Accessing Accurate and Reliable Information
(İlköğretim ve Ortaöğretim
Öğrencilerinin Doğru ve Güvenilir Bilgiye Erişimlerine Yönelik
Uyarlamalı bir Soru Cevaplama Sisteminin Geliştirilmesi)
o Tübitak 1003 Research Project 113E036 (FATIH Project), 2013-2015. (Fund: 566,783 TL) (Project Leader)
o The number of the information sources and the amount of information
that exist on electronic environments and the Web are increasing steadily.
Users cater for their information needs related to their topics of interest on
these resources using some technologies like search engines. However, such
technologies may fall short due to being general purpose, not tailored to a
particular domain, and making use of natural language processing methods in a
very restricted manner.
In
this project, a question answering system that meets the information needs
(questions) of students using reliable resources, that provides the students to
use resources in foreign languages, and that presents the results in the form
of a combined and coherent summary was developed. The theoretical aspect of the
project is formed of natural language processing, question answering, text
summarization, and machine translation formalisms. The system developed
consists of the phases of asking questions by the students in natural language
(Turkish), analysis of the questions using natural language processing
techniques, identifying resources to answer the questions, translating the
foreign resources to Turkish, and analysis, combination and summarization of
the related resources oriented towards the requested information need. In this
respect, the project aimed at providing the students to access question-based,
accurate and reliable information using an environment specialized for
answering the students information needs.
· Developing
Concept Mining Methods for Turkish Document Analysis
(Türkçe
Metinlerin Analizine Yönelik Olarak Kavram Madenciliği Yöntemlerinin
Geliştirilmesi)
o Tübitak 1001 Research Project 110E162, 2011. (Fund: 115,301 TL) (Project Leader)
o In this project funded by the Scientific and Technological Research
Council of Turkey, research was conducted in order to extract concepts from
Turkish documents by using natural language processing algorithms and the
Turkish Language Institution (TDK) dictionary. Concept mining is a process that
aims at extracting expressive concepts from textual, visual, or audio
artifacts. Although there have been developed methodologies in many Western
lanuages, mainly in English, for this purpose, there is no work developed for
Turkish that is successful so far in this domain. In this project, research and
work have been done in mainly six areas: 1. Compiling the documents and
preparing the corpora that will be used for concept mining; 2. Obtaining the
terms from these documents using preprocessing steps; 3. Weighing the terms in
the documents that are likely to be concepts using the TDK dictionary; 4.
Extracting the concepts according to the weights of the terms; 5. Clustering
the documents based on the extracted concepts; 6. Developing a concept-based
search engine. The outputs of the project are theses, technical reports,
scientific papers, databases, and the software programs.
· Design,
Development and Evaluation of a Customized Web-Based Portfolio System to
Improve Preservice Teachers Reflective Skills
(Hizmet Öncesi Öğretmenlerinin
Yansıtma Becerilerini Geliştirmek için Özelleştirilmiş Web
Tabanlı Portfolyo Sistemi Tasarımı, Geliştirilmesi ve
Değerlendirilmesi)
o Tübitak 3501 Research Project, 2012. (Advisor)
o This study aims to design, develop, and evaluate a customized web-based
portfolio system. The system will be integrated into teaching practicum courses
and primarily used for improving preservice teachers reflective skills. The
web-based system will provide an effective platform both for preservice
teachers to upload their course assignments and for the course instructors to
follow-up their students work. In addition, the system will allow instructors
structure their students teaching experience around the notion of reflection,
which is one of the most important skills that preservice teachers ought to
develop.
Research
in teacher education maintains that reflection is one of the most important
skills that preservice teachers need to develop (Braun ve Crumpler, 2004;
Freese, 1999, 2006; Loughran, 2002; Wedman ve Martin, 1986; Willard-Holt ve
Bottomley, 2000; Zeichner ve Liston, 1996). Teaching portfolios are recognized
to be the most effective tools to improve preservice teachers reflective
skills. Electronic portfolios, which offer several advantages over paper-based
portfolios (e.g., easy access, editing, and sharing) are now becoming a new
trend in higher education.
As
the proposed system will be web-based, it will offer various opportunities to
students, such as being able to: (1) access to their work anytime from
anywhere, (2) receive and provide continuous feedback, (3) revise and edit
their documents based on the feedback. Thus, the portfolio assignments will
become open learning documents that change continuously and visible to more
people, rather than being documents that are completed at once and shared with
just the course instructor. The portfolio system will also provide benefits to
the instructors in a way that they could follow-up teacher candidates work
easily and efficiently.
The
existing electronic portfolio systems show a wide range, varying from
commercial systems (content management systems that include electronic
portfolio modules) to open source electronic portfolio software (Stefani,
Mason, ve Pegler, 2007). In the previous studies carried out by the
principle investigator, some of these portfolio systems have been used as
portfolio tools. The results of these studies showed that the tasks
developed around the theoretical framework helped improving preservice
teachers reflection skills. In addition, they also revealed a need
for a more user-friendly electronic portfolio system, especially designed for
improving preservice teachers reflection skills.
The
e-portfolio reflection tasks will be constructed around the model developed by
Fuller and Bown (1975). In this model, a teachers life space
includes three basic elements: (1) goals, (2) experiences, and (3) observations
of themselves, or others. Fuller and Bown stated that for beginning
teachers there will be discrepancies among these three components, and the
experience of becoming a teacher entails coping with all the
discrepancies. In order to cope with such differences, preservice
teachers first need to become aware of such inconsistencies among the three
constituent elements. For this reason, the reflection tasks of this study will
require preservice teachers to consider all three elements of their teacher
life space and to identify the discrepancies among them after each teaching
experience cycle.
The
development and evaluation of the customized web-based portfolio system involve
the processes of design, programming, implementation, and evaluation. During
the design and programming phase, the aim is to develop a system that is easily
accessible and user-friendly. In addition, the system will be
designed in a way that will allow structuring of preservice teachers teaching
practice assignments using certain scripts. For example, to
complete a lesson plan assignment, preservice teachers will be able to edit a
page that includes the components of an expected lesson plan. In the
design of the system, the main goal is to create a platform that will easily
allow preservice teachers to upload and edit their artifacts (such as lesson
plans or teaching practice videos), to give/receive feedback to/from their
peers and instructors, and to complete their reflection tasks.
In
the implementation and evaluation phases of the project, the usability of the
web-based portfolio system and its contribution to the development of
preservice teachers reflective skills will be examined. Data will be collected
through using the customized web-based portfolio system within teaching
practicum courses. The sample of the study will include preservice teachers who
take a teaching practicum course in their final semester of their program in
the Faculty of Education. The main data collection sources are preservice
teachers responses to the reflection tasks, which also compromise the content of
web-based portfolios, surveys and semi-structured interviews. The change in
preservice teachers reflective skills will be examined using mixed method
analysis of the data, that is, with quantitative and qualitative analyses. The
data will be coded in an iterative manner with respect to Fuller and Bowns
(1975) model. The frequency count of each theme for each participant will be
calculated. The statistical analyses will then be conducted after the
qualitative data are quantified in this manner. The usability of the system
will be evaluated through the qualitative analyses of the data mainly coming
from surveys and interviews.
· Concept Extraction from Turkish Texts by Automatic
Methods
(Türkçe Metinlerden
Otomatik Yöntemlerle Kavram Çikarilmasi)
o Bogaziçi University Research Fund 5187, Bogaziçi University, Istanbul, 2010. (Fund: 9,456.83 TL) (Project Leader)
o Concept extraction is a subtopic of concept mining, which forms an
important branch of data mining. Concept mining can be defined as the study of
extracting important concepts that appear in documents. A basic point in
concept mining is processing the words for obtaining the concepts. Usually
thesauri is used during this process. Word-concept matching is usually
ambiguous and context is used to resolve the ambiguity. The relationships
between the concepts and the context are extracted using semantic similarity.
Also, formal concept analysis that makes this relationship explicit is another
important topic. Concept extraction study aims at obtaining efficient solutions
to some problems which are harder to solve using data mining.
The
two basic approaches in concept extraction can be named as expert-based
approach and statistical approach. The first one is also called rule-based or
information engineering-based approach. These systems include a set of pattern
matching rules compiled by experts in the field. The main disadvantage of this
method is finding such experts. The second one, statistical approach, is also
known as automatic learning approach. It makes use of statistical informations
gathered from available corpora for learning. Mostly HMM (Hidden Markov Model)
is used for this purpose. The transition probabilities are estimated from the
training data. The advantage is portability of the domain. The disadvantages
are the difficulty and cost of building corpora, the need to retrain the system
for different concepts, and slower execution.
Our
goal in this project is to build an automated concept extraction system for
Turkish. Since it is more appropriate for the model we build, we will use
statistical approach rather than the expert-based approach. The PASW software
developed by SPSS for data mining is a successful concept extraction
application for English and other well-known languages. In this project, we aim
at implementing the works done by PASW Text Analytics module for Turkish. The
system will work as follows: First, noun words will be extracted from the
documents, then sub-dictionary groups will be formed by clustering similar
words, then these sub-dictionaries will be labeled manually, and finally data
mining techniques will be applied to these concepts. For extracting words from
documents and disambiguating these words, the morphological parser and the
morphological disambiguator developed in the scope of Boğaziçi University
BAP 08M103 and TÜBİTAK 107E261 projects will be used. The cosine
similarity metric will be employed for discovering tem similarities. To group
similar terms, we will use k-means or a related clustering approach, and the
terms will be labeled manually. In this way, word-concept matching will be
obtained. The concept extraction phase will be completed after labeling.
Finally, some interesting information will be obtained from the concepts by
employing data mining methods.
· Developing a General-Purpose Turkish Handwritten Recognition System using a Large Lexicon
(Büyük Sözlüklü ve Genel Amaçli Türkçe El Yazisi Tanima Sistemi Gelistirilmesi)
o Bogaziçi University Research Fund 09A107D, Bogaziçi University, Istanbul, 2009. (Fund: 10,800 TL) (Project Leader)
o The aim of this project is to develop algoritms for identifying
patterns in periodic sequences and applying these algorithms to handwritten
character recognition problem. Handwritten character recognition is usually
divided into two groups: with segmentation (general purpose) and without
segmentation (special purpose). In this work, instead of these methods, it was
decided to apply a cognitive approach that became popular recently. As the
short sequences obtained from each handwritten sample are combined, a periodic
sequence is generated. These sequences are learnt and matched using compression
algorithms.
· Morphology Based Language Modeling for Turkish Speech Recognition
(Biçimbilimsel Tabanli Dil Modeli ile Türkçe Konusma Tanima Sistemi)
o Bogaziçi University Research Fund 08M103, Bogaziçi University, Istanbul, 2008. (Fund: 8,805 YTL) (Project Leader)
o In this project, we aimed at developing a high performance large vocabulary
continuous speech recognition system for Turkish. The most important
contribution of this work has been to develop a morphology-based language model
for Turkish. As a result of our previous work, we have built some language
resources for Turkish such as a morphological parser, a morphological
disambiguator, and a web corpus. Using these language resources, in this
project, we developed an effective morphology-based language model for Turkish.
We also replaced the static lexicon with a dynamic one based on the
morphological parser. So that, we greatly alleviated the out-of-vocabulary
problem for Turkish. We also developed a speech decoder which can do speech
decoding on morphology-integrated search networks.
· Morphology Based Language Modeling for Turkish Speech Recognition
(Biçimbilimsel Tabanli Dil Modeli ile Türkçe Konusma Tanima Sistemi)
o Tübitak 1001 Research Project 107E261, Tübitak 2007. (Fund: 46,490 YTL) (Project Leader)
o In this project, we aimed at developing a large vocabulary continuous
speech recognition system for Turkish, with the support of the Scientific and
Technological Research Council of Turkey. The state-of-the-art speech
recognition systems are basically composed of three main components: acoustic
model, language model, and speech decoder. In order to build the acoustic
model, hidden Markov models were trained by using a speech corpus compiled
previously. During the researchs for building the language model, a number of
resources were obtained. These resources are a morphological parser, a
morphological disambiguator, and a web corpus. By parsing the web corpus using
the morphological parser and then disambiguating these parses using
the morphological disambiguation system, a statistical language model based on
the lexical morphemes in the parses was estimated. The lexical morheme language
model was composed with the lexical transducer of the morphological parser to
construct a morphology-integrated search network. A real-time speech
decoder system for Turkish that can do speech decoding on the finite-state
transducer-based networks was developed. The output of the project is in
the form of a PhD thesis, technical reports, journal papers, conference papers,
databases, and software programs. The databases and programs were made publically
available for research purposes.
· Developing Structure-preserving and Query-biased Automated Summarization Methods for Web Search Engines
(Arama Motorlari için Bilgi Istegi ve Doküman
Yapisi Tabanli Otomatik Ozetleme Metotlarinin Gelistirilmesi)
o Bogaziçi University Research Fund 07A106, Bogaziçi University, Istanbul, 2007 (Fund: 6,065 YTL) (Project Leader)
o In this project, a new
summarization approach was developed to improve the effectiveness of Web search
based on two stages. In the first stage, a rule-based approach and a machine
learning approach were implemented to identify the sectional hierarchies of Web
documents. In the second stage, query-biased summaries are created based on
document structure. The evaluation results show that the system has significant
improvement over unstructured summaries and Google snippets.
· Morphotactic based Statistical Language Modeling for Large Vocabulary Continuous Speech Recognition Systems
(Genis Sözcük Dagarcikli Sürekli Konusma Tanima Sistemleri için Morfotaktik Tabanli Istatistiki Dil Modelleme)
o Bogaziçi University Research Fund 06A102, Bogaziçi University, Istanbul, 2006 (Fund: 14,405 YTL) (Project Leader)
o Bu projede, Türkçe gibi eklemeli
dillerde geniş sözcük dağarcıklı sürekli konuşma
tanıma (large vocabulary continuous speech recognition
LVCSR) sistemleri için kullanılacak yeni bir dil modelinin
geliştirilmesi amaçlanmaktadır. Bilindiği gibi, eklemeli
dillerde sınırsız sayıda kelime üretilebilmesi,
konuşma tanıma sistemlerinde dil modeli oluşturmada zorluklara
neden olmaktadır ve iyi bir dil modelinin eksikliği bu sistemlerin
etkinliğini önemli ölçüde etkilemektedir. İngilizce gibi nispeten
eklemeli olmayan dillerde konuşma tanıma sistemlerinin
başarıyla geliştirilmiş olmasının ve Türkçe gibi
dillerde henüz aynı başarıya
ulaşılamamasının en önemli nedenlerinden birisi, etkin bir
dil modelinin eksikliğidir. Konuşma tanıma sistemlerinde yaygın
olarak n-birimli (n-gram) dil modeli kullanılmaktadır. Bu
model, dili istatistiki olarak modellemeye çalışmaktadır.
Geniş bir metin havuzundan (corpus) sözcüklerin birbiri ardına
gelme sıklıklarını göz önüne alarak oluşturulan model,
sözcük dizilerinin olasılıklarını hesaplamakta
kullanılmaktadır. Türkçede diğer bir sorun, sözcüklerin cümle
içinde diziliminin nispeten serbest olmasıdır. Bu serbestlik sorunu
Türkçenin eklemeli bir dil olması ile birleştiğinde, sözcük
bazında basit bir n-birimli dil modelinin konuşma tanıma sistemlerinde
etkinliğini azaltmaktadır. Bu çalışma ile Türkçenin
morfotaktik (morphotactic) (morfların dizilim kuralları)
bilgisini n-birimli bir dil modeli ile birleştirerek etkin bir dil
modelinin oluşturulması hedeflenmektedir. Böylece Türkçe için birçok
uygulama alanına sahip geniş sözcük dağarcıklı
konuşma tanıma sistemleri geliştirilebilecektir.
· Developing Natural Language Processing-based Methods for Text Classification
(Metin Siniflandirmada Dogal Dil Isleme Tabanli Metotlarin Gelistirilmesi)
o Bogaziçi University Research Fund 05A103, Bogaziçi University, Istanbul, 2005 (Fund: 10,700 YTL) (Project Leader)
o Bu projede, metin
sınıflandırma (text categorization) problemi için
doğal dil işleme tekniklerinin kullanılması
düşünülmektedir. Günümüzde metin sınıflandırma amaçlı
pek çok araştırma yürütülmektedir ve bunlardan
bazılarının pratik uygulamaları da mevcuttur. Fakat, bu
çalışmaların başarı oranı belli bir
sınırı geçememektedir. Bunun başlıca sebebi, hemen
hemen bütün çalışmalarda, sadece sentaktik (sözdizimsel syntactic)
bilgilerin kullanılması ve semantik (anlambilimsel semantic)
bilgilerden yararlanılmamasıdır. Diğer bir deyişle,
metinlerdeki kelimeler anlamlarından bağımsız olarak ele
alınmaktadır. Bu projede, bu eksikliğin giderilmesi ve
metinlerin içerdikleri anlam gözönüne alınarak
sınıflandırılması amacıyla yeni metotlar
önerilecektir.
· Developing Dynamic and Adaptive Methods for Turkish Spam Filtering
(Türkçe Spam Mesajlarin Filtrelenmesine Yönelik Dinamik ve Uyarlamali Metotlarin Gelistirilmesi)
o Bogaziçi University Research Fund 04A101, Bogaziçi University, Istanbul, 2004. (Fund: 6,600,000,000 TL) (Project Leader)
o Bu projede, spam e-posta
mesajlarının önlenmesine yönelik olarak Türkçe için spam-önler
filtreleme metotları geliştirilecektir. Günümüzde spam mesajlar tüm
e-posta mesajlarının %10unu oluşturmaktadır ve
kullanıcılar açısından önemli zaman kayıplarına
neden olmaktadır. İngilizce gibi yaygın diller için filtreleme
algoritmaları mevcuttur, fakat Türkçe mesajlar için henüz böyle bir
çalışma yapılmamıştır. Bu tür bir
çalışmada Türkçenin karmaşık morfolojik
yapısının gözönüne alınması gerekmektedir. Bu projede
geliştirilecek olan metotlar dinamik olacaktır ve yapay sinir
ağları ile Bayesian ağları tekniklerine dayanacaktır.
Ortaya konulacak olan algoritmaların iki temel bileşeni
içereceği düşünülmektedir: Mesaj içeriklerinin morfolojik analizini
yapacak bir morfoloji modülü ve mesajları normal ve spam olarak
sınıflandıracak bir öğrenme modülü.
· Statistical Analysis of Turkish
(Türkçe'nin Istatistiksel Incelenmesi)
o Bogaziçi University Research Fund 02A107, Bogaziçi University, Istanbul, 2002-2003. (Fund: 6,350,000,000 TL) (Project Leader)
o Bu projede, yeni bir yaklaşım olan doğal dillerin
istatistiksel işlenmesi (statistical processing) konusu Türkçeye
uygulanacaktır. Bu konuda, bazı yaygın dillerle ilgili
araştırmalar yapılmaktadır; fakat Türkçe için henüz bu tür
bir çalışma bulunmamaktadır. Bu amaçla, proje elemanları
tarafından kapsamlı bir literatür taraması yapılacaktır.
Bu taramaya dayanılarak, Türkçenin istatistiksel işlenmesi için bir
altyapı oluşturulacak ve bir program geliştirilecektir.
Programın tasarımı, implementasyonu ve testi
yapılacaktır.