Possible PhD Research Areas
- Developing a machine translation (MT) system
- Developing an automated theorem prover (ATP)
Possible Cmpe491/492 and MS Projects
- Evaluation of Text Categorization Performance on Different Datasets (Cmpe492)
The goal of this project is evaluating the performance of some text categorization methods on a few datasets, then developing a new method (a slight modification of a known method), and then repeating the evaluation using this method with the same datasets. The scope of this project is clearly defined and the instructor will initially guide the student(s) by providing papers about the methods that will be considered, the datasets that will be used, and an idea about the new method that will be developed. So, the project will roughly proceed as follows:
- Read papers about some methods on text categorization.
- Get acquainted with the tools and techniques used (e.g. support vector
machines, chi-squared, information gain, etc.).
- Prepare the datasets for processing.
- Apply some of the methods on these datasets and evaluate the performances.
- Develop a new method.
- Apply the new method on the same data sets, evaluate the performances, and
compare with the other methods.
See the papers An Extensive Empirical Study of Feature Selection Metrics for Text Classification, Text Categorization with Class-Based and Corpus-Based Keyword Selection, and Classification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords, which will be used in the project.
Resources required: Datasets
- Developing a Software for Turkish Spam Mail Filtering (Cmpe492)
Research about identifying and filtering Turkish spam mails has been initiated in the department and some algorithms have been developed. The main goal of this project is two-fold: i) Building a professional software that makes use of the developed algorithms for Turkish spam mail filtering, and ii) Integrating this system into Outlook Express, Microsoft Outlook, and/or some more e-mail environments. The student(s) thus need(s) to be fluent in programming. The scope of the project is as follows:
- Get acquainted with some machine learning techniques (Bayesian learning,
artificial neural networks, hidden Markov models).
- Analyze and understand the algorithms already developed in the department.
- Investigate some of the professional spam mail filters currently used in
practice.
- Incorporate into the algorithms some simple but effective properties, which
greatly improves the performance, that are found in
current spam mail filters (e.g. holding a list of known spammers).
- Develop a professional spam filtering software. The program will be
parametric and adaptive in the sense that the end user will be able to change
the configuration and the program will behave differently for different users.
- Integrate the system into Outlook Express, Microsoft Outlook, and/or some
more e-mail environments. The user interface is an important part - the user
should use the filter as it is already a component of the e-mail environment,
and the interface should be clear and easy to use.
The system will mainly aim at Turkish spam filtering, but it will also include a component for English mails.
See the papers Adaptive Anti-spam Filtering for Agglutinative Languages - A Special Case for Turkish and Developing Methods and Heuristics with Low Time Complexities for Filtering Spam Messages, which explain the algorithms developed and which will be used in the project.
- Semantic Categorization of Turkish Language Elements (Cmpe491,MS)
In this project, a set of semantic categories will be determined and these categories will be assigned to Turkish words. For instance, the semantic category of "computer" may be "elektronik cihaz", while those of "kedi" may be "hayvan" and "canli". Possibly the categories will be arranged in a hierarchical order. For instance, "elektronik cihaz" may be a subcategory of "cihaz"; "hayvan" and "insan" may be subcategories of "canli".
Roughly, the project will consist of three phases:
i) Building a Turkish lexicon
A Turkish lexicon in electronic medium will be prepared, which will be identical to an official Turkish dictionary (e.g. TDK Dictionary). The lexicon will not be built from scratch; instead either the current lexicon used in the department will be modified or lexicons available in internet will be used.
Suffixes that will not be entered as different words ("lik", "ci", etc.) will be determined. If a word has different syntactic categories, they will be entered as different entries in the lexicon (e.g. "sevgi" may be noun and proper noun). Also, if a word has different senses in the same category, they will be entered as different entries (e.g. "kalem" means "pencil" and "item"). Because the suffixes affixed to them and their semantic categories may be different. Compound words separated by spaces (e.g. "bata çika") will be entered as they appear in the dictionary. In the lexicon, properties of the words will also be identified. Some additional information (e.g. origins of words) may be included.
Thus, the lexicon will look like the following:
|
word |
syntactic category |
suffixes |
properties |
origin |
|
kitap |
noun |
ci (noun), lik (noun) |
... |
persian |
|
sevgi |
noun |
... |
... |
... |
|
sevgi |
proper noun |
... |
... |
... |
|
saat |
noun |
... |
violates vowel harmony |
... |
|
kalem |
noun |
... |
... |
... |
|
kalem |
noun |
... |
... |
... |
|
bata çika |
adverb |
... |
... |
... |
ii) Assigning semantic categories
Some literature survey related to semantic categorization and semantic networks will be done. Possible semantic categories will be identified and for each lexicon entry, its semantic categories will be determined. Previous researches on the topic, Turkish WordNet and semantic networks on other languages may aid this process.
The categorization should be meaningful as much as possible. In determining semantic categories, its should be useful to check whether two words having the same category can be used interchangeably. If not, a new category that discriminates these words may be created. For instance, assigning the words "kus" and "köpek" both to the same category "hayvan" may be incorrect since they have different properies - the first one is a flying animal while the other is not.
Depending on the progress of the project, categorization may be limited to a few thousand of the words rather than the whole lexicon.
iii) Developing an application
A small-scale application program will be developed in order to test the correctness of the semantic categorization.
- Turkish Word Sense Disambiguation Using WordNet (Cmpe491/492,MS)
Literature survey on word sense disambiguation (WSD) using WordNet will be done. In this project, Turkish WordNet will be used, thus the structure of Turkish WordNet will be studied in full. The existing WSD methods will be adapted to Turkish using Turkish WordNet and in addition a new method will be proposed. Some characteristics peculiar to Turkish may be taken into consideration and some heuristics in order to increase the success of the method may be formulated (e.g. large number of suffixes in Turkish may give clue in selecting the correct sense of a word). The performances of the methods will be compared in detail.
If need arises, in addition to WordNet, some other resources (Turkish corpora, etc.) may be used. Also the methods used and the proposed method may be applied to English.
Resources required: Turkish WordNet, Turkish morphological analyzer/POS tagger
- Turkish Part of Speech Tagging (Cmpe492)
Literature survey on part of speech (POS) tagging will be done. Also morphology of Turkish and finite state machines will be studied. Using an available Turkish lexicon, a POS tagger for Turkish will be developed. The performance of the tagger will be analyzed in detail. Some characteristics peculiar to Turkish may be taken into consideration and some heuristics in order to increase the success of the system may be formulated.
Resources required: Turkish lexicon
- Syntactic Parsing of Turkish (Cmpe491,MS)
Literature survey on grammars and parsing will be done. The characteristics of Turkish language and structure of sentences will be analyzed. A comprehensive grammar for Turkish will be built. An important property of the grammar formalism is that it includes grammatical sentences and excludes ungrammatical sentences.
For syntactic parsing, a sophisticated morphological analyzer or POS tagger is necessary. One of available such components will be used and it will be improved if necessary.
The performance of the parser will be analyzed in detail. Turkish WordNet may be used for this purpose.
Resources required: Turkish morphological analyzer/POS tagger
- Text Summarization for Turkish (MS)
Literature survey on text summarization/sentence extraction will be done. The existing methods will be applied to Turkish texts. In addition, a new method will be proposed. Some characteristics peculiar to Turkish may be taken into consideration and some heuristics in order to increase the success of the method may be formulated. The performances of the methods will be compared in detail. Turkish WordNet may be made use of in developing the method and heuristics.
Resources required: Turkish morphological analyzer/POS tagger
- Unsupervised Segmentation of Words into Morphemes (Cmpe491,MS)
The objective of this project is to design a statistical machine learning algorithm that segments words into the smallest meaning-bearing units of language, morphemes. Ideally, morphemes are basic vocabulary units suitable for different tasks such as text understanding, machine translation, information retrieval, and statistical language modeling. The algorithm that will be developed should be unsupervised, i.e. there will be no already segmented input-output data. Some sample Turkish words and their segmentations are given below:
|
word (input) |
segmentation (output) |
|
bitirenlerin |
bitir-en-ler-in |
|
defterimdeki |
defter-im-de-ki |
|
giyiliyor |
giy-il-iyor |
The project will begin with a literature survey on supervised/unsupervised learning and word segmentation. Some available algorithms that perform word segmentation will be studied. Then new methods and algorithms will be developed. The input to the algorithms will be a list of unsegmented words and their frequencies of use. Three corpora (Turkish, English, Finnish) are already available and these will be used as input. Also available are a small sample of segmented words for each language, which will be used in order to evaluate the algorithms. The performance of the developed algorithms will be analyzed in detail.
- Translation Memory System for Turkish-English (Cmpe491,MS)
A translation memory (TM) is a component used in machine translation, which efficiently searches a database of previously translated text fragments in order to locate one which closely resembles the fragment being translated. In this project, literature survey on translation memory concepts and currently used translation memories will be done. A corpus as large as possible containing Turkish and English sentences (translations of each other) will be built. An algorithm for searching on this corpus will be developed and implemented. The algorithm may also be applied to some other corpora as well. The performance of the algorithm will be analyzed in detail.
In the case of an MS thesis, the algorithm to be developed will be expected to be a sophisticated one and will bear a contribution to existing algorithms.
- Alignment of Turkish-English Sentences (Cmpe492)
Given a text in one language and its translation in another language, sentence
alignment is the task of matching translation correspondences in the texts. The
difficulties in this task come from the fact that the relative order of
correspondent pairs may be different in the two texts or a single sentence in one
text may correspond to more than one sentence in the other.
In this project, literature survey on sentence alignment will be done. Most of the work will be dedicated to compiling a large Turkish-English corpus from different resources. The sentences will be aligned manually. Then a simple sentence alignment algorithm will be implemented and tested on this corpus.
- Developing an Automated Theorem Prover (MS)
In this project, a first-order and resolution-based automated theorem prover (ATP) will be developed. Literature survey on ATPs, resolution, resolution refinements, and first-order predicate calculus will be done. Also the structures and interior workings of some of the currently available ATP systems (e.g. Otter) will be studied. A new ATP will be designed and implemented. It is required that at least one contribution to the current research (e.g. a new refinement, a heuristic) be proposed. The system will be tested on standard theorem sets and its performance will be measured (in seconds or some other metric) and compared with other ATPs in detail.
- Development of an Adaptive Chess Program (Cmpe491/492,MS)
The prerequisite of this project is that the student knows playing chess at least at intermediate level. Literature survey on methods, techniques, and heuristics used in chess playing and chess programming will be done. How the pieces, board, and positions on the board are evaluated will be analyzed. An algorithm for a computer program that plays chess will be proposed. An important point is that the algorithm should be able to learn and adapt itself to changing environment. It is required that the algorithm bears at least one contribution (e.g. a new technique, a heuristic, a new board evaluation approach) to chess playing programming. A program will be developed and implemented. The success of the program will be compared with those of other chess playing programs.