In this project, you will develop a simple HTML parser and a stemmer for Turkish web pages. The simple parser you develop will just remove the HTML formatting commands (tags) from a web page and produce the contents of the web document in text form (i.e. extract the tokens). The stemmer should determine whether the word in question has been derived from a stem, and if it is, it should report the word stem. As an example, consider the token word ``gittim". This word has been derived from the word stem ``git". Note that the stemming algorithms can be quite complex (this is especially true for Turkish). So what you will do is implement a very simple stemmer. The simple stemmer will work as follows: You will be given a Turkish dictionary. In order to determine the stem word, first check to see if the word is in the dictionary. If it is, then you are done. If it not, then remove one letter from the end of the word and again check to see if it is in the dictionary. You basically repeat this process until you either find a word in the dictionary that matches the truncated word or you end up with one letter in the truncated word. If you end up with one letter in the truncated word, then you just conclude that you have found a word that is not in the dictionary. This algorithm definitely does not give good results, but we will just use it since developing a good Turkish stemmer can be a topic for a master's thesis.
In your project, you will implement a single program called parstem which incorporates both the parser and the stemmer. Your program will be invoked as follows:
parstem dictionaryfile htmlfile1 htmlfile2 htmlfile3 ....
The dictionaryfile will contain Turkish words in alphabetical order (one word on each line). The htmlfileX are the names of HTML files that will be parsed. The output of your program should be the list of stem words and the words that could not be found in the dictionary. The list should be output in alphabetical order, one word per line. After each word you should also print (i) the number of times the word appeared in the files and (ii) a 1 if the word stem appeared in the dictionary or 0 if it did not. Additionally, after the list of the words, you should print the following statistics:
Implementation Notes
Since we have not covered dynamic memory allocation topic yet, you
will use statically defined data structures in this project:
Miscellaneous Notes
Html parser and stemming procedures are used in search engines
(like http://www.google.com,
http://www.altavista.com etc) in real life.