Assignments
LING 467
New assignments will be added here each week.
Assignment 1 -- Due 23 Jan 08
1. Reading:
a. Chapter 1 of Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto
b. Article by Grefenstette and Tapanainen
What is a Word? What is a Sentence? (1994)
2. Send me email that provides me with the following information.
Your name
Your email address
Your status (undergrad, grad, which program you are in)
What prior programming experience do you have?
What would you like to get from this course?
Do you have a personal computer? If so, what kind?
3. Download perl and install it on your machine.
Assignment 2 -- Due 30 Jan 08
1. Reading
Chapter 2 of Information Retrieval by van Rijsbergen.
2. Program
Write a perl program to find words in a text file, count them and
display a frequency list for the words in the text. Your program
should use a subroutine that accepts a string as input and returns
a list of the words found. Try to make your function be a good one,
not just the simplest thing that works.
Here is some Test Data that could be used to test your program.
3. Find at least one example of some string that would be problematic
for an English word finding program, e.g. "4x4" or "$1million".
(Send me an email with your example(s)).
Bonus Points! Send at least one problematic example from a language
other than English. Please include a translation and a full description
of why your example might give an IR system difficulties
Assignment 3 -- Due 6 Feb 08
1. Reading - Chapter 3 of Baeza-Yates/Ribeiro-Neto.
Also read the web pages on Zipfian distributions
and stopwords.
2. Program
Refine your word finding program. This time, you should
make a subroutine called TokenizeWords that takes a
text string as an arguement and returns an array of the
words in the string. Your program should work with
the driver program provided HERE.
Assignment 4 -- Due 12 Mar 08
1. Study for midterm
The midterm will cover everything in the course so far.
This includes anything covered in lecture, material in the
readings and the assignments.
2. Program
Begin work on you indexer. For this assignment, your indexer
need only handle the documents enough to create the
Document Information File. Later assignments will enhance
this version to create the other two files.
Assignment 5 -- Due 19 Mar 08
1. Enhance your indexer so that it generates all three files of the index.
Assignment 6 -- Due 26 Mar 08
1. Read the WikiPedia page on the Google PageRank algorithm.
2. Continue working on your indexer. Add in the additional information
that you need to support TF-IDF and any other features that you may want
to use for your search engine.
3. Write up a preliminary idea about what you will do with your search
engine that makes it uniquely yours. This is preliminary. If you
find that you want to change it later, that will be alright.