Here is some Test Data
that could be used to test your program.
3. Find at least one example of some string that would be problematic
for a word finding program, e.g. "4x4" or "$1million".
(Send me an email with your example(s)).
Assignment 3 -- Due 10 Feb 04
General Instructions
Each time I make a programming assignment, everyone in
the class will be turning in a program. In order to keep
track of each student's programs, I would like you to
adopt the following naming convention. Please make the
name of each program start with your initials followed
by _HWn.pl (where n is the number of the assignment
to which you are responding). Thus, for assignment 2,
my program would be called GVW_HW2.pl. If you need to
turn in more than one program for a single assignment,
place A, B, ... after the assignment number (e.g.
GVW_HW2A.pl).
1. Program
Refine your word finding program. This time, you should
make a subroutine called TokenizeWords that takes a
text string as an arguement and returns an array of the
words in the string. Your program should work with
the driver program provided HERE. The text that I used
for testing your previous assignments and discussed in
class is available HERE.
2. Program
Make a subroutine called IndexableWords that takes
a list of words (from TokenizeWords) and post-processes
them into a list of words for indexing. It should work
with the same driver program as above.
Note: We will talk about this assignment in class.
Assignment 4 -- Due 17 Feb 04
1. Reading
Chapter 4 of Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto
2. Program
Write a program that will process a file. Your program should:
a. Identify documents (document boundaries)
b. Identify metadata like title and date
c. Identify the text body of the document
Your output does not need to be elaborate. This will become a part
of your indexer where the output will be the index files.
Assignment 5 -- Due 22 Feb 04
1. Reading
Chapter 8 of Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto
2. Program
In assignments 2 and 3, you wrote a subroutine to find words in a text.
In assignment 4, you wrote a program to handle multiple documents. For
this assignment, you should merge the two. Your program should accept a
list of files, each of which may contain multiple documents. It should
loop through the documents and extract relevant metadata (at least the
title) and identify the body of the text. For each document, your program
should print out the title, the total number of words (tokens) in the
text and the number of distinct words (types) in the text.
Assignment 6 -- Due 23 March 04
Program: Enhance your program from Assignment 5 to build your indexer.
You should either use the three file index structure discussed in
class or submit a description of the structure that you are using.
You will need to keep track of things like the byte position of the
documents to support retrieval later.
Assignment 7 -- Due 30 March 04
Write a preliminary version of your search engine.
Use the index files that you generated last time to
find files that contain words from a user query.