Assignments
LING 467

New assignments will be added here each week.

Assignment 1 -- Due 19 Jan 04

1. Reading:
 a. Chapter 1 of Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto
 b. Article by Grefenstette and Tapanainen
    What is a Word? What is a Sentence? (1994)

2. Send me email that provides me with the following information.

Your name
Your email address
Your status (undergrad, grad, which program you are in)
What prior programming experience do you have?
What would you like to get from this course?
Do you have a personal computer? If so, what kind?

3. Download perl and install it on your machine.


Assignment 2 -- Due 26 Jan 04

1. Reading
 a. Chapter 2 of Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto
 b. Steve Lawrence and C. Lee Giles, NEC Research Institute
   Searching the World Wide Web - From Science Magazine

2. Program
Write a perl program to find words in a text file, count them and
display a frequency list for the  words in the text.  Your program
should use a subroutine that accepts a string as input and returns
a list of the words found.  Try to make your function be a good one,
not just the simplest thing that works.
  • Here is some Test Data that could be used to test your program. 3. Find at least one example of some string that would be problematic for a word finding program, e.g. "4x4" or "$1million". (Send me an email with your example(s)).

    Assignment 3 -- Due 10 Feb 04

    
    General Instructions
    Each time I make a programming assignment, everyone in
    the class will be turning in a program. In order to keep
    track of each student's programs, I would like you to
    adopt the following naming convention. Please make the
    name of each program start with your initials followed
    by _HWn.pl  (where n is the number of the assignment
    to which you are responding). Thus, for assignment 2,
    my program would be called GVW_HW2.pl. If you need to
    turn in more than one program for a single assignment,
    place A, B, ... after the assignment number (e.g.
    GVW_HW2A.pl).
    
    1. Program
    Refine your word finding program. This time, you should
    make a subroutine called TokenizeWords that takes a
    text string as an arguement and returns an array of the
    words in the string. Your program should work with
    the driver program provided HERE. The text that I used
    for testing your previous assignments and discussed in
    class is available HERE.
    
    2. Program
    Make a subroutine called IndexableWords that takes
    a list of words (from TokenizeWords) and post-processes
    them into a list of words for indexing. It should work
    with the same driver program as above.
    
    Note: We will talk about this assignment in class.
    
    

    Assignment 4 -- Due 17 Feb 04

    1. Reading
    Chapter 4 of Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto
    
    2. Program
    Write a program that will process a file. Your program should:
      a. Identify documents (document boundaries)
      b. Identify metadata like title and date
      c. Identify the text body of the document
    
    Your output does not need to be elaborate. This will become a part
    of your indexer where the output will be the index files.      
    

    Assignment 5 -- Due 22 Feb 04

    1. Reading
    Chapter 8 of Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto
    
    2. Program
    In assignments 2 and 3, you wrote a subroutine to find words in a text.
    In assignment 4, you wrote a program to handle multiple documents. For
    this assignment, you should merge the two. Your program should accept a
    list of files, each of which may contain multiple documents. It should
    loop through the documents and extract relevant metadata (at least the
    title) and identify the body of the text. For each document, your program
    should print out the title, the total number of words (tokens) in the
    text and the number of distinct words (types) in the text.
    
    

    Assignment 6 -- Due 23 March 04

    
    Program: Enhance your program from Assignment 5 to build your indexer.
    You should either use the three file index structure discussed in
    class or submit a description of the structure that you are using.
    You will need to keep track of things like the byte position of the
    documents to support retrieval later.
    
    

    Assignment 7 -- Due 30 March 04

    Write a preliminary version of your search engine.
    Use the index files that you generated last time to
    find files that contain words from a user query.