Distribution of Words - Zipfian Distribution

This page presents some data that categorizes query words by the number of documents in which they occur. Words that appear in only one document will be very selective. If the goal of the user is to answer a question, this could be very helpful in eliminating extraneous information. However, if the answer to the question is not in this document, or if the goal is for more general knowledge discovery, such words may be too restrictive. At the other end of the spectrum, a word that appears in every document cannot be much help in selecting documents.

The "words" used were consecutive strings of nonblank characters. Internal punctuation was kept (thus two-fisted is a word), but external punctuation was removed (so $1million became 1million). All upper case letters were translated to lower case. Each word was required to contain at least one alphabetic character (Thus, 7up is a word, but 1998 is not).

This data came from a corpus of 825 articles from the Reuters News Service that appeared in December of 1997. The articles included a total of 388416 words of which 20488 were distinct.

Results

There were 10 words that appeared in every document. They were

  a                reuters
  date             source
  december         text
  est              the
  online           title

Two of these words, "a" and "the", really occurred in all articles. The other words came from the headers on the documents. This points out the importance of not indexing headers as ordinary text. It is easy to see that this problem extends further. The articles had timestamps on them. Thus, we find that "am" appeared in 383 articles and "pm" appeared in 480. The counts for the names of the days of the week were similarly distorted.

At the opposite end of the spectrum, about 41% of all words appeared in only one article. These words are quite specific (at least for this corpus). Click HERE to see a small sample of these words. Some of these words (e.g. "chunk") could be expected to appear more often in a larger corpus. Others ("figure...(it"), seem unlikely to recur.

Now let's get an overall impression of how the words were distributed. Here is a graph showing the full range of the data. The scale makes it only a little informative.
Words vs Docs
Notice that a huge number of words appear in a small number of documents. As described in class, the number of words falls off rapidly as the number of documents increases.

In order to get a better idea about what is going on, let's look at some restricted parts of the curve, starting with the very beginning of the data.
Words vs Docs

As you can see, the graph goes down rather quickly from 8499 words in 1 document to only 121 words in 15 documents. That is why in the first graph everything on the right seems to be zero. The number of documents is very small compared to 8499.

Further out to the right, the graph continues down pretty quickly. However, the numbers have gotten so small that irregularities in the data have become rather apparent. When the number of words was 200, a change of 1 or two was not visible in the graph. Now these changes stick out.

Even further to the right, all entries are near zero. Specifically, of the number of documents over 150, 543 numbers have 0 words, 105 have 1 word, 20 have 2 words and 6 have 3 words.