The "words" used were consecutive strings of nonblank characters. Internal punctuation was kept (thus two-fisted is a word), but external punctuation was removed (so $1million became 1million). All upper case letters were translated to lower case. Each word was required to contain at least one alphabetic character (Thus, 7up is a word, but 1998 is not).
This data came from a corpus of 825 articles from the Reuters News Service that appeared in December of 1997. The articles included a total of 388416 words of which 20488 were distinct.
a reuters date source december text est the online titleTwo of these words, "a" and "the", really occurred in all articles. The other words came from the headers on the documents. This points out the importance of not indexing headers as ordinary text. It is easy to see that this problem extends further. The articles had timestamps on them. Thus, we find that "am" appeared in 383 articles and "pm" appeared in 480. The counts for the names of the days of the week were similarly distorted.
At the opposite end of the spectrum, about 41% of all words appeared in only one article. These words are quite specific (at least for this corpus). Click HERE to see a small sample of these words. Some of these words (e.g. "chunk") could be expected to appear more often in a larger corpus. Others ("figure...(it"), seem unlikely to recur.
Now let's get an overall impression of how the words were distributed.
Here is a graph showing the full range of the data. The scale makes
it only a little informative.
Notice that a huge number of words appear in a small number of documents.
As described in class, the number of words falls off rapidly as the number
of documents increases.
Even further to the right, all entries are near zero. Specifically, of the number of documents over 150, 543 numbers have 0 words, 105 have 1 word, 20 have 2 words and 6 have 3 words.