Assignment #5: The Collections API and Files and String Tokenizing and String Processing

This assignment requires to use several components of the Collections API to implement a spell-checker. The amount of code that you will write is not large.

Specifications

The name of the ditictionary file will be dict.txt, unless a command line switch of -d file.txt is provided, in which case file.txt becomes the dictionary file. The command line will also contain the name of the file(s) to spell check. Any word that is not in the dictionary is considered to be misspelled. Output, in sorted order, each misspelled word and the line number(s) on which it occurs.If a word is misspelled more than once, it is listed once, but with several line numbers.

What's A Word?

For the purposes of this assignment, you will determine words as follows: The input is considered to be a sequence of tokens separated by whitespace. Any token that ends with a single period, question mark, comma, semicolon, or colon should have the punctuation removed. After doing this, any token that contains letters only is considered a word. Convert this word to lower case.

Example: For the following line
This is a test, one-half of four is 2.
The tokens are:
This
is
a
test,
one-half
of
four
is
2.
Among these, the words are:
this
is
a
test
of
four
is
one-half fails the rule of consisting entirely of letters, as does 2. This is converted to lower case and test has the punctuation at the end stripped.

These are the rules, even if I've missed a few cases (like apostrophes, etc.). The Character, String and StringBuffer classes have plenty of routines to help you out.

The Dictionary

The dictionary contains one word per line. A large dictionary (~800Kbytes) is available here. This dictionary was obtained from the Internet and may have inappropriate words. I apologize in advance if this is the case. I'll also provide a real data file containing a chapter from a textbook here.

The Algorithm

Read the dictionary file and store its contents in a Set. (You can decide if you want a TreeSet or HashSet when you do a new, but you should write everything in terms of Set to allow a quick change.) Then read the data input file, one line at a time. Break the line into tokens using an StringTokenizer object, and then write some functions to convert the tokens to words (or an empty string if it is not a word). Once you have a word, check to see if it is in the Set that stores the dictionary. If it is not, you will need to add it to a Map that stores the misspelled words and the line numbers on which they occur. (This implies that you know the current line number.) Once everything is read, you need to step through the map and print its contents in an orderly way.

What to Submit

Submit the usual stuff, and include the output from running with the provided dictionary and test file. Also, provide a smaller test case to verify that your -d option works and that you can handle several test files on the command line.