Assignment #5: The Collections API and Files and String Tokenizing and String Processing
This assignment requires to use several components of the Collections API
to implement a spell-checker. The amount of code that you will write is
not large.
Specifications
The name of the ditictionary file will be dict.txt, unless a command line
switch of -d file.txt is provided, in which case file.txt
becomes the dictionary file.
The command line will also contain the name of the file(s) to spell check.
Any word that is not in the dictionary is considered to be misspelled. Output,
in
sorted order, each misspelled word and the line number(s) on which it occurs.If
a word is misspelled more than once, it is listed once, but with several
line numbers.
What's A Word?
For the purposes of this assignment, you will determine words as follows:
The input is considered to be a sequence of tokens separated by
whitespace. Any token that ends with a single period, question mark, comma,
semicolon, or colon should have the punctuation removed. After doing this,
any token that contains letters only is considered a word. Convert this
word to lower case.
Example: For the following line
This is a test, one-half of four is 2.
The tokens are:
This
is
a
test,
one-half
of
four
is
2.
Among these, the words are:
this
is
a
test
of
four
is
one-half fails the rule of consisting entirely of letters,
as does 2. This is converted to lower case and test
has the punctuation at the end stripped.
These are the rules, even if I've missed a few cases (like apostrophes,
etc.).
The Character,
String and StringBuffer classes
have plenty of routines to help you out.
The Dictionary
The dictionary contains one word per line. A large dictionary (~800Kbytes)
is available here.
This dictionary was
obtained from the Internet and may have inappropriate words. I apologize
in advance if this is the case.
I'll also provide a real data file containing a chapter from a textbook
here.
The Algorithm
Read the dictionary file and store its contents in a Set.
(You can decide if you want a TreeSet or HashSet
when you do a new, but
you should write everything in terms of Set
to allow a quick change.)
Then read the data input file, one line at a time. Break the line into tokens
using an StringTokenizer object, and then write some functions to
convert the tokens to words (or an empty string if it is not a word). Once
you have a word, check to see if it is in the Set that
stores the dictionary. If it is not, you will need to add it to a Map
that stores the misspelled words and the line numbers on which they
occur. (This implies that you know the current line number.) Once everything
is read, you need to step through the map and print its contents in an
orderly way.
What to Submit
Submit the usual stuff, and include the output from running with
the provided dictionary and test file.
Also, provide a smaller test case to verify that your -d option works
and that you can handle several test
files on the command line.