Assignment #3: C Arrays, String, Files

Write a C program that accepts two files (on the command line) and outputs the longest common substring of the two files. In other words, the longest substring that is present in both files.

Your program should collapse all consecutive white-space into a single space. An example of correct output:
FILE1 contains: I love programming all day and all night.
FILE2 contains: If I have to write one more program, I will change majors.
OUTPUT: The longest common substring is 'e program'

Your program should be reasonably efficient, so it should be able to find the longest common substring in two books: "Moby Dick" and "War and Peace" which I will provide as data files. Note that these data files will be about 1,000,000 characters for Moby Dick and 3,000,000 characters for War and Peace. If your algorithm works, you will find that phrases such as

'could not sleep for a long time'
'covered his face with his hands'
'his whole attention was absorbed'
'it was impossible to prevent it'

are in both books. But they are not the longest such phrase.

Efficient Algorithm

You will need to use a suffix array. There are complicated algorithms to construct the suffix array in linear time, but also simpler algorithms that will work well enough for this input. You can use the simpler construction algorithm.

Begin, by forming a single string representing the input file, followed by some special character or characters that are not in the input, followed, by the second file. As an example, let us use two much smaller inputs:

INPUT #1: fashion
INPUT #2: crashes

Example for above:

String: fashion#crashes

Then compute all the possible suffixes of the resulting string

s
es
hes
shes
ashes
rashes
crashes
#crashes
n#crashes
on#crashes
ion#crashes
hion#crashes
shion#crashes
ashion#crashes
fashion#crashes

This is very easy to do in C: just store pointers into the middle of the string. DO NOT try to make copies; that will use quadratic space. Then sort the suffixes.

#crashes
ashes
ashion#crashes
crashes
es
fashion#crashes
hes
hion#crashes
ion#crashes
n#crashes
on#crashes
rashes
s
shes
shion#crashes

Sorting is also pretty easy to do; use qsort.

Now look at each adjacent pair of elements, in which one contains a # and one does not (you don't have to search for the # in your code; just check if the starting point of the string is before the # (pointer math again!!!)). For each such adjacent pair, compute the length of the common prefix; the longest such common prefix (in this case the prefix ash in ashes and ashion#crashes is your answer.