Assignment #5: Kevin Bacon Game

For a description of the Kevin Bacon game, follow this occasionally unreliable link or look at Exercise 14.23 in the text. Try the game a few times and see if you can find someone with a Bacon Number higher than 3. In this assignment, you will find all persons with a Bacon Number of 8 or higher. One of these persons is a former President of the United States.

Strategy

This is basically a shortest path problem. After that is done, find the large Bacon numbers by scanning the bacon numbers and print out the high-numbered actors and their chains. To print out a high-numbered actor, you should use recursion. Specifically, if some actor x has a Bacon number of b, then you know that they must have been in a movie with someone, call them actor y with a Bacon number of b-1. To print out x's chain, you would print out y's chain (recursively) and then the movie that x and y had in common.

The Input Files

There are two data files; both have identical formats. These files are: actors file and actresses file. These files are both compressed in .gz format, and were obtained from the Internet Movie Database. Combined, they are 52 Mbytes (compressed!) and were last updated October 17, 2002. These files are available in the AUL in the folder \\cougar\cop3530, and also /home/aul1/cop3530 on the Unix boxes. YOU MAY NOT COPY THEM INTO ANY OF THE LOCAL FIU MACHINES BECAUSE YOU WILL EXCEED YOUR DISK QUOTA.

These datafiles contain approximately 571,000 actors/actresses in a total of 192,000 movies, with 2,144,000 roles. These files also list TV roles, but you must not include TV roles in your analysis.

Before you run on the large data sets, use the small (uncompressed) file sample.list to debug the basic algorithms. In this data file, there are six actors, named a, b, c, d, e, and f, who have been in movies such as X, Y, and Z.

Input File Hints

Since it is not my input file, I cannot answer questions about it. Here are some observations that I used in my program, that should suffice. You can read the input file line by line by wrapping a FileInputStream inside a BufferedInputStream inside a GZIPInputStream inside an InputStreamReader inside a BufferedReader. You may not uncompress the file outside of your program.

There are over 200 lines of preamble that can be skipped. This varies from file to file. However, you can figure it out by skipping all lines until the first occurrence of a line that begins with "Name", and then skipping one more.
There are many postamble lines, too, starting with a line that has at least nine dashes (i.e. ---------).
A name is listed once; all roles are together; the name starts in the first column.
A movie title follows the optional name and a few tab stops ('\t'). There are some messed up entries that have spaces in addition to tab stops.
The year should be part of the movie title.
Movies made in 2003 or later should be skipped.
A TV movie, indicated by (TV) following the year, is to be skipped.
Archive material, indicated by (archive footage), is to be skipped. (Otherwise JFK is a movie star).
Cameo appearances, indicated by [Cameo appearance], should be skipped.
A TV series, indicated by a leading " in the title is to be skipped.
A video-only movie, indicated by (V) following the year is allowed.
Blank lines separate actors/actresses, and should be skipped.

Strategy

In order to compute your answers, you will need to store the data that you read. The main data structures are a Map in which each key is an Actor and each value is the corresponding list of movies that the actor has been in, and then a second Map, in which key is a movie and each value is the list of Actors in the movie (i.e. the cast). A movie is represented simply as a String that includes the year in which it was made, but an Actor includes both the name of the actor, and a data field that you can use to store computed information later on. Thus, ideally, you would like to define a class that looks somewhat like this (with routines to compute Bacon Numbers not listed):

public class Database
{
private static final class Actor
{
     String name;
     int    data; // Bacon number ,
                   // determined by computeBaconNumbers

public String toString( )
{ return name; }

public int hashCode( )
{ return name.hashCode( ); }

     public boolean equals( Object other )
       { return (other instanceof Actor) &&
                ( (Actor) other ).name.equals( name ); }
}

// Open fileName; update the maps
public void loadFile( String fileName ) throws IOException
{ ... }

public int getActorCount( )
{ ... }

public int getMovieCount( )
{ ... }

private Map actorsToMovies = new HashMap( );
private Map moviesToActors = new HashMap( );
}

Memory Details

The description above is pretty much what you have to do, except that you must take extra steps to avoid running out of memory.

First, you will need to increase the maximum size of the Java Virtual Machine from the default of 64Meg to 224Meg. You may not increase it any higher than that. If you are running the java interpreter from the command line, the magic option is -Xmx224m. If you are using an IDE, you will have to consult their documentation --- don't ask me.

Second, you will quickly run out of memory, because if you find two movies that are the same, but are on different input lines, the default setup will create two separate (yet equal) String objects and place them in the value lists of two different actors. Since there are 2.1 million roles, but only 192,000 movies, this means that you will have ten times as many String objects as you really need. What you need to do is to make sure that each movie title is represented by a single String object, and that the maps simply store references to that single String object. There are two basic alternatives:

The String class has a method call intern. If you invoke it, the return value on equal Strings will always reference the same internal String object.
You can keep a HashMap in which each key is a movie title, and each value is the same as the key. When you need a movie title, you use the value in the HashMap.

Option two is superior (performancewise) to option #1 and it is required that you use it to avoid memory problems.

When you maintain the list of movies for each actor, you will want to use an ArrayList. It takes little effort to ensure that the capacity in the ArrayList is not more than is needed, and you should do so, to avoid wasting space (since there are 571,000 such array lists).

When you construct the HashMaps, you can issue a parameter (the load factor). The higher the load factor, the less space you use (at the expense of a small increase in time). You should play around with that too; the default is 0.75; you will probably want a load factor of 2.00 or 3.00.

You must be very careful to avoid doing any more work in the inner loops than you need to, including creating excessive objects. IF YOU CREATE EXCESSIVE OBJECTS, YOUR PROGRAM MAY SLOW TO A CRAWL BECAUSE ALL ITS TIME WILL BE SPENT IN THE GARBAGE COLLECTOR OR RUN OUT OF MEMORY COMPLETELY.

What to Submit

Submit complete source code and the actors/actresses with Bacon Numbers of 8 or higher. Include the complete paths for each of the actors/actresses (with shared movie titles). Also indicate how long your algorithm takes in the AUL. This means how long it takes to load, and also how long it takes to run the shortest path computation (not including the output of the answer), by inserting calls to System.currentTimeMillis at appropriate points in your code, and tell me how many actors and movies there are. Also, insert this code (at the end of your program) that tells me how large the VM is:

        Runtime rt = Runtime.getRuntime( );        
        int vmsize = (int) rt.totalMemory( );
        
        System.out.println( "Virtual machine size: " + vmsize );

Don't forget to write in the processor speed of the AUL computer you are using. If it's somewhat fast, or provably space-efficient, you can get extra credit. You cannot receive credit for a working program if your program fails to produce a large set of actors and actresses with Bacon Numbers of 8 or higher. Note: the data you will work on and the data the Oracle uses (and the IMDB data) are all slightly out of sync. So you might not be able to exactly reproduce the results, although I was able to get a complete set of Bacon Numbers 8 and higher when I ran my program on October 23, using the Oct 17 files.

Due Date

This program is due on Thursday November 14.

Additional Notes

The exact sizes of the data files are 36381624 and 18263563 bytes, respectively. If you download and the files are larger, then you messed up the download. Note that if you are using Windows XP, these files might be uncompressed during downloading. If so, you can use this program to recompress the file. If the files are smaller, then the download probably got interrupted before it finished and you will need to retry. Here is a gzipped sample.list for you to test that aspect of your program.

Running on COTTON in the AUL, which is a 450 MHz Pentium III, and accessing the files over the network as "\\\\couger\\cop3530\\actors.list.gz", (and similarly for actresses.list.gz), and with no other processes running besides notepad, the data files loaded in 180 seconds. You should be able to get faster results on a faster machine. Some of the AUL machines are 1.6 GHz. You should indicate which AUL machine (with processor speed) ran your submission.

Entries with years such as (1996/I) are optional.

When writing and reading make sure you are using BufferedInputStream and BufferedOutputStream, as appropriate.

Sketch of the basic shortest path computation:

    
      // typically invoked with Kevin Bacon as parameter
      // This is the basic algorithm; there's stuff you
      // should do to avoid looking at the same cast twice.
    public void computeBaconNumbers( String centralActor )
    {
        for each actor, act,
            set act's Bacon number to INFINITY;
        set centralActor's Bacon Number to 0;

        enqueue( centralActor );
        
        while ( !queue.isEmpty( ) )
        {
            Actor current = (Actor) queue.removeFirst( );
            List currentMovies = (List) actorsToMovies.get( current );
            
            for each movie, thisMovie, in currentMovies
            {         
                List sharedActors =  (List) moviesToActors.get( thisMovie );
                
                for each Actor, otherActor, in sharedActors
                {               
                    if( otherActor's Bacon number is INFINITY )
                    {
                        set otherActor's Bacon number to current's Bacon number + 1;
                        enqueue( otherActor );
                    }
                }
            }
        }    
    }