Assignment #2: Movie Star Pairs

Your program should repeatedly prompt for the name of a movie star. For each movie star it will list the actor or actress who has been in the most movies with the movie star. If there is a tie, all persons in the tie are listed. For each actor in the listing shared movies are to be listed. For example, if the input actor is "Hanks, Tom", the output would look something like:
Reiner, Tracy (5 shared roles): 
	Apollo 13 (1995)
	Big (1988)
	League of Their Own, A (1992)
	Nothing in Common (1986)
	That Thing You Do! (1996)
For "Roberts, Julia" there is a tie, so output would look something like
D'Onofrio, Vincent (3 shared roles): 
	Dying Young (1991)
	Mystic Pizza (1988)
	Player, The (1992)

Roberts, Lisa (I) (3 shared roles): 
	I Love Trouble (1994)
	Runaway Bride (1999)
	Something to Talk About (1995)

The Input File

There are two data files; both have identical formats. These files are: actors file and actresses file. These files are both compressed in .gz format, and were obtained from the Internet Movie Database. Combined, they are 45.5 Mbytes (compressed!) and were last updated Jan 9, 2002. These files are available in the AUL in the folder \\cougar\cop3530, and also /home/aul1/cop3530 on the Unix boxes. YOU MAY NOT COPY THEM INTO ANY OF THE LOCAL FIU MACHINES BECAUSE YOU WILL EXCEED YOUR DISK QUOTA.

These datafiles contain approximately 590,000 actors/actresses in a total of 171,000 movies, with 1,900,000 roles. These files also list TV roles, but you must not include TV roles in your analysis.

Before you run on the large data sets, use the small (uncompressed) file sample.list to debug the basic algorithms. In this data file, there are six actors, named a, b, c, d, e, and f, who have been in movies such as X, Y, and Z.

Input File Hints

Since it is not my input file, I cannot answer questions about it. Here are some observations that I used in my program, that should suffice. You can read the input file line by line by wrapping a FileInputStream inside a BufferedInputStream inside a GZIPInputStream inside an InputStreamReader inside a BufferedReader. You may not uncompress the file outside of your program.
  1. There are over 200 lines of preamble that can be skipped. This varies from file to file. However, you can figure it out by skipping all lines until the first occurrence of a line that begins with "Name", and then skipping one more.
  2. There are many postamble lines, too, starting with a line that has at least nine dashes (i.e. ---------).
  3. A name is listed once; all roles are together; the name starts in the first column.
  4. A movie title follows the optional name and a few tab stops ('\t'). There are some messed up entries that have spaces in addition to tab stops.
  5. The year should be part of the movie title.
  6. Movies made in 2002 or later should be skipped.
  7. A TV movie, indicated by (TV) following the year, is to be skipped.
  8. Archive material, indicated by (archive footage), is to be skipped. (Otherwise JFK is a movie star).
  9. A TV series, indicated by a leading " in the title is to be skipped.
  10. A video-only movie, indicated by (V) following the year is allowed.
  11. Blank lines separate actors/actresses, and should be skipped.

Strategy

In order to compute your answers, you will need to store the data that you read. The main data structures are a Map in which each key is an Actor and each value is the corresponding list of movies that the actor has been in, and then a second Map, in which key is a movie and each value is the list of Actors in the movie (i.e. the cast). A movie is represented simply as a String that includes the year in which it was made, but an Actor includes both the name of the actor, and a data field that you can use to store computed information later on. Thus, ideally, you would like to define a class that looks somewhat like this:

public class Database
{
  private static final class Actor
  {
     String name;
     int    data;  // number of shared movies,
                   // determined by computeSharedMovies

     public String toString( )
       { return name; }

     public int hashCode( )
       { return name.hashCode( ); }

     public boolean equals( Object other )
       { return (other instanceof Actor) &&
                ( (Actor) other ).name.equals( name ); }
  }

    // Open fileName; update the maps
  public void loadFile( String fileName ) throws IOException
    { ... }

    // This can be called after the maps are built.
    // The method will place in each Actor's data field
    // an int representing the number of movies shared with actor parameter.
  public void computeSharedMovies( String actor )
    { ... }

    // This can be called after computeSharedMovies.
    // It scans through all actors, finding the largest
    // data field. In case of ties, all of the tied Actors
    // are returned in the List.
  public List mostSharedMovies( )
    { ... }

  public int getActorCount( )
    { ... }

  public int getMovieCount( )
    { ... }

  private Map actorsToMovies = new HashMap( );
  private Map moviesToActors = new HashMap( );
}
 

Memory Details

The description above is pretty much what you have to do, except that you must take extra steps to avoid running out of memory.

First, you will need to increase the maximum size of the Java Virtual Machine from the default of 64Meg to 192Meg. You may not increase it any higher than that. If you are running the java interpreter from the command line, the magic option is -Xmx192m. If you are using an IDE, you will have to consult their documentation --- don't ask me.

Second, you will quickly run out of memory, because if you find two movies that are the same, but are on different input lines, the default setup will create two separate (yet equal) String objects and place them in the value lists of two different actors. Since there are 1.9 million roles, but only 171,000 movies, this means that you will have ten times as many String objects as you really need. What you need to do is to make sure that each movie title is represented by a single String object, and that the maps simply store references to that single String object. There are two basic alternatives:

  1. The String class has a method call intern. If you invoke it, the return value on equal Strings will always reference the same internal String object.
  2. You can keep a HashMap in which each key is a movie title, and each value is the same as the key. When you need a movie title, you use the value in the HashMap.
One of these two ideas may be superior (performancewise) to the other, but both will avoid memory problems.

When you maintain the list of movies for each actor, you will want to use an ArrayList. It takes little effort to ensure that the capacity in the ArrayList is not more than is needed, and you should do so, to avoid wasting space (since there are 590,000 such array lists).

When you construct the HashMaps, you can issue a parameter (the load factor). The higher the load factor, the less space you use (at the expense of a small increase in time). You should play around with that too; the default is 0.75.

You must be very careful to avoid doing any more work in the inner loops than you need to, including creating excessive objects.

What To Submit

Submit complete source code and the result of running your code on the actors that I will list on the class page shortly. Also indicate how long your algorithm takes in the AUL by inserting calls to System.currentTimeMillis at the start and end of you code, and tell me how many actors and movies there are.

Due Date

This program is due on Tuesday, February 5.