Assignment #2: Movie Star Pairs
Your program should repeatedly prompt for the name of a movie star. For
each movie star it will list the actor or actress who has been in the most
movies with the movie star. If there is a tie, all persons in the tie are
listed. For each actor in the listing shared movies are to be listed.
For example, if the input actor is "Hanks, Tom", the output would
look something like:
Reiner, Tracy (5 shared roles):
Apollo 13 (1995)
Big (1988)
League of Their Own, A (1992)
Nothing in Common (1986)
That Thing You Do! (1996)
For "Roberts, Julia" there is a tie, so output would look
something like
D'Onofrio, Vincent (3 shared roles):
Dying Young (1991)
Mystic Pizza (1988)
Player, The (1992)
Roberts, Lisa (I) (3 shared roles):
I Love Trouble (1994)
Runaway Bride (1999)
Something to Talk About (1995)
The Input File
There are two data files; both have identical formats. These files are:
actors
file and actresses
file. These files are both compressed in .gz format, and were
obtained from the Internet
Movie Database. Combined, they are 45.5 Mbytes (compressed!) and were
last updated Jan 9, 2002. These files are available in the AUL in the folder
\\cougar\cop3530, and also /home/aul1/cop3530 on the
Unix boxes.
YOU MAY NOT COPY THEM INTO ANY OF THE LOCAL FIU MACHINES
BECAUSE YOU WILL EXCEED YOUR DISK QUOTA.
These datafiles contain approximately 590,000 actors/actresses in a
total of 171,000 movies, with 1,900,000 roles. These files also list TV
roles, but you must not include TV roles in your analysis.
Before you run on the large data sets, use the small (uncompressed)
file
sample.list
to debug the basic algorithms. In this data file, there are six actors,
named
a,
b,
c,
d,
e, and
f,
who have been in movies such as
X,
Y, and
Z.
Input File Hints
Since it is not my input file, I cannot answer questions about it. Here
are some observations that I used in my program, that should suffice. You
can read the input file line by line by wrapping a FileInputStream
inside a BufferedInputStream inside a GZIPInputStream
inside an InputStreamReader inside a BufferedReader.
You may not uncompress the file outside of your program.
-
There are over 200 lines of preamble that can be skipped. This varies from
file to file. However, you can figure it out by skipping all lines until
the first occurrence of a line that begins with "Name", and then
skipping one more.
-
There are many postamble lines, too, starting with a line that has at least
nine dashes (i.e. ---------).
-
A name is listed once; all roles are together; the name starts in the first
column.
-
A movie title follows the optional name and a few tab stops ('\t').
There are some messed up entries that have spaces in addition to tab stops.
-
The year should be part of the movie title.
-
Movies made in 2002 or later should be skipped.
-
A TV movie, indicated by (TV) following the year, is to be skipped.
-
Archive material, indicated by (archive footage), is to be skipped.
(Otherwise JFK is a movie star).
-
A TV series, indicated by a leading " in the title is to be skipped.
-
A video-only movie, indicated by (V) following the year is allowed.
-
Blank lines separate actors/actresses, and should be skipped.
Strategy
In order to compute your answers, you will need to store the data that
you read. The main data structures are a Map in which each key
is an Actor and each value is the corresponding list of movies
that the actor has been in, and then a second Map, in which key
is a movie and each value is the list of Actors in the movie (i.e.
the cast). A movie is represented simply as a String that includes
the year in which it was made, but an Actor includes both the
name of the actor, and a data field that you can use to store
computed information later on. Thus, ideally, you would like to define
a class that looks somewhat like this:
public class Database
{
private static final class Actor
{
String name;
int data; // number
of shared movies,
// determined by computeSharedMovies
public String toString( )
{ return name; }
public int hashCode( )
{ return name.hashCode( );
}
public boolean equals( Object other )
{ return (other instanceof
Actor) &&
( (Actor) other ).name.equals( name ); }
}
// Open fileName; update the maps
public void loadFile( String fileName ) throws IOException
{ ... }
// This can be called after the maps are built.
// The method will place in each Actor's data
field
// an int representing the number of movies
shared with actor parameter.
public void computeSharedMovies( String actor )
{ ... }
// This can be called after computeSharedMovies.
// It scans through all actors, finding the
largest
// data field. In case of ties, all of the tied
Actors
// are returned in the List.
public List mostSharedMovies( )
{ ... }
public int getActorCount( )
{ ... }
public int getMovieCount( )
{ ... }
private Map actorsToMovies = new HashMap( );
private Map moviesToActors = new HashMap( );
}
Memory Details
The description above is pretty much what you have to do, except that you
must take extra steps to avoid running out of memory.
First, you will need to increase the maximum size of the Java Virtual
Machine from the default of 64Meg to 192Meg. You may not increase it any
higher than that. If you are running the java interpreter from the command
line, the magic option is -Xmx192m. If you are using an IDE, you
will have to consult their documentation --- don't ask me.
Second, you will quickly run out of memory, because if you find two
movies that are the same, but are on different input lines, the default
setup will create two separate (yet equal) String objects
and place them in the value lists of two different actors. Since there
are 1.9 million roles, but only 171,000 movies, this means that you will
have ten times as many String objects as you really need. What
you need to do is to make sure that each movie title is represented by
a single String object, and that the maps simply store references
to that single String object. There are two basic alternatives:
-
The String class has a method call intern. If you invoke
it, the return value on equal Strings will always reference the
same internal String object.
-
You can keep a HashMap in which each key is a movie title, and
each value is the same as the key. When you need a movie title, you use
the value in the HashMap.
One of these two ideas may be superior (performancewise) to the other,
but both will avoid memory problems.
When you maintain the list of movies for each actor, you will want to
use an ArrayList. It takes little effort to ensure that the capacity
in the ArrayList is not more than is needed, and you should do
so, to avoid wasting space (since there are 590,000 such array lists).
When you construct the HashMaps, you can issue a parameter
(the load factor). The higher the load factor, the less space you use (at
the expense of a small increase in time). You should play around with that
too; the default is 0.75.
You must be very careful to avoid doing any more work in the inner loops
than you need to, including creating excessive objects.
What To Submit
Submit complete source code and the result of running your code on the
actors that I will list on the class page shortly. Also indicate how long
your algorithm takes
in the AUL by inserting calls to System.currentTimeMillis
at the start and end of you code, and tell me how many actors and movies
there are.
Due Date
This program is due on Tuesday, February 5.