Kevin Bacon Details (Tuesday's Day Lecture)
Here is some of what I discussed on Tuesday. It looks like I will not be
able to cover this in detail in the night class.
LAST UPDATE: Oct 10, 11:02 PM (What about (????) for years)
How Much Space is Used In The Basic Algorithm On The Web Page
Let's ignore the local variables in various routines and concentrate on
the two main structures. We have
map<string,vector<string> > actorsToMovies;
// movies for each actor
map<string,vector<string> > moviesToActors;
// actors for each movie
We need space to store each key in the map, and then each value. Let
us assume that there is no additional space requirement for the map. (This
is an incorrect assumption. For instance, if this was a list, there would
be extra space needed for the next and previous nodes.) Combined, the
two maps store roughly 566,000 keys (422,000 actors, 144,000 movies). The
values are the 566,000 vectors. The vectors themselves will take space,
but let's only count the total amount of space to represent the strings
that are in those vectors. In the first map, every movie that an actor
is in adds one to the total sizes of all the vectors. This is another way
of saying that the total size of all the vectors in the actorsToMovies
value is 1,550,000, or the number of roles. The same is true for the moviesToActors
vector. So there are 3,100,000 strings in all the vectors. Unfortunately,
it turns out that in a vector, as we discussed a few weeks ago, excess
space (for the additional capacity) is also used. This excess is, on average,
33%. So there are an additional 1,000,000 empty strings.
All totalled, we have 566,000 strings as keys, 3,100,000 strings as
values, and 1,000,000 empty strings. So the question is: how much space
does a string take? It turns out that each string object requires:
-
a char * varable that points to memory where the characters are stored
(4 bytes)
-
the real length of the string (4 bytes)
-
the capacity of the buffer that holds the characters (4 bytes)
-
an "allocator" function object (4 bytes)
Thus, the following code fragment produces 16 (try it out):
string x = "";
cout << sizeof( x ) << endl;
In addition, we need memory for the string contents themselves! There,
it is not so gloomy. Standard library implementations use reference
sharing, so we only need extra (character array) space for the first
copy of a string; additional copies are shared, as long as the string class
can detect this. As an example:
string x = "hello";
string y = x;
string z = x; // or y
cout << sizeof( x ) << endl;
// 16, plus there is an extra 5 bytes allocated somewhere
cout << sizeof( y ) << endl;
// 16, but it doesn't need the extra 5 bytes; it is shared
cout << sizeof( z ) << endl;
// 16, but it doesn't need the extra 5 bytes; it is shared
Since the 566,000 strings have to be stored somewhere, no matter, what
we do, the memory we are talking about for the maps really turns out to
be all of the strings above times 16 bytes each. So 4,666,000 * 16 bytes
is 75 Meg. This is not good news, especially since we are conveniently
ignoring excess space that the map needs, not to mention space for the
baconNumber table and the links table! The vast majority
of this memory is in the value component of the map.
Saving Some Space
To save space, we will give each actor a number (sequentially starting
at either 0 or 1), and then also give each movie a number (also sequentially,
starting at either 0 or 1). By doing so, we change the maps:
map<string,vector<int> > actorsToMovies;
// movies for each actor
map<string,vector<int> > moviesToActors;
// actors for each movie
Now the 4,100,000 strings that were values become 4,100,000 ints. Since
ints are 4 bytes instead of 16, we save 12 bytes per entry, or about 48
Meg total.
But we now need to know four things:
-
Given an actor, what's its internal number?
-
Given a movie, what's its internal number?
-
Given an internal number, what actor is that?
-
Given an internal number, what movie is that?
So we need four tables. The first two are map<string,int>,
which we can call actorToNum and movieToNum. Of course,
this adds space: 566,000 strings and 566,000 ints, for a total of 11.3
Meg (not counting node excesses). The next two could also be maps, but
since the internal numbers are assigned sequentially, we can use a vector
(and push_back). We will call those numToActor to numToMovie.
This is 566,000 more strings, or another 9.0 Meg. (Note, we can avoid the
33% excess slack by using reserve on these vectors before we start, and
providing the vector with an estimate of how large we think the capacity
should be). So we've just added 20 Meg, but since we saved 48 Meg, we are
still 28 Meg ahead of the game.
Starting the internal numbers at 0 seems convenient, but for movies,
you will need to decide if the movie already has an internal number or
not. If it has an internal number you can just use it. If not, you have
to give it an internal number, update the movieToNum map, and
then do a push_back in the numToMovie vector. The easiest
way to decide if a movie, mov, already has an internal number is to check
the value of movieToNum[mov]. But if it is not there, we know
that the map's operator[] will add it to the map, with a value
of 0, and return the reference to the zero value. So it might be better
to start internal numbers at 1, because then you know that if a value of
0 comes back from the map, you have a new entry.
In order to avoid too many strings, you should be careful to make sure
that whenever you add a string to a vecotr/map/etc whose value is already
known, that you use a copy of an existing string, so that extra character
arrays are not created. I think it would be difficult for you to have that
problem, because the main place where you will have strings is in the four
tables above, and additions to tables are done in tandem, but you should
double-check your code anyway.
(I forgot to mention this part)
For further memory savings, change your
map<string,vector<int> >
that you are using for actorsToMovies and moviesToActors
to a vector<vector<int> >, using
the same logic I just described
(i.e. that maps with keys that are sequential ints can be replaced
with vectors).
This directly saves 16*566,000 bytes (roughly 9Meg),
since there are no keys, plus the
excess cost of nodes in a map, which is probably another 9Meg.
When you declare this vector, it starts out as a vector of vectors,
but it is a vector of length 0.
To make it bigger, when you see a new actor, you'll need something like
actorsToMovies.push_back( vector<int>( ) );
which makes the vector of vectors one larger, and when you have a new role
actorsToMovies[ internalActorNum ].push_back( internalMovieNum );
Time Issues
You need to be extremely careful about making unneeded copies. This may
mean using reference variables (as illustrated in the pseduocode for computeNumbers).
If you copy vectors, maps, or anything needlessly, you may completely trash
your algorithms, by adding factors of 10000s to your running time. Additionally,
you will probably want to change the baconNumber and link
tables from maps that have string as their keys to simple vectors that
use the internal actor number as the key. This will save significant additional
memory, and will also make computeNumber signficantly faster,
because it will replace expensive O(log N) map searches with
O( 1 ) vector accesses.
In your deepest code, make sure you are not recomputing the
same information when it is possible to save a result in a temporary
variable or reference.
And finally, DON'T FORGET THE OPTIMIZER.
FAQS
Isn't computeNumber quadratic? There are two loops!
The answer is no. If you are very careful with the marking, then the innermost
of the two loops is only executed roughly ROLES times, rather than
ACTORS * MOVIES. So it should be fast. Note however, that
in the pseduocode, the inner most loop involves accessing a map, and each
map operation is logarithmic, so the running time is a little higher than
it has to be. If you do the assignment correctly, computeNumber will execute
extremely quickly. The main effort is reading the file.
Can I use the STL queue instead of list?
It is not likely to matter. If you use queue, you are responsible
for understanding how to code it. See page 249; the valid container that
you can use to instantiate are list and deque.
Can You Give An Example Of a High Bacon Number?
No; that would ruin the fun! But here is the distribution I am getting.
0 1
1 1489
2 107073
3 244723
4 54070
5 3120
6 384
7 72
8 9
At least one of my 8s is listed by the Oracle as a 6 because the Oracle
has an old database that counts Tara (I) and Tara (II) as the same person.
I am using a rigid format for years: only four digit years in parentheses count.
You will get slightly different answers if entries containing
substrings such as (1967/I) are counted.
What About Years With ????
Although you do not have to handle unknown years,
you will get about 10 extra 7s if you do.
An unknown year is represented as (????).
Can I Use Borland's Compiler
An older version of the compiler
starts vectors with capacity of 256. Given that there are at least 1,000,000
vectors, it is safe to assume you would be in trouble unless
you were able to convince the library to start with smaller capacities.
However, I do not know if the newer compiler's have similar behavior.
Any Chance of An Extension?
Not likely. Program 4 asks you to further speed up this program, so if
this doesn't work, you'll be in trouble for the next program too. In that
case, I would like this program to be due prior to the drop date, so those
in trouble do not have to stay in the course and get a poor grade. Additionally,
there are still two weeks left to do the assignment, and many who have
already started are close to getting answers.