Kevin Bacon Details (Tuesday's Day Lecture)

Here is some of what I discussed on Tuesday. It looks like I will not be able to cover this in detail in the night class.
LAST UPDATE: Oct 10, 11:02 PM (What about (????) for years)

How Much Space is Used In The Basic Algorithm On The Web Page

Let's ignore the local variables in various routines and concentrate on the two main structures. We have
    map<string,vector<string> > actorsToMovies;     // movies for each actor
    map<string,vector<string> > moviesToActors;     // actors for each movie

We need space to store each key in the map, and then each value. Let us assume that there is no additional space requirement for the map. (This is an incorrect assumption. For instance, if this was a list, there would be extra space needed for the next and previous nodes.) Combined, the two maps store roughly 566,000 keys (422,000 actors, 144,000 movies). The values are the 566,000 vectors. The vectors themselves will take space, but let's only count the total amount of space to represent the strings that are in those vectors. In the first map, every movie that an actor is in adds one to the total sizes of all the vectors. This is another way of saying that the total size of all the vectors in the actorsToMovies value is 1,550,000, or the number of roles. The same is true for the moviesToActors vector. So there are 3,100,000 strings in all the vectors. Unfortunately, it turns out that in a vector, as we discussed a few weeks ago, excess space (for the additional capacity) is also used. This excess is, on average, 33%. So there are an additional 1,000,000 empty strings.

All totalled, we have 566,000 strings as keys, 3,100,000 strings as values, and 1,000,000 empty strings. So the question is: how much space does a string take? It turns out that each string object requires:

Thus, the following code fragment produces 16 (try it out):
    string x = "";
    cout << sizeof( x ) << endl;
In addition, we need memory for the string contents themselves! There, it is not so gloomy. Standard library implementations use reference sharing, so we only need extra (character array) space for the first copy of a string; additional copies are shared, as long as the string class can detect this. As an example:
    string x = "hello";
    string y = x;
    string z = x;    // or y
    cout << sizeof( x ) << endl;   // 16, plus there is an extra 5 bytes allocated somewhere
    cout << sizeof( y ) << endl;   // 16, but it doesn't need the extra 5 bytes; it is shared
    cout << sizeof( z ) << endl;   // 16, but it doesn't need the extra 5 bytes; it is shared
Since the 566,000 strings have to be stored somewhere, no matter, what we do, the memory we are talking about for the maps really turns out to be all of the strings above times 16 bytes each. So 4,666,000 * 16 bytes is 75 Meg. This is not good news, especially since we are conveniently ignoring excess space that the map needs, not to mention space for the baconNumber table and the links table! The vast majority of this memory is in the value component of the map.

Saving Some Space

To save space, we will give each actor a number (sequentially starting at either 0 or 1), and then also give each movie a number (also sequentially, starting at either 0 or 1). By doing so, we change the maps:
    map<string,vector<int> > actorsToMovies;     // movies for each actor
    map<string,vector<int> > moviesToActors;     // actors for each movie
Now the 4,100,000 strings that were values become 4,100,000 ints. Since ints are 4 bytes instead of 16, we save 12 bytes per entry, or about 48 Meg total.

But we now need to know four things:

  1. Given an actor, what's its internal number?
  2. Given a movie, what's its internal number?
  3. Given an internal number, what actor is that?
  4. Given an internal number, what movie is that?
So we need four tables. The first two are map<string,int>, which we can call actorToNum and movieToNum. Of course, this adds space: 566,000 strings and 566,000 ints, for a total of 11.3 Meg (not counting node excesses). The next two could also be maps, but since the internal numbers are assigned sequentially, we can use a vector (and push_back). We will call those numToActor to numToMovie. This is 566,000 more strings, or another 9.0 Meg. (Note, we can avoid the 33% excess slack by using reserve on these vectors before we start, and providing the vector with an estimate of how large we think the capacity should be). So we've just added 20 Meg, but since we saved 48 Meg, we are still 28 Meg ahead of the game.

Starting the internal numbers at 0 seems convenient, but for movies, you will need to decide if the movie already has an internal number or not. If it has an internal number you can just use it. If not, you have to give it an internal number, update the movieToNum map, and then do a push_back in the numToMovie vector. The easiest way to decide if a movie, mov, already has an internal number is to check the value of movieToNum[mov]. But if it is not there, we know that the map's operator[] will add it to the map, with a value of 0, and return the reference to the zero value. So it might be better to start internal numbers at 1, because then you know that if a value of 0 comes back from the map, you have a new entry.

In order to avoid too many strings, you should be careful to make sure that whenever you add a string to a vecotr/map/etc whose value is already known, that you use a copy of an existing string, so that extra character arrays are not created. I think it would be difficult for you to have that problem, because the main place where you will have strings is in the four tables above, and additions to tables are done in tandem, but you should double-check your code anyway.

(I forgot to mention this part) For further memory savings, change your map<string,vector<int> > that you are using for actorsToMovies and moviesToActors to a vector<vector<int> >, using the same logic I just described (i.e. that maps with keys that are sequential ints can be replaced with vectors). This directly saves 16*566,000 bytes (roughly 9Meg), since there are no keys, plus the excess cost of nodes in a map, which is probably another 9Meg. When you declare this vector, it starts out as a vector of vectors, but it is a vector of length 0. To make it bigger, when you see a new actor, you'll need something like

    actorsToMovies.push_back( vector<int>( ) );
which makes the vector of vectors one larger, and when you have a new role
    actorsToMovies[ internalActorNum ].push_back( internalMovieNum );

Time Issues

You need to be extremely careful about making unneeded copies. This may mean using reference variables (as illustrated in the pseduocode for computeNumbers). If you copy vectors, maps, or anything needlessly, you may completely trash your algorithms, by adding factors of 10000s to your running time. Additionally, you will probably want to change the baconNumber and link tables from maps that have string as their keys to simple vectors that use the internal actor number as the key. This will save significant additional memory, and will also make computeNumber signficantly faster, because it will replace expensive O(log N) map searches with O( 1 ) vector accesses. In your deepest code, make sure you are not recomputing the same information when it is possible to save a result in a temporary variable or reference. And finally, DON'T FORGET THE OPTIMIZER.

FAQS

Isn't computeNumber quadratic? There are two loops!

The answer is no. If you are very careful with the marking, then the innermost of the two loops is only executed roughly ROLES times, rather than ACTORS * MOVIES. So it should be fast. Note however, that in the pseduocode, the inner most loop involves accessing a map, and each map operation is logarithmic, so the running time is a little higher than it has to be. If you do the assignment correctly, computeNumber will execute extremely quickly. The main effort is reading the file.

Can I use the STL queue instead of list?

It is not likely to matter. If you use queue, you are responsible for understanding how to code it. See page 249; the valid container that you can use to instantiate are list and deque.

Can You Give An Example Of a High Bacon Number?

No; that would ruin the fun! But here is the distribution I am getting.
0        1
1     1489
2   107073
3   244723
4    54070
5     3120
6      384
7       72
8        9
At least one of my 8s is listed by the Oracle as a 6 because the Oracle has an old database that counts Tara (I) and Tara (II) as the same person. I am using a rigid format for years: only four digit years in parentheses count. You will get slightly different answers if entries containing substrings such as (1967/I) are counted.

What About Years With ????

Although you do not have to handle unknown years, you will get about 10 extra 7s if you do. An unknown year is represented as (????).

Can I Use Borland's Compiler

An older version of the compiler starts vectors with capacity of 256. Given that there are at least 1,000,000 vectors, it is safe to assume you would be in trouble unless you were able to convince the library to start with smaller capacities. However, I do not know if the newer compiler's have similar behavior.

Any Chance of An Extension?

Not likely. Program 4 asks you to further speed up this program, so if this doesn't work, you'll be in trouble for the next program too. In that case, I would like this program to be due prior to the drop date, so those in trouble do not have to stay in the course and get a poor grade. Additionally, there are still two weeks left to do the assignment, and many who have already started are close to getting answers.