Data Science of Visiting Famous Movie Locations in San Francisco

Using the Google Places API and IMDb API, we selected movie locations in The Golden City which every movie fan should visit while they are in town, and optimize sightseeing by solving the travelling salesman problem.

By Juraj Kapasny, Knoyd.

Let the data science show you the optimal route through famous San Francisco movie locations

In this blog we take a look at movie locations in San Francisco. Using the Google Places API and IMDb API we selected places in the Golden City, which every movie fan should visit, while they are in town.

San Francisco movie sites

The original dataset was downloaded from SF OpenData site, which provides many datasets about San Francisco. Apart from the already mentioned movie locations, you can find there, for example, information on all exhibitions hosted by the San Francisco airport, the Mobile Food Facility Permits, the Aircraft Noise Complaint Data, the Air Traffic Passenger Statistics, etc.

Our base dataset included following columns:

  • Title (name of the movie)
  • Release Year
  • Locations (identification of the location)
  • Fun Fact (if available)
  • Production Company
  • Distributor
  • Director
  • Writer
  • Actor 1 (the protagonist of the movie)
  • Actor 2 (the secondary protagonist of the movie (if available))
  • Actor 3 (the tertiary protagonist of the movie (if available))

A location feature, which uniquely identifies the place, was included. However, the information about the longitude and latitude was missing so we were not able to plot these locations onto a map right away. We found the geo coordinates for all the places using Google Places API, and plotted them on the map using Python library gmplot.

Next, we focused only on the places where more famous movies were shot. To determine these movies, we used the IMDb API. All information about the movies, including the average rating and the overall number of votes were downloaded from The top movies shot in San Francisco with respect to the average rating were:

Movie  Rating No. of votes
Forrest Gump 8.8 1,234,615
Sense8 8.4 63,164
All About Eve 8.3 82,126
Looking 8.3 10,696
I Remember Mama 8.3 3,857

On the other hand, movies with the biggest number of votes on IMDb were:

Movie Rating No. of votes
Forrest Gump 8.8 1,234,615
Indiana Jones and the Last Crusade 8.3 509,609
Dawn of the Planet of the Apes 7.6 313,938
Ant-Man 7.4 301,246
Godzilla 6.5 299,385

Using the combination of these ratings and votes, we selected the top 7 movies: Forrest Gump, Indiana Jones and the Last Crusade, Dawn of the Planet of the Apes, Ant-Man, The Game, Godzilla, and The Graduate. These movies are associated with 36 movie locations across San Francisco.

In order to do a graph analysis on these locations, we needed to come up with a way to set up the edges of the graph accordingly. For this, we used the handy thing called Google API, where you can compute driving, cycling or walking distances between any two geographical locations. These travel times were taken as edge values between each of the pairs of locations (nodes or vertices). We built the graph using places from the top 7 movies with cycling and driving distances between these places. Furthermore, only 2 edges, those to the 2 closest places, were created for each location. The logic behind this is to avoid using paths with long travel times.

Below you can see the visualization and simple analysis of the graph. Edges of this graph were created from the cycling distances between locations.

Visualization of the graph

As the next step in analyzing the movie locations, we looked into the betweenness of a particular location. Betweenness is equal to the number of shortest paths from all the vertices (in our case the locations) to all others that pass through that node. A location with high betweenness has large influence on transfer of people through the network of the top movie locations under the assumption that people are always looking for the shortest path. We have compared locations with the biggest betweenness using edges based on driving as well as cycling distance. We came up with following results:

Places with the highest betweenness when cycling:

  1. Bank of America Building (555 California Street)
  2. 301 Howard Street
  3. Embarcadero & Washington
  4. Mission & Beal
  5. Bay Bridge

Places with the highest betweenness when driving:

  1. Bank of America Building (555 California Street)
  2. Washington Street & Waverly Place (Chinatown)
  3. City Club (155 Sansome Street)
  4. Embarcadero & Washington
  5. Bay Bridge

We can see that two places were different when using driving distances: 301 Howard Street and Mission & Beal were replaced with Washington Street & Waverly Place (Chinatown) and City Club (155 Sansome Street) respectively. This means that movie fans are more likely to pass through the Chinatown of San Francisco if they travel around by car.

Finally, we looked into the Traveling Salesman Problem (TSP) and applied it to our dataset. The TSP is an optimization problem of finding the shortest possible route to visit each from given set of places. Using a random start and iterative algorithm we came up with a single route that any movie fan should take if they want to visit all of the interesting places from famous movies.

You can see a visualization of the optimal route below. Google supports only 10 locations in 1 route, therefore four layers were created. The descriptions start with A and end with J for each layer, J is overlapping with A from the next layer. The beginning of each layer is marked with a number for better readability.

Optimal route

This is the optimal route itinerary, starting at Harrison Street (The Embarcadero) and ending at Mission & Beal:

  • 0(A) - Harrison Street - The Embarcadero (The Game)
  • 1(B) - Mission & Fremont St (Godzilla)
  • 2(C) - 301 Howard Street (The Game)
  • 3(D) - Bay Bridge (The Graduate)
  • 4(E) - Administration Building - Treasure Island (Indiana Jones and the Last Crusade)
  • 5(F) - California & Powell (Dawn of the Planet of Apes)
  • 6(G) - Pier 1 (Godzilla)
  • 7(H) - Broadway between Powell and Davis (Ant-Man)
  • 8(I) - California & Davis St (Godzilla)
  • 9(J-A) - City Hall (Dawn of the Planet of the Apes)
  • 10(B) - Potrero & San Bruno (Godzilla)
  • 11(C) - Alioto Park (Dawn of the Planet of the Apes)
  • 12(D) - Market between Stuart and VData Science of visiting famous movie locations in San Franciscoan Ness (Ant-Man)
  • 13(E) - Eddy & Taylor St. (Godzilla)
  • 14(F) - 420 Jones St. at Ellis St. (Ant-Man)
  • 15(G) - Post & Jones St. (Godzilla)
  • 16(H) - Presidio - Golden Gate National Recreation Area (The Game)
  • 17(I) - Conzelman Rd at McCollough Rd and down Conzelm... (Ant-Man)
  • 18(J-A) - Mason & California Streets - Nob Hill (The Game)
  • 19(B) - Broadway & Columbus (Godzilla)
  • 20(C) - Sacramento & Front St. (Godzilla)
  • 21(D) - Pier 7 - The Embarcadero (Godzilla)
  • 22(E) - Embarcadero & Washington (Godzilla)
  • 23(F) - Bush & Kearny (Godzilla)
  • 24(G) - California St from Mason to Kearny (Dawn of the Planet of the Apes)
  • 25(H) - Kearney & Pine St. (Godzilla)
  • 26(I) - Stockton & Clay St (Godzilla)
  • 27(J-A) - University Club (Dawn of the Planet of the Apes)
  • 28(B) - Pine between Kearney and Davis (Ant-Man)
  • 29(C) - Washington Street & Waverly Place - Chinatown (The Game)
  • 30(D) - Columbus between Bay and Washington (Ant-Man)
  • 31(E) - Bank of America Building - 555 California Street (The Game)
  • 32(F) - City Club - 155 Sansome Street (The Game)
  • 33(G) - Grant between Bush and Broadway (Ant-Man)
  • 34(H) - Pine St. & Davis St (Godzilla)
  • 35(I) - Mission & Beal (Godzilla)

And on the more detailed map of San Francisco city center:

Optimal route

If you are interested, you can check out the other data sources from the City by the Bay by yourself - we are definitely going to do that.

Bio: Juraj Kapasny is a co-founder and data scientist at Knoyd, data mining enthusiast, former data scientist at Teradata (Vienna, Austria). He's worked on many customer specific projects across industries like Telco, Finance or Automotive, helping customers to gain additional insights and value from their data.