Python, Selenium & Google for Geocoding Automation: Free and Paid
This tutorial will take you through two options that have automated the geocoding process for the user using Python, Selenium and Google Geocoding API.
Geocoding is the beginning of Spatial Analysis. Whether geocoding a place or residence, an accurate geocoder is the first step to make sure your analysis is on the right track. This tutorial will take you through two options that have automated the geocoding process for the user using Python, Selenium and Google Geocoding API. Automation is something every Data Scientist needs to learn. But automating workflows is unique to every person and not everyone will have the same workflow. Automation is something to strive for when we believe that repeating the same process is too time consuming and your time is better spent doing something else while your automated process will take care of everything.
Free Geocoding with Python and Selenium
The first tutorial was written to automate my geocoding process when a paid service was not an option. From my personal observation, my script has given me a minimum 70% accuracy given the following standards are met:
- The address must not contain the name of a place.
- Minimal spelling errors
- The country is also part of the address
This script assumes you are:
- Using xlxs file extension (Microsoft Excel Workbook)
- The whole address is in one column.
- Using Google Chrome
Note: When using selenium, make sure that you download and keep the Google Chrome Diriver File in the same directory as your script and that the version is compatible with your own Google Chrome Browser.
The final output will be an excel file with the geocoded address and coordinates in separate columns.
from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import Select from selenium.common.exceptions import ElementClickInterceptedException import pandas as pd import time from openpyxl import load_workbook from random import randint wb_name = "xxx.xlsx" #file name wb = load_workbook(wb_name, data_only = True) ws = wb['sheet_name'] address_list =[] link_col = xx #column number coord_prospects = pd.DataFrame() for row in ws.iter_rows(min_row = x , max_row = x, min_col = link_col, max_col=link_col): if str(row[0].value) != "None": address_list.append(row[0].value) driver = webdriver.Chrome(options=options) #driver.minimize_window() #this is optional if the opening google chrome window gets annoying driver.get('https://www.mapdevelopers.com/geocode_tool.php') add = driver.find_element_by_class_name('form-control') for t,a in enumerate(address_list): print ("Geocoding...",t+1,"/",len(address_list),str(round(t/len(address_list)*100,2)),"%"," : ", a) add.clear() add.send_keys(a) try: search1 = driver.find_element_by_xpath('//*[@id="search-form"]/div[1]/span[2]').click() time.sleep(3) search2 = driver.find_element_by_xpath('//*[@id="search-form"]/div[1]/span[2]').click() time.sleep(3) except ElementClickInterceptedException: time.sleep(2) search = driver.find_element_by_xpath('//*[@id="search-form"]/div[1]/span[2]').click() lat=driver.find_element_by_id('display_lat') lng=driver.find_element_by_id('display_lng') street=driver.find_element_by_id('display_address') city=driver.find_element_by_id('display_city') postcode=driver.find_element_by_id('display_zip') state=driver.find_element_by_id('display_state') county=driver.find_element_by_id('display_county') country=driver.find_element_by_id('display_country') latlng = pd.DataFrame({'Latitude':pd.Series(lat.text), 'Longitude':pd.Series(lng.text), 'Street':pd.Series(street.text), 'City':pd.Series(city.text), 'Postcode':pd.Series(postcode.text), 'State':pd.Series(state.text), 'County':pd.Series(county.text), 'Country':pd.Series(country.text)}) coord_prospects = coord_prospects.append(latlng, ignore_index=True) print(coord_prospects.tail(1)) print(" ") coord_prospects.to_excel('xxxx.xlsx') #name of output excel file driver.close()
Once you have updated wb_name
, link_col
and the name of the output excel file, you can begin running the script.
The above script once run will open a google chrome browser and go to https://www.mapdevelopers.com/geocode_tool.php where you will see the your geocoding taking place. If not for automation, the usual scenario will be keeping a excel and google screen open and a repetitive copy and paste motion. A person with thousands of rows will just be wasting time.
There are many ways to automate and improve on geocoding and this is a method I personally developed for when I do not feel like paying for an API and I am not time constrained. This brings me to some of the drawbacks of using a free geocoding method versus an API:
- It is very slow in comparison but more efficient and fast than manually geocoding
- Accuracy is rarely and almost never beaten
Google Geocoding with Python
Google used to allow a certain amount (about 2,500 API calls) for its geocoding per day for free. This is an annoying limitation but one that allowed for highly accurate geocoding even if your address had some spelling mistakes and even better if you had a place name. However recently google now charges for every API call.
If not mistaken, currently the damage is $5 for every 1,000 API calls and $4 for every API call above 100,000. To put it into perspective, for the first 100,000, you will be spending $0.005 for every API call. Anything above 500,000 may need a call to the their sales team. See here for their Geocoding API billing. Every new sign up makes you eligible for $200 credit.
In order to start using Google Geocoder API, you have to obtain an API key: Google Developers API Key
Once you obtain your API Key for their geocoding service, you can start using their service. The address used can include place names as well and this should even improve the results most of the time. If it can be found on google maps when you manually check, it can be used with google's geocoder to extract the coordinates.
import pandas as pd import googlemaps api = "xxx" #API Key df = pd.read_excel("xxx.xlsx") #file name geocoder = googlemaps.Client(key=api) df['Latitude'] = None df['Longitude'] = None df['Google Address'] = None for i in range(len(df)): print("Geocoding..."+" "+str(i)+"/"+str(len(df)) + " " + str(round(i/len(df)*100,2))+"%") result = geocoder.geocode(df.loc[i,"xxx"]) # xxx is the name of the column with the full address try: lat = result[0]["geometry"]["location"]["lat"] lng = result[0]["geometry"]["location"]["lng"] address = result[0]["formatted_address"] df.loc[i,"Latitude"] = lat df.loc[i,"Longitude"] = lng df.loc[i,"Google Address"] = address except: lat=None lng=None location=None print(df) df.to_excel("xxx.xlsx") #name of output file
The only parts you need to add are api
, df
, name of the column with the full address and the name of the output file at the end.
The above script will return an excel file with three new columns: 'Latitude', 'Longitude', 'Google Address'. It will be much faster than any custom script built. Google is undoubtedly the most preferred geocoding API as it is reliable for a client side project. In a business setting, if a client wishes to have the coordinates of their customers residence mapped, visualized and analyzed, explaining the use of google to geocode will be simple and understandable to them.
This method uses much less lines of code and is easier to maintain. With no restrictions and a relatively cheap price per API call, it can be feasible to a certain extent.
Related:
- Visualising Geospatial data with Python using Folium
- How to build a data science project from scratch
- OpenStreetMap Data to ML Training Labels for Object Detection