Web Scraping for Data Science with Python

We take a quick look at how web scraping can be useful in the context of data science projects, eg to construct a social graph based of S&P 500 companies, using Python and Gephi.

By Seppe vanden Broucke and Bart Baesens Sponsored Post.

For those who are not familiar with programming or the deeper workings of the web, web scraping often looks like a black art: the ability to write a program that sets off on its own to explore the Internet and collect data is seen as a magical and exciting ability to possess. In Web Scraping for Data Science with Python, we set out to provide a concise though thorough and modern guide to web scraping, using Python as our programming language. In addition, this book is written with a data science audience in mind.

We're data scientists ourselves, and have very often found web scraping to be a powerful tool to have in your arsenal, as many data science projects start with the first step of obtaining an appropriate data set, so why not utilize the treasure trove of information the web provides? The book employs a "code first" approach to get you up to speed quickly without too much boilerplate text, shows how to handle the web of today, including JavaScript, cookies, and common web scraping mitigation techniques, and includes a thorough managerial and legal discussion regarding web scraping. We also provide provide lots of pointers for further reading and learning and include fourteen real-life, fully worked out examples. For more details, click here.

In this article, we take a quick look at how web scraping can be useful in the context of data science projects. Web "scraping" (also called "web harvesting", "web data extraction" or even "web data mining"), can be defined as "the construction of an agent to download, parse, and organize data from the web in an automated manner". Or, in other words: instead of a human end-user clicking away in their web browser and copy-pasting interesting parts into, say, a spreadsheet, web scraping offloads this task to a computer program which can execute it much faster, and more correctly, than a human can.

The automated gathering of data from the Internet is probably as old as the Internet itself, and the term "scraping" in itself has been around for much longer than the web, even. Before "web scraping" became popularized as a term, a practice known as "screen scraping" was already well-established as a way to extract data from a visual representation-which in the early days of computing (think 1960s-80s) often boiled down to simple, text-based "terminals". Just as today, people around this time were already interested in "scraping" off large amounts of text from such terminals and store this data for later use. When surfing around the web using a normal web browser, you've probably encountered multiple sites where you considered the possibility of gathering, storing, and analyzing the data present on the site's pages. Especially for data scientists, whose "raw material" is data, the web exposes a lot of interesting opportunities. In such cases, the usage of web scraping might come in handy. If you can view some data in your web browser, you will be able to access and retrieve it through a program. If you can access it through a program, the data can be stored, cleaned, and used in any way. No matter your field of interest, there's almost always a use case to improve or enrich your practice based on data. "Data is the new oil", so the common saying goes, and the web has a lot of it.

In this article, our goal is to construct a social graph of S&P 500 companies and their interconnectedness through their board members. We'll start from the S&P 500 page at Reuters to obtain a list of symbols:

from bs4 import BeautifulSoup
import requests
import re

session = requests.Session()

sp500 = 'https://www.reuters.com/finance/markets/index/.SPX'

page = 1
regex = re.compile(r'/finance/stocks/overview/.*')
symbols = []

while True:
  print('Scraping page:', page)
  params = params={'sortBy': '', 'sortDir' :'', 'pn': page}
  html = session.get(sp500, params=params).text
  soup = BeautifulSoup(html, "html.parser")
  pagenav = soup.find(class_='pageNavigation')
  if not pagenav:
    break
  companies = pagenav.find_next('table', class_='dataTable')
  for link in companies.find_all('a', href=regex):
    symbols.append(link.get('href').split('/')[-1])
  page += 1

print(symbols)

Once we have obtained a list of symbols, we can visit the board member pages for each of them (e.g. https://www.reuters.com/finance/stocks/company-officers/MMM.N), fetch out the table of board members, and store it as a pandas data frame:

from bs4 import BeautifulSoup
import requests
import pandas as pd

session = requests.Session()

officers = 'https://www.reuters.com/finance/stocks/company-officers/{symbol}'

symbols = ['MMM.N', [...], 'ZTS.N']
dfs = []

for symbol in symbols:
  print('Scraping symbol:', symbol)
  html = session.get(officers.format(symbol=symbol)).text
  soup = BeautifulSoup(html, "html.parser")
  officer_table = soup.find('table', {"class" : "dataTable"})
  df = pd.read_html(str(officer_table), header=0)[0]
  df.insert(0, 'symbol', symbol)
  dfs.append(df)

df = pd.concat(dfs)
df.to_pickle('data.pkl')

This sort of information can lead to a lot of interesting use cases, especially in the realm of graph and social network analytics. For instance, we can use our collected information to export a graph and visualize it using Gephi, a popular graph viz tool:

import pandas as pd
import networkx as nx
from networkx.readwrite.gexf import write_gexf

df = pd.read_pickle('data.pkl')

G = nx.Graph()

for row in df.itertuples():
  G.add_node(row.symbol, type='company')
  G.add_node(row.Name,type='officer')
  G.add_edge(row.symbol, row.Name)

write_gexf(G, 'graph.gexf')

The output file can be opened in Gephi, filtered, and modified. The following figure shows a snapshot of the egonets of order 3 for Apple, Google, and Amazon, showing that these are indeed connected:

Sp500 Amazon Apple Google Network

Further reading

Our book entitled "Web Scraping for Data Science with Python" is out for release soon and will be geared towards data scientists who want to adopt web scraping techniques in their workflow. Stay tuned for more information at www.webscrapingfordatascience.com/.

Graphs can be used in a variety of ways in predictive setups as well. We refer for more reading on this topic to:

www.dataminingapps.com/dma_research/fraud-analytics/
Node2Vec is a powerful featurization technique converting nodes in a graph to feature vectors: https://snap.stanford.edu/node2vec/
Personalized pagerank is very often used as a featurization approach in the context of e.g. churn and fraud analytics: https://www.r-bloggers.com/from-random-walks-to-personalized-pagerank/
Van Vlasselaer, V., Akoglu, L., Eliassi-Rad, T., Snoeck, M., Baesens, B. (2015). Guilt-by-constellation: fraud detection by suspicious clique memberships. Proceedings of 48 Annual Hawaii International Conference on System Sciences: Vol. accepted. HICSS-48. Kauai (Hawaii), 5-8 January 2015
Van Vlasselaer, V., Akoglu, L., Eliassi-Rad, T., Snoeck, M., Baesens, B. (2014). Finding cliques in large fraudulent networks: theory and insights. Conference of the International Federation of Operational Research Societies (IFORS 2014). Barcelona (Spain), 13-18 July 2014.
Van Vlasselaer, V., Akoglu, L., Eliassi-Rad, T., Snoeck, M., Baesens, B. (2014). Gotch'all! Advanced network analysis for detecting groups of fraud. PAW (Predictive Analytics World). London (UK), 29-30 October 2014
Van Vlasselaer, V., Van Dromme, D., Baesens, B. (2013). Social network analysis for detecting spider constructions in social security fraud: new insights and challenges: vol. accepted. European Conference on Operational Research. Rome (Italy), 1-4 July 2013
Van Vlasselaer, V., Meskens, J., Van Dromme, D., Baesens, B. (2013). Using social network knowledge for detecting spider constructions in social security fraud. Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Network Analysis and Mining. ASONAM. Niagara Falls (Canada), 25-28 August 2013 (pp. 813-820). 445 Hoes Lane, PO Box 1331, Piscataway, NJ 08855-1331, USA: IEEE Computer Society

Bio: Seppe vanden Broucke is an assistant professor at the Faculty of Economics and Business, KU Leuven, Belgium. His research interests include business data mining and analytics, machine learning, process management, and process mining. His work has been published in well-known international journals and presented at top conferences. Seppe's teaching includes Advanced Analytics, Big Data and Information Management courses. He also frequently teaches for industry and business audiences.

Bart Baesens is a professor of Big Data and Analytics at KU Leuven (Belgium) and a lecturer at the University of Southampton (United Kingdom). He has done extensive research on Big Data & Analytics, Credit Risk Modeling, Fraud Detection and Marketing Analytics. He has written 8 books and more than 200 scientific papers, some of which have been published in well-known international journals and presented at top international conferences. He has received various best paper and best speaker awards. His research is summarized at www.dataminingapps.com.

Web Scraping for Data Science with Python

More On This Topic

Latest Posts

Top Posts