ParseHub gives Data Scientists a better, faster way to collect data
ParseHub enables data professionals to easily collect, structure, combine and manipulate data, and speed up the Data Science process.
By Angelina Fomina (ParseHub).
A hypothesis is now lingering in your mind, and you can’t wait to dig into further investigation. With a keen sense of observation and with an excitement for problem solving, you strive to uncover patterns that only you can make sense of. But taming the chaos of data collection and preparation, before you can get to the polished findings is like digging a tunnel without a headlamp.
Let’s face it. Getting data, especially the right data, is always a challenge. Ideally you want to make art and science collide to create beautiful meaning, instead of working as a janitor in a data asylum. And a world - where all of your data appears perfectly structured and groomed right in front of your finger tips - can exist.
Developing software that automates the monotonous task of gathering data, is the next step that will give data scientists more freedom to focus on creative aspects of their job. “Data scientists...spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.” - Steve Lohr, for New York Times.
The scraping applications currently available are at the tip of the iceberg of technologies that are converting “blah” data parsing irritations, into “aha” discoveries. These technologies will make data collection even faster and more accurate. The speed at which teams and companies can digest the inflow of data has a direct influence on their competitive edge.
In the recruiting industry, for example, the data available through social network comments can uncover insights about employee vs. employer relationships and predict the future of hiring trends. Subsequently, companies can create policies that decrease employee turnover and prepare to source the best talent in advance.
Today, we will show you one way to combine population data and employer reviews to draw observations about the top rated and most popular companies in the largest American cities. With ParseHub, you will be able to collect, structure, combine and manipulate data (using RegEx as well), so your results are friendly enough to analyze right away. You will be able to pull all of the largest cities in USA from Wikipedia and have ParseHub automatically enter each city one-by-one into the search form of a website that lists top rated companies.
At the end of the day it is less about the data and more about what you do with it. Let’s speed through the data aggregation process, now, so you can go on to explore the results.
Download the ParseHub browser extension, go to -
en.wikipedia.org/wiki/List_of_United_States_cities_by_population and start a new project.
Scroll down the page, so you can select all of the cities. Hover over the first city in the table, hold down the Ctrl key (Cmd on Mac) and click number 1 on your keyboard twice. When you see <td> selected click on it. Now, hold Shift and hover over the second city. Hold down the Ctrl or Cmd key at the same time as the Shift key and click on the new <td> selection. Now all of the table cells in the city column will be selected for you.
Use the list tool - to create a new empty Excel row or empty JSON object for each selected city. Rename the list “cities”. Use the extract tool to add the text for each city into a separate empty row/JSON object. Rename the extraction “city”.
Use RegEx to clean up the text. Let’s get rid of the numbered links (ex: )beside each city by selecting “Use regex” and entering (\w+(\s\w+)*) into the text box, located under the “Extract city” node.
Now, add the state to the results as well. Use the relative select tool. Click on the first city and click on the corresponding state in the next column. Use the extract tool to add the state text and rename the extraction “state”.
Join the city and the state into one phrase. This will help ParseHub enter a more accurate search on the reviews website. Click on the “List cities” node. Use the extract tool and rename the extraction location. Instead of $e.text, enter city + "," + " " + state into the text box. In the results you should see “New York, New York”.
Navigate to glassdoor.com to get the companies for each city. Use the navigate tool. From the dropdown select “Going to this link” and enter "http://www.glassdoor.com/Reviews/index.htm" into the text box. In the second text box type in search and click “create new template”.
Make new instructions to handle the search forum on the new template. Select the location search box with the select tool active. Use the input tool and enter location into the “input value” text box. Select “expression” from the dropdown. ParseHub will now enter all of the cities from wikipedia into the search box one by one.
Now select the button using the select tool. Before telling ParseHub to click on the button and go to the results, use the browser tool and enter “New York” into the search box like you would regularly. Now use the navigate tool, enter companies into the search box and click “Create new template”.
Important: Click on the “Options” dropdown next to the “search” node. Select “No duplicates”. ParseHub automatically does not go to the same page twice, and we just disabled that feature.
Make new instructions to get the top 10 companies for each city on the new template. Make sure the select tool is active. Click on the first company, hold Shift, click on the second company. All of the companies will be selected for you.
Use the list tool - to put each company into a separate Excel row or JSON object and rename the list “companies”. Use the extract tool to get the text for each company and rename the extraction “name”.
Extract the rating for each company. Use the relative select tool, click on the first company and click on its corresponding ranking. All of the rankings for all of the companies should be selected for you. Use the extract tool and rename the extraction “rating”.
Click on the “Select page” node. Use the select tool to click on the “rating” tag at the top of the page to display and extract companies with the highest rating. Use the navigate tool to open the top rated companies and select “companies” from the dropdown, so ParseHub applies the instructions we already created to the top rated companies. Do this because top rated and most popular companies have the exact same page structure - the only difference is the information.
Click “Get Data”, “Run Once”, “Save” and “Run on Servers”. Wait a few minutes and your data should be available for you for all of the 200 + cities in the USA.
Beside the “actions” text, click to download your data in JSON or CSV format. You can also interact with the data from this projects using your API key and project API token.
From a quick analysis of the data in this example, we can figure out the 10 companies that were mentioned in at least 1 out of every 10 largest cities and have a rating between 4 and 5 stars.
The most popular and highest rated companies are:
- AutoClaims Direct - 5.0
- Day Translations - 5.0
- SmashFly Technologies - 5.0
- The Trade Desk - 5.0
- UDig - 5.0
- NetView - 4.8
- Hotelplanner.com - 4.8
- Insight Global - 4.5
- Scalability Experts - 4.4
- US Air Force - 4.1
The top 10 most popular companies across all cities, including Walmart, Target, Wells Fargo, Bank of America, AT&T, had an average rating of 3.9.
This is just one example of how you can eliminate the painful data collection process and get right to the good stuff. Starting with data scraping, soon, powerful applications will pop-up and re-invent every process in the big data world, becoming essential tools in the data scientist's toolbox. The time you spend anxiously waiting for your data, or fidgeting to make it perfectly presentable, will be replaced with finding the true meaning behind all of the information.
Bio: Angelina Fomina, @aafomina, is a co-founder of ParseHub, and can also be found outdoors, painting & philosophizing about life.