KDnuggets Home » News » 2016 » Mar » Tutorials, Overviews » 3 Viable Ways to Extract Data from the Open Web ( 16:n10 )

3 Viable Ways to Extract Data from the Open Web


We look at 3 main ways to handle data extraction from the open web, along with some tips on when each one makes the most sense as a solution.



By Ran Geva, Webhose.io.

Whether you want to manage your company’s reputation, monitor the online chatter surrounding your brand, perform research or just keep a finger on the pulse of a certain product or industry, there’s probably plenty of relevant raw web data out there.

crawlerWeb crawling allows you to extract the content you need from any number of sources. Fortunately, there are several tools and tactics for obtaining this data. Let’s take a look at three of the main ways to handle data extraction from the open web, along with some tips on when each one makes the most sense as a solution.

In this article we are going to discuss 3 different options to extract data from the web:

  1. Build your own crawler
  2. Use scraping tools
  3. Use pre-packaged data

1. Build Your Own Crawler

The first way to tackle data extraction is to DIY. To do this, you will need the crawler to be coded (there are some useful open-source products that can help get you started), you’ll need a host to run the crawler 24/7, and you’ll need scalable, agile server infrastructure for storing and accessing the content. And that’s all before you’ve cleaned, structured or analyzed your extracted data.

The main advantage of this approach is that the crawler is customized exactly to your needs, giving you total control over the whole process. However, this method can be quite taxing on resources, as you’ll need to keep constant watch over the system, updating how it works and what it scrapes as your needs evolve.

On the other hand, for a one-off extraction project, building your own crawler may make sense.

2. Set a Scraping Tool to Pull What You Want

This method does not call for the same level of coding skills – all you need to do is use an app’s GUI to select the sources of data you need from each site. By using tools like Import.io, Diffbot, Kapow and Mozenda, you can train a pre-fab scraper to recognize patterns. Thus, once unleashed, your crawler will pull what you set it to and deliver the content as an Excel or CSV file.

The advantage of using a tool like this is that it will extract only what you want and will structure the data according to your settings. The disadvantage is that it takes a lot of work to set up, as you have to manually list sources and map out fields. Also, a lot of effort is required in maintaining the system because sites are so dynamic.

This solution, then, is best when you’re dealing with finite sets of sources and required fields – not so much for monitoring or ongoing research.

3. Pre-Packaged Data from Webhose.io

The Webhose.io platform provides a different method to access crawled web data. With this data-as-a-service (DaaS) solution, the setup and maintenance is close to nil, because it essentially pre-crawls and structure the web for you.

All you need to do is filter the data using a simple API, so that you get the data that is relevant to you. As of last month, you can even access historical web data using the archive.

This means that virtually all costs associated with running your own crawling operation are eliminated. With Webhose.io, you’re exempt from coding any scraper bots, managing any site lists or parsing any fields – you can simply plunder the troves of structured, cleaned and crawled content to get what you need.

The Right Tool for the Right Job

We have yet to see a single solution that’s perfect for all web crawling needs. However, between these three methods, you should be able to find a viable one for the needs of your project.