KDnuggets Home » News » 2014 » Dec » Opinions, Interviews, Reports » DIY Crawlers vs. Crawlers as Service ( 14:n35 )

DIY Crawlers vs. Crawlers as Service


Crawling structured data from the web has been made easier with the choice between crawlers as a service, like webhose.io, and do-it-yourself, like import.io.



By Ran Geva, Dec 2014.

Crawler So you need to obtain structured data from the world wide web. You could develop a proprietary crawling technology, modify an existing one, or you could look for a service that already provides a solution. I suggest the latter.

Data crawling services are divided into two main categories, the DIY (Do It Yourself) and the CAS (Crawlers As Service) solutions. In this post I will look into one solution from each category.

Crawlers As Service - Webhose.io

The crawlers behind  Webhose.io are already crawling hundreds of thousands public sources, downloading millions of posts a day. The output is unified and structured, where each post contains all the meta-data extracted, such as the actual post text, author, date and time, title, link, section details and more. You can consume the produced structured data, either via an API or a firehose. You can also request the addition of a new source and get it added to the crawling cycle at any time.

Pros:
  • Get access to millions of live posts either via a Boolean query or via a firehose
  • The data is structured and ready to be inserted into a DB
  • Very simple to consume (JSON/XML/RSS)

Cons:
  • Trial based: you get only 1,000 free requests, after than you have to pay
  • Currently supports sites with either articles or discussion threads, and not tabular data

Do It Yourself - Import.io

Import.io lets you define and scrape data from individual websites into a structured format. It is perfect for gathering, aggregating and analysing data from websites without the need for coding skills. The tool allows people to create an API using their point and click interface. Just download the import.io browser, and from there it will guide you and asks you a few questions about the data you're trying to gather. Import.io is completely free for the time being and is available on Windows, OSX, and Linux

Pros:
  • Free!
  • User friendly and very easy to use
  • You are your own man, you control what you get

Cons:
  • Doesn’t work well with AJAX based sites, or sites that require manual actions to show text
  • Isn’t built for heavy lifting where you want to crawl hundreds of sites, with thousands of pages created daily.



In short, both DIY and CAS approaches are valid, you just need to choose the right tool for the right job.

Ran Geva Ran Geva is responsible for the technological and business development of Buzzilla LTD (the company behind Webhose.io), as well as research and development of future products. Ran, one of the founders of Buzzilla LTD, has 20 years of experience in technology development and customer/server-related systems.



Related:

Sign Up