DIY Crawlers vs. Crawlers as Service

Crawling structured data from the web has been made easier with the choice between crawlers as a service, like, and do-it-yourself, like

By Ran Geva, Dec 2014.

Crawler So you need to obtain structured data from the world wide web. You could develop a proprietary crawling technology, modify an existing one, or you could look for a service that already provides a solution. I suggest the latter.

Data crawling services are divided into two main categories, the DIY (Do It Yourself) and the CAS (Crawlers As Service) solutions. In this post I will look into one solution from each category.

Crawlers As Service -

The crawlers behind are already crawling hundreds of thousands public sources, downloading millions of posts a day. The output is unified and structured, where each post contains all the meta-data extracted, such as the actual post text, author, date and time, title, link, section details and more. You can consume the produced structured data, either via an API or a firehose. You can also request the addition of a new source and get it added to the crawling cycle at any time.

  • Get access to millions of live posts either via a Boolean query or via a firehose
  • The data is structured and ready to be inserted into a DB
  • Very simple to consume (JSON/XML/RSS)

  • Trial based: you get only 1,000 free requests, after than you have to pay
  • Currently supports sites with either articles or discussion threads, and not tabular data

Do It Yourself - lets you define and scrape data from individual websites into a structured format. It is perfect for gathering, aggregating and analysing data from websites without the need for coding skills. The tool allows people to create an API using their point and click interface. Just download the browser, and from there it will guide you and asks you a few questions about the data you're trying to gather. is completely free for the time being and is available on Windows, OSX, and Linux

  • Free!
  • User friendly and very easy to use
  • You are your own man, you control what you get

  • Doesn’t work well with AJAX based sites, or sites that require manual actions to show text
  • Isn’t built for heavy lifting where you want to crawl hundreds of sites, with thousands of pages created daily.

In short, both DIY and CAS approaches are valid, you just need to choose the right tool for the right job.

Ran Geva Ran Geva is responsible for the technological and business development of Buzzilla LTD (the company behind, as well as research and development of future products. Ran, one of the founders of Buzzilla LTD, has 20 years of experience in technology development and customer/server-related systems.