KDnuggets Home » Software » Web Content Mining, Screen Scraping

Web Content Mining, Screen Scraping





commercial | free and open source
  • AMI Enterprise Intelligence searches, collects, stores and analyses data from the web.
  • Automation Anywhere, intelligent automation software to automate business & IT processes, including web data extraction and screen scraping.
  • Bixolabs, an elastic web mining platform built w/Bixo, Cascading & Hadoop for Amazon's cloud (EC2).
  • Crawlera, a smart IP rotator to work around bot countermeasures, allows to crawl more complex sites like Google.
  • Darcy Ripper, a powerful pure Java multi-platform web crawler with great work load and speed capabilities, with an separate easy-to-use GUI for downloading web resources. Free download.
  • Diggernaut, let's you turn website content into datasets - No programming skills required.
  • Ficstar, customized web extraction, automated data management, and business intelligence.
  • FMiner, a visual web scraping software with a diagram designer.
  • Helium Scraper, a powerful Web Page Scraper / Web Data Extractor that can be set up to extract from the web virtually anything you can point your mouse at.
  • Import.io, an easy and visual way to download and import web data. Free version.
  • iWebScraping, Web Scraping, Data Extraction, Data Mining Services. Scrape data from YellowPages, Directory, Amazon, eBay, Business Listing, Google Maps.
  • Metafy Anthracite Web Mining Software, visually construct spiders and scrapers without scripts (requires MacOS X 10.4 or newer).
  • Mozenda, More-Zenful-Data, web content mining.
  • MyDataProvider builds web scraping services for ecommerce & business.
  • PDFonline (BCL) Data Extraction Software, extract data from your documents.
  • ProxyCrawl reduces time spent developing scrapers and crawlers. Crawling API protects web scrapers against site ban, IP leak, browser crash, CAPTCHA, and proxy failure. The first 1000 requests are free.
  • Scraping-Bot.io: a great API for efficient web scraping from any listing (retail, real estate, ranking, etc.) without getting blocked. Easy to integrate or use directly on the dashboard, with free calls every month.
  • Scrapy Cloud allows Scrapy/Portia users to crawl ~3 billion pages/month and offers a free plan.
  • Screen Scraper, allows users to scrape structured and unstructured data from websites and format it (free download).
  • Simple Scraper: Web scraping made simple — extract data from any website in seconds and download instantly, scrape in the cloud, or create an API.
  • TheWebMiner, for extracting structured data and custom web scraping services in cloud.
  • Visual Web Ripper, a powerful visual tool used for automated web scraping, web harvesting and content extraction from the web.
  • Web Data Extraction Services provides robust, cutting-edge solutions and services for data extraction from websites.
  • WebGet.io, a visual web scraping service, easy to use with free and low cost options, ability to login to secure sites, clicking, looping, change monitoring, image scraping, and more.
  • Webhose.io, easily get instant access to large scale structured data from online Discussions, News, Blogs and more.
  • WebQL, for creating turnkey web extraction applications, such as price collector, patent information aggregator, etc.
  • XML Miner, XML Miner is a system and class library for mining data and text expressed in XML, extracting knowledge and re-using that knowledge in products and applications in the form of fuzzy logic expert system rules.


free and open source

  • Bixo, an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop.
  • DEiXTo, a powerful tool for creating "extraction rules" (wrappers) that describe what pieces of data to scrape from a web page; consists of GUI and a stand-alone extraction rule executor.
  • Frontera, a crawl frontier manager that allows to dispatch crawling to multiple spiders in parallel - announcement.
  • GNU Wget, command line tool for retrieving files using HTTP, HTTPS and FTP.
  • Iepy, open-source Information Extraction: get data from your documents or content. (iepy on github).
  • Octoparse, a tool to easily extract any unstructured web data into structured data, and save to Excel, HTML, Text, or directly into a database.
  • Pattern, a web mining module for Python; bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider), text analysis (rule-based shallow parser, WordNet interface, tf-idf, ...) and data visualization (graph networks).
  • Portia, the Open Source Visual Web Scraper.
  • Python Web Scraping overview and examples
  • ScraperWiki, a collaborative platform for web-scraping and screen-scraping code and views.
  • Scrapy, a fast high-level screen scraping and web crawling framework in Python.
  • Trapit, system for personalizing content based on keywords, URLs and reading habits.
  • Website Downloader, a completely free way to download a copy of any website and get the contents as a zip.
  • WebSundew, a powerful web scraping and web data extraction tool that extracts data from the web pages with high productivity and speed.

Related


Sign Up

By subscribing you accept KDnuggets Privacy Policy