Guides: Text Mining &amp; Analysis @ Pitt: Web Scraping

Web scraping is the process of identifying and extracting different types of text from a website, such as author names or article titles, through a programming language or a web tool. In using web scrapers, researchers must consider a website's possible copyright restrictions. In some cases, websites only allow the use of proprietary APIs for text mining.

Tools

Out-of-the-Box*

Octoparse
For scraping simple and dynamic websites using HTML, CSS, AJAX technologies, JavaScript, redirects, cookies, search bars, etc.
Parsehub
For scraping simple and dynamic websites using HTML, CSS, AJAX technologies, JavaScript, redirects, cookies, search bars, etc.
NCapture
For scraping web content to import into NVivo
Outwit (Hub, Images, Docs, Email Sourcer)
For scraping simple and dynamic websites (using HTML, CSS, AJAX technologies, JavaScript, redirects, cookies, CAPTCHA tests, etc), news media, social media sites, images, documents, and email
Portia
For scraping simple and dynamic websites using HTML, CSS, AJAX technologies, JavaScript, redirects, cookies, etc.
Import.io
For scraping simple and dynamic websites using HTML, CSS, AJAX technologies, JavaScript, redirects, cookies, etc.
Webhose.io
For scraping simple and dynamic websites using HTML, CSS, AJAX technologies, JavaScript, redirects, cookies, etc.
Spinn3r
For social media, blogs, news sites, videos, RSS feeds, and live web content
Data Scraper
For simple HTML and CSS websites
Web scraper
For scraping simple and dynamic websites using HTML, CSS, AJAX technologies, JavaScript, redirects, cookies, etc.
Scraper
For scraping tables on For simple HTML websites
wget
For retrieving files using HTTP, HTTPS, FTP and FTPS, the most widely used Internet protocols; non-interactive commandline tool, so it may easily be called from scripts.
OpenRefine
For fetching, exploring, cleaning, transforming, reconciling and matching data

* For a detailed comparison of these tools, see the Non-Coding Web Scraping Tools chart.

Programmatic

Python

Scrapy
For extracting the data you need from websites
Beautiful Soup
For parsing and extracting data from HTML and XML documents
Pattern
For webscraping (Google, Wikipedia, Twitter, Facebook, generic RSS, etc.), web crawling, HTML DOM parsing, part-of-speech tagging, n-gram search, sentiment analysis, vector space modeling, clustering, classification (KNN, SVM, Perceptron), graph centrality and visualization

R

rvest
For scraping (or harvesting) data from web pages
selectr
For translating CSS selectors to XPath expressions

University of Pittsburgh Library System

Course & Subject Guides

Text Mining & Analysis @ Pitt

Get Help with Text Mining & Analysis

Guide Contributors

Web Scraping

Tools

Out-of-the-Box*

Programmatic

Helpful Resources

Sample Projects