Skip to Main Content

Course & Subject Guides

Text Mining & Analysis @ Pitt

An introduction to text mining/analysis and resources for finding text data, preparing text data for analysis, methods and tools for analyzing text data, and further readings regarding text mining and its various methods.

Web Scraping

Web scraping is the process of identifying and extracting different types of text from a website, such as author names or article titles, through a programming language or a web tool. In using web scrapers, researchers must consider a website's possible copyright restrictions. In some cases, websites only allow the use of proprietary APIs for text mining.  

 

Tools

 

Out-of-the-Box*
  • Octoparse
    For scraping simple and dynamic websites using HTML, CSS, AJAX technologies, JavaScript, redirects, cookies, search bars, etc.

  • Parsehub
    For scraping simple and dynamic websites using HTML, CSS, AJAX technologies, JavaScript, redirects, cookies, search bars, etc.

  • NCapture
    For scraping web content to import into NVivo

  • Outwit (Hub, Images, Docs, Email Sourcer)
    For scraping simple and dynamic websites (using HTML, CSS, AJAX technologies, JavaScript, redirects, cookies, CAPTCHA tests, etc), news media, social media sites, images, documents, and email

  • Portia
    For scraping simple and dynamic websites using HTML, CSS, AJAX technologies, JavaScript, redirects, cookies, etc.

  • Import.io
    For scraping simple and dynamic websites using HTML, CSS, AJAX technologies, JavaScript, redirects, cookies, etc.

  • Webhose.io
    For scraping simple and dynamic websites using HTML, CSS, AJAX technologies, JavaScript, redirects, cookies, etc.

  • Spinn3r
    For social media, blogs, news sites, videos, RSS feeds, and live web content

  • Data Scraper
    For simple HTML and CSS websites

  • Web scraper
    For scraping simple and dynamic websites using HTML, CSS, AJAX technologies, JavaScript, redirects, cookies, etc.

  • Scraper
    For scraping tables on For simple HTML websites

  • wget
    For retrieving files using HTTP, HTTPS, FTP and FTPS, the most widely used Internet protocols; non-interactive commandline tool, so it may easily be called from scripts.

  • OpenRefine
    For fetching, exploring, cleaning, transforming, reconciling and matching data

* For a detailed comparison of these tools, see the Non-Coding Web Scraping Tools chart.  

Programmatic

 Python 

  • Scrapy
    For extracting the data you need from websites

  • Beautiful Soup
    For parsing and extracting data from HTML and XML documents

  • Pattern
    For webscraping (Google, Wikipedia, Twitter, Facebook, generic RSS, etc.), web crawling, HTML DOM parsing, part-of-speech tagging, n-gram search, sentiment analysis, vector space modeling, clustering, classification (KNN, SVM, Perceptron), graph centrality and visualization

R

  • rvest
    For scraping (or harvesting) data from web pages

  • selectr
    For translating CSS selectors to XPath expressions