Skip to Main Content

Course & Subject Guides

Web Scraping @ Pitt

An introduction to web scraping, including information for getting started, best practices, and listings of out-of-the-box and programmatic web scraping tools.

What is "Web Scraping"?

Web scraping (also called web harvesting or web data extraction) refers to the systematic extraction of data or content (either manual or automated) from a website for later analysis, retrieval, and preservation. This method of collecting data can be useful when you need to acquire a significant amount of data from a website and no API (Application Programming Interface) is available to facilitate the process.

The web is filled with data that may be structured according to HTML or XHTML markup tags, organized in tables, dynamically populated from databases when a page is rendered, stored in files, or altogether unstructured. Web scraping mostly works with data in markup tags, which instruct browsers how to display them to site visitors. Web scraping tools can interpret these tags and follow instructions on how to collect the data they contain.

Web scraping tools can also enable you to structure the data into formats that are amenable to your research aims as you collect it. For example, you can transform your scraped data into tabular/spreadsheet formats (e.g., CSV, Excel), database formats (e.g., SQLITE, DB, MDB), and data serialization formats (e.g., JSON, XML, YAML).

The Ethics of Web Scraping

Because web scraping involves the collection of data produced by and about others, it’s important to consider all the potential ethical and legal implications of your project. Prior to your project, you should make sure you understand any relevant copyright, privacy, and security issues. Here are some steps you might take:

Web Scraping Tools

After considering the ethics surrounding your web scraping project, an important next step is to select a tool to fit your research needs. Web scrapers can range from out-of-the-box tools, like manual browser extensions and desktop applications, to programmatic tools that automate the process but require coding or command-line skills. The features and capabilities of web scraping tools vary widely, so you will need to determine which tool strikes the right balance between 1) meeting your project needs and 2) the time and effort required to learn how to use it. Some tools have subscription fees, but many are free and open source, or at least have free versions.

This guide provides listings of web scraping tools organized into two categories, Out-of-the-Box Tools and Programmatic Tools.

Out-of-Box Tools

  • Desktop Applications: Downloading one of these tools to your computer can often provide familiar interface features and generally easy-to-learn workflows. These tools are often quite powerful, as they are usually designed for enterprise contexts. They sometimes come with data storage or subscription fees.
  • Web Browser Extensions: These tools allow you to install an extension or plugin to your Chrome or Firefox browser. They often require more manual work since you will have to visit each webpage with the desired data and select what you want to collect.
  • Web-based Applications: Web applications provide many of the same benefits of desktop applications. But, instead of downloading the software, you will need an Internet connection and web browser to use the the app online. These tools are also less likely to have subscription fees and the data storage limit will depend on the space available on your computer.

Programmatic Tools

  • Application Programming Interfaces (APIs):Technically, a web scraping tool is an Application Programming Interface (API) in that it enables you to interact with data stored on a server. Many data sources (like Google, Amazon, Facebook, and Twitter) have their own APIs that can help you gather data. Whenever possible, you should use the API provided by the owner of the content you want to collect. This will not only save you time and effort, but also help ensure that your data collection process follows the requirements of the data owner.

    Because there are a great many APIs, this guide does not list specific ones. Here are a few places you can learn more about the available API tools:
  • Command-line Tools: Like desktop and web applications, these tools can be quite powerful and robust. However, these tools do not have a graphical user interface (GUI) and must be run using text-based, command-line interface (CLI). CLIs come pre-installed on computers, like the Command Prompt or PowerShell for Windows, the Terminal for Mac, the Bash shell for Linux, and the Unix shell. There are several other CLIs that you can download. To run the program, you will have to download it to your computer and enter computer commands into the CLI. Documentation for the commands are provided by the program creators, and you can usually find examples and tutorials online.
  • Programming Languages: For large scale, complex scraping projects, the best option might be a software package or library within a programming language because you can write scripts that do exactly what you need, from start to finish. These tools may require more up front learning, if you're not already familiar with coding in the programming language. These tools usually come with helpful documentation provided by their creators, and you can usually find lots of examples and tutorials that can help you get started.