Pitt community: Write to Digital Scholarship Services or use our AskUs form
Pitt health sciences researchers: Contact Data Services, Health Sciences Library System
"Web Scraping @ Pitt" by University of Pittsburgh Library System is licensed for reuse under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Web scraping (also called web harvesting or web data extraction) refers to the systematic extraction of data or content (either manual or automated) from a website for later analysis, retrieval, and preservation. This method of collecting data can be useful when you need to acquire a significant amount of data from a website and no API (Application Programming Interface) is available to facilitate the process.
The web is filled with data that may be structured according to HTML or XHTML markup tags, organized in tables, dynamically populated from databases when a page is rendered, stored in files, or altogether unstructured. Web scraping mostly works with data in markup tags, which instruct browsers how to display them to site visitors. Web scraping tools can interpret these tags and follow instructions on how to collect the data they contain.
Web scraping tools can also enable you to structure the data into formats that are amenable to your research aims as you collect it. For example, you can transform your scraped data into tabular/spreadsheet formats (e.g., CSV, Excel), database formats (e.g., SQLITE, DB, MDB), and data serialization formats (e.g., JSON, XML, YAML).
Because web scraping involves the collection of data produced by and about others, it’s important to consider all the potential ethical and legal implications of your project. Prior to your project, you should make sure you understand any relevant copyright, privacy, and security issues. Here are some steps you might take:
After considering the ethics surrounding your web scraping project, an important next step is to select a tool to fit your research needs. Web scrapers can range from out-of-the-box tools, like manual browser extensions and desktop applications, to programmatic tools that automate the process but require coding or command-line skills. The features and capabilities of web scraping tools vary widely, so you will need to determine which tool strikes the right balance between 1) meeting your project needs and 2) the time and effort required to learn how to use it. Some tools have subscription fees, but many are free and open source, or at least have free versions.
This guide provides listings of web scraping tools organized into two categories, Out-of-the-Box Tools and Programmatic Tools.
APIs for Scholarly Resources. Nicole Hennig. CEU Library. Central European University.
Resources and Tools for Computational Research. Murack, Jennie. MIT Libraries. Massachusetts Institute of Technology.
Resources for Text and Data Mining: APIs and WebScraping. Erica Bruchko. Emory Libraries. Emory University.
Text Mining & Analysis @ Pitt: Sources of Text Data – Social Media. Tyrica Terry Kapral. University of Pittsburgh Library System.