Skip to Main Content

Course & Subject Guides

Web Scraping @ Pitt

An introduction to web scraping, including information for getting started, best practices, and listings of out-of-the-box and programmatic web scraping tools.

Best Practices

Check if the website supports direct downloads or an API

Some websites offer their data in a downloadable format or allow data access through Application Programming Interfaces (APIs). These methods of gathering data are authorized by the data owner and can save you significant time and effort, and you may not even need to use a separate web scraping tool.

Respect Robots.txt

Robots.txt is a text file that web administrators create to instruct search engine robots and other web scrapers on how to crawl and index pages on their website. You can usually find this file in the root directory of the website, in the admin section. You should first check this file before even planning how you will extract data from a website because this file determines how your crawler should interact with the website. An important factor is the frequency interval for crawling, which means that crawlers can only hit the website at specified intervals. If someone has asked not to crawl their website, it's best not to in order to avoid legal repercussions.

For example, check out these robots.txt files:

Don't overload the website server

Every time you make a request to a website, it has to use server resources to get you a response. So, the volume and frequency of requests you make should be minimized and intermittent. Hitting at a constant rate can create high traffic on the server, which can slow performance or even cause it to fail when trying to serve other requests. This degrades user experience, which is more important than serving crawlers. If a website specifies the frequency interval for crawlers sending requests, it is wise to follow it. This also helps prevent you from getting blocked by the target website.

Here are a few ways to minimize the load you put on a server:

  • Scrape during off-peak hours when there will be less traffic on the server (i.e., when users are less likely to visit the site). These hours can be identified by the geolocation from where the site's traffic originates.
  • Limit the number of parallel / concurrent requests to the target website.
  • Spread the requests across multiple IP addresses.
  • Add delays to successive requests. Make requests according to the specified interval in robots.txt or use a standard delay of 10 seconds.
  • Cache your web scraper's HTTP requests and responses to avoid making unnecessary requests.

Use Canonical URLs

Sometimes websites have multiple URLs that point to the same webpage, which could lead to scraping duplicate content. This wastes time and resources, and duplicate data is usually not what you want. If possible, your web scrapers should use canonical URLs. These are HTML link elements with the rel="canonical" attribute (or canonical tag), found in the head element of the webpage, that specify to search engines the preferred version of the webpage/URL.

Continuously parse and verify scraped data

As you're extracting data, you should continuously parse the data and regularly verify that it is correct. Data parsing is the process of transforming scraped data into a structured format for analysis, which is easier to review and identify any issues. The last thing you want is to collect hundreds or thousands of pages worth of messy/useless data, so don't leave this step to the end of the web scraping process. You can use data parsing tools that are built in or supplemental to a web-scraping tool to automatically structure the data using predefined rules.

Beware of Honeypot Traps

Honeypot traps or honeypot links are links placed on websites to detect web scrapers. These links can only be seen/accessed by web scrapers and usually have their background-color or display CSS property set to none to mask it from legitimate users. So, if a honeypot link is accessed, the server can confirm it is not a real human and start blocking the requesting IP address or serving misleading data.

There are other mechanisms by which web scrapers can be detected and thwarted. Check out the Further Reading for best practices below for additional tips.

Use the scraped data responsibly

It's important to be aware of the possible ethical and legal issues around the data you are scraping. You should check the target website's Terms of Service page before scraping and make sure that your scrapers are in compliance with these terms and any other regulations, such as copyright, privacy, and other laws (GDPR, trespass to chattel, etc.).

When in doubt, ask

If you're not sure if or how you are allowed to scrape a website, contact the website owner and ask. If you are not sure about the legality of scraping a site, consult with the website owner, an expert advisor, the Human Resource Protection Office (HRPO; formally, IRB), and/or a lawyer.

Further Reading