Pitt community: Write to Digital Scholarship Services or use our AskUs form
Pitt health sciences researchers: Contact Data Services, Health Sciences Library System
"Web Scraping @ Pitt" by University of Pittsburgh Library System is licensed for reuse under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Some websites offer their data in a downloadable format or allow data access through Application Programming Interfaces (APIs). These methods of gathering data are authorized by the data owner and can save you significant time and effort, and you may not even need to use a separate web scraping tool.
Robots.txt is a text file that web administrators create to instruct search engine robots and other web scrapers on how to crawl and index pages on their website. You can usually find this file in the root directory of the website, in the admin section. You should first check this file before even planning how you will extract data from a website because this file determines how your crawler should interact with the website. An important factor is the frequency interval for crawling, which means that crawlers can only hit the website at specified intervals. If someone has asked not to crawl their website, it's best not to in order to avoid legal repercussions.
For example, check out these robots.txt files:
Every time you make a request to a website, it has to use server resources to get you a response. So, the volume and frequency of requests you make should be minimized and intermittent. Hitting at a constant rate can create high traffic on the server, which can slow performance or even cause it to fail when trying to serve other requests. This degrades user experience, which is more important than serving crawlers. If a website specifies the frequency interval for crawlers sending requests, it is wise to follow it. This also helps prevent you from getting blocked by the target website.
Here are a few ways to minimize the load you put on a server:
Sometimes websites have multiple URLs that point to the same webpage, which could lead to scraping duplicate content. This wastes time and resources, and duplicate data is usually not what you want. If possible, your web scrapers should use canonical URLs. These are HTML link elements with the rel="canonical"
attribute (or canonical tag), found in the head
element of the webpage, that specify to search engines the preferred version of the webpage/URL.
As you're extracting data, you should continuously parse the data and regularly verify that it is correct. Data parsing is the process of transforming scraped data into a structured format for analysis, which is easier to review and identify any issues. The last thing you want is to collect hundreds or thousands of pages worth of messy/useless data, so don't leave this step to the end of the web scraping process. You can use data parsing tools that are built in or supplemental to a web-scraping tool to automatically structure the data using predefined rules.
Honeypot traps or honeypot links are links placed on websites to detect web scrapers. These links can only be seen/accessed by web scrapers and usually have their background-color or display CSS property set to none
to mask it from legitimate users. So, if a honeypot link is accessed, the server can confirm it is not a real human and start blocking the requesting IP address or serving misleading data.
There are other mechanisms by which web scrapers can be detected and thwarted. Check out the Further Reading for best practices below for additional tips.
It's important to be aware of the possible ethical and legal issues around the data you are scraping. You should check the target website's Terms of Service page before scraping and make sure that your scrapers are in compliance with these terms and any other regulations, such as copyright, privacy, and other laws (GDPR, trespass to chattel, etc.).
If you're not sure if or how you are allowed to scrape a website, contact the website owner and ask. If you are not sure about the legality of scraping a site, consult with the website owner, an expert advisor, the Human Resource Protection Office (HRPO; formally, IRB), and/or a lawyer.
7 Web Scraping Best Practices You Must Be Aware of in ’23. Gulbahar Karatas. AIMultiple. March 6, 2023. https://research.aimultiple.com/web-scraping-best-practices/.
13 Web Scraping Best Practices and Tips. Tony Paul. Datahut (blog). June 24, 2021. https://www.blog.datahut.co/post/web-scraping-best-practices-tips.
Ethics & Legality of Webscraping – Introduction to Webscraping. Library Carpentry. The Carpentries. https://ucsbcarpentry.github.io/2022-05-12-ucsb-webscraping/06-Ethics-Legality-Webscraping/index.html.
Being a Good Scraper of the Web. Social Science Computing Cooperative. Introduction to Web Scraping with R. University Of Wisconsin–Madison. October, 2021. https://sscc.wisc.edu/sscc/pubs/webscraping-r/index.html#being-a-good-scraper-of-the-web.
Web Scraping: Introduction, Best Practices & Caveats. Velotio Technologies. Velotio Perspectives (blog). August 24, 2020. https://medium.com/velotio-perspectives/web-scraping-introduction-best-practices-caveats-9cbf4acc8d0f.
Web Scraping: 10 Best Practices And Tips. Emad Bin Abid. API Layer (blog). August 21, 2022. https://blog.apilayer.com/web-scraping-10-best-practices-and-tips/.