Whitepaper2024-09-26

Scraping the Web for use in Text and Data Mining

Dorian Cougias, Vicki McEwen, Steven Piliero, Helmut Neher, Imad Ibrahim, Austin Mack

Defines the layered, strictest-rule-wins assessment for whether web content may be scraped for TDM: license, public domain, jurisdictional exception, Creative Commons, then organizational restrictions (robots.txt, ToS, copyright, captcha/clickwrap, metadata, API). Carries the 50-state plus 5-territory public-domain appendix.

Citation: Cougias, D., McEwen, V., Piliero, S., Neher, H., Ibrahim, I., & Mack, A. (2024). Scraping the Web for use in Text and Data Mining. ResearchGate.

Read the source ↗

Backs these rules

robots-semanticstechnical-barriersterms-of-serviceapi-accessjurisdiction-domainscc-licensesrights-assembly