1. Ignoring Website Policies (robots.txt / TOS)
Always check https://example.com/robots.txt
Respect crawl delays, disallowed paths
Use public or scrape-friendly websites
Add disclaimers if you share the data
2. Scraping Too Fast = Getting Blocked
Add randomized delays (time.sleep(random.uniform(1, 3)))
Use rotating proxies or services like ScraperAPI, BrightData
Use headers to mimic real browsers (User-Agent, Referer)
3. Not Handling JavaScript-Rendered Content
Use Selenium, Playwright, or Puppeteer for dynamic content
Check browser DevTools → Network → XHR for hidden APIs
4. Scraping Dirty or Irrelevant Data
Use strip(), regex, and data validation checks
Plan what fields you need before scraping
Use Pandas or Excel to clean and format after scraping
5. No Error Handling or Logging
Save progress in batches (e.g., every 50 rows to a CSV).
Log failed URLs and errors for later re-scraping
No comments:
Post a Comment