Web scraping […] is used for extracting data from websites
it’s been a month since I started working on mega-scraper.
mega-scraper is meant to make scraping a website better
it is based on the popular Puppeteer API to interact with a Chromium instance, a web browser.
the scraping queue is based on Redis and can be monitored using bull-dashboard.
I built it because I felt the need for a better way to do scraping.
how to make scraping more reliable and less detectable by anti-scraping shields?
I think the way to go is to simulate a real user using a real browser.
it also comes in handy when debugging and inspect updated CSS selectors or understand how to avoid unexpected modals or solve captcha pages.
you could even simulate a legit user session by having a pool of legit cookies.
the possibilities are wider if you try to surf a website as similar as possible to a real user browsing a product page, with eased step timeouts, random scrolling of a page, etc.
why not, even login to a given page with a real customer account to almost undetectably scrape its content.
blocking trackers by default.
being able to proxy each request can also help in case of speed, since you’re using multiple services to handle your requests.
it’s all about experiments
mega-scraper itself needs lots of improvements and new creative ways to avoid (even solve) captchas, improve networking, generic pagination, automation data extraction and much more.
open-source and npm package
both are available as npm packages 📦
let me know if you find ways to improve web scraping by opening a pull-request on GitHub at github.com/christian-fei/mega-scraper.