Website scraping: what websites scrape well and which do not?

I know there are a couple of other similar topics but I have not seen help or suggestions.

I’m having mixed results with adding web pages to the knowledge base. For one site it worked and found 100 pages. I just tried to add a Wikipedia page and it errored out with a processed failed notice.

I am looking for more detailed instructions on how to use this important capability. Such as: 1) What determines if a site can be scrapped? Like Wikipedia’s. 2) Is there anything I can do to prior to starting the process to have it be more successful?

Any tips or guides would be helpful. Thanks

1 Like

Pickaxe’s scraper works well for static HTML pages but struggles with sites that load content dynamically using JavaScript—like Wikipedia. When a page loads this way, the scraper only sees an empty shell because the actual content isn’t in the initial HTML.

How to Tell If a Site Can Be Scraped?

Quick test:

  • Right-click the page → Click “View Page Source.”
  • If you see the full content there, Pickaxe can scrape it.
  • If it’s mostly JavaScript or missing content, the scraper won’t work well.

How to Improve Your Success?

  1. Try a print-friendly version—Wikipedia and some other sites have ?printable=yes links that work better.
  2. Disable JavaScript in your browser—If the page still loads properly, it’s scrapable.
  3. Stick to simpler sites—If a page needs a browser to “render” content, Pickaxe won’t be able to grab it.

That said, Pickaxe is great for most basic scraping needs—but for more advanced cases, you’ll need a universal scraping solution. I can handle that if needed—just DM me!

2 Likes

@Ned.Malki Thank you. It was very helpful to better understand how scraping works and why it might not. I could download a PDF of a Wikipedia page that worked great.

1 Like