Website scraping: what websites scrape well and which do not?

stevetn · March 2, 2025, 7:23pm

I know there are a couple of other similar topics but I have not seen help or suggestions.

I’m having mixed results with adding web pages to the knowledge base. For one site it worked and found 100 pages. I just tried to add a Wikipedia page and it errored out with a processed failed notice.

I am looking for more detailed instructions on how to use this important capability. Such as: 1) What determines if a site can be scrapped? Like Wikipedia’s. 2) Is there anything I can do to prior to starting the process to have it be more successful?

Any tips or guides would be helpful. Thanks

Ned.Malki · March 3, 2025, 5:03am

Pickaxe’s scraper works well for static HTML pages but struggles with sites that load content dynamically using JavaScript—like Wikipedia. When a page loads this way, the scraper only sees an empty shell because the actual content isn’t in the initial HTML.

How to Tell If a Site Can Be Scraped?

Quick test:

Right-click the page → Click “View Page Source.”
If you see the full content there, Pickaxe can scrape it.
If it’s mostly JavaScript or missing content, the scraper won’t work well.

How to Improve Your Success?

Try a print-friendly version—Wikipedia and some other sites have ?printable=yes links that work better.
Disable JavaScript in your browser—If the page still loads properly, it’s scrapable.
Stick to simpler sites—If a page needs a browser to “render” content, Pickaxe won’t be able to grab it.

That said, Pickaxe is great for most basic scraping needs—but for more advanced cases, you’ll need a universal scraping solution. I can handle that if needed—just DM me!

stevetn · March 3, 2025, 10:15pm

@Ned.Malki Thank you. It was very helpful to better understand how scraping works and why it might not. I could download a PDF of a Wikipedia page that worked great.

Topic		Replies	Views
Reading a website General	3	76	November 19, 2024
Scrap webpage feature not working or broken Bugs / Site Issues	1	41	November 18, 2024
Scraping website for knowledge base doesn't work General	9	95	January 11, 2025
How get full text and transcript extractions? Questions	3	30	February 8, 2025
Knowledge base refresh Bugs / Site Issues	3	36	October 2, 2024

Website scraping: what websites scrape well and which do not?

How to Tell If a Site Can Be Scraped?

How to Improve Your Success?

Related topics