I know there are a couple of other similar topics but I have not seen help or suggestions.
I’m having mixed results with adding web pages to the knowledge base. For one site it worked and found 100 pages. I just tried to add a Wikipedia page and it errored out with a processed failed notice.
I am looking for more detailed instructions on how to use this important capability. Such as: 1) What determines if a site can be scrapped? Like Wikipedia’s. 2) Is there anything I can do to prior to starting the process to have it be more successful?
Any tips or guides would be helpful. Thanks
1 Like
Pickaxe’s scraper works well for static HTML pages but struggles with sites that load content dynamically using JavaScript—like Wikipedia. When a page loads this way, the scraper only sees an empty shell because the actual content isn’t in the initial HTML.
How to Tell If a Site Can Be Scraped?
Quick test:
- Right-click the page → Click “View Page Source.”
- If you see the full content there, Pickaxe can scrape it.
- If it’s mostly JavaScript or missing content, the scraper won’t work well.
How to Improve Your Success?
- Try a print-friendly version—Wikipedia and some other sites have
?printable=yes
links that work better.
- Disable JavaScript in your browser—If the page still loads properly, it’s scrapable.
- Stick to simpler sites—If a page needs a browser to “render” content, Pickaxe won’t be able to grab it.
That said, Pickaxe is great for most basic scraping needs—but for more advanced cases, you’ll need a universal scraping solution. I can handle that if needed—just DM me!
2 Likes
@Ned.Malki Thank you. It was very helpful to better understand how scraping works and why it might not. I could download a PDF of a Wikipedia page that worked great.
1 Like