Scraping website for knowledge base doesn't work

tomcat101 · December 31, 2024, 5:15pm

Hi,
First of all I’m very very new to pickaxe.
I’m trying to add my website to my first pickaxe just like in this video https://www.youtube.com/watch?v=0RSsUpaOwbE but the only result is 22 chunks added to KB, all out of home page.
The site has many pages, about 50 product categories and more that 10,000 product pages.
Needless to say, the knowledge of all products and pages is a crucial part of the chatbot as it is built to help customers and (also) suggest categories and products.

For obvious reasons I can’t manually add all pages to the KB.
Actually, the website is made with magento and has 3 different store views for three diffrerent languages. All laguanges should be included in the KB

Thanks

nathaniel · December 31, 2024, 7:16pm

The tool will try to index your website and grab all the pages. To do this, it uses the sitemap file, which is a common file almost all websites have that indexes their site for search engines, so that for example Google can understand the site.

If your site isn’t working, it’s probably because it is missing this page. Your options are either to add it, which would also improve your sites overall seo, or add the pages manually.

You mention there are too many to add manually, but actually adding manually will give you the ability to pick and choose content more granularly, which might in the end be better.

Hope this helps!

tomcat101 · January 1, 2025, 7:17am

Hello and happy new year!

The website has a sitemap page of course, sorry if I don’t disclose the url publicly now.

What the sitemap does is listing all categories, blog and CMS pages. It doesn’t list all products because, as said, they are too many and it would be counter productive to have all the links here.

Nevertheless, it’s easy to navigate all products starting from home page via the main menu.

I’m thinking of one thing though… the whole website is under cloudflare where I have activated the “bot fight mode” which is supposed to prevent scraping from bots. I don’t want to turn off this feature (the content is original and is mine), yet I must allow pickaxe. Is there an IP or user agent that I can whitelist?

Another question: Is there a limit to the number of scraped pages?

Before Pickaxe I have also tested another automated AI chatbot that, although being much less sophisticated than PA, was able to suggest single products within a chat just by knowing the home page url

I can give you more information and examples in DM if required

Thanks again

admin_mike · January 8, 2025, 5:27pm

We use scraping bee documentation to power our web-scraping. They’re a large, very powerful service. That being said, it isn’t always able to scrape every website. You can try to see if you can whitelist the addresses used by scraping bee, but I don’t think there is a fixed IP address because I think they’re using proxies anyway. But maybe you can find an agent name in the Scraping Bee documentation. If you do, let us know! It’d be useful.

You can certianly reduce your cloudflare security. It’s worth a try. Another alternative to reducing your cloudflare security, is to simply download the pages you’re interested in as a PDF, and upload them to Pickaxe as PDFs.

tomcat101 · January 9, 2025, 4:49pm

Ok, after researching a bit, I understand that there is really no way to prevent scraping bee from scraping a website as it can use very smart techniques to fool the cloudflare defenses.
So I assume that the bot can actually scrape the whole site starting from home page (that contains the link to sitemap, the full menu and all the usual things).

With this scenario, the problem of Pickaxe is even bigger, in fact I verified that it ignores very simple informations that have been uploaded into learn (as pdf) and even picked-up into chunks.
I’m also observing that, while proposing products from website, it suggests page urls that don’t actually exist, therefore taking the visitor to a 404 page (!)
I don’t want to be rude but I don’t think is worth wasting more time with this tool

admin_mike · January 9, 2025, 11:29pm

Where are you getting this information about Scraping Bee? I will look into more.

tomcat101 · January 10, 2025, 6:51am

Hi,
The main purpose of scraping bots is to scrape as much information as possible avoiding the different blockings of cloudflare and similar.
They use standard user agents, they change IP from time to time, etc etc.
I’ve researched a bit and found several pages that explain the various techniques to fool cloudflare (e.g. https://scrapfly.io/blog/how-to-bypass-cloudflare-anti-scraping/ )

I could also turn OFF the special AI bot protection on CF but I don’t think it would make any difference,

But the thing that I really don’t understand is why pickaxe is literally making up urls that don’t exist when proposing products to visitors. This is really bad, don’t you think?

Thanks
Franco

Ned.Malki · January 10, 2025, 9:28am

Hi @tomcat101

Have you tried connecting custom webhook actions to either Zapier or Make with a web scraping module integration? There are several advanced web scraping tools out there that seamlessly integrate with Make and can be useful for many cases.

Keep in mind, that scraping is not always straightforward and depends on how the website in focus is structured.

Give Make Webhooks a try and integrate your scraper into the workflow. You can get good results after several tests.

If you require complex and advanced integrations, send me a DM.

~Ned

admin_mike · January 10, 2025, 8:10pm

It should never make up a URL. If it gives a bad URL, it’s probably found it on the sitemap which often contains strange legacy URLs.

But as I mentioned, we don’t actually do any web-scraping ourselves. It’s all handled by Scraping Bee. @nathaniel can explain in more detail, as I don’t touch that part of the product much.

tomcat101 · January 11, 2025, 5:12am

Well, this means that scraping bee is all but reliable then.
the platform we are on is Magento and believe me, urls in sitemap are perfectly fine, constantly monitored in search console

The bot also adds “.html” as url suffix to all products. We have NEVER used such suffix with our urls.
I think you should fix the scraping part
Thanks

Topic		Replies	Views
Uploading list if links rather than pasting them individually General custom-domain , pickaxe , knowledge-base	9	124	August 8, 2024
Adding a scraped page tries to add all pages! Bugs / Site Issues pickaxe	2	19	March 22, 2025
How do I auto-update website content in my AI knowledge base? Prompt Help knowledge-base	12	370	April 13, 2025
Website scraping: what websites scrape well and which do not? Questions	2	47	March 3, 2025
Knowledge base refresh Bugs / Site Issues	3	41	October 2, 2024

Scraping website for knowledge base doesn't work

Related topics