We had some instability with our database provider last night, from the hours of 2am PST - about 10am PST. I have been up most of the night attempting to make things right, and since about 10am PST today most of the issues should have been solved. @Pristine can you let me know if you or your team are still encountering issues?
@nathaniel … I’m still definitely seeing major issues here. None of my test queries are returning responses that refer to the knowledge base files. The Message Insights and Explore screens show a list of chunks supposedly used, but none of the chunks are actually relevant and they’re all from a single file that contains no info relevant to the queries.
@nathaniel … Update as of 9:45 pm ET Friday: All bot responses are still coming back reflecting no knowledge base information, despite the fact that clear info does exist there. Message Insights screen says knowledge base wasn’t used. Explore screen shows “0 relevant knowledge chunks”. This bot has the relevance cutoff set at 60. The explore screen displays at least 20 chunks, but the highest score is typically in the 50s and none of the chunks on that screen actually contain the requested relevant info. Knowledge base use appears to be entirely broken. @stephenasuncion
I’ve duplicated the bot, removed all knowledge base, added in different knowledge base, used no knowledge base, nothing works to fix. Bot is somehow utilizing an old knowledge base that was deleted weeks ago, providing information and answers that should be impossible for it to provide since they are not in its knowledge base, including providing links to 404 Error pages.
Thanks for the update, @stephenasuncion. The Pickaxe chunk size used to be 200-250 words, and the explanation was that tests showed that was a good size. Now, the average chunk size in files I’ve re-uploaded since the new chunking process is just 30 words – three lines in the Chunk Explorer, and some have as few as 1 or 2 lines. Can you please share the rationale behind this change?
I’m working with documents to which markdown has been added to define a hierarchy of header sections, and I’m doing some experimentation using CSV files to see whether longer chunks whose boundaries are defined partly by the heading-defined sections result in a higher level of accuracy.
No worries! It turns out there was an issue with token counting. We’ve increased the max token limit, so you should notice an increase in chunk size.
I recommend splitting sections using double new lines (\n\n). As I mentioned earlier, we start by breaking the raw content into paragraphs based on double new lines. If a chunk still exceeds the max token limit, we continue splitting it further—first by lines, then by spaces, and finally character by character if necessary.
Thanks, @stephenasuncion! With the generous help of ChatGPT, I’ve just developed a Python script that chunks my documents with close attention to topic boundaries, and saves the output as a CSV file. I manually applied hierarchical topic headings to all my documents using markdown’s #-##### heading symbols. The Python script treats the content between the # headings as segments, and then combines (or in some cases, splits) the segments to intelligently allocate content to the chunks. It’s set up for a default minimum length of 80 words and a default normal maximum of 250 (with a subsequent average of around 200 words, but will permit a chunk to be as large as 400 words if that helps keep related information together. The script supports the passing of parameters to alter those three values if desired. I’m willing to share the script if anyone has documents with markdown headings and would like to try this approach. I plan to do some testing to compare this topic-based chunking with inbuilt length-based chunking, and will pass along the results when available.
Hello @sensei
The issues I was experiencing with KB retrieval appear to have all been resolved with the programming changes made by the Pickaxe team. What specific issues are you seeing?
Hi @Gene
Thank you for your reply.
Although the file (plain text e-book very well structured with ## main headings and #subheadings) was very neatly chuncked and thre relevance score was 70%, it could not find the very obvious answers for simple questions. I am talking in the past tense, since after that day I have not encountered similar problems on my other pickaxes. When I visit this pickaxe (a bot for chatting about an e-book), I will try again.
Best,