Experiments with topic-based chunking

Disclaimer: I’m definitely not an expert in RAG chunking or knowledge bases, but I’ve worked with words and communications most of my life and the topic fascinates me. I thought I’d share recent experiences regarding attempts to optimize KB content for the best possible bot responses. Please share your own thoughts and experiences.

As the Pickaxe team was working in recent days to fix some Knowledge Base issues and implement a new chunking method, I experimented with some new KB approaches of my own. My goal was to improve the accuracy and completeness of responses in an app I’ve developed that enables users to “chat” with a 16th century Anabaptist leader by the name of Menno Simons. I need the responses to be as true as possible to the life and theology of this remarkable man.

First, I added topical headings and sub-headings to approximately 15 documents totalling a little over 3MB. They include Simons’ complete writings as well as several biographies and articles about this Protestant Reformation leader. I did this using the # - ##### markdown symbols.

Second, I developed a script to convert the markdown files into CSV files containing topic-aware chunks ready for uploading into my Pickaxe KB. My aim was to respect the topic and paragraph divisions as much as possible and avoid creating chunks containing orphaned or out-of-context info the AI might overlook or misinterpret.

I developed the process initially using narrative instructions in ChatGPT, doing a lot of iteration to get it right, and then decided I’d go one step further and ask ChatGPT to convert my instructions into a Python script that I could use to chunk a whole batch of files at once. (I have done coding and Web development in the past, but have not previously worked with Python. ChatGPT did nearly all the work and even tweaked the code remarkably well as I identified areas where refinement was needed.)

Whenever possible, the script begins chunks with a heading or sub-heading and varies the size of chunks, from 80 to 250 words, to limit the number of topical segments whose content must be broken across multiple chunks. It expands the maximum chunk size to as much as 400 if needed to avoid breaking a topical segment or paragraph. When a chunk does need to be continued to another chunk, the script automatically adds a “continuation heading” at the start of that next chunk, using the last heading or sub-heading that appeared in the previous chunk, followed by “ – cont’d.” This is intended to help the AI understand the context of content that had to be separated from its preceding associated text.

The script has three parameters, making it easy to alter the minimum chunk length, normal maximum chunk length, and the extended chunk length permitted if needed to accommodate large paragraphs.

So today I compared the responses of my “Chat with Menno” app using two kinds of files: TXT files containing markdown that were automatically chunked upon upload by Pickaxe’s new, improved chunking process, and CSV files containing ready-made chunks that my custom script created with logic that relies on the heading symbols in the same markdown TXT files.

I thought up 10 questions for the app – some focused on specific facts, others requiring synthesis of a lot of information – and recorded the results of those questions using both types of files. I used a “Relevance Cutoff” of 60 and an “Amount” setting of 2000.

First, I’m impressed with the new Pickaxe chunking process, which I understand does the splitting mostly based on paragraph boundaries denoted by a double newline (\n\n), and length constraints. It appears to produce chunks averaging slightly longer than the former process (and a bit shorter than the ones generated from my custom script). I found it interesting that the chunking decisions used by the new Pickaxe algorithm seem to at least partially respect topic boundaries. Many of the Pickaxe-generated chunks begin at a section heading in my markdown files, and sometimes end at the conclusion of that section. That’s great. The app’s responses to my 10 questions when using knowledge files chunked by Pickaxe were generally good, and I think a clear improvement over responses arising from chunks created via the former process.

I’m biased, of course, but my assessment is that the files I chunked with my script gave somewhat better responses to 5 of the 10 questions in this initial unscientific assessment. Some of these were questions that asked for a list of reasons that had been addressed in a lengthy section of one of the files. The responses generated from my CSV files tended to be a bit more direct and specific, with less generic information.

I haven’t directly compared responses generated from files with markdown and those without, but my sense is that adding topic headings prior to chunking was worth the considerable effort. Maybe the next step is to create a pickaxe or a Python script that parses a document and adds descriptive topic headings, ready for chunking that uses such headings to carve up a document in a manner best suited for AI use.

1 Like

We’re actively experimenting with better parsing/chunking methods! Thank you for the deep dive.

1 Like