Different types of Chunking in RAG Systems 🇬🇧

What is Chunking?

Chunking is the breaking down the large documents into smaller, manageable pieces or chunks. One single document is known as one chunk. This helps in efficiently retrieving and generating relevant information from a vast amount of data.

Why is Chunking required?

Memory Limitation:
- Large documents can exceed memory capacity.
- Chunking breaks down the data into manageable chunks.
Processing Efficiency:
- Smaller chunks are faster to process.
- Reduces computational cost.
Improved Retrieved Accuracy
- Focuses on relevant sections.
- Enhances context specific responses.
Simplifies Information Management
- Easier to navigate and search
- Facilitate quick access to specific data
Scalability
- Allows to handle larger datasets
- Makes system more robust and scalable.

Fixed-Length Chunking

As the name suggest, we create chunk of data of fixed size from an existing document.
It is a method of splitting text into chunks of a specific size, maintaining the overlap to ensure continuity between the chunks.
The CharacterTextSplitter in langchain achives it by splitting the text by character count.

Steps

Input Document: a long document or text that needs to be divided into smaller chunks.
Define Chunk Size: determine the fixed size of each chunk (e.g., 100 words, 500 characters, etc.).
Chunking Process
- The text is split into non-overlapping segments based on the defined size.
- This can be done at the word level, token level, or character level.
- In fixed-length chunking, chunks are created strictly based on the fixed size, regardless of sentence boundaries.