Document Chunking

Document chunking is a method of splitting documents into smaller chunks based on document structure like paragraphs and sections. It analyzes natural document boundaries rather than splitting at fixed character counts. This is useful when you want to process large documents while preserving semantic meaning and context.

Usage

from bitca.agent import Agent
from bitca.document.chunking.document import DocumentChunking
from bitca.knowledge.pdf import PDFUrlKnowledgeBase
from bitca.vectordb.pgvector import PgVector

db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"

knowledge_base = PDFUrlKnowledgeBase(
    urls=["https://bitca-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf"],
    vector_db=PgVector(table_name="recipes_document_chunking", db_url=db_url),
    chunking_strategy=DocumentChunking(),
)
knowledge_base.load(recreate=False)  # Comment out after first run

agent = Agent(
    knowledge_base=knowledge_base,
    search_knowledge=True,
)

agent.print_response("How to make Thai curry?", markdown=True)

Params

Parameter

Type

Default

Description

chunk_size

int

5000

The maximum size of each chunk.

overlap

int

0

The number of characters to overlap between chunks.

PreviousRecursive Chunking NextVectorDBS

Last updated 11 months ago

​Usage

​Params

Usage

Params