Skip to content

fix(kb): chunking config persistence#3877

Merged
icecrasher321 merged 5 commits intostagingfrom
fix/chunking-config-persist
Apr 1, 2026
Merged

fix(kb): chunking config persistence#3877
icecrasher321 merged 5 commits intostagingfrom
fix/chunking-config-persist

Conversation

@icecrasher321
Copy link
Copy Markdown
Collaborator

Summary

Respect KB chunking configs set correctly.

Type of Change

  • Bug fix

Testing

Tested manually

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 1, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Apr 1, 2026 2:15am

Request Review

@cursor
Copy link
Copy Markdown

cursor bot commented Apr 1, 2026

PR Summary

Medium Risk
Changes how document chunking parameters are sourced by removing per-request chunk sizing and always using the knowledge base’s stored chunkingConfig, which can affect how all uploaded/retried documents are parsed and embedded. Risk is moderate because it touches the document ingestion pipeline and request validation but doesn’t change auth or data access controls.

Overview
Document uploads now rely on the knowledge base’s persisted chunking configuration rather than client-provided chunk settings. The API schemas for bulk upload and upsert drop chunkSize/minCharactersPerChunk/chunkOverlap validation, making processingOptions optional and limited to recipe/lang.

Background processing is updated accordingly: processDocumentAsync no longer applies chunking overrides and always passes the KB’s stored chunking values to the document processor, and various upload/retry call sites now enqueue processing with {} options. UI upload flows are adjusted to stop sending chunking options, and the edit KB modal now displays the current chunking config.

Written by Cursor Bugbot for commit 003b32d. Configure here.

@icecrasher321
Copy link
Copy Markdown
Collaborator Author

bugbot run

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 1, 2026

Greptile Summary

This PR fixes a bug where knowledge base chunking configuration was not being respected during document processing — the frontend was passing client-side fallback values (e.g. chunkSize: chunkingConfig?.maxSize || 1024) that would override the actual KB-stored config, and the server was preferring per-request processingOptions over the database chunkingConfig. The fix removes chunk fields from all frontend upload calls, makes them optional in API schemas, and changes processDocumentAsync to always source chunking parameters exclusively from the KB's stored chunkingConfig in the database.

Key changes:

  • processDocumentAsync (service) now reads kbConfig.maxSize/overlap/minSize directly, dropping the processingOptions.chunkSize ?? kbConfig.maxSize fallback pattern
  • All bulk/upsert API schemas make chunkSize, chunkOverlap, minCharactersPerChunk, recipe, and lang optional (server-side fallback via DB)
  • add-documents-modal upload calls stripped of client-side chunk overrides
  • EditKnowledgeBaseModal gains a read-only display of the current chunking config, and the create-base-modal inputs get proper type="number" constraints
  • The v1 public API route removes its manual chunkingConfig ?? defaults fallback and passes {} to the queue

Findings:

  • The three chunk fields (chunkSize, chunkOverlap, minCharactersPerChunk) still present in processDocumentAsync's parameter type and accepted by the upsert API schema are now silently ignored — the type and schema should be trimmed or documented to avoid misleading API consumers
  • The kbConfig cast at line 459 of service.ts has no null-safety fallback; the previous v1 route's ?? { maxSize: 1024, minSize: 1, overlap: 200 } guard was removed, leaving legacy rows (created before the notNull constraint) unprotected
  • recipe and lang are still forwarded in payloads but are never consumed by processDocument, making them dead API surface

Confidence Score: 5/5

Safe to merge — the fix is correct and well-scoped; all remaining findings are P2 style/cleanup suggestions.

The core bug (client-side fallbacks overriding KB chunking config) is correctly fixed across all upload paths. No data integrity or correctness regressions were introduced. The three P2 findings (dead parameter fields, missing null fallback for an edge case already guarded by a DB notNull constraint, and unused recipe/lang fields) are cleanup opportunities that do not block correct behavior.

apps/sim/lib/knowledge/documents/service.ts — dead processingOptions chunk fields and missing null-safety fallback for kbConfig.

Important Files Changed

Filename Overview
apps/sim/lib/knowledge/documents/service.ts Core fix: processDocumentAsync now always reads chunking params from the KB's stored chunkingConfig instead of request-supplied processingOptions; however, the processingOptions parameter still declares chunk fields that are now dead code, and there is no null-safety fallback if chunkingConfig is absent.
apps/sim/app/api/knowledge/[id]/documents/route.ts All processingOptions chunking fields made optional in BulkCreateDocumentsSchema; comment corrected (chunkOverlap unit changed from characters to tokens).
apps/sim/app/api/knowledge/[id]/documents/upsert/route.ts All chunking fields in UpsertDocumentSchema made optional; validated values are still forwarded to processDocumentsWithQueue but are now silently ignored by processDocumentAsync.
apps/sim/app/api/v1/knowledge/[id]/documents/route.ts Removed manual chunkingConfig fallback and explicit processing options; now passes empty {} to processDocumentsWithQueue, letting processDocumentAsync pull config from DB directly.
apps/sim/app/workspace/[workspaceId]/knowledge/hooks/use-knowledge-upload.ts Chunk values removed from upload payload; only recipe and lang remain in processingOptions, using null-safe conditional spread for chunk fields when provided.
apps/sim/app/workspace/[workspaceId]/knowledge/[id]/components/add-documents-modal/add-documents-modal.tsx Removed client-side chunking overrides (chunkSize, minCharactersPerChunk, chunkOverlap) from both the retry and initial upload calls; only recipe: 'default' remains.
apps/sim/app/workspace/[workspaceId]/knowledge/components/edit-knowledge-base-modal/edit-knowledge-base-modal.tsx New read-only display of chunking config (max size, min size, overlap) added to the edit modal when chunkingConfig prop is provided.
apps/sim/app/workspace/[workspaceId]/knowledge/components/create-base-modal/create-base-modal.tsx Numeric input fields for chunk size, min chunk size, and overlap gain proper type="number" with min/max/step attributes; custom name attributes removed in favour of id-based registration.
apps/sim/app/workspace/[workspaceId]/knowledge/components/base-card/base-card.tsx Optional chunkingConfig prop added to BaseCardProps and forwarded to EditKnowledgeBaseModal.
apps/sim/app/workspace/[workspaceId]/knowledge/knowledge.tsx chunkingConfig from activeKnowledgeBase now passed to EditKnowledgeBaseModal so the modal can display current chunking settings.

Sequence Diagram

sequenceDiagram
    participant UI as AddDocumentsModal
    participant Hook as useKnowledgeUpload
    participant API as POST /api/knowledge/[id]/documents
    participant Queue as processDocumentsWithQueue
    participant Worker as processDocumentAsync
    participant DB as Database (KB chunkingConfig)

    UI->>Hook: uploadFiles(files, kbId, { recipe: 'default' })
    Note over UI,Hook: chunkSize / chunkOverlap / minCharactersPerChunk<br/>no longer sent (bug fix)
    Hook->>API: POST { documents, processingOptions: { recipe, lang }, bulk: true }
    API->>Queue: processDocumentsWithQueue(docs, kbId, processingOptions, reqId)
    Queue->>Worker: dispatchDocumentProcessingJob(payload)
    Worker->>DB: SELECT chunkingConfig FROM knowledgeBase WHERE id = kbId
    DB-->>Worker: { maxSize, minSize, overlap }
    Note over Worker: Always uses kbConfig values (bug fix)<br/>processingOptions chunk fields ignored
    Worker->>Worker: processDocument(url, name, mime, kbConfig.maxSize, kbConfig.overlap, kbConfig.minSize, ...)
Loading

Comments Outside Diff (2)

  1. apps/sim/lib/knowledge/documents/service.ts, line 419-425 (link)

    P2 Dead processingOptions chunking params after refactor

    After this PR, processDocumentAsync always reads chunking values from the database (kbConfig.maxSize, kbConfig.overlap, kbConfig.minSize) and completely ignores processingOptions.chunkSize, processingOptions.chunkOverlap, and processingOptions.minCharactersPerChunk. The parameter shape now implies these fields are usable, but they're silently discarded on every code path.

    This creates a misleading contract: callers (including the upsert API route that still passes validatedData.processingOptions.chunkSize etc.) can supply these values, pass Zod validation, and get no error — but the values are never applied. Consider either removing the dead fields from the type:

    processingOptions: {
      recipe?: string
      lang?: string
    }

    …or, if per-document overrides should remain possible in the future, restore the ?? kbConfig.xxx fallback pattern and document the intentional removal.

  2. apps/sim/app/workspace/[workspaceId]/knowledge/hooks/use-knowledge-upload.ts, line 1021-1022 (link)

    P2 recipe and lang are sent but never consumed

    recipe and lang are included in every upload payload and validated by the server-side Zod schema, but processDocumentAsync never passes them to processDocument(...) — the processor function signature doesn't accept them. This was true before this PR as well, but now that the chunking overrides are gone the only surviving fields in processingOptions are these two unused ones.

    If recipe/lang are not planned for near-term use, removing them from the payload and schemas would reduce noise and prevent confusion for API consumers.

Reviews (1): Last reviewed commit: "remove dead code" | Re-trigger Greptile

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

@icecrasher321 icecrasher321 merged commit b95a049 into staging Apr 1, 2026
6 checks passed
@icecrasher321 icecrasher321 deleted the fix/chunking-config-persist branch April 1, 2026 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant