Files & Retrieval is a processing chain, not a single switch. Upload limits, extraction, OCR, full-context injection, vectorization, semantic recall, RAG assembly, and context compaction all affect final answer quality. Administrators should know which setting controls which part of the chain and what happens when a dependency is unavailable.
Processing Chain#
After a user uploads a file, the system checks upload limits, stores and identifies the file, extracts text or runs OCR, then chooses full-context injection or retrieval. If extraction, indexing, or retrieval is unavailable, the system falls back to limited full text, image/attachment context, or the user question only.
From an administrator perspective, the chain has six stages:
- Check file count, size, type, and user quota.
- Store the file and identify type.
- Extract text, parse documents, or run OCR.
- Decide between full context and vector retrieval.
- Recall relevant content for the conversation.
- Compress context when needed while preserving key information.
Each administrator setting maps to one control point in this chain.
Configuration Overview#
Open Files & Retrieval in the admin console. Configure from front to back: upload limits, extraction, full context, vector retrieval, semantic enhancement, RAG, and context compression.
| Group | Controls | If Missing |
|---|---|---|
| Upload limits | Whether files can enter the system | Upload may fail or storage may grow uncontrolled. |
| Extraction | Whether files become readable text | Users can preview files, but the model may not answer from text. |
| OCR | Whether images and scanned PDFs become text | Image text and scanned PDFs may not be searchable. |
| Full context | Whether small files enter context directly | Small files may not be complete, or large files may crowd context. |
| Vector retrieval | Whether large/multiple files can retrieve chunks | RAG is unavailable and falls back to full text or user question. |
| Semantic enhancement | Whether historical conversation chunks are recalled | Long conversations lose earlier semantic context. |
| RAG | How evidence chunks are filtered and injected | Too little recall misses facts; too much adds noise. |
| Context compression | How long context is preserved | Long sessions may exceed budget or lose early facts. |
Upload Limits#
| Setting | Function | Guidance |
|---|---|---|
| Attachments per message | Limits files attached to one message. | Prevent one message from triggering too much processing. |
| Default size limit | Default single-file size limit, shown in MB. | Set for common documents, then override images/documents separately. |
| User storage quota | Total storage per user; 0 means unlimited. | Recommended for team or public deployments. |
| MIME allowlist | Allows only selected file types. | Open only business-needed types and test extraction. |
| Image size limit | Overrides image attachment size. | Tighten when OCR or vision cost is high. |
| Document size limit | Overrides document attachment size. | Large documents should usually use RAG. |
Extraction decides whether files become readable text. Different services fit different files:
| Service Type | Best For | Notes |
|---|---|---|
| General document extraction | PDF, Word, sheets, slides, text | Extracts text from ordinary documents. |
| OCR | Images and scanned PDFs | Recognizes image text and scanned pages. |
| Complex document parsing | Layout-heavy PDFs, reports, papers | Preserves structure, paragraphs, and layout better. |
| LLM OCR | Hard images or documents | Higher cost; use with controlled scope. |
Extraction and OCR#
| Setting | Function | Guidance |
|---|---|---|
| Extraction engine | Built-in, Tika, Docling, or MinerU. | Built-in is lightweight; Tika covers common documents; Docling/MinerU fit complex PDFs. |
| Tika/Docling/MinerU address, key, timeout | Connects document extraction services. | Required when selecting the corresponding engine. |
| MinerU source | Cloud or self-hosted service. | Match your deployment and credentials. |
| Image OCR | Sends image attachments through OCR. | Enable for screenshots, receipts, and image text. |
| PDF OCR fallback | Runs OCR when PDF text extraction fails or is poor. | Enable for scanned PDFs. |
| OCR engine | Rapid OCR, Tesseract, Paddle, Tencent, Aliyun, or LLM OCR. | Local engines lower cost; cloud engines improve stability; LLM OCR fits complex layouts. |
| OCR service settings | Address, key, timeout, region, endpoint, model, prompt as required. | Required fields depend on the selected OCR engine. |
Full-context injection puts file text directly into the model context. It is complete and works well for small files, contracts, short code, precise review, and paragraph-by-paragraph analysis. The risk is context pressure: very large files can crowd out the user question, history, or other evidence.
Recommendation: use full context for small files and vector retrieval for large files. A long PDF should not be forced into full context when the user asks a local question.
Full Context#
| Setting | Function | Guidance |
|---|---|---|
| Full-context limits | Enables size, token, and PDF page thresholds. | Recommended to prevent large files from crowding context. |
| Full-text size limit | Maximum text bytes for full injection; 0 means unlimited. | Set for small documents; large files should use RAG. |
| Full-context token limit | Token budget for full injection; 0 means unlimited. | Leave room for user question, history, and evidence. |
| Full-context PDF page limit | Maximum PDF pages for full injection; 0 means unlimited. | Multi-page PDFs usually fit RAG better. |
Vector retrieval chunks files, creates embeddings, and builds indexes. It is the core path for knowledge bases, multiple files, large documents, and long-term material management. If the embedding model changes, old vectors and new vectors should not be mixed; rebuild the index.
Vector Retrieval#
| Setting | Function | Guidance |
|---|---|---|
| Enable Embedding | Enables file vectorization and retrieval. | Required before RAG or semantic enhancement. |
| Embedding address and key | Connects the embedding service. | Must match model and dimensions. |
| Embedding model | Model used for vectorization. | Changing it marks old vectors stale. |
| Embedding timeout | Wait time for one embedding operation. | Increase for remote or large batches. |
| Vector dimensions | Dimension used for writes and search. | Must match model output; changes require reindex. |
| Normalize vectors | Normalizes vectors before storage/search. | Keep consistent across indexing and search. |
| Auto vectorization | Triggers async embedding after upload. | Recommended for retrieval readiness. |
| Batch size | Text count per embedding batch. | Balance throughput and timeout risk. |
| Chunk size | Token-estimated chunk size. | Too small breaks context; too large hurts recall precision. |
| Chunk overlap | Overlap between adjacent chunks. | Helps avoid cutting important facts. |
| Top K | Final chunks returned for injection. | Start small, then increase as needed. |
Index Status#
| Status | Meaning | Action |
|---|---|---|
| Ready | Index works with current config. | Retrieval is usable. |
| Stale | Model or dimension changed. | Rebuild index. |
| Pending | Vectorization is queued or running. | Wait before testing retrieval. |
| Failed | Vectorization failed. | Check extracted text, embedding service, and chunk settings. |
| Empty | No index exists yet. | Configure Embedding and upload sample files. |
Semantic enhancement recalls related historical conversation chunks, while RAG retrieves file evidence. Message embedding must be enabled before semantic context recall. For RAG, too few chunks miss facts; too many chunks add noise and consume context. Tune from a small high-quality recall set.
Semantic Enhancement and RAG#
| Setting | Function | Guidance |
|---|---|---|
| Message embedding | Embeds conversation messages after each turn. | Required before semantic context recall. |
| Semantic context recall | Recalls historical related chunks. | Useful for long sessions and ongoing projects. |
| RAG switch | Allows files to enter retrieval-augmented context. | Requires Embedding. |
| Similarity threshold | Minimum similarity for injected chunks. | Too high may return empty; too low adds noise. |
| Injection token budget | Limits evidence token usage. | Reserve room for prompt, history, and current question. |
| Fetch multiplier | Fetches more candidates before filtering. | Useful when documents are noisy. |
| Ready wait time | Waits for vectorization before sending. | Helps immediately after upload; too high slows replies. |
| Query history turns | Adds recent user turns to retrieval query. | Improves multi-turn follow-ups. |
| Retrieval cache TTL | Caches retrieval results. | Helps repeated questions; avoid long TTL for changing files. |
| Hybrid retrieval | Combines BM25 and vector retrieval. | Good for exact terms, numbers, and identifiers. |
Context compression helps long conversations and large evidence sets keep important information. A compression model should be stable, inexpensive, and good at summarization. Async compression reduces blocking, but summaries may update slightly later. Test that facts, numbers, conclusions, and open tasks are preserved.
Context Compression#
| Setting | Function | Guidance |
|---|---|---|
| Token budget truncation | Trims history based on model context window. | Recommended. |
| Compression turn threshold | Triggers compression by conversation turns. | Lower for long-running conversations. |
| Compression token threshold | Triggers compression by token size. | Keep below model context limit. |
| Recent turns preserved | Keeps recent raw turns before summarizing older content. | Preserve current task details. |
| Highlights per role | Number of summary items per role. | Balance detail and budget. |
| Snippet character limit | Max characters per compressed snippet. | Prevent one long message from dominating summary. |
| Evidence retention days | Retains RAG, history, and tool evidence; 0 means no auto-expiry. | Adjust for audit and storage needs. |
| LLM compression | Uses a model for semantic summaries. | Higher quality with extra cost. |
| Async compression | Runs compression in the background. | Reduces reply blocking, with slight delay. |
| Compression model | Model used for summaries. | Use a stable, inexpensive summarization model. |
| Failure threshold | Falls back to template summaries after repeated failures. | Protects the main chat flow. |
| Summary prompts | Custom full/light summary instructions. | Preserve facts, numbers, constraints, and open tasks. |
Fallback Rules#
| Condition | Behavior | Check |
|---|---|---|
| Upload check fails | File is rejected. | Count, size, MIME, quota. |
| Extraction unavailable | File remains previewable but has no searchable text. | Engine, address, timeout. |
| OCR unavailable | Image or scanned text cannot become searchable text. | OCR switches and engine settings. |
| Full context exceeds limits | System tries retrieval or fallback. | Size, token, PDF page limits. |
| Embedding missing | RAG and semantic recall are unavailable. | Embedding address, model, dimensions. |
| Index stale or failed | Retrieval is incomplete or unreliable. | Reindex, extracted text, chunk settings. |
| RAG returns empty | No irrelevant chunks are injected. | Similarity, top K, hybrid retrieval. |
| Context too long | Truncation or compression runs. | Token budget and compression settings. |
Recommended Setup Path#
- Configure upload count, size, MIME, image/document limits, and user quota.
- Configure extraction and test common file types.
- Enable image OCR and PDF OCR fallback only when needed.
- Enable full-context limits for small documents.
- Configure Embedding model, dimensions, chunks, and auto vectorization.
- Upload sample files and confirm index status is ready.
- Enable RAG and tune similarity, top K, token budget, and hybrid retrieval.
- For long sessions, enable message embedding, semantic recall, and context compression.
Practical Tips#
Do not enable everything at once. Stabilize upload, preview, and extraction first, then add OCR, vectorization, and RAG. After changing the embedding model, vector dimensions, chunking, or extraction engine, test with sample files. When users say the model cannot answer from a file, check upload limits, extraction result, index status, RAG settings, and context budget in that order.