Knowledge ingestion — drop a folder, get RAG

The Knowledge Ingestion pipeline classifies and embeds your DOCX, PDF, PPTX, MD files (or a public GitHub repo URL) into the same RAG that powers feasibility-report citations.

The 125-document RAG that backs the platform is not a closed catalogue. The Knowledge Ingestion pipeline at /admin/ingest lets you drop your own documents in and have them participate in feasibility-report citations within a few minutes.

Supported sources

Drag-drop files: .docx, .pdf, .pptx, .md, .txt, .html, .geojson. Up to 50 files at once, max 50 MB each.
GitHub repo URL: paste a public repo (e.g. https://github.com/owner/name) and the pipeline clones it, walks the README + docs/ + .md files, and ingests them as a single classified bundle.
URLs: any public web page (HTML readability extraction, no scraping anything behind auth).

The pipeline, step by step

Upload. The file lands in the resiland-uploads MinIO bucket. A document_ingestion_jobs row is created with status=pending.
Parse. Background worker pulls the file, extracts text using format-specific parsers (python-docx for DOCX, pdfminer for PDF, python-pptx for PPTX, markdown for MD).
Classify. Opus 4.7 reads the first ~2K tokens and emits a structured tag bundle:
- kind: framework / guideline / report / glossary / dataset / other.
- relevance: score 0-100 against the RESILAND CA+ scope.
- tags: 3–8 free-form descriptors (ndvi, uzbekistan, species-fit, etc.).
- language: detected.
- summary: 200-word abstract.
Chunk. Semantic chunking, ~400 tokens per chunk, 50-token overlap. Markdown headers preserved as chunk metadata.
Embed. Voyage-3 (1024-dim) per chunk, batched 64 at a time. Stored in document_chunks table with a pgvector vector(1024) column.
Index. A pgvector HNSW index over the embedding column, plus a trigram index over the source filename for fuzzy lookup.

End-to-end takes 30 seconds per MB on a single-CPU VPS. A 100-page PDF is in ~3 minutes.

What you can do once it is in

Cite from new sources. The next feasibility report you draft can pull chunks from your uploaded docs and cite them — same numeric markers, same click-to-source UX.
Search. The chat companion (P6) does a hybrid retrieval (semantic + keyword) over your full corpus.
Export. Each ingested doc gets a stable URL — share it, link it from a report, anchor citations to it.

Visibility & multi-tenant

Documents are owned by the uploader and default to org-private. To make a document available across all orgs (system-default), an admin sets is_system_default=true. The original 125-doc baseline is system-default; your uploads are private to your org unless you flip the flag.

Costs

Voyage-3 embeddings: free tier covers 200M tokens/month, well past typical use. A 100-page PDF embeds in ~30K tokens.
Opus classification: ~$0.05 per document (one Opus call, cached prompt).
Storage: the file in MinIO + chunks + embeddings in Postgres. ~3 KB per chunk in pgvector. A 100-page PDF takes ~600 KB total.

Limits

50 MB per file (configurable via UPLOAD_MAX_*_MB env vars).
10 ingestion jobs per hour per user, to keep the Voyage tier from being abused.
Languages: English, Turkish, Russian, Uzbek (Latin and Cyrillic) tested. Other languages embed fine but classification quality drops.

Last updated 26/04/2026