Knowledge ingestion — drop a folder, get RAG
The Knowledge Ingestion pipeline classifies and embeds your DOCX, PDF, PPTX, MD files (or a public GitHub repo URL) into the same RAG that powers feasibility-report citations.
The 125-document RAG that backs the platform is not a closed catalogue. The Knowledge Ingestion pipeline at /admin/ingest lets you drop your own documents in and have them participate in feasibility-report citations within a few minutes.
Supported sources
- Drag-drop files:
.docx,.pdf,.pptx,.md,.txt,.html,.geojson. Up to 50 files at once, max 50 MB each. - GitHub repo URL: paste a public repo (e.g.
https://github.com/owner/name) and the pipeline clones it, walks theREADME+docs/+.mdfiles, and ingests them as a single classified bundle. - URLs: any public web page (HTML readability extraction, no scraping anything behind auth).
The pipeline, step by step
- Upload. The file lands in the
resiland-uploadsMinIO bucket. Adocument_ingestion_jobsrow is created withstatus=pending. - Parse. Background worker pulls the file, extracts text using format-specific parsers (python-docx for DOCX, pdfminer for PDF, python-pptx for PPTX, markdown for MD).
- Classify. Opus 4.7 reads the first ~2K tokens and emits a structured tag bundle:
kind: framework / guideline / report / glossary / dataset / other.relevance: score 0-100 against the RESILAND CA+ scope.tags: 3–8 free-form descriptors (ndvi,uzbekistan,species-fit, etc.).language: detected.summary: 200-word abstract.
- Chunk. Semantic chunking, ~400 tokens per chunk, 50-token overlap. Markdown headers preserved as chunk metadata.
- Embed. Voyage-3 (1024-dim) per chunk, batched 64 at a time. Stored in
document_chunkstable with a pgvectorvector(1024)column. - Index. A pgvector HNSW index over the embedding column, plus a trigram index over the source filename for fuzzy lookup.
End-to-end takes 30 seconds per MB on a single-CPU VPS. A 100-page PDF is in ~3 minutes.
What you can do once it is in
- Cite from new sources. The next feasibility report you draft can pull chunks from your uploaded docs and cite them — same numeric markers, same click-to-source UX.
- Search. The chat companion (P6) does a hybrid retrieval (semantic + keyword) over your full corpus.
- Export. Each ingested doc gets a stable URL — share it, link it from a report, anchor citations to it.
Visibility & multi-tenant
Documents are owned by the uploader and default to org-private. To make a document available across all orgs (system-default), an admin sets is_system_default=true. The original 125-doc baseline is system-default; your uploads are private to your org unless you flip the flag.
Costs
- Voyage-3 embeddings: free tier covers 200M tokens/month, well past typical use. A 100-page PDF embeds in ~30K tokens.
- Opus classification: ~$0.05 per document (one Opus call, cached prompt).
- Storage: the file in MinIO + chunks + embeddings in Postgres. ~3 KB per chunk in pgvector. A 100-page PDF takes ~600 KB total.
Limits
- 50 MB per file (configurable via
UPLOAD_MAX_*_MBenv vars). - 10 ingestion jobs per hour per user, to keep the Voyage tier from being abused.
- Languages: English, Turkish, Russian, Uzbek (Latin and Cyrillic) tested. Other languages embed fine but classification quality drops.
Last updated 26/04/2026