I wanted to internalise how Retrieval-Augmented Generation actually works, not just glue together a tutorial. So I built a POC: a chatbot that answers natural-language questions about Canadian federal tax legislation — and only that legislation — with citations back to the section the answer came from. The corpus: the Canada Revenue Agency Act (~150 KB, 112 sections) and the Income Tax Act (~13 MB, ~3,700 pages). Both are distributed by the Department of Justice as XML in a structured legal schema (LIMS).
The goal was learning. Every architectural decision was made through that lens: surface metrics liberally, automate the testing, document the tradeoffs, iterate visibly through a real eval loop. What follows is a write-up of the parts that surprised me — the parts I’ll carry into the next RAG project.
The setup
Stack:
- Laravel 13 + AWS SDK PHP — Laravel because it’s my primary toolchain.
- AWS Bedrock for both Titan Embeddings v2 and Claude Sonnet 4.5 — IAM was already set up.
- Postgres + pgvector as the vector store, with HNSW indexing.
- Blade + Alpine.js + CDN-loaded marked + DOMPurify for the chat UI. No build step. No SPA.
Each piece was a deliberate choice over a flashier alternative.
DIY RAG, not Bedrock Knowledge Bases. Knowledge Bases collapses ingest, embed, retrieve, and generate behind one managed call — but the OpenSearch Serverless backend has a ~$200/month idle floor. Wrong shape for a POC. Also, the whole point was to see the chunker, the retriever, the prompt. A managed pipeline hides the parts I wanted to learn from.
pgvector, not Pinecone or Weaviate. Postgres was already there for sessions and cache. Adding the vector extension is one CREATE EXTENSION. At my scale (~21K chunks), HNSW retrieval is sub-50ms; pgvector handles millions of vectors comfortably. Cross-document filtering becomes a WHERE document_id IN (...) clause rather than a vendor-specific API. Zero idle cost.
Two Bedrock calls per question, not one. Each query: embed → vector search → chat. Could have collapsed retrieval and generation, but keeping them separate gives per-stage timing, observability, and tunability. A bad embed shows as poor retrieval. A good embed plus a bad answer shows in the model’s stop reason or refusal flag. The whole “journey modal” diagnostic UI I built later is only possible because each stage is independently observable.
Structure-aware chunking, not fixed-size. Legal text has meaning at element boundaries. A naïve “split every 800 characters” chunker fragments definitions mid-sentence and loses the citation context (s. 4(2)) that lets the model cite back. The chunker walks the LIMS schema (<Statute> → <Body> → <Section> → <Subsection> → <Paragraph>), threads citation paths through nesting (s. 4(2)(a)(i)), and prepends each chunk’s heading hierarchy and marginal note before embedding. That last detail — embedding with the citation context, not just the prose — is what makes citation-style queries like “what does s. 4(2) say?” match semantically.
The eval harness, before any tuning
The single best decision of the project was building the eval harness first.
Three artisan commands:
eval:baseline— run each seed question through the system, save the system’s canonical answer + retrieved citations as the baseline.eval:variants— Claude generates 8 paraphrased variants per seed (vary formality, length, word order, include a typo, include an indirect phrasing).eval:run— for each variant, ask the live RAG, then grade against the baseline using both citation overlap (Jaccard on chunk IDs) and an LLM-as-judge that returns PASS / PARTIAL / FAIL with a one-line reason.
10 seeds × 8 variants = 80 graded queries per run. The full suite costs about $0.10–$0.15 in Bedrock spend and takes ~5 minutes. A --seed filter cuts a single-seed iteration to ~30 seconds and ~$0.05.
The choice that matters is the grading: baseline-as-ground-truth, not hand-curated expected answers. I considered hand-writing “expected facts” per seed (substrings the answer must contain). I chose baseline instead because:
- I don’t have to know the right answer in advance. For 21K chunks of legal text, hand-curating ground truth for 10 seeds is hours of work. Letting the system generate a canonical answer and then approving it is faster, and it forces me to read what the system produces.
- It anchors against the system itself. A regression test compares “what did this do before” to “what does it do after a change.” That’s the right axis when the goal is don’t break what works.
- Baseline incompleteness becomes a separate finding. If the baseline is wrong, the eval surfaces it eventually — variants either pass against a wrong baseline (so you re-baseline) or fail (so you investigate).
This is good enough for a POC. For regulated production you’d add hand-rolled fact assertions for critical seeds, a grounding verifier model, and human evaluation. But none of those are blockers to learning.
The “regression” that wasn’t
I started with the small corpus (CRA Act only): 94% pass rate, 75 PASS / 2 PARTIAL / 3 FAIL. The system worked.
Then I ingested the Income Tax Act — 73× larger. The eval immediately dropped to 79%. That looked like a 15-point regression.
Reading the per-seed breakdown told a different story:
| Before | After | |
|---|---|---|
| PASS | 75 | 63 |
| PARTIAL | 2 | 15 |
| FAIL | 3 | 2 |
Real failures decreased by one. The PARTIAL count exploded.
Two compounding effects, both pedagogically valuable:
The bar moved. With more documents to choose from, baseline answers became more elaborate — they pulled in more citations and broader scope. The judge measures equivalence to the baseline, not absolute correctness. A stricter, broader baseline is mechanically harder for a paraphrased variant to match.
Cross-document blur. When someone asks “What qualifications must directors have?”, the retriever pulls s. 16(1) from the CRA Act (the actual qualifications rule) AND s. 227.1(3) and s. 55(3.2) from the Income Tax Act (about director liability for unremitted source deductions). Semantically related, conceptually distinct. The model dutifully synthesizes both into a hybrid baseline. Variants that retrieve only the qualifications chunk look “incomplete” to the judge.
The cross-document blur is a real architectural property, not just an eval artifact. Pure vector search can’t tell “directors who must be qualified” ≠ “directors who are liable for tax debts.” Both contain the word “directors”; both are about responsibilities.
The lesson: don’t optimize for the headline pass-rate; optimize for the per-seed shape. The aggregate number was misleading. The breakdown made the failure mode obvious.
The optimisations that actually moved the needle
Two surprises hidden in this list.
HNSW non-determinism. While debugging eval inconsistencies, I noticed that running the same query twice could return different top-K orderings. pgvector’s HNSW is approximate nearest-neighbour search, controlled by hnsw.ef_search (default: 40). At 21K chunks, that candidate set was small enough that consecutive runs of the same query could return slightly different rankings. Several days of eval interpretation had been polluted by this run-to-run variance — I’d been blaming the prompt and the model.
Setting hnsw.ef_search = 100 per query made retrieval deterministic across 5 runs. Cost: ~10–30 ms extra per query at this scale. Negligible.
The lesson: when eval results swing between runs without a real change, suspect the index, not the prompt or model. Approximate search is a tunable accuracy/latency dial. At small corpora the default is fine; at larger ones you have to crank it.
Hybrid retrieval did not help. I built vector + Postgres FTS via Reciprocal Rank Fusion expecting +5pp. The eval said:
- Vector only: 89% (best measured).
- Vector + FTS, equal weight: 82%.
- Vector + FTS, vector 2× weight: 81%.
- Trigram channel evaluated separately, dropped — too noisy on long natural-language queries.
I kept the infrastructure as opt-in (RAG_RETRIEVAL_MODE=hybrid) for experimentation. Default stayed vector.
The lesson: architectural additions need to prove themselves on the actual eval, not on hand-picked failing cases. Adding channels can dilute precision. The hybrid case looked good when I cherry-picked the few queries vector-only got wrong, but those wins came at the cost of regressions on queries vector-only handled cleanly.
A 5-word seed edit beat 300 lines of hybrid retrieval. Two seeds were chronically failing because their wording activated cross-document blur. The original director-qualifications seed — “What qualifications must directors have?” — semantically activated three concepts: qualifications, fiduciary duty, and director tax liability. Changing it to “What experience and capacity must a person have to be appointed as a director?” — language closer to the actual statutory phrasing — took the seed from 0/8 to 8/8.
Same trick on previous-name: “What was the agency known as before being renamed?” drifted into Tax Act amalgamation territory in paraphrases. “What was the Canada Revenue Agency previously called before it was renamed?” — anchored on the explicit “Canada Revenue Agency” — survived paraphrasing into “CRA” or “the Agency” without losing the discriminator. Took the seed from 4/8 to 8/8, including a typo variant that still preserved “Canada Revenue.”
The general principle: discriminating keywords in seeds survive paraphrasing; generic keywords drift. If your eval seed uses a word that could plausibly appear in any document in your corpus, expect cross-document blur in the variants.
Cost and latency
Per query at the final settings (top_k=8, ~3K input / ~600 output tokens, Sonnet 4.5):
| Component | Cost |
|---|---|
| Embed query (~12 input tokens, Titan v2) | ~$0.0000002 |
| Vector search (pgvector HNSW) | $0 (local) |
| Claude generation | ~$0.018 |
| Total per question | ~$0.018 |
For comparison: stuffing the whole Income Tax Act into every prompt would cost ~$4.50 per question. RAG buys a ~225× cost reduction at this corpus size. That gap is the headline business case for RAG.
Total Bedrock spend across the entire POC — ingest of both documents, ~12 eval runs, manual testing across the build:
Under $10.
That low number is the meta-lesson on its own: if you build the diagnostic surfaces and automate the testing, you can iterate freely without watching a meter.
Latency, end-to-end:
| Stage | Time |
|---|---|
| Browser → Laravel | <5 ms |
| Embed (Bedrock Titan, cross-region) | 150–250 ms |
| Vector search (HNSW, 21K chunks) | 30–50 ms |
| Claude generation (~600 tokens) | 1.5–3 s |
| Total | ~2–3 s |
The model call dominates. Lowering generation cost is the only real lever for latency.
Final state
- 91% eval pass rate, 0 FAILs, 7 PARTIALs.
- Pass-rate trajectory: 94% → 79% → 89% → 91%.
- The 7 PARTIALs are coverage variance — model produces correct content with different breadth than the baseline. None are wrong answers.
If I were taking this to production, the next things I’d add are: hand-rolled fact assertions for critical seeds (catches the case where baseline and candidate are both wrong); a grounding verifier model that scores whether every claim in the answer is supported by the retrieved chunks; document scoping enforced server-side rather than as a UI nicety; per-document RBAC. None of those are blockers to learning RAG. All of them are blockers to shipping it.
What I’d take to the next project
In rough order of value:
- Surface metrics liberally and visibly. Token counts, distances, confidence labels, refusal flags, per-chunk rank, channel pills. Every diagnostic surface saved me from a debugging round.
- Build the eval before tuning anything. Without it, you’re chasing single-question failures and missing systemic patterns.
- Per-seed breakdown beats headline pass-rate. Always. Aggregate numbers hide regression signal.
- Don’t add architectural complexity until eval proves it helps. Hybrid retrieval is the case study — built it, measured it, kept the simpler default.
- Suspect the index, not the prompt, when results swing between runs. HNSW non-determinism cost me hours of confused interpretation before I found the cause.
- Discriminating keywords in seeds survive paraphrasing. Generic keywords drift across documents. Applies to user-facing UX too — a multi-doc system needs strong query disambiguation.
- Document the why, not just the what. The issues log, the architecture doc, and the project retrospective are the highest-leverage artifacts for future engineering — including future-me in three months.
The whole experience reinforced something I keep relearning: AI-augmented work is fastest when you build the loop first — surfaces, eval, iteration script — and treat the model itself as the cheapest, most replaceable piece. The model’s a function call. The loop is the engineering.