Methodology
Built from a chat, line by line.
RAGTAG (Retrieval Augmented Graph Tax Answer Generator) takes a tax question, retrieves the relevant Finnish statutes, court rulings and Vero guidance, and answers with citations. Every decision below is tagged either to a paper or to a numbered point from Taxxa’s team chat on 23.05. We will present the demo live.
Chat · 23.05
- Pricing#01
“€60 per user per month. Queries have to be cheap.”
- Corpus#02
“Connect Finlex, Vero, case law. EU-lex out of scope.”
- Authority#03
“Case laws refer to Finlex. Vero is just an interpreter. Case laws can overwrite Vero.”
- Retrieval#04
“Can't bring 1,000 chunks per question. 25M-page DB.”
- Timeline#05
“Active now, not then, not later.”
- Stack#06
“Run RAG locally first. DeepSeek is good and cheap.”
- Extraction#07
“Reference-extraction by regex / NLP is a good idea. We aren't doing it.”
Part I
What we tried. And what chat point killed it.
Bitemporal graph on Neo4j every edge with its own validity window.
Schema from SAT-Graph RAG (JURIX 2025). The whole ontology in a graph database.
- SAT-Graph RAG · arXiv:2505.00039
- LRMoo · IFLA
Two days into the ontology, still zero answers. We kept the bitemporal idea and rebuilt it in SQLite as version_chain (see RAGTAG #08).
BGE-M3 hybrid retrieval dense + sparse + ColBERT, fused at k=60.
BAAI's multilingual model with three signal heads. RRF for fusion, HyDE for English to Finnish query expansion.
- BGE-M3 · BAAI
- ColBERT · SIGIR 2020
- RRF · SIGIR 2009
- HyDE · Gao et al. 2022
Voyage voyage-3-large already ranked the right chunk in the top 30 on every eval question. Adding a second stack failed the cost test.
Courtroom-style debate prosecutor, defense, judge.
Three LLMs arguing per conflict, pattern from AgenticSimLaw.
- AgenticSimLaw · arXiv:2601.21936
- Multi-Agent Debate · NeurIPS 2023
Chat #03 handed us the rule directly. One integer compare resolves every conflict in our eval set; seven LLM turns cost real money.
63,660-node constellation WebGL force layout of the whole corpus.
Attribution view inspired by Anthropic Circuit Tracer.
- @cosmos.gl/graph
- Anthropic Circuit Tracing · 2025
Judges need one answer's reasoning, not the corpus shape. The reasoning panel animates the 5 to 10 nodes that actually mattered.
Live SPARQL fallback CRAG escalation into Semantic Finlex.
Self-RAG reflection tokens to enforce citation coverage at draft time.
- CRAG · arXiv:2401.15884
- Self-RAG · ICLR 2024
- data.finlex.fi/sparql
Cold SPARQL hit 9 seconds on the public endpoint. We can surface 'unsure' offline via AmendmentCaveat instead (RAGTAG #08).
EU-lex contradictions primacy of EU law over national act.
Ingest EUR-Lex directives, model a transposes edge, surface EU vs national conflicts.
- EUR-Lex · hierarchy of norms
- Lex superior · UN OLA
Chat #02 declared this out of scope. Adding it later is not an architectural rewrite: the `transposes` edge type is already in the EdgeType enum, the authority-rank lattice extends to an EU tier with one number, and the strategy router treats it as another `cross_source` route. New corpus, not new infrastructure.
Part II
RAGTAG. Ten pieces. Each cites file paths or papers.
RAGTAG
Three deterministic extraction passes structural, anchor, regex.
Edges are emitted by three rule-based passes over HTML and the document tree: structural (parent_of from the heading hierarchy), anchor (cross-references inside an <a href> attribute), and regex (text citations like '§ 102 AVL', 'KHO 2025:46'). No model in the batch graph build.
- src/extraction/structural_edges.py
- src/extraction/anchor_edges.py
- src/extraction/citations_regex.py
- src/extraction/definition_edges.py
RAGTAG
Two SQLite tables plus a local LanceDB nodes, edges, chunks. Joined by section_id.
1,967,776 nodes and 2,180,769 typed edges live in two SQLite tables (nodes, edges). 402,088 embedded chunks live in LanceDB on the filesystem (no remote service). Everything joins on section_id with an O(1) lookup.
- scripts/load_graph.py (nodes, edges CREATE TABLE)
- findings/04a_index_sanity.md (402,088 chunks)
- findings/04b_load_report.md (1.97M nodes, 2.18M edges)
- src/indexing/vector_store.py (LanceDB)
RAGTAG
Section-anchored chunking 800 to 1,500 tokens, 2,000 hard max, never split mid-citation.
The chunk unit is the SECTION (§). Children are greedily packed under their § head and never split across sentence, item, or citation boundaries. The result: every chunk carries its own legal anchor.
- pipeline/chunks.py
RAGTAG
Multilingual embeddings via Voyage voyage-3-large, 1,024-dim, asymmetric query / document.
Hosted but cheap. Asymmetric (input_type='query' vs 'document') to avoid the quality cliff Voyage warns about. Same embedding space carries Finnish, Swedish, and English.
- src/indexing/voyage_client.py (MODEL = voyage-3-large)
- voyageai.com
RAGTAG
Strategy router, six presets case_law, recency, definition, cross_source, multi_hop, default.
A keyword and regex classifier on the question text picks one ExpansionStrategy. Each preset sets seed depth, edge types, BFS direction, max hops, and per-edge degree caps. Default falls back to vector-only retrieval.
- src/retrieval/strategy.py
RAGTAG
Bounded BFS with hub-skip interprets_in > 30, cites_out > 15, parent_of_in > 50.
Default max_hops = 1. Hub nodes (widely cited statutes) are not expanded through. Final candidate set is truncated to fit a 25k-token context.
- src/retrieval/graph_expand.py
RAGTAG
Two reranking paths, one cross-encoder v2 uses bge-reranker-v2-m3, v1 uses metadata signals.
v2 runs BAAI/bge-reranker-v2-m3 (a multilingual cross-encoder) over 30 to 40 candidates and combines 0.6 cross-encoder + 0.3 cosine + 0.1 metadata. v1 (the default path in the API sidecar today) uses a metadata reranker: authority_rank, recency, term overlap.
- src/retrieval/cross_encoder_rerank.py (bge-reranker-v2-m3)
- src/retrieval/rerank.py (metadata reranker)
RAGTAG
Temporal correctness version_chain, as_of, AmendmentCaveat. All deterministic.
Every SECTION carries a chronological version_chain of muutetaan, kumotaan, lisätään steps. GraphStore.text_at(section_id, as_of) plays it forward. Every cited chunk on a stale ancestor emits an AmendmentCaveat (suspect, stale, repealed). A separate check_temporal_mismatches function compares the drafted answer against the section's chain via difflib; no LLM in this check.
- src/indexing/graph_store.py (text_at)
- src/models.py (VersionStep, AmendmentCaveat)
- src/agents/verifier.py (check_temporal_mismatches)
- src/retrieval/pipeline_v2.py (wires both)
RAGTAG
Authority is one integer Finlex 100, Treaty 90, KHO 80, Vero 60.
Ranks are assigned at ingestion from source / source_subcorpus and stored on every node. Conflict surfacing compares the integer; the team's lattice (Finlex over Vero, KHO can overwrite Vero) drops out of this directly.
- src/extraction/authority.py
- findings/03_authority_ranks.md
- Lex superior · UN OLA
RAGTAG
Generation via DeepSeek-V4-Flash hosted on Featherless. Query-rewrite is cached.
The drafter is deepseek-ai/DeepSeek-V4-Flash served via Featherless. Per-question query rewrites are cached in process, which lowers cost on repeated framings. Per-query answers are not cached today; a localStorage history in the Next.js UI lets the user recall past questions but does not skip the call.
- src/retrieval/generate.py (MODEL = deepseek-ai/DeepSeek-V4-Flash)
- src/retrieval/query_rewrite.py (in-process cache)
- Featherless · featherless.ai
Receipts
Single cite, TVL § 124. No graph hop. The baseline case.
Three cites: KHO 2025:46, KVL:004/2024, Vero ohje. Two graph hops via cites and interprets.
Rank-100 Finlex statute outranks rank-60 Vero kannanotto. Verifier picks Finlex.
€0.005 local · €0.04 hosted · €1 brief cap. The cost meter UI is a char-count heuristic, not API billing.
From the corpus
two things we found
Mojibake recovered through the graph
About 1.7% of chunks were double-encoded: the HTML sniffer mis-detected UTF-8 as Latin-1 and produced 'päätös → pรครคtรถs'-style chunks in LanceDB. We caught it by tracing RAG hits back to source files, fixed the parse layer to force UTF-8, and re-embedded the affected slice. The graph spine made the recovery surgical, not corpus-wide.
source · scripts/reingest_corrupted_chunks.py · pipeline/html_utils.py
Not every tax question is in the law
Eval question N49 asks about the account-number range commonly used for trade receivables and payables (myyntisaamiset / ostovelat) in the Finnish chart of accounts. Our system returned the correct legal answer (no universally binding range exists), which did not match the question-bank reference. The reference traces to KILA practice and platform-specific defaults, not Finlex. Honest UX would surface that the law is silent here and the convention lives elsewhere.
source · eval/questions.json · question N49
References
- SAT-Graph RAGde Martim, JURIX 2025 · arXiv:2505.00039
- TG-RAGHan et al., 2025 · arXiv:2510.13590
- LRMoo v1.1.1IFLA, 2026 · cidoc-crm.org/LRMoo
- Semantic FinlexSeCo Aalto + MoJ · data.finlex.fi/sparql
- Self-RAGAsai et al., 2024 · ICLR 2024
- CRAGYan et al., 2024 · arXiv:2401.15884
- HyDEGao et al., 2022 · Precise Zero-Shot Dense Retrieval
- AgenticSimLawJan 2026 · arXiv:2601.21936
- Multi-Agent DebateDu et al., NeurIPS 2023 · arXiv:2305.14325
- BGE-M3Chen et al., BAAI · bge-model.com
- bge-reranker-v2-m3BAAI · bge-model.com
- ColBERTKhattab + Zaharia · SIGIR 2020
- RRFCormack et al. · SIGIR 2009
- Voyage voyage-3-largeVoyage AI · voyageai.com
- Anthropic Circuit TracingAnthropic, 2025 · transformer-circuits.pub
- UN OLA · lex superiorUN Office of Legal Affairs · a_cn4_l682.pdf
Timeline · phone-friendly
The same story as a two-minute scroll on your phone. ragtag-timeline.vercel.app