Methodology

Built from a chat, line by line.

RAGTAG (Retrieval Augmented Graph Tax Answer Generator) takes a tax question, retrieves the relevant Finnish statutes, court rulings and Vero guidance, and answers with citations. Every decision below is tagged either to a paper or to a numbered point from Taxxa’s team chat on 23.05. We will present the demo live.

Chat · 23.05

Pricing#01
“€60 per user per month. Queries have to be cheap.”
Corpus#02
“Connect Finlex, Vero, case law. EU-lex out of scope.”
Authority#03
“Case laws refer to Finlex. Vero is just an interpreter. Case laws can overwrite Vero.”
Retrieval#04
“Can't bring 1,000 chunks per question. 25M-page DB.”
Timeline#05
“Active now, not then, not later.”
Stack#06
“Run RAG locally first. DeepSeek is good and cheap.”
Extraction#07
“Reference-extraction by regex / NLP is a good idea. We aren't doing it.”

Part I

What we tried. And what chat point killed it.

Bitemporal graph on Neo4j every edge with its own validity window.

Schema from SAT-Graph RAG (JURIX 2025). The whole ontology in a graph database.

SAT-Graph RAG · arXiv:2505.00039
LRMoo · IFLA

Killed by#01

Two days into the ontology, still zero answers. We kept the bitemporal idea and rebuilt it in SQLite as version_chain (see RAGTAG #08).

BGE-M3 hybrid retrieval dense + sparse + ColBERT, fused at k=60.

BAAI's multilingual model with three signal heads. RRF for fusion, HyDE for English to Finnish query expansion.

BGE-M3 · BAAI
ColBERT · SIGIR 2020
RRF · SIGIR 2009
HyDE · Gao et al. 2022

Killed by#01

Voyage voyage-3-large already ranked the right chunk in the top 30 on every eval question. Adding a second stack failed the cost test.

Courtroom-style debate prosecutor, defense, judge.

Three LLMs arguing per conflict, pattern from AgenticSimLaw.

AgenticSimLaw · arXiv:2601.21936
Multi-Agent Debate · NeurIPS 2023

Killed by#03#01

Chat #03 handed us the rule directly. One integer compare resolves every conflict in our eval set; seven LLM turns cost real money.

63,660-node constellation WebGL force layout of the whole corpus.

Attribution view inspired by Anthropic Circuit Tracer.

@cosmos.gl/graph
Anthropic Circuit Tracing · 2025

Killed by#04

Judges need one answer's reasoning, not the corpus shape. The reasoning panel animates the 5 to 10 nodes that actually mattered.

Live SPARQL fallback CRAG escalation into Semantic Finlex.

Self-RAG reflection tokens to enforce citation coverage at draft time.

CRAG · arXiv:2401.15884
Self-RAG · ICLR 2024
data.finlex.fi/sparql

Killed by#01

Cold SPARQL hit 9 seconds on the public endpoint. We can surface 'unsure' offline via AmendmentCaveat instead (RAGTAG #08).

EU-lex contradictions primacy of EU law over national act.

Ingest EUR-Lex directives, model a transposes edge, surface EU vs national conflicts.

EUR-Lex · hierarchy of norms
Lex superior · UN OLA

Killed by#02

Chat #02 declared this out of scope. Adding it later is not an architectural rewrite: the `transposes` edge type is already in the EdgeType enum, the authority-rank lattice extends to an EU tier with one number, and the strategy router treats it as another `cross_source` route. New corpus, not new infrastructure.

Part II

RAGTAG. Ten pieces. Each cites file paths or papers.

RAGTAG

Three deterministic extraction passes structural, anchor, regex.

Edges are emitted by three rule-based passes over HTML and the document tree: structural (parent_of from the heading hierarchy), anchor (cross-references inside an <a href> attribute), and regex (text citations like '§ 102 AVL', 'KHO 2025:46'). No model in the batch graph build.

src/extraction/structural_edges.py
src/extraction/anchor_edges.py
src/extraction/citations_regex.py
src/extraction/definition_edges.py

Answers#07

RAGTAG

Two SQLite tables plus a local LanceDB nodes, edges, chunks. Joined by section_id.

1,967,776 nodes and 2,180,769 typed edges live in two SQLite tables (nodes, edges). 402,088 embedded chunks live in LanceDB on the filesystem (no remote service). Everything joins on section_id with an O(1) lookup.

scripts/load_graph.py (nodes, edges CREATE TABLE)
findings/04a_index_sanity.md (402,088 chunks)
findings/04b_load_report.md (1.97M nodes, 2.18M edges)
src/indexing/vector_store.py (LanceDB)

Answers#01

RAGTAG

Section-anchored chunking 800 to 1,500 tokens, 2,000 hard max, never split mid-citation.

The chunk unit is the SECTION (§). Children are greedily packed under their § head and never split across sentence, item, or citation boundaries. The result: every chunk carries its own legal anchor.

pipeline/chunks.py

Answers#04

RAGTAG

Multilingual embeddings via Voyage voyage-3-large, 1,024-dim, asymmetric query / document.

Hosted but cheap. Asymmetric (input_type='query' vs 'document') to avoid the quality cliff Voyage warns about. Same embedding space carries Finnish, Swedish, and English.

src/indexing/voyage_client.py (MODEL = voyage-3-large)
voyageai.com

Answers#02

RAGTAG

Strategy router, six presets case_law, recency, definition, cross_source, multi_hop, default.

A keyword and regex classifier on the question text picks one ExpansionStrategy. Each preset sets seed depth, edge types, BFS direction, max hops, and per-edge degree caps. Default falls back to vector-only retrieval.

src/retrieval/strategy.py

Answers#04

RAGTAG

Bounded BFS with hub-skip interprets_in > 30, cites_out > 15, parent_of_in > 50.

Default max_hops = 1. Hub nodes (widely cited statutes) are not expanded through. Final candidate set is truncated to fit a 25k-token context.

src/retrieval/graph_expand.py

Answers#04

RAGTAG

Two reranking paths, one cross-encoder v2 uses bge-reranker-v2-m3, v1 uses metadata signals.

v2 runs BAAI/bge-reranker-v2-m3 (a multilingual cross-encoder) over 30 to 40 candidates and combines 0.6 cross-encoder + 0.3 cosine + 0.1 metadata. v1 (the default path in the API sidecar today) uses a metadata reranker: authority_rank, recency, term overlap.

src/retrieval/cross_encoder_rerank.py (bge-reranker-v2-m3)
src/retrieval/rerank.py (metadata reranker)

Answers#04

RAGTAG

Temporal correctness version_chain, as_of, AmendmentCaveat. All deterministic.

Every SECTION carries a chronological version_chain of muutetaan, kumotaan, lisätään steps. GraphStore.text_at(section_id, as_of) plays it forward. Every cited chunk on a stale ancestor emits an AmendmentCaveat (suspect, stale, repealed). A separate check_temporal_mismatches function compares the drafted answer against the section's chain via difflib; no LLM in this check.

src/indexing/graph_store.py (text_at)
src/models.py (VersionStep, AmendmentCaveat)
src/agents/verifier.py (check_temporal_mismatches)
src/retrieval/pipeline_v2.py (wires both)

Answers#05

RAGTAG

Authority is one integer Finlex 100, Treaty 90, KHO 80, Vero 60.

Ranks are assigned at ingestion from source / source_subcorpus and stored on every node. Conflict surfacing compares the integer; the team's lattice (Finlex over Vero, KHO can overwrite Vero) drops out of this directly.

src/extraction/authority.py
findings/03_authority_ranks.md
Lex superior · UN OLA

Answers#03

RAGTAG

Generation via DeepSeek-V4-Flash hosted on Featherless. Query-rewrite is cached.

The drafter is deepseek-ai/DeepSeek-V4-Flash served via Featherless. Per-question query rewrites are cached in process, which lowers cost on repeated framings. Per-query answers are not cached today; a localStorage history in the Next.js UI lets the user recall past questions but does not skip the call.

src/retrieval/generate.py (MODEL = deepseek-ai/DeepSeek-V4-Flash)
src/retrieval/query_rewrite.py (in-process cache)
Featherless · featherless.ai

Answers#06#01

Receipts

Capital income > €30k

Single cite, TVL § 124. No graph hop. The baseline case.

Q12

Meal voucher VAT

Three cites: KHO 2025:46, KVL:004/2024, Vero ohje. Two graph hops via cites and interprets.

Q41

Avainhenkilö 48 vs 84 months

Rank-100 Finlex statute outranks rank-60 Vero kannanotto. Verifier picks Finlex.

Cost

Local · hosted · brief cap

€0.005 local · €0.04 hosted · €1 brief cap. The cost meter UI is a char-count heuristic, not API billing.

From the corpus

two things we found

Mojibake recovered through the graph

About 1.7% of chunks were double-encoded: the HTML sniffer mis-detected UTF-8 as Latin-1 and produced 'päätös → pรครคtรถs'-style chunks in LanceDB. We caught it by tracing RAG hits back to source files, fixed the parse layer to force UTF-8, and re-embedded the affected slice. The graph spine made the recovery surgical, not corpus-wide.

source · scripts/reingest_corrupted_chunks.py · pipeline/html_utils.py

Not every tax question is in the law

Eval question N49 asks about the account-number range commonly used for trade receivables and payables (myyntisaamiset / ostovelat) in the Finnish chart of accounts. Our system returned the correct legal answer (no universally binding range exists), which did not match the question-bank reference. The reference traces to KILA practice and platform-specific defaults, not Finlex. Honest UX would surface that the law is silent here and the convention lives elsewhere.

source · eval/questions.json · question N49

References

SAT-Graph RAGde Martim, JURIX 2025 · arXiv:2505.00039
TG-RAGHan et al., 2025 · arXiv:2510.13590
LRMoo v1.1.1IFLA, 2026 · cidoc-crm.org/LRMoo
Semantic FinlexSeCo Aalto + MoJ · data.finlex.fi/sparql
Self-RAGAsai et al., 2024 · ICLR 2024
CRAGYan et al., 2024 · arXiv:2401.15884
HyDEGao et al., 2022 · Precise Zero-Shot Dense Retrieval
AgenticSimLawJan 2026 · arXiv:2601.21936
Multi-Agent DebateDu et al., NeurIPS 2023 · arXiv:2305.14325
BGE-M3Chen et al., BAAI · bge-model.com
bge-reranker-v2-m3BAAI · bge-model.com
ColBERTKhattab + Zaharia · SIGIR 2020
RRFCormack et al. · SIGIR 2009
Voyage voyage-3-largeVoyage AI · voyageai.com
Anthropic Circuit TracingAnthropic, 2025 · transformer-circuits.pub
UN OLA · lex superiorUN Office of Legal Affairs · a_cn4_l682.pdf

Timeline · phone-friendly

The same story as a two-minute scroll on your phone. ragtag-timeline.vercel.app