About RAGTAG
Retrieval Augmented Graph Tax Answer Generator.
RAGTAG answers Finnish tax-law questions by walking a typed legal graph instead of a flat vector index. Each statute, section, ruling, and Vero ohje becomes a node with bitemporal validity; amendments and interpretations become typed edges; retrieval uses both the graph topology and a Voyage embedding store, then grounds the answer in a DeepSeek call that cites the section ids it leaned on.
01 · Data
The corpus
Nodes - by type
- SUBSECTION1,184,91560.2%momentit - atomic operative paragraphs
- SECTION344,15617.5%§ inside a law/asetus
- ITEM320,45316.3%list items inside subsections
- LAW54,6782.8%root acts (laki / asetus / SK)
- CHAPTER27,6111.4%luku inside a law
- AMENDMENT_BLOCK14,1450.7%muutoslaki bodies
- DEFINITION12,8350.7%domain terms · linked via `defines`
- CASE7,0400.4%KHO / KKO precedents
- GUIDE1,8260.1%Vero ohje / kannanotto / päätös
- TREATY117<0.1%tax treaties
Edges - by type
- parent_of1,904,11585.3%structural containment (law → section → subsection)
- defines234,57010.5%statute or section defines a domain term
- amends26,8801.2%amendment LAW → target LAW
- amends_section24,0361.1%muutetaan / kumotaan / lisätään directives
- applies18,2590.8%KHO/KVL case applies a statutory provision
- interprets16,6130.7%Vero/KHO interpretation of a section
- cites7,1640.3%textual cross-reference ("see §117")
- repeals161<0.1%amendment action repealing a section
02 · Architecture
The pipeline
Two halves. The left half runs once per corpus refresh and produces the graph + vector stores. The right half runs once per question and produces a grounded, cited answer.
npm install mermaid to render the diagram, or paste the source into mermaid.live.flowchart LR
%% ---------- Ingest ----------
subgraph Ingest [Ingest - build the graph]
direction TB
F[Finlex laki / asetus<br/>+ Säädöskokoelma] --> P{Parsers}
V[Vero · ohje · kannanotto · päätös] --> P
K[KHO / KVL rulings] --> P
T[Tax treaties] --> P
P --> N[Typed nodes<br/>LAW · SECTION · CHAPTER<br/>CASE · GUIDE · TREATY]
P --> E[Typed edges<br/>parent_of · cites · amends<br/>interprets · applies · defines]
N --> GS[(Graph store<br/>SQLite)]
E --> GS
N --> C[Chunker]
C --> EM[voyage-3-large<br/>1024-dim]
EM --> VS[(Vector store<br/>LanceDB)]
end
%% ---------- Retrieve ----------
subgraph Retrieve [Retrieve - answer one question]
direction TB
Q[Question] --> QR[Query rewrite<br/>FI / EN synonyms]
QR --> H[Hybrid search<br/>dense + BM25 RRF]
VSr[(Vector store)] --> H
H --> RR[Rerank<br/>cosine + authority<br/>+ recency + term]
GSr[(Graph store)] --> RR
RR --> AS[Assemble context<br/>render typed cross-refs<br/>between cited sections]
AS --> GEN[LLM · grounded answer<br/>with cited section ids]
GEN --> CF[Confidence grader<br/>high · medium · low]
AS --> ORB[Provenance orbit<br/>sub-graph for the UI]
GEN --> OUT([Answer<br/>+ citations + cost])
CF --> OUT
ORB --> OUT
end
%% Cross-links so the two halves read as one system.
GS -.-> GSr
VS -.-> VSr
classDef store fill:#fff7eb,stroke:#944921,stroke-width:1.2px;
classDef io fill:#fafafa,stroke:#1a1c1b,stroke-width:1.2px;
class GS,VS,GSr,VSr store;
class OUT io;
Finlex HTML, Vero ohje pages, KHO rulings, and treaties are parsed into typed nodes anchored to their source URL. Cross-references and amendment clauses become typed edges (cites / amends / interprets / applies / repeals / defines). Sections are also tokenised and embedded with voyage-3-large so retrieval has a hybrid lexical+semantic surface.
The question is rewritten to surface Finnish equivalents, then a hybrid dense+BM25 search returns seed sections. Rerank blends cosine with authority rank, recency, term overlap, and a graded penalty pulled from the section's amendment history in the graph.
The top-N reranked chunks are rendered into a single prompt where every cross-reference between cited sections is surfaced verbatim - `Source 1 cites → Source 3`, `Source 5 interprets → Source 2`. The LLM treats the sources as a small graph, not a bag.
DeepSeek V4 Pro writes the answer with inline [Source N] cites, which the sidecar rewrites to clickable [cite:node:…] tokens. A second short LLM call grades confidence (high / medium / low); the UI surfaces an Ask-Specialist CTA when low.
See also
Methodology → describes the richer LRMoo / SAT-Graph-style architecture we explored during the hackathon but did not ship. The system documented on this page is what actually backs /ask.