About RAGTAG

Retrieval Augmented Graph Tax Answer Generator.

RAGTAG answers Finnish tax-law questions by walking a typed legal graph instead of a flat vector index. Each statute, section, ruling, and Vero ohje becomes a node with bitemporal validity; amendments and interpretations become typed edges; retrieval uses both the graph topology and a Voyage embedding store, then grounds the answer in a DeepSeek call that cites the section ids it leaned on.

01 · Data

The corpus

1,967,776
typed nodes
10 node kinds, from LAW root down to ITEM leaves
2,231,798
typed edges
8 relation types, all bitemporal (t_valid / t_invalid)
402,088
vector chunks
1024-dim dense embeddings via Voyage AI · voyage-3-large

Nodes - by type

  • SUBSECTION1,184,91560.2%momentit - atomic operative paragraphs
  • SECTION344,15617.5%§ inside a law/asetus
  • ITEM320,45316.3%list items inside subsections
  • LAW54,6782.8%root acts (laki / asetus / SK)
  • CHAPTER27,6111.4%luku inside a law
  • AMENDMENT_BLOCK14,1450.7%muutoslaki bodies
  • DEFINITION12,8350.7%domain terms · linked via `defines`
  • CASE7,0400.4%KHO / KKO precedents
  • GUIDE1,8260.1%Vero ohje / kannanotto / päätös
  • TREATY117<0.1%tax treaties

Edges - by type

  • parent_of1,904,11585.3%structural containment (law → section → subsection)
  • defines234,57010.5%statute or section defines a domain term
  • amends26,8801.2%amendment LAW → target LAW
  • amends_section24,0361.1%muutetaan / kumotaan / lisätään directives
  • applies18,2590.8%KHO/KVL case applies a statutory provision
  • interprets16,6130.7%Vero/KHO interpretation of a section
  • cites7,1640.3%textual cross-reference ("see §117")
  • repeals161<0.1%amendment action repealing a section
Embeddings
Voyage AI · voyage-3-large
1024-dim · cosine similarity · multilingual
Answer LLM
DeepSeek V4 Pro · via Featherless
grounded generation · cite-token output · confidence grader

02 · Architecture

The pipeline

Two halves. The left half runs once per corpus refresh and produces the graph + vector stores. The right half runs once per question and produces a grounded, cited answer.

info
Mermaid package not installed - showing source. Install with npm install mermaid to render the diagram, or paste the source into mermaid.live.
flowchart LR
    %% ---------- Ingest ----------
    subgraph Ingest [Ingest - build the graph]
        direction TB
        F[Finlex laki / asetus<br/>+ Säädöskokoelma] --> P{Parsers}
        V[Vero · ohje · kannanotto · päätös] --> P
        K[KHO / KVL rulings] --> P
        T[Tax treaties] --> P
        P --> N[Typed nodes<br/>LAW · SECTION · CHAPTER<br/>CASE · GUIDE · TREATY]
        P --> E[Typed edges<br/>parent_of · cites · amends<br/>interprets · applies · defines]
        N --> GS[(Graph store<br/>SQLite)]
        E --> GS
        N --> C[Chunker]
        C --> EM[voyage-3-large<br/>1024-dim]
        EM --> VS[(Vector store<br/>LanceDB)]
    end

    %% ---------- Retrieve ----------
    subgraph Retrieve [Retrieve - answer one question]
        direction TB
        Q[Question] --> QR[Query rewrite<br/>FI / EN synonyms]
        QR --> H[Hybrid search<br/>dense + BM25 RRF]
        VSr[(Vector store)] --> H
        H --> RR[Rerank<br/>cosine + authority<br/>+ recency + term]
        GSr[(Graph store)] --> RR
        RR --> AS[Assemble context<br/>render typed cross-refs<br/>between cited sections]
        AS --> GEN[LLM · grounded answer<br/>with cited section ids]
        GEN --> CF[Confidence grader<br/>high · medium · low]
        AS --> ORB[Provenance orbit<br/>sub-graph for the UI]
        GEN --> OUT([Answer<br/>+ citations + cost])
        CF --> OUT
        ORB --> OUT
    end

    %% Cross-links so the two halves read as one system.
    GS -.-> GSr
    VS -.-> VSr

    classDef store fill:#fff7eb,stroke:#944921,stroke-width:1.2px;
    classDef io fill:#fafafa,stroke:#1a1c1b,stroke-width:1.2px;
    class GS,VS,GSr,VSr store;
    class OUT io;
IngestParse, type, embed

Finlex HTML, Vero ohje pages, KHO rulings, and treaties are parsed into typed nodes anchored to their source URL. Cross-references and amendment clauses become typed edges (cites / amends / interprets / applies / repeals / defines). Sections are also tokenised and embedded with voyage-3-large so retrieval has a hybrid lexical+semantic surface.

RetrieveSearch, walk, rerank

The question is rewritten to surface Finnish equivalents, then a hybrid dense+BM25 search returns seed sections. Rerank blends cosine with authority rank, recency, term overlap, and a graded penalty pulled from the section's amendment history in the graph.

AssembleContext with typed edges

The top-N reranked chunks are rendered into a single prompt where every cross-reference between cited sections is surfaced verbatim - `Source 1 cites → Source 3`, `Source 5 interprets → Source 2`. The LLM treats the sources as a small graph, not a bag.

GenerateCite-token answer + grader

DeepSeek V4 Pro writes the answer with inline [Source N] cites, which the sidecar rewrites to clickable [cite:node:…] tokens. A second short LLM call grades confidence (high / medium / low); the UI surfaces an Ask-Specialist CTA when low.

See also

Methodology → describes the richer LRMoo / SAT-Graph-style architecture we explored during the hackathon but did not ship. The system documented on this page is what actually backs /ask.