Technology · Corpus & pipeline

The official record, made data.

§ Ingestion

Everything rests on a clean, complete dataset. Opinions, statutes, and the legislative record are pulled from each jurisdiction's official source — guamcourts.gov, the Guam Code Annotated, the Legislature's Public Law archive — stripped out of their PDFs, and rebuilt into a single normalized structure: numbered paragraphs, parsed citations, structured annotations, and the metadata that links a document to everything around it. The pipeline is polite and resumable — it identifies itself, throttles, retries with backoff, and skips what it already has — so a run can be stopped and picked up without re-downloading the corpus.

01Scrapeofficial sources02ParsePDF → text03Normalizeone schema04Chunkpassages05Embedvectors06Verifycoverage test

§ Three strata

For Guam, three layers of the law are loaded and searchable today — not just the opinions, but the statutes they apply and the legislative record that enacted those statutes. Most legal databases stop at the first. The value here is having all three, normalized into the same structure, so they can be linked.

792Supreme Court opinions1996–present

17,898Code sectionsAll 22 titles · ~3M words

3,222Public Laws22nd–38th Legislatures

Opinions: Every Supreme Court of Guam decision, reconstructed into numbered paragraphs with stable anchors
The Code: The full Guam Code Annotated — every title, chapter, and section, with its source and annotation notes
Legislation: Public Laws and introduced bills — sponsors, the bill each law came from, and the code sections it touched

§ Coverage is a test, not a claim

Completeness is not assumed; it is verified. The Guam Code Annotated is reconciled against each title's own table of contents — an independent source of truth published by the compiler — so a chapter the scraper silently skipped or a section lost in parsing fails the build rather than passing unnoticed. The check catches regex skips, chapters listed but never fetched, sections that loaded but were never embedded, and stale rows in the database. It runs green or the corpus is not shipped.

Where a document genuinely cannot be read — older Public Laws that survive only as scanned images with no text layer — the gap is recorded as an honest unknown, never papered over with a guess. A sponsor we cannot extract is left blank, not filled with the nearest plausible name. The number we publish is the number we can stand behind.

§ The lifecycle graph

Underneath the search is a graph that the documents alone do not give you. We model a jurisdiction's law as linked nodes — opinions, code sections, the Public Laws that enacted them, and the people behind each: the senators who sponsored a bill, the governor who signed it, the justices who later construed it. The edges are the value. A single code section can be traced from the bill that introduced it, through the Public Law that enacted or amended it, to every opinion that has interpreted it.

One section, its whole history — and every actor who touched it, as a node.

The moat is not the documents — those are public — but the completeness and the links between them, which no one else has assembled for these jurisdictions. The actor profiles built on this graph are held to the same standard as the rest of the corpus: they remain in preparation, behind verification, until the historical names are reconciled to an authoritative roster, because a confident wrong attribution is worse than an honest gap.

Next: how that corpus is searched, and the rule synthesized from it.Search & synthesis →