Technology · Corpus & pipeline
§ Ingestion
Everything rests on a clean, complete dataset. Opinions, statutes, and the legislative record are pulled from each jurisdiction's official source — guamcourts.gov, the Guam Code Annotated, the Legislature's Public Law archive — stripped out of their PDFs, and rebuilt into a single normalized structure: numbered paragraphs, parsed citations, structured annotations, and the metadata that links a document to everything around it. The pipeline is polite and resumable — it identifies itself, throttles, retries with backoff, and skips what it already has — so a run can be stopped and picked up without re-downloading the corpus.
§ Three strata
For Guam, three layers of the law are loaded and searchable today — not just the opinions, but the statutes they apply and the legislative record that enacted those statutes. Most legal databases stop at the first. The value here is having all three, normalized into the same structure, so they can be linked.
§ Coverage is a test, not a claim
Completeness is not assumed; it is verified. The Guam Code Annotated is reconciled against each title's own table of contents — an independent source of truth published by the compiler — so a chapter the scraper silently skipped or a section lost in parsing fails the build rather than passing unnoticed. The check catches regex skips, chapters listed but never fetched, sections that loaded but were never embedded, and stale rows in the database. It runs green or the corpus is not shipped.
Where a document genuinely cannot be read — older Public Laws that survive only as scanned images with no text layer — the gap is recorded as an honest unknown, never papered over with a guess. A sponsor we cannot extract is left blank, not filled with the nearest plausible name. The number we publish is the number we can stand behind.
§ The lifecycle graph
Underneath the search is a graph that the documents alone do not give you. We model a jurisdiction's law as linked nodes — opinions, code sections, the Public Laws that enacted them, and the people behind each: the senators who sponsored a bill, the governor who signed it, the justices who later construed it. The edges are the value. A single code section can be traced from the bill that introduced it, through the Public Law that enacted or amended it, to every opinion that has interpreted it.
The moat is not the documents — those are public — but the completeness and the links between them, which no one else has assembled for these jurisdictions. The actor profiles built on this graph are held to the same standard as the rest of the corpus: they remain in preparation, behind verification, until the historical names are reconciled to an authoritative roster, because a confident wrong attribution is worse than an honest gap.