Docs · PatentChecker

PatentChecker docs for evaluation, verification, and delivery risk

Use this lane for buyer evaluation, evidence verification, self-hosting, adapters, coverage scope, and the delivery-system surfaces behind vector and sequence IP review.

Patent IPDelivery riskVerificationSelf-serve
Need another page?
Search the docs

Jump to buyers, verification, demos, self-hosting, or adapters without opening the full docs tree first.

Mobile navigation
Jump to section
Open
This document freezes the proven PatentChecker ingestion v0.1 contract before any provider expansion or deeper persistence work.

Implemented Scope

USPTO ODP bulk ingestion is the only implemented provider path in ingestion v0.1.
EPO OPS, KIPRISPlus, and CNIPA remain registered as typechecked inert stubs. They must not fetch, parse, ingest, or mutate canonical tables in v0.1.

Core Contract

A patent record is not eligible for redistribution, evidence packet generation, or customer export unless its source artifact, checksum, provider, retrieval timestamp, parser version, and license status are present and valid.
Canonical writes are provenance-gated. Provider workers do not write canonical patent tables directly; they write through the ingestion unit-of-work boundary with a ProvenanceValidatedContext.
Raw artifacts are immutable and content-addressed by SHA-256. API or bulk fetch results must be persisted as raw artifacts before parsing, normalization, canonical upsert, checkpoint advancement, retry resolution, or evidence use.

Runtime Defaults

SettingDefaultContract
INGESTION_SQLITE_DURABILITY_MODEstrictProduction default. Invalid values fail closed.
INGESTION_MAX_CONCURRENT_ARTIFACTS2Bounded USPTO artifact fetch/preparation concurrency.
INGESTION_CANONICAL_LANES1Conservative default. Canonical processing stays single-lane unless explicitly overridden.
INGESTION_CHUNK_SIZE1000Streaming parse/normalize chunk size.
INGESTION_USPTO_ARTIFACTS_PER_SECOND1External USPTO artifact fetch/preparation rate limit.
benchmark sqlite durability mode is unsafe and intended only for profiling. It is blocked when NODE_ENV=production.

Determinism And Replay

Replay determinism is part of the v0.1 contract. USPTO bulk ingestion supports deterministic replay from stored raw artifacts and streaming replay hashing for large artifacts.
Canonical identity remains deterministic across repeated runs:
  • canonical_patent_id is derived from normalized country code, document number, and kind code.
  • Unknown family identifiers are scoped as family:unknown:{canonical_patent_id} to avoid merging unrelated patents.
  • canonical_family_id is deterministic for repeated parses of the same artifact.

Retry And Checkpoint Semantics

provider_checkpoint is the ingestion completion ledger. A checkpoint means canonical ingestion completed successfully for that provider, dataset, and checkpoint key.
failed_artifact is a retry work queue, not a completion ledger.
On canonical failure:
  • the raw artifact remains stored
  • the checkpoint is marked failed
  • a failed artifact row is enqueued
  • no partial canonical commit is allowed for that artifact
On retry success:
  • stored raw bytes are reused
  • the failed artifact is resolved
  • the checkpoint advances to succeeded

Operational Drill Guidance

Manual operational drills should run on /dev/shm or another known-stable local filesystem on hosts where repo artifact paths or /tmp show journal stalls. On this host, prior D state failures were host I/O symptoms, not ingestion proof failures.
Successful smoke drill evidence on /dev/shm recorded:
  • 20 raw artifacts
  • 20000 patent documents
  • 4000 patent families
  • 20 succeeded checkpoints
  • 0 failed artifacts
  • max_concurrent_artifacts=2
  • uspto_artifacts_per_second=1

Benchmark Receipts

Durability readiness, 100k records:
moderecords/sectotal sqlite msreplay stabledocumentsfamilies
strict7161.459895.27yes10000020000
balanced5997.7313222.89yes10000020000
benchmark11658.46429.46yes10000020000
Decision: keep strict as default. balanced was slower than strict on the measured USPTO workload. benchmark is fastest but unsafe.
USPTO artifact preparation concurrency, 20 artifacts x 5000 records:
max_concurrent_artifactsrecords/secpeak RSSdocumentsfamiliesfailed artifacts
129.15243.48 MiB100000200000
245.27272.26 MiB100000200000
434.45269.71 MiB100000200000
Decision: default INGESTION_MAX_CONCURRENT_ARTIFACTS=2.
USPTO canonical lane benchmark on /dev/shm, 20 artifacts x 5000 records:
canonical_lanesrecords/secpeak RSSreplay hash vs lane 1documentsfamiliesfailed artifacts
15160.91282.95 MiBmatch100000200000
25164.61280.12 MiBmatch100000200000
45157.41320.89 MiBmatch100000200000
Decision: keep INGESTION_CANONICAL_LANES=1. Throughput is flat and 4 increases RSS.

Validation

Release checkpoint validation:
BashRunnable example
npm run build
npm test -- --group ingestion
The ingestion group covers provenance gates, provider import boundaries, non-USPTO inert stubs, USPTO bulk ingestion, replay determinism, retry behavior, durability mode guards, runtime config validation, operational drill behavior, and benchmark CLI surfaces.

Explicit Non-Goals

Ingestion v0.1 does not include EPO ingestion.
Ingestion v0.1 does not include KIPRISPlus ingestion.
Ingestion v0.1 does not include CNIPA ingestion.
Ingestion v0.1 does not include customer export.
Ingestion v0.1 does not include search index writes.
PatentChecker ingestion v0.1 | Omnis documentation | Omnis Genomics