Docs · PatentChecker

PatentChecker docs for evaluation, verification, and delivery risk

Use this lane for buyer evaluation, evidence verification, self-hosting, adapters, coverage scope, and the delivery-system surfaces behind vector and sequence IP review.

Patent IPDelivery riskVerificationSelf-serve

Viewing

PatentChecker ingestion v0.1

Docs hub Buyers start here Browse demos Verification guide Self-hosting

Need another page?

Search the docs

Jump to buyers, verification, demos, self-hosting, or adapters without opening the full docs tree first.

Key sections

Implemented Scope Core Contract Runtime Defaults Determinism And Replay Retry And Checkpoint Semantics Operational Drill Guidance

Mobile navigation

Jump to section

Open

This document freezes the proven PatentChecker ingestion v0.1 contract before any provider expansion or deeper persistence work.

Implemented Scope

USPTO ODP bulk ingestion is the only implemented provider path in ingestion v0.1.

EPO OPS, KIPRISPlus, and CNIPA remain registered as typechecked inert stubs. They must not fetch, parse, ingest, or mutate canonical tables in v0.1.

Core Contract

A patent record is not eligible for redistribution, evidence packet generation, or customer export unless its source artifact, checksum, provider, retrieval timestamp, parser version, and license status are present and valid.

Canonical writes are provenance-gated. Provider workers do not write canonical patent tables directly; they write through the ingestion unit-of-work boundary with a ProvenanceValidatedContext.

Raw artifacts are immutable and content-addressed by SHA-256. API or bulk fetch results must be persisted as raw artifacts before parsing, normalization, canonical upsert, checkpoint advancement, retry resolution, or evidence use.

Runtime Defaults

Setting	Default	Contract
`INGESTION_SQLITE_DURABILITY_MODE`	`strict`	Production default. Invalid values fail closed.
`INGESTION_MAX_CONCURRENT_ARTIFACTS`	`2`	Bounded USPTO artifact fetch/preparation concurrency.
`INGESTION_CANONICAL_LANES`	`1`	Conservative default. Canonical processing stays single-lane unless explicitly overridden.
`INGESTION_CHUNK_SIZE`	`1000`	Streaming parse/normalize chunk size.
`INGESTION_USPTO_ARTIFACTS_PER_SECOND`	`1`	External USPTO artifact fetch/preparation rate limit.

benchmark sqlite durability mode is unsafe and intended only for profiling. It is blocked when NODE_ENV=production.

Determinism And Replay

Replay determinism is part of the v0.1 contract. USPTO bulk ingestion supports deterministic replay from stored raw artifacts and streaming replay hashing for large artifacts.

Canonical identity remains deterministic across repeated runs:

canonical_patent_id is derived from normalized country code, document number, and kind code.
Unknown family identifiers are scoped as family:unknown:{canonical_patent_id} to avoid merging unrelated patents.
canonical_family_id is deterministic for repeated parses of the same artifact.

Retry And Checkpoint Semantics

provider_checkpoint is the ingestion completion ledger. A checkpoint means canonical ingestion completed successfully for that provider, dataset, and checkpoint key.

failed_artifact is a retry work queue, not a completion ledger.

On canonical failure:

the raw artifact remains stored
the checkpoint is marked failed
a failed artifact row is enqueued
no partial canonical commit is allowed for that artifact

On retry success:

stored raw bytes are reused
the failed artifact is resolved
the checkpoint advances to succeeded

Operational Drill Guidance

Manual operational drills should run on /dev/shm or another known-stable local filesystem on hosts where repo artifact paths or /tmp show journal stalls. On this host, prior D state failures were host I/O symptoms, not ingestion proof failures.

Successful smoke drill evidence on /dev/shm recorded:

20 raw artifacts
20000 patent documents
4000 patent families
20 succeeded checkpoints
0 failed artifacts
max_concurrent_artifacts=2
uspto_artifacts_per_second=1

Benchmark Receipts

Durability readiness, 100k records:

mode	records/sec	total sqlite ms	replay stable	documents	families
`strict`	`7161.45`	`9895.27`	yes	`100000`	`20000`
`balanced`	`5997.73`	`13222.89`	yes	`100000`	`20000`
`benchmark`	`11658.4`	`6429.46`	yes	`100000`	`20000`

Decision: keep strict as default. balanced was slower than strict on the measured USPTO workload. benchmark is fastest but unsafe.

USPTO artifact preparation concurrency, 20 artifacts x 5000 records:

`max_concurrent_artifacts`	records/sec	peak RSS	documents	families	failed artifacts
`1`	`29.15`	`243.48 MiB`	`100000`	`20000`	`0`
`2`	`45.27`	`272.26 MiB`	`100000`	`20000`	`0`
`4`	`34.45`	`269.71 MiB`	`100000`	`20000`	`0`

Decision: default INGESTION_MAX_CONCURRENT_ARTIFACTS=2.

USPTO canonical lane benchmark on /dev/shm, 20 artifacts x 5000 records:

`canonical_lanes`	records/sec	peak RSS	replay hash vs lane 1	documents	families	failed artifacts
`1`	`5160.91`	`282.95 MiB`	match	`100000`	`20000`	`0`
`2`	`5164.61`	`280.12 MiB`	match	`100000`	`20000`	`0`
`4`	`5157.41`	`320.89 MiB`	match	`100000`	`20000`	`0`

Decision: keep INGESTION_CANONICAL_LANES=1. Throughput is flat and 4 increases RSS.

Validation

Release checkpoint validation:

BashRunnable example

npm run build
npm test -- --group ingestion

The ingestion group covers provenance gates, provider import boundaries, non-USPTO inert stubs, USPTO bulk ingestion, replay determinism, retry behavior, durability mode guards, runtime config validation, operational drill behavior, and benchmark CLI surfaces.

Explicit Non-Goals

Ingestion v0.1 does not include EPO ingestion.

Ingestion v0.1 does not include KIPRISPlus ingestion.

Ingestion v0.1 does not include CNIPA ingestion.

Ingestion v0.1 does not include customer export.

Ingestion v0.1 does not include search index writes.