Docs · PatentChecker
PatentChecker docs for evaluation, verification, and delivery risk
Use this lane for buyer evaluation, evidence verification, self-hosting, adapters, coverage scope, and the delivery-system surfaces behind vector and sequence IP review.
Patent IPDelivery riskVerificationSelf-serve
Viewing
PatentChecker ingestion v0.1
Need another page?
Search the docs
Jump to buyers, verification, demos, self-hosting, or adapters without opening the full docs tree first.
Key sections
Mobile navigationJump to sectionOpen
Mobile navigation
Jump to section
This document freezes the proven PatentChecker ingestion v0.1 contract before any provider expansion or deeper persistence work.
Implemented Scope
USPTO ODP bulk ingestion is the only implemented provider path in ingestion v0.1.
EPO OPS, KIPRISPlus, and CNIPA remain registered as typechecked inert stubs. They must not fetch, parse, ingest, or mutate canonical tables in v0.1.
Core Contract
A patent record is not eligible for redistribution, evidence packet generation, or customer export unless its source artifact, checksum, provider, retrieval timestamp, parser version, and license status are present and valid.
Canonical writes are provenance-gated. Provider workers do not write canonical patent tables directly; they write through the ingestion unit-of-work boundary with a
ProvenanceValidatedContext.Raw artifacts are immutable and content-addressed by SHA-256. API or bulk fetch results must be persisted as raw artifacts before parsing, normalization, canonical upsert, checkpoint advancement, retry resolution, or evidence use.
Runtime Defaults
| Setting | Default | Contract |
|---|---|---|
INGESTION_SQLITE_DURABILITY_MODE | strict | Production default. Invalid values fail closed. |
INGESTION_MAX_CONCURRENT_ARTIFACTS | 2 | Bounded USPTO artifact fetch/preparation concurrency. |
INGESTION_CANONICAL_LANES | 1 | Conservative default. Canonical processing stays single-lane unless explicitly overridden. |
INGESTION_CHUNK_SIZE | 1000 | Streaming parse/normalize chunk size. |
INGESTION_USPTO_ARTIFACTS_PER_SECOND | 1 | External USPTO artifact fetch/preparation rate limit. |
benchmark sqlite durability mode is unsafe and intended only for profiling. It is blocked when NODE_ENV=production.Determinism And Replay
Replay determinism is part of the v0.1 contract. USPTO bulk ingestion supports deterministic replay from stored raw artifacts and streaming replay hashing for large artifacts.
Canonical identity remains deterministic across repeated runs:
canonical_patent_idis derived from normalized country code, document number, and kind code.- Unknown family identifiers are scoped as
family:unknown:{canonical_patent_id}to avoid merging unrelated patents. canonical_family_idis deterministic for repeated parses of the same artifact.
Retry And Checkpoint Semantics
provider_checkpoint is the ingestion completion ledger. A checkpoint means canonical ingestion completed successfully for that provider, dataset, and checkpoint key.failed_artifact is a retry work queue, not a completion ledger.On canonical failure:
- the raw artifact remains stored
- the checkpoint is marked failed
- a failed artifact row is enqueued
- no partial canonical commit is allowed for that artifact
On retry success:
- stored raw bytes are reused
- the failed artifact is resolved
- the checkpoint advances to succeeded
Operational Drill Guidance
Manual operational drills should run on
/dev/shm or another known-stable local filesystem on hosts where repo artifact paths or /tmp show journal stalls. On this host, prior D state failures were host I/O symptoms, not ingestion proof failures.Successful smoke drill evidence on
/dev/shm recorded:20raw artifacts20000patent documents4000patent families20succeeded checkpoints0failed artifactsmax_concurrent_artifacts=2uspto_artifacts_per_second=1
Benchmark Receipts
Durability readiness, 100k records:
| mode | records/sec | total sqlite ms | replay stable | documents | families |
|---|---|---|---|---|---|
strict | 7161.45 | 9895.27 | yes | 100000 | 20000 |
balanced | 5997.73 | 13222.89 | yes | 100000 | 20000 |
benchmark | 11658.4 | 6429.46 | yes | 100000 | 20000 |
Decision: keep
strict as default. balanced was slower than strict on the measured USPTO workload. benchmark is fastest but unsafe.USPTO artifact preparation concurrency, 20 artifacts x 5000 records:
max_concurrent_artifacts | records/sec | peak RSS | documents | families | failed artifacts |
|---|---|---|---|---|---|
1 | 29.15 | 243.48 MiB | 100000 | 20000 | 0 |
2 | 45.27 | 272.26 MiB | 100000 | 20000 | 0 |
4 | 34.45 | 269.71 MiB | 100000 | 20000 | 0 |
Decision: default
INGESTION_MAX_CONCURRENT_ARTIFACTS=2.USPTO canonical lane benchmark on
/dev/shm, 20 artifacts x 5000 records:canonical_lanes | records/sec | peak RSS | replay hash vs lane 1 | documents | families | failed artifacts |
|---|---|---|---|---|---|---|
1 | 5160.91 | 282.95 MiB | match | 100000 | 20000 | 0 |
2 | 5164.61 | 280.12 MiB | match | 100000 | 20000 | 0 |
4 | 5157.41 | 320.89 MiB | match | 100000 | 20000 | 0 |
Decision: keep
INGESTION_CANONICAL_LANES=1. Throughput is flat and 4 increases RSS.Validation
Release checkpoint validation:
BashRunnable example
npm run build
npm test -- --group ingestionThe ingestion group covers provenance gates, provider import boundaries, non-USPTO inert stubs, USPTO bulk ingestion, replay determinism, retry behavior, durability mode guards, runtime config validation, operational drill behavior, and benchmark CLI surfaces.
Explicit Non-Goals
Ingestion v0.1 does not include EPO ingestion.
Ingestion v0.1 does not include KIPRISPlus ingestion.
Ingestion v0.1 does not include CNIPA ingestion.
Ingestion v0.1 does not include customer export.
Ingestion v0.1 does not include search index writes.