Scraper Admin

Scraper Control Center

One deterministic control surface for crawl operations, semantic mapping, preview, versioning, replay, and audit-safe change management.

Deterministic pipeline Snapshot-first preview Versioned publish / rollback

Operate

Runs, Pipeline, and Health

Queueing, fetch/render monitoring, live run visibility, replay/rematerialization, and system health live in one operational loop.

Govern

Normalization, Mappings, and Preview

Pattern detection, candidate review, schema rules, preview, validate, publish, rollback, and provenance stay deterministic and inspectable.

Trust

Versions, Audit, and Trace

Immutable versions, signed audit verification, trace-aware workflows, and remediation controls keep the lane accountable.

Training Workbench

Samples -> Patterns -> Candidates -> Map -> Preview -> Normalize -> Publish -> Replay -> Monitor

Use this rail when onboarding or repairing a site: sample deterministic snapshots first, inspect pattern reasons, verify candidate evidence, map to Tag Studio-owned vocabulary, preview, review normalized profile output, publish, replay, and monitor.

Current step: Runs

Snapshots -- Load sample snapshots

Matched Patterns -- Load or discover patterns

Candidates -- Inspect field evidence

Fill Rate -- Run preview/validation

Validation Warnings -- Run validation

Dead Letters -- Load dead letters

Replay Freshness -- Load replay jobs

Vocabulary Owners

Mapping decisions should link back to Tag Studio instead of becoming hidden scraper-only vocabulary.

Rankings owner Services / x.Services owner Listings Structure owner Structural Migration owner

Operate

Runs

Canonical runtime/operator surface for scrape launches, live monitoring, per-site health, diagnostics, and queue/replay handoff. Snapshot browsing now hands off to Snapshot Center.

Canonical runtime surface

Unified Control Center. Listings `#scrape` is now a bridge/launch surface. Run controls, live log, per-site health, and diagnostics stay here, while snapshot browsing belongs in Snapshot Center.

Live status

Scraper Runtime

Hydrating from canonical runtime sources...

Runtime status-

Running now-

Last run-

Next run-

Last duration-

Status source-

Rows Materialized — Latest run output.

Runtime Status & Schedule Start, stop, and refresh the live scraper runtime without leaving Scraper Admin.

Runtime idle.

Waiting for runtime status...

Queue + run controls

Queue Site ID Lease Worker ID Queue Batch Limit Lease Seconds Queue URLs

No queue URLs yet. Add one or import them below.

Paste/import bridge (secondary)

Paste/import is only a convenience bridge into the structured queue URL rows above.

Structured queue URLs are the primary owner path here. Paste/import remains available below as a secondary bridge.

Queue Row ID Failure Text Dead-letter payload (structured rows)

No dead-letter payload fields yet. Add one or import JSON below.

Paste/import bridge (secondary)

Dead-letter payload JSON

Paste/import remains a convenience bridge into the structured dead-letter payload rows above.

Structured dead-letter payload rows are the primary owner path here. Paste/import remains available below as a secondary bridge.

Lease and batch-process controls stay site-scoped so worker follow-through remains RBAC-safe.

Operational monitoring

Use the toolbar Trace ID to correlate run, queue, replay, audit, and health activity across the control center.

Pattern discovery and candidate evidence now live under Normalization. Replay, audit, and remediation controls now live under Pipeline.

Scrapy bridge (inline payloads)

Requested sites

No requested sites yet. Add one or import them below.

Paste/import bridge (secondary)

Requested sites (comma optional)

Paste/import is only a convenience bridge into the structured site rows above.

Structured site rows are the primary owner path here. Paste/import remains available below as a secondary bridge.

Serialized payloads (structured rows)

No payload rows yet. Add one or import JSON below.

Paste/import bridge (secondary)

Serialized payload JSON

Paste/import is only a convenience bridge into the structured payload rows above.

Structured payload rows are the primary owner path here. Paste/import remains available below as a secondary bridge.

Inline payloads only. No server-side file paths or uploads. Preview shows deterministic site grouping and skipped payload counts; queue mode hands resolved URLs into the deterministic crawl queue; ingest mode writes snapshots and listings directly.

Live Log Streams the active scraper log tail while this page is open.

Waiting for scraper log output...

Per-Site Health Merges site registry rows with recent scrape history so every configured site shows telemetry or an explicit no-data state.

Waiting for tag distribution...

Site	Runs	Avg Results	Error Rate	Last Run	Status	Detail
Load runtime monitoring to populate site health.

Recent Runs & Diagnostics Combines recent run history with the latest scrape diagnostic snapshot and remediation detail.

Recent Runs

Started	Duration	Sites	Mode	Results	Status	Detail
Load runtime monitoring to populate recent runs.

Diagnostics

State: Loading…
Last Run: —
Results: —
Sites: —
Failure Class: —
Description: —
Remediation: —
Log File: —

If no site telemetry is available yet, this panel explains whether the gap comes from no completed run, runtime restart, or missing site attribution in stored history.

Load runs to populate recent activity, status mix, and trace-aware summary details. Raw JSON remains available underneath for diagnostics.

Configure

Sites

Site records, rollout settings, rate-limit defaults, and secret references for scraper workers.

Operational inventory

Create Site

Site key Base URL Status

Edit Existing Site

Site ID Site key Base URL Status Min delay seconds Max concurrency

Load a site from the summary table or enter a Site ID to edit an existing record.

Site Secrets (credential refs)

Site ID Secret Key Credential Ref

Upsert credential refs here, then confirm them through canonical readback.

Load refs for a site to manage row-level secret refs here.

Load sites or secret refs to inspect field-driven inventory summaries here. Raw JSON remains available underneath for diagnostics.

Load sites or policy payloads to inspect Compliance / Crawl Policy controls here. Challenge detection, access resilience, and network routing remain human-approved policy workflows.

Discover

Normalization

Deterministic pattern discovery, sample-page browsing, and candidate-evidence drill-in for site-specific extraction workflows.

Discovery workbench

Pattern inventory

Site ID

Pattern discovery

Sample URLs

No discovery URLs yet. Add one or import them below.

Paste/import bridge (secondary)

Paste/import is only a convenience bridge into the structured discovery URL rows above.

Structured discovery URLs are the primary owner path here. Paste/import remains available below as a secondary bridge.

Sample page browser

Site ID Pattern ID (optional) Limit Candidate Page IDs

No candidate page IDs yet. Add one or import them below.

Paste/import bridge (secondary)

Primary page ID bridge Candidate page IDs (comma/newline)

Paste/import is only a convenience bridge into the structured page-ID rows above. Candidate loads use the first valid structured page ID.

Structured page IDs are the primary owner path here. Candidate loads use the first valid row.

Suggestion and reason visibility

Pattern discovery output should expose suggested pattern drafts plus the reasons an operator needs to trust or reject them: structured-data hints, repeated-region signatures, content-density signals, route/page-class clues, and forbidden selector candidates.

Accept: stable repeated card/listing regions, consistent schema hints, and sample coverage across snapshots.
Challenge: low-support selectors, ambiguous page classes, hidden/modal-only content, and forbidden navigation chrome.
Owner handoff: vocabulary changes belong in Tag Studio owner surfaces, not hidden scraper-only mappings.

Load patterns or candidates to inspect field-driven extract summaries here. Raw JSON remains available underneath for diagnostics.

Pattern Output

Sample Browser Output

Candidate evidence

Inspect extracted fields with page-level evidence before you bind rules or publish a deterministic mapping version. This uses the first valid structured page-ID row from the Sample page browser card.

Candidate evidence checklist

Before binding, confirm each candidate has inspectable evidence and a clear owner path.

Source path: label/key, source selector/path, normalized value, and raw evidence stay visible.
Context path: page/snapshot id, listing/card boundary, sibling labels, and page scope are inspectable.
Warning state: low confidence, conflicting candidates, forbidden selectors, and missing required fields block blind publish.
Owner: map semantic vocabulary in Tag Studio; keep raw scrape-only observations as evidence until promoted.

Candidate Output

Govern

Mappings

Bulk/manual rule authoring, transforms, and deterministic binding thresholds in one mapping workbench.

Structured rule workbench

Bulk mapping

Mapping ID Threshold Margin

Add structured bindings here. Leave the list empty only when you intentionally want a threshold-only publish.

Structured bindings are the primary owner path here. Paste/import remains available below as a secondary bridge.

Paste Bindings (optional import)

Paste/import is only a convenience bridge into the structured binding list above. Publishing uses the structured rows, not the raw textarea.

Manual mapping

Mapping ID Source Key Tag Key

Add transform rows here. Publishing uses the structured list first and keeps comma import as a secondary bridge.

Structured transform rows are the primary owner path here. Paste/import remains available below as a secondary bridge.

Paste/import bridge (secondary)

Transforms (comma optional import)

Paste/import is only a convenience bridge into the structured transform rows above. Publishing uses the row list, not the raw input.

Threshold Margin

Publish bulk or manual rules to inspect field-driven mapping summary details here. Raw JSON remains available underneath for diagnostics.

Verify

Preview

Preview mapped output, validate deterministic rules, and keep live fetch explicitly opt-in after snapshot-first review.

Snapshot-first

Mapping ID

Sample Page IDs

No sample page IDs yet. Add one or import them below.

Paste/import bridge (secondary)

Sample Page IDs (comma)

Paste/import is only a convenience bridge into the structured page-id rows above.

Sample URLs

No sample URLs yet. Add one or import them below.

Paste/import bridge (secondary)

Paste/import is only a convenience bridge into the structured URL rows above.

Structured preview sources are the primary owner path here. Paste/import remains available below as a secondary bridge.

Live fetch (snapshot-first by default)

Preview actions

Queueing now lives under Runs so preview remains a semantic verification surface instead of a second crawl launcher.

Publish readiness gate

Publish only after snapshot-first preview has a usable fill rate, validation warnings are explained or resolved, dead letters are triaged, and replay freshness shows the new mapping can rematerialize recent evidence.

Awaiting preview evidence

SourcesUse structured preview rows first.

Fill rateRun preview or validation.

WarningsRun validation and triage warnings.

Dead lettersLoad dead letters and resolve blockers.

ReplayLoad replay jobs after publish.

Fill rate should come from preview/validation output, not a hand-entered claim.
Warnings must remain visible until resolved, dismissed with reason, or moved to review queue.
Replay job status closes the loop after publish so stale normalized listings are not silently left behind.

Run preview or validation to inspect field-driven preview counts, trace/run identifiers, and validation totals here. Raw JSON remains available underneath for diagnostics.

Run preview to load field-level evidence when the adapter payload provides it. If no evidence ledger appears, the adapter has not supplied field provenance yet.

Run preview or validation to inspect field-level confidence dimensions here. Raw JSON remains available underneath for diagnostics.

Normalization Workbench

Profiles

Work the canonical normalized profile lane here: inventory recent profiles, materialize from a listing, inspect effective values, publish structured overrides, and rollback by version without mutating raw scrape facts.

Effective values + override history

Scope + inventory

Site ID Pattern ID (optional) Limit

This inventory stays on the scraper-mapping owner path. Listings remains a projection surface and does not write normalized profile state directly.

Materialize from listing

Listing ID

Leave field keys empty to materialize all currently supported fields.

Paste Field Keys (optional import)

Reason

Materialization seeds or refreshes the normalized profile from the listing's canonical data without changing the raw listing payload. Paste/import is only a convenience bridge into the structured field-key list above.

Open + rollback

Profile ID Rollback Target Version # Reason

Rollback is append-only. The selected target version becomes the next active override version instead of rewriting history.

Selected profile

Load or materialize a profile to inspect effective values, override state, source listing context, and audit-safe metadata.

Override editor

Profile ID Reason

Load a profile to edit override rows.

Structured scalar rows are the primary owner path here. Load a profile to begin.

Structured scalar rows are the primary owner path here. Complex nested values remain available through the advanced JSON editor below and are preserved unless you replace the same key in the structured editor.

Advanced JSON editor Use this only when an override needs nested objects or arrays that do not fit the structured scalar editor.

Secondary editor

Override JSON

Overrides are the user-authored layer. They never rewrite raw snapshots, source rows, or raw scrape facts.

Profile inventory

Load profiles to see recent normalized records for the selected site/pattern scope.

Override history

Load a profile to inspect append-only override versions, active state, authored-by metadata, and rollback targets.

Effective field comparison

Load a profile to compare effective values with their current source layer and override ownership metadata.

Workbench Output

Release

Versions

Immutable publish history, compare summaries, and rollback controls. Replay has its own dedicated tab.

Publish / Compare / Rollback

Mapping ID

From To

Rollback Target Version #

Load or compare versions to inspect field-driven publish history and replay-safe diff context here. Raw JSON remains available underneath for diagnostics.

Rematerialize

Pipeline

Replay, queue handoff, replay jobs, and downstream remediation stay correlated through the toolbar Trace ID.

Replay + queue + remediation

Launch replay

Mapping Version ID Pattern ID (optional) Listing IDs

No replay listing IDs yet. Add one or import them below.

Paste/import bridge (secondary)

Listing IDs (comma optional)

Paste/import is only a convenience bridge into the structured replay listing-ID rows above.

Page IDs

No replay page IDs yet. Add one or import them below.

Paste/import bridge (secondary)

Page IDs (comma optional)

Paste/import is only a convenience bridge into the structured replay page-ID rows above.

Structured replay scope IDs are the primary owner path here. Paste/import remains available below as a secondary bridge.

Default limit

Replay job monitoring

Replay Site ID (optional) Replay Pattern ID (optional) Replay Listing ID (optional)

Pipeline operator guidance

Run launches, live log, snapshots, diagnostics, and site-health telemetry stay under Runs. Use Pipeline for replay/rematerialization, queue handoff, and downstream job visibility. Pre-scan policy is dynamic and follows Scan (Exclude - Always Filters) from the current runtime/Tag Studio filter contract.

Recommended operator pass

Samples -> Patterns -> Candidates -> Map -> Preview -> Normalize -> Publish -> Replay -> Monitor. Use this order when onboarding or repairing a site, then confirm replay jobs and health telemetry before closing.

Review and dead-letter follow-through

Pipeline operators often need quick access to audit, review, and dead-letter queues while replaying recent work.

Audit and review loaders

Review Status Filter

Signed audit history remains canonical evidence, but it now sits inside Pipeline so replay/rematerialization and remediation stay together.

Resolution controls

Review ID Decision Review Note

Dead Letter ID Resolution Note

Review queue rows

Load the review queue to resolve items from live row actions instead of copying IDs by hand.

Dead-letter rows

Load dead letters to resolve live rows from this card instead of copying IDs by hand.

Resolve review and dead-letter items here, then confirm their canonical readback state before treating them as final.

Load replay, audit, review, or dead-letter payloads to inspect field-driven pipeline summaries here. Raw JSON remains available underneath for diagnostics.

Run replay or load replay jobs to inspect field-driven job summaries here. Raw JSON remains available underneath for diagnostics.

Run replay or load replay jobs to inspect trace artifacts here. Raw JSON remains available underneath for diagnostics.

Observe

Health

Site health, provenance lookup, validation/dedup truth, trace correlation, and listings-level runtime visibility.

Monitoring + provenance

Recovery guard

Backend Watchdog

Waiting for runtime supervisor state…

Last probe—

Policy—

Maintenance: none

Synthetic UX

Browser-backed page checks

Waiting for watcher data…

Pages monitored—

Failing—

Loading synthetic page status…

Site health

Site ID Site Status Filter

Trace + provenance

Monitoring filters use the toolbar Trace ID. Provenance lookup now returns provenance, validation, dedup, and latest materialization audit context for one listing.

Listing ID (for provenance)

Load site health, listings, provenance, or trace data to inspect field-driven monitoring summaries here. Raw JSON remains available underneath for diagnostics.

Load provenance or trace correlation to inspect data lineage here. Raw JSON remains available underneath for diagnostics.

Load health or SLO metrics to inspect reliability target and error-budget burn here. Raw JSON remains available underneath for diagnostics.

Load health or trace data to inspect payload-provided drift and anomaly signals here. Raw JSON remains available underneath for diagnostics.

Top Headlines

Scraper Admin Operating Spine

Runs, Pipeline, and Health

Normalization, Mappings, and Preview

Versions, Audit, and Trace

Samples -> Patterns -> Candidates -> Map -> Preview -> Normalize -> Publish -> Replay -> Monitor

Runs

Scraper Runtime

Queue + run controls

Operational monitoring

Scrapy bridge (inline payloads)

Recent Runs

Diagnostics

Sites

Create Site

Edit Existing Site

Site Secrets (credential refs)

Normalization

Pattern inventory

Pattern discovery

Sample page browser

Suggestion and reason visibility

Pattern Output

Sample Browser Output

Candidate evidence

Candidate evidence checklist

Candidate Output

Mappings

Bulk mapping

Manual mapping

Preview

Preview actions

Publish readiness gate

Profiles

Scope + inventory

Materialize from listing

Open + rollback

Selected profile

Override editor

Profile inventory

Override history

Effective field comparison

Workbench Output

Versions

Pipeline

Launch replay

Replay job monitoring

Pipeline operator guidance

Recommended operator pass

Review and dead-letter follow-through

Audit and review loaders

Resolution controls

Review queue rows

Dead-letter rows

Health

Backend Watchdog

Browser-backed page checks

Site health

Trace + provenance