Skip to content

Data Quality

Validation Pipeline

Readings pass through multiple validation stages:

StageWhat it CatchesAction
Pydantic model (storage broker)Missing required fields, wrong typesReject with 422
Geocoding check (storage broker)Missing geo_country / geo_subdivisionReject (counter incremented)
Deduplication (ingester + ClickHouse)Duplicate readings from mesh flooding or multi-path deliverySkip silently
Content-based ID (ClickHouse ReplacingMergeTree)Same reading ingested by multiple stationsLast-write-wins dedup at query time with FINAL

Known Data Quality Challenges

Sensor drift: Sensors degrade over time. A PM2.5 sensor may read 20% high after a year of outdoor exposure. WeSense stores calibration_status per reading (from the sensor firmware) and sensor_model to enable post-hoc correction by researchers, but does not currently apply corrections in the pipeline.

Stuck sensors: A sensor reporting the same value indefinitely (e.g., 0°C for a week) is likely malfunctioning. Currently not detected automatically.

GPS glitches: A sensor reporting coordinates in the ocean or on the wrong continent. The geocoder will assign a country/subdivision, but the result will be wrong. Currently not detected automatically.

Bad actors: A malicious ingester could sign and submit fabricated readings. The Ed25519 trust model means every reading is traceable to its signing ingester. Revoking an ingester's key in the trust list allows consumers to exclude all its readings retroactively — including from already-archived Parquet files (the trust snapshot records revocation status).

Data Quality Flags

The ClickHouse schema includes a data_quality_flag column (LowCardinality(String), default 'unvalidated'). This supports future automated quality assessment without modifying readings — the flag is metadata about the reading, not part of the reading itself.

All WeSense data is free and open, forever.