Data Quality
Validation Pipeline
Readings pass through multiple validation stages:
| Stage | What it Catches | Action |
|---|---|---|
| Pydantic model (storage broker) | Missing required fields, wrong types | Reject with 422 |
| Geocoding check (storage broker) | Missing geo_country / geo_subdivision | Reject (counter incremented) |
| Deduplication (ingester + ClickHouse) | Duplicate readings from mesh flooding or multi-path delivery | Skip silently |
Content-based ID (ClickHouse ReplacingMergeTree) | Same reading ingested by multiple stations | Last-write-wins dedup at query time with FINAL |
Known Data Quality Challenges
Sensor drift: Sensors degrade over time. A PM2.5 sensor may read 20% high after a year of outdoor exposure. WeSense stores calibration_status per reading (from the sensor firmware) and sensor_model to enable post-hoc correction by researchers, but does not currently apply corrections in the pipeline.
Stuck sensors: A sensor reporting the same value indefinitely (e.g., 0°C for a week) is likely malfunctioning. Currently not detected automatically.
GPS glitches: A sensor reporting coordinates in the ocean or on the wrong continent. The geocoder will assign a country/subdivision, but the result will be wrong. Currently not detected automatically.
Bad actors: A malicious ingester could sign and submit fabricated readings. The Ed25519 trust model means every reading is traceable to its signing ingester. Revoking an ingester's key in the trust list allows consumers to exclude all its readings retroactively — including from already-archived Parquet files (the trust snapshot records revocation status).
Data Quality Flags
The ClickHouse schema includes a data_quality_flag column (LowCardinality(String), default 'unvalidated'). This supports future automated quality assessment without modifying readings — the flag is metadata about the reading, not part of the reading itself.
