Quality of Data (QoD) Mechanism v1 - Full Description
QoD serves as the mechanism employed to differentiate between expected and unexpected behaviour of the data recorded by a weather station. To accomplish this, we utilise a series of techniques and processes that examine various aspects concerning the quality of data.
The current version of QoD includes control checks to identify instances where a sensor measures values beyond the limits specified by the manufacturer, detects unusual jumps, identifies consistent values, or determines if there is insufficient data to compute an average.
The processes included in the QoD mechanism v1 are:
QoD v1 is able to process:
The final output of the mechanism is an hourly percentage of valid/available data and the corresponding text annotations per meteorological variable (temperature, relative humidity, wind speed, wind direction, atmospheric pressure and illuminance). The output is expected to be available on a daily basis.
Before moving forward to the description of the QoD mechanism, we should note that the input data has been standardised over time. Weather stations may transmit data at varying non-fixed time intervals. This implies that even if a station's specification indicates a recording frequency of 16 seconds, the actual interval can fluctuate, ranging from a few seconds to several minutes. To ensure that the quality control operates on a time-normalised dataset, a new timeframe is established with a fixed interval (determined by the theoretical recording frequency (fr) of the station model, such as 16’’ for the WG1000 - M5 model). Subsequently, data points falling within the nearest fr/2-second (or 16’’/2=8’’) range around the fixed timestamp are assigned to that fixed timestamp.
The QoD v1 mechanism is described step-by-step in the next sections.
To ensure the proper functioning of certain SQC processes, an ignoring time period is established (Table 1). In practice, all time slots within the ignoring period that lack data are filled with the most recent valid value. Subsequently, this newly filled time series is exclusively employed for scrutinising constant data and anomalous jumps at the raw scale level (see section 4).
Table 1 - Ignoring period for the available weather station models.
Weather Station Model | Ignoring Period [seconds] |
WG1000 (M5) | 60 |
WS2000 (Helium) | 180 |
The OBC is a simple process aimed at identifying values for each parameter that exceed the limits set by the manufacturer. The OBC proves highly valuable in identifying sensors that might be experiencing technical malfunctions. Although OBC is a simple check for temperature, relative humidity, wind speed/direction, atmospheric pressure and illuminance, precipitation requires a slightly different OBC process.
The OBC for precipitation applies to the variable “accumulated precipitation”. As OBC for precipitation looks at the accumulated precipitation difference between two consecutive timesteps, it should be applied over the filled time series produced in section 1. This allows for a discrimination of faulty precipitation rates, even under the lack of data for a predefined short period of time. The upper cut-off limit used for this process is time dependent, as its unit is in mm/sec. This limit is reduced depending on the recording frequency of the station model (e.g., from 0.254mm/sec to 4.064mm/16sec) and, eventually, the OBC is applied on the difference of the “accumulated precipitation”. In case of a time gap of data that has been filled (according to the process in section 1), the precipitation cut-off threshold is reduced to the corresponding time (e.g., for a data temporal resolution of 16’’, if time slots at 16’’, 32’’ and 48’’ have no data and have been filled with the value of the latest valid measurement at 00’’, and the next available measurement is at 64’’, the cut-off limit will gradually increase for each filled timestep till the next valid datum, such as 4.064, 8.128, 12.192, 16.256mm).
Currently, there are two sets of outdoor sensors, each accompanied by its own respective specifications outlined in Tables 2 and 3.
Table 2 - Manufacturer’s limits for WG1000 (M5) sensors.
Parameter | Lower Limit | Upper Limit |
Temperature [°C] | -40 | 60 |
Relative Humidity [%] | 10 | 99 |
Wind Speed [m/s] | 0 | 50 |
Wind Direction [°] | 0 | 359 |
Atmospheric Pressure [mb] | 300 | 1100 |
Illuminance [Lux] | 0 | 400000 |
Precipitation [mm] | 0 | 0.254/sec |
Table 3 - Manufacturer’s limits for WS2000 (Helium) sensors.
Parameter | Lower Limit | Upper Limit |
Temperature [°C] | -40 | 80 |
Relative Humidity [%] | 1 | 99 |
Wind Speed [m/s] | 0 | 50 |
Wind Direction [°] | 0 | 359 |
Atmospheric Pressure [mb] | 540 | 1100 |
Illuminance [Lux] | 0 | 200000 |
Precipitation [mm] | 0 | 0.254/sec |
SQC is a process that aims to detect anomalous behaviour of weather observations due to faulty sensors or not proper deployments using only the data of a single weather station itself. The thresholds, but also the assumptions that lead to the final result, are based on the WMO and European Commission general guidelines. However, our further data analysis based on data from the WeatherXM network results in corrected and less strict thresholds at some SQC checks.
As there are weather stations that transmit data at various time intervals, ranging from a few seconds to several minutes, SQC should be capable of annotating data regardless of the temporal resolution of a data time series. However, it's important to highlight that temporal resolutions exceeding 1’ may not be suitable for achieving rigorous control checks, as recommended by the WMO.
SQC consists of a series of controls per parameter (temperature, relative humidity, wind speed, wind direction, atmospheric pressure and illuminance) at different time scales, which differ for different temporal resolutions (HTRT, LTRT). Note that precipitation is excluded from SQC, as it requires a careful and more complex quality control.
A HTRT passes through (Table 4):
For LTRT, practically only the raw (which is now called inter-minute scale level) and the hourly scale levels are applied, as the minute averaging is not possible (Table 5).
Table 4 - All the control processes that extend across the three time scale levels for the case of a weather station that records data every 20 seconds or less (e.g., WG1000 (M5).
Control Process of HTRTs (<20sec) | ||||||
Time Scale | Raw Scale | Minute Scale | Hourly Scale | |||
Control Check Type | Constant data detection | Unavailable data detection | Suspicious Jump Detection | Unavailable data detection | Suspicious Jump Detection | Counting of invalid hourly time slots |
Table 5 - All the control processes that extend across the two time scale levels for the case of a weather station that records data every 1 minute or more (e.g., WS2000 (Helium).
Control Process of LTRTs (>1min) | ||||
Time Scale | Inter-Minute Scale | Hourly Scale | ||
Control Check Type | Constant data detection | Unavailable data detection | Suspicious Jump Detection | Counting of invalid hourly time slots |
There are three basic processes to check Constancy, Availability and Fluctuation of the data, these are a. Constant data detection, b. Unavailable data detection and c. Suspicious jump detection.
The constant data detection process aims to identify whether a sensor is recording unchanging values due to technical disruptions or improper deployment. This process is applied to raw data to ensure that even insignificant alterations over a span of a few seconds are taken into account. According to WMO recommendations, the parameters of temperature, relative humidity, wind speed, wind direction and atmospheric pressure should not remain constant for more than 1 hour. However, our current approach employs a more flexible strategy, employing different thresholds according to our data analysis (see the report).
While constant data typically raises suspicion, there are certain cases that warrant exclusion. For instance, during foggy or fairly wet conditions, there is often a prolonged period of constant relative humidity/temperature/wind spanning hours, and thus, such cases are excluded from this process (Table 6 - Constancy Duration Threshold). Additionally, given that a significant number of weather stations in various networks, including those in the WeatherXM network, lack a heating system, we omit constant 0m/s wind speed and/or constant wind direction readings, provided the median temperature for the predefined period is <0°C, although these cases are annotated for reference. It is noted though that constant, but non-zero wind speed under freezing temperatures is considered as faulty (due to faulty sensor). Finally, we establish a maximum threshold for constant illuminance when it is not equal to 0 lux. This is because it may remain at ≠0Lux for extended hours during the night, but it should not remain constant at a specific value for an extended period during the day.
Finally, a second set of constancy checks has been established to mainly identify faulty sensors (Table 6 - Constancy Duration Max Threshold). Consequently, any time series under investigation is marked as erroneous when the data remain constant for 1440 minutes without any additional specific conditions. It's worth noting that relative humidity has been excluded from this criterion due to its potential to remain constant for longer durations (as explained in the report). Similarly, illuminance and pressure are also excluded since the initial set of criteria (with a 120-minute threshold) adequately addresses constancy checks for this parameter.
Table 6 - Duration thresholds for the constant data detection process.
Variable | Constancy Duration Threshold (minutes) | Constancy Duration Max Threshold (minutes) | Control Checks (annotated as faulty) | Exclusions |
Temperature | 240 | 1440 | Constant values when RHmedian<95% | Constant values when RHmedian≥95% |
Relative Humidity | 360 | - | Constant values when RHmedian<95% | Constant values when RHmedian≥95% |
Wind Speed | 360 | 1440 | - Constant wind speed to 0m/s when Tmedian is >0°C - Constant wind speed to 0m/s when RHmedian<85% - Constant wind speed to value ≠0m/s | - Constant values when Tmedian is <0°C or
|
Wind Direction | 360 | 1440 | - Constant wind direction when Tmedian is >0°C - Constant wind direction when RHmedian<85% | - Constant values when Tmedian is ≤0°C
|
Atmospheric Pressure | 120 | Constant pressure regardless the pressure value | ||
Illuminance | 120 | - | Constant illuminance only when it is ≠0Lux | Constant illuminance only when it is 0Lux |
Having already normalised the timeframe, we then check for no-data timeslots. We count the no-data time slots at both the raw and minute scale level. Note that in minute scale level, a no-data timeslot may arise due to lack of raw data within a certain minute or invalidity of data (e.g., see section 3c).
Suspicious jump detection is employed to detect abrupt and unusual changes in a parameter occurring within a short period of time. Such changes are often attributed to technical disruptions, although in fewer instances, they may result from improper deployment. This process is applied across both raw and minute scale levels for HTRTs (Table 7), while it operates solely at the raw data scale for LTRTs (Table 8). The LTRTs’ thresholds are applied in a proportional way till the pre-defined upper limit. For instance, if the temperature jump threshold for a 1-min interval is 3°C, it will be 9°C for a 3-min interval, but will never exceed the upper limit of 15°C. The upper limits are recommended by the European Commission for consecutive 1-hour averaged values. We highlight the lack of RH upper limit prompting us to make an arbitrary yet logical decision to set the value at 80%. In case of a faulty jump detection, an additional check is conducted to label the subsequent value as faulty if it equals the previously identified faulty value.
To overcome the challenge of distinguishing a faulty value from two consecutive ones, we require a minimum of 67% of the data within the past 10-minute window to calculate the median for HTRTs (60-min window for LTRTs). Subsequently, we compare both the two consecutive values with the calculated median, eventually categorising the one with the largest deviation from the median as faulty. The same process applies on both raw and (inter-minute) scale levels. Note that if the median cannot be calculated, then no decision can be taken and the datum is annotated as not available.
Table 7 - Jump thresholds for the raw and minute scale levels for HTRTs. There are no jump thresholds for wind direction observations.
Temperature | RH | Wind Spd | Wind Dir | Pressure | Illuminance | |
Raw scale level jump threshold | ≤2°C | ≤5% | ≤20m/s | - | ≤0.3mb | ≤97600Lux |
Minute scale level jump threshold | ≤3°C | ≤10% | ≤10m/s | - | ≤0.5mb | ≤97600Lux |
Table 8 - Jump thresholds for the raw and minute scale levels for LTRTs. There are no jump thresholds for wind direction observations.
Temperature | RH | Wind Speed | Wind Dir | Pressure | Illuminance | |
Inter-Minute scale level jump threshold | ≤3°C | ≤10% | ≤10m/s | - | ≤0.5mb | ≤97600Lux |
Upper Limit | 15°C | 80% | 15m/s | - | 15mb | 146400Lux |
After detecting time slots that either possess invalid data (following the control check as described in sections 2 and 3a-c) or are devoid of data, a minimum availability threshold is established (Table 9). This ensures the presence of sufficient data to facilitate the eventual calculation of average values for each parameter over both one-minute and one-hour time intervals. Notably, all parameters undergo averaging within one-minute and one-hour time slots, except for wind speed and direction, which undergo vector averaging across two-minute and one-hour intervals, respectively (Table 9). The thresholds employed are aligned with the WMO recommendations for minute-scale data, and we have extended these thresholds to encompass hourly-scale data as well. As an example, we necessitate the presence of over 67% of raw temperature data within a minute/hour to consider a certain minute/hour as valid. As availability of precipitation data has no WMO recommendations, we opt for a minimum availability on raw scale of 30%, but for an increased threshold of 85% of data availability on minute scale. The low availability threshold on raw scale is because a weather station measures the accumulated precipitation, thus at least a measurement per minute could be adequate. However, the increased threshold of 85% on minute scale is set to ensure a sufficient rain rate calculation spanning minutes and to ensure that even a short precipitation episode will be observed.
Table 9 - Thresholds per parameter of data availability for both HTRTs and LTRTs and the corresponding averaging period in minute scale level for HTRTs.
Temperature | RH | Wind Speed | Wind Dir | Pressure | Illuminance | Precipitation | |
Availability Threshold | ≥67% | ≥67% | ≥75% | ≥75% | ≥67% | ≥67% | ≥30%* |
Averaging Period | 1-min | 1-min | 2-min | 2-min | 1-min | 1-min | 1-min** |
* The precipitation availability threshold applies only on raw data. On a minute scale, the threshold is ≥85%.
** Precipitation is accumulated within the averaging period
At both of the raw and (inter-) minute scale levels various text annotations are assigned to every single value for each of the 7 investigated meteorological variables (Tables 10-11). At the hourly scale level, all the unique text annotations produced in the previous levels are gathered to provide information about all the detected faults within the hourly slot.
Table 10 - Explanation of text annotations applying at the different levels of the QoD mechanism for HTRTs.
Text Annotation | Description | |
Raw Scale Level | Out of sensor's specs value | The investigated value is out of the predefined manufacturer’s boundaries (OBC). |
Not available data | There is no data in the investigated time slot. | |
Not enough data to take a decision | There is less than 67% data available within the last 10 minutes. Thus, the median (used in SQC processes) cannot be calculated. | |
Abnormal changes due to problematic sensor or bad deployment | There are suspicious jumps between two consecutive observations. | |
Constant data for at least x minutes | The data are constant over the last x minutes. | |
Frozen wind gauge for at least x minutes | Wind direction is constant over the last x minutes and/or the wind speed is constantly at 0m/s | |
Minute Scale level | Average value is suspicious or unavailable | There is not enough data to produce the minute average. This may occur due to lack of data or invalid data, which are not considered for averaging. |
Abnormal changes due to problematic sensor or bad deployment | There are suspicious jumps between two consecutive averaged values. |
Table 11 - Explanation of text annotations applying at the different levels of the QoD mechanism for LTRTs.
Text Annotation | Description | |
Inter-minute Scale Level | Out of sensor's specs value | The investigated value is out of the predefined manufacturer’s boundaries (OBC). |
Not available data | There is no data in the investigated time slot. | |
Not enough data to take a decision | There is less than 67% data available within the last 10 minutes. Thus, the median (used in SQC processes) cannot be calculated. | |
Abnormal changes due to problematic sensor or bad deployment | There are suspicious jumps between two consecutive observations. | |
Constant data for at least x minutes | The data are constant over the last x minutes. | |
Frozen wind gauge for at least x minutes | Wind direction is constant over the last x minutes and/or the wind speed is constantly at 0m/s |