Data Evaluation
Before using data for forecasting or model training, it’s important to understand data quality. The evaluation scores data from three angles: whether the data is complete, regular, and whether series are related. Higher scores indicate better quality.
Three evaluation dimensions
Integrity
This score tells you whether the data is missing or irregular.
Sensor outages, network jitter, and duplicate reports can introduce gaps, duplicates, or misalignment in time-series data. The integrity score reflects how severe these issues are.
| Score | Meaning | Suggestion |
|---|---|---|
| 80–100 | Continuous data with regular timestamps | Safe to use directly |
| 40–80 | Some missing points or anomalies | Clean the data first |
| 0–40 | Many missing points or severe anomalies | Investigate data sources carefully |
If the integrity score is low, forecasting and analytics may be impacted—models can learn incorrect patterns or become biased due to gaps. Consider fixing data issues before proceeding.
Forecastability
This score tells you whether the series has learnable patterns.
Some series are naturally regular (e.g., hourly load with daily cycles), while others behave closer to random fluctuations. The forecastability score reflects how strong the pattern is.
| Score | Meaning | Suggestion |
|---|---|---|
| 50–100 | Strong patterns and easier to forecast | Good candidate for modeling |
| 30–50 | Some patterns with noticeable volatility | Try modeling; treat results as reference |
| 0–30 | Weak patterns, close to random noise | Forecasting may be poor; review data or strategy |
If the forecastability score is low, it does not necessarily mean the data is wrong—it may simply be highly volatile and hard to predict. Use domain knowledge to decide whether to add covariates or adjust expectations.
Correlation
This score tells you whether multiple series are related.
When you collect multiple signals (e.g., temperature, humidity, pressure), correlation helps you see which series move together and which are independent. This is useful for multivariate forecasting and feature selection.
If two series are highly correlated (close to 100), they likely carry similar information—keeping one may be enough to reduce redundancy. If the target series has low correlation with others, those series may contribute little and can be removed.
Quick Reference
| Metric | Key Question | Low Score Indicates |
|---|---|---|
| Integrity | Are timestamps continuous? Any duplicates or gaps? | Missing data, duplicates, or time-ordering issues |
| Forecastability | Does the series have cyclic or trend patterns? | Near-random series, hard to extrapolate from history |
| Correlation | Are multiple series linearly related? | Series are independent; covariate value is limited |