Full cross-validation and generating learning curves for time-series models
Standard cross-validation on time series data is not possible because the data model is sequential, which does not lend well to splitting the data into statistically useful training and validation sets. However, a new approach called Reconstructive Cross-validation may pave the way toward performing this type of important analysis for predictive models with temporal datasets.
By Mehmet Suzen, Theoretical Physicist | Research Scientist.
Time series analysis is needed almost in any quantitative field and real-life systems that collect data over time, i.e., temporal datasets. Building predictive models on temporal datasets for the future evolution of systems in consideration are usually called forecasting. The validation of such models deviates from the standard holdout method of having random disjoint splits of train, test, and validation sets used in supervised learning. This stems from the fact that time series are ordered, and order induces all sorts of statistical properties that should be retained. For this reason, applying direct cross-validation to time-series model building is not possible and only restricted to out-of-sample (OOS) validation, using the end-tail of a temporal set as a single test set. Recent work proposed an approach that overcomes the known limitation of achieving full cross-validation for time series. The approach opens up a possibility to produce learning curves for the time-series models as well, which is usually also not possible due to similar reasons.
Reconstructive Cross-validation (rCV):Â A meta-algorithm design principles
rCV is proposed recently in the paper titled Generalised learning of time-series: Ornstein-Uhlenbeck processes. The design principles of rCV for time-series aims at the following principles:
Figure 1: rCV meta-algorithm for time series cross-validation and learning curves.
- Logically close to standard cross-validation: Arbitrary test-set size and number of folds.
- Preserve correlations and data order.
- Does not create the absurdity of predicting the past from the future data.
- Applicable in a generic fashion regardless of the learning algorithm.
- Applicable to multi-dimensional time series.
- Evaluation metric agnostic.
Idea of introducing missing data: Temporal cross-validation and learning curves
The key idea of rCV is to create cross-validation sets via creating missing-data sets K-times, as in K-fold, with a given degree of missing ratio, i.e., random data point removal. Each fold will have a disjoint set of missing data points. By an imputation method, we would fill out the K-disjoint missing data sets and generate K-different training datasets.  This would allow us to have K-different models, and we could measure the generalised performance of the modelling approach by testing the primary model's prediction on the Out-of-sample (OOS) test set. To avoid confusion about what is a model?, what we are trying to achieve is to find out the hypothesis, i.e., the modelling approach.  By changing the ratio of missing data and repeating the cross-validation, the exercise will yield to set of the ratio of missing-missing data introduced and their corresponding rCV errors, where the plot is nothing but a learning curve from a supervised learning perspective.  Note that the imputation and prediction models are different models. The primary model we are trying to build is the prediction model we used for producing OOS predictions. The procedure is summarised in Figure 1.
Figure 2: Synthetic data and reconstructions.
Showcase with Gaussian process models on Ornstein-Uhlenbeck processes
To demonstrate the utility of rCV, the mentioned paper uses synthetic data generated by the Ornstein-Uhlenbeck process, i.e., Gaussian process with a certain parameter setting.  Figure 2 shows the synthetic data and example locations of generated missing dataset’s reconstruction errors. Figure 3 shows learning curves depending on the different ratios of the missing data setting.
Figure 3: Learning curves for the Gaussian Process model generated by rCV.
Conclusion
rCV provides a logically consistent way of practicing cross-validation in time series. It is usually not possible to produce learning curves on the same time window for the time series model, but using rCV with different ratios of missing data achieves this as well. rCV paves the way to do generalised learning for time series.
Further Reading
Apart from the paper Generalised learning of time-series: Ornstein-Uhlenbeck processes, the results can be reproduced with the Python prototype implementation here.
Original. Reposted with permission.
Related: