Continuous predictions • mapac

Example

Our example uses a simulated test dataset containing an observed $y_{obs}$ and a predicted $y_{pred}$ variable. With the aa_predobs function, you can estimate a suite of error statistics.

pred_errors <- aa_predobs(test$y_obs, test$y_pred, df=T)
pred_errors
#>                       stat
#> bias           0.011944825
#> varratio       0.850698750
#> mse            0.032482187
#> rmse           0.180228152
#> rrmse         35.869406693
#> mlp            0.029772398
#> mla            0.002709789
#> rmlp           0.172546799
#> rmla           0.052055633
#> plp            0.916576155
#> pla            0.083423845
#> sma_intercept  0.086962203
#> sma_slope      0.850698750
#> ols_intercept  0.151910035
#> ols_slope      0.721438134
#> r_squared      0.719194903

The model had a prediction accuracy of 0.18 (RMSE), which is 35.9% of the mean observation (rRMSE). The bias of 0.01 is negligible. The bias is an estimate of the mean systematic error. However, systematic errors may not be constant across the range of observations. For example, regression models tend to over-predict at low values and under-predict at high values (regression effect). To evaluate potential bias, it makes sense to take a look at the scatterpot. We can also fit a regression line between the predictions and observations, to quantify the bias.

In the literature, you will find three methods for fitting such a regression line to the predictions: 1) ordinary least squares regression (OLS) with the observations on the x-axis and the predictions on the y-axis, 2) OLS with the predictions on the x-axis and the observations on the y-axis, and 3) standardized (=reduced) major axis regression (SMA). OLS fits a line by minimizing the residuals in the y-direction only. In comparison, SMA minimizes the residuals in the x- and y-direction. Consequently, OLS assumes that the x-variable is measured without (or with negligible) error. While this assumption may be reasonable for certain applications, the measurement errors are often not trivial in remote sensing studies. Note, that measurement errors here may include actual instrument errors but also geo-location uncertainties when linking reference data (obtained in the field or high-resolution data) with satellite observations.

PredVsObserved

Many studies put the observations on the x-axis (Figure, left). Using OLS the estimated slope is 0.721, which suggests a significant overestimation at low values and underestimation at high values. However, method (1) ignores errors in the observations. A number of studies have suggested to reverse the axis (Figure, right). Putting the predictions on the x-axis reduces the regression effect (low and high values tend towards mean). This can be seen in the right figure where the OLS line is close to the 1:1 line. However, method (2) assumes that the predictions are obtained without error, which seems difficult to justify. In comparison, SMA leads to symmetric slope estimates, regardless of the choice of axis. The SMA slope estimate of 0.851 lies between the slopes of the other two methods. Please see Correndo et al. (2021a) for a more detailed description and discussion of the topic.

Error decomposition

Following Correndo et al. (2021a), we partition the prediction error, specifically the MSE, into a systematic and non-systematic component. In our example dataset, the proportion of the random component is PLP=0.92 and the proportion of the systematic component is PLA=0.08.

Equations

Statistic	Description	Equation
bias	Bias	$\bar{p} - \bar{o}$
Varratio	Variance ratio	$\frac{s_p}{s_o}$
MSE	Mean square error (MSE)	$\frac{1}{n}\sum_{i=1}^{n}(p_i-o_i)^2$
RMSE	Root mean square error	$\sqrt{MSE}$
rRMSE	Relative RMSE	$\frac{RMSE}{\bar{o}}$
MLP	Mean Lack of Precision	$\frac{1}{n}\sum_{i=1}^{n}(\|p_i-\hat{p_i}\|)(\|o_i-\hat{o_i}\|)$
MLA	Mean Lack of Accuracy	$\frac{1}{n}\sum_{i=1}^{n}(o_i-\hat{p_i})^2$
RMLP	Root Mean Lack of Precision	$\sqrt{MLP}$
RMLA	Root Mean Lack of Accuracy	$\sqrt{MLA}$
PLP	Proportion Lack of Precision	$\frac{MLP}{MSE}$
PLA	Proportion Lack of Accuracy	$\frac{MLA}{MSE}$
$\alpha_{SMA}$	Intercept of standardized major axis regression (SMA)	$\bar{p} - \beta_{SMA} * \bar{o}$
$\beta_{SMA}$	Slope of standardized major axis regression (SMA)	$sign(\rho_{po})\frac{s_{p}}{s_{o}}$
$\alpha_{OLS}$	Intercept of ordinary least squares regression (OLS)	$\bar{p} - \beta_{OLS} * \bar{o}$
$\beta_{OLS}$	Slope of ordinary least squares regression (OLS)	$\frac{\sum_{i=1}^{n}(o_i - \bar{o})(p_i - \bar{p})}{\sum_{i=1}^{n}(o_i - \bar{o})^2}$
$R^2$	Coefficient of determination between predictions and observations	$1-\frac{\sum_{i=1}^{n}{(p_i-\hat{p_i})^2}}{\sum_{i=1}^{n}{(p_i-\bar{p_i})^2}}$

References

Correndo, A.A., Hefley, T.J., Holzworth, D.P., & Ciampitti, I.A., 2021a. Revisiting linear regression to test agreement in continuous predicted-observed datasets. Agricultural Systems, 192. https://doi.org/10.1016/j.agsy.2021.103194

Correndo, A.A., Hefley, T., Holzworth, D., Ciampitti, I.A., 2021b. R-Code Tutorial: Revisiting linear regression to test agreement in continuous predicted-observed datasets. Harvard Dataverse V3. https://doi.org/10.7910/DVN/EJS4M0.

Kuhn, M., & Johnson, K., 2013. Applied predictive modeling. New York: Springer. https://link.springer.com/book/10.1007/978-1-4614-6849-3

Pauwels, V.R.N., Guyot, A., & Walker, J.P., 2019. Evaluating model results in scatter plots: A critique. Ecological Modelling, 411. https://users.monash.edu.au/~jpwalker/papers/em19.pdf

Piñeiro, G., Perelman, S., Guerschman, J.P., & Paruelo, J.M., 2008. How to evaluate models: Observed vs. predicted or predicted vs. observed? Ecological Modelling, 216, 316-322. https://doi.org/10.1016/j.ecolmodel.2008.05.006