# How to Estimate Models with Measurement Error for our COVID-19 Indices

Learn how to use errors-in-variables models to take into account coding and data accuracy uncertainty.

This tutorial gives an overview of the COVID-19 policy indexes just released by the CoronaNet project of which I am a part and the Oxford Government Response Tracker. See the associated paper here. This post shows you how to download the estimated indices and also how to incorporate measurement error into analyses using the `eivtools`

and `brms`

R packages.

As shown in our working paper, these six indices (social distancing policies, business restrictions, school restrictions, health resources, health monitoring and mask policies) provide statistical estimates that take into account coding and data limitations. Each index has an associated value between 0 and 100 along with an uncertainty interval that provides the most likely values for the true level of policy activity for that index type. We also provide for each estimate the posterior standard deviation of the estimate, which is an analogue to the standard error in frequentist statistics. These standard deviations are helpful for incorporating measurement error in analyses using existing packages without much further work. It is the aim of this blog post to show how to do that.

If you want to see the full code underlying this post as an Rmarkdown file, just hop over to the site’s Github repo. The Rmarkdown file is in the `content/post`

folder and the Google mobility data is in the `content/post/data`

folder.

## Load and Merge Data

First, the outcome variable we will look at is the Google cell phone mobility data reported in retail establishments and restaurants. This is a key metric in COVID-19 spread because these kinds of areas can be a hotspot for indoors transmission of the virus.

```
mobility <- readRDS("data/gmob.rds") %>%
filter(is.na(sub_region_1), is.na(sub_region_2), is.na(metro_area)) %>%
select(country,date_policy="date",
retail="retail_and_recreation_percent_change_from_baseline")
```

I next load the CoronaNet-OxCGRT indices from this direct download link as a CSV:

`indices <- read_csv("https://drive.google.com/uc?export=download&id=1dMCTVPrf-tJyhv_uxr0yAQO-Elx0QOCG")`

I also want to point out that we have the full list of indicators (i.e, individual policies as variables) available via a direct download link as well:

`indicators <- read_csv("https://drive.google.com/uc?id=1lorcowHNnF0Vl6pxBjMdjTC4yPhHBLJI&export=download")`

I’ll display the data to get a sense of what it looks like. First, these are the raw indicators. There are a lot of them–more than 160 variables:

```
indicators %>%
filter(date_policy>ymd("2020-04-01"),
date_policy<ymd("2020-04-07"),
country=="Afghanistan") %>%
select(country,date_policy,social_distance,mask_everywhere,primary_school) %>%
knitr::kable() %>%
kable_styling()
```

country | date_policy | social_distance | mask_everywhere | primary_school |
---|---|---|---|---|

Afghanistan | 2020-04-02 | 0 | 0.1624358 | 4.516191 |

Afghanistan | 2020-04-03 | 0 | 0.1624358 | 4.516191 |

Afghanistan | 2020-04-04 | 0 | 0.1624358 | 4.565445 |

Afghanistan | 2020-04-05 | 0 | 0.1624358 | 4.565445 |

Afghanistan | 2020-04-06 | 0 | 0.1624358 | 4.565445 |

The data is organized in a country-day-policy record format with the different policy indicators in columns. The fractional values indicate a proportion of the population that have the following policy in place (i.e., at the provincial or municipal level). The high value for `primary_school`

has to do with the fact that this variable naturally has 3 levels, from totally open to totally closed, and when provincial and municipal policies are added it can exceed 3. For more information about the underlying indicators, you can see our codebook.

We can now take a glance at the estimated indices that collapse these indicators into single constructs:

```
indices %>%
filter(date_policy==ymd("2020-04-01"),
country=="Afghanistan") %>%
knitr::kable() %>%
kable_styling()
```

country | modtype | date_policy | med_est | high_est | low_est | sd_est |
---|---|---|---|---|---|---|

Afghanistan | Business Restrictions | 2020-04-01 | 62.53219 | 67.09647 | 57.94249 | 2.8344479 |

Afghanistan | Health Monitoring | 2020-04-01 | 41.40538 | 48.99068 | 34.40337 | 4.4375788 |

Afghanistan | Health Resources | 2020-04-01 | 46.97723 | 52.57886 | 41.45534 | 3.4747765 |

Afghanistan | Mask Policies | 2020-04-01 | 40.92886 | 48.72028 | 32.89501 | 4.8434060 |

Afghanistan | School Restrictions | 2020-04-01 | 77.33891 | 78.97166 | 75.70822 | 0.9911008 |

Afghanistan | Social Distancing | 2020-04-01 | 61.88964 | 68.51834 | 55.19300 | 3.9687823 |

This dataset is organized somewhat differently. We have one row per index value per index type. The `modtype`

column records which index is in each row. The columns include estimates of the given index for that day. `med_est`

is the median posterior estimate; i.e., the most likely estimate. The upper and lower bounds of the 5% - 95% posterior distribution (the uncertainty interval) are the `low_est`

- `high_est`

columns. Finally, the posterior standard deviation (standard error) is the `sd_est`

column.

If you just want “the” value of the index, then `med_est`

is the best bet. However, it’s best not to ignore the other columns. A simple way to take into account measurement error is to fit three separate models, one with `med_est`

, another with `low_est`

, and another with `high_est`

. If there are important differences across these models than you would know that measurement error in the indices could be affecting your results.

However, there are better ways to incorporate your uncertainty into models, which I will show next. We will use the retail and recreation outcome from the Google mobility as the response we want to understand, and each of our indices as predictors. We will use the `sd_est`

column to incorporate Normally-distributed measurement error in our models; i.e., we will assume that measurement error is centered around `med_est`

with a standard deviation of `sd_est`

. It’s not an exact incorporation of uncertainty as the posterior isn’t a perfect bell curve shape, but it’s pretty good as the posterior is usually at least unimodal if not symmetric.

First I use the `eivtools`

package to do a what is called an errors-in-variables model. These have been around for some time in economics and other fields, and are relatively easy to estimate. To do so, first I convert our index into two datasets with the posterior median estimates and standard deviations and one column for each index. I can then join these together to make one dataset with both estimates per index:

```
index_med <- select(indices,country,date_policy,med_est,modtype) %>%
spread(key="modtype",value="med_est")
index_sds <- select(indices,country,date_policy,sd_est,modtype) %>%
spread(key="modtype",value="sd_est")
# need to adjust names so that we can bind by column
names(index_med)[3:8] <- paste0(names(index_med)[3:8],"_med") %>%
str_replace(" ","_")
names(index_sds)[3:8] <- paste0(names(index_sds)[3:8],"_sd") %>%
str_replace(" ","_")
# merge back together
index_combined <- left_join(index_med,index_sds,by=c("country","date_policy"))
```

Finally we will merge our predictors into the Google mobility data:

```
# we only have index values through Jan 16th, 2021
mobility <- left_join(mobility,index_combined,by=c("country","date_policy")) %>%
filter(!is.na(Social_Distancing_sd))
```

To make the modeling a bit easier, we will only use a subset of the countries in the full data.

```
mobility <- filter(mobility, country %in% c("United States",
"Brazil","China",
"United Kingdom",
"France",
"South Africa","Nigeria",
"Indonesia",
"Australia","Thailand"))
```

## Conventional Errors-in-Variables

Now we can use the `eivreg`

as a function for estimating a linear regression model for our outcome, `retail`

. We need to simplify our uncertainty measures because conventional measurement/errors-in-variables models can only handle one standard deviation/standard error per variable. As such, we will pass in the average posterior standard deviation for each index as our estimate of measurement error `Sigma`

:

```
# first create Sigma, the covariance matrix
# only has a diagonal value as we don't have covariance estimates across indices
# we have to square them first as Sigma is a covariance matrix and we have standard deviations (variance = SD^2)
Sigma <- diag(c(mean(mobility$`Business_Restrictions_sd`^2,na.rm=T),
mean(mobility$`Health_Monitoring_sd`^2,na.rm=T),
mean(mobility$`Health_Resources_sd`^2,na.rm=T),
mean(mobility$`Mask_Policies_sd`^2,na.rm=T),
mean(mobility$`School_Restrictions_sd`^2,na.rm=T),
mean(mobility$`Social_Distancing_sd`^2,na.rm=T)))
# add dimnames to identify each variables
dimnames(Sigma) <- list(c("Business_Restrictions_med",
"Health_Monitoring_med",
"Health_Resources_med",
"Mask_Policies_med",
"School_Restrictions_med",
"Social_Distancing_med"),
c("Business_Restrictions_med",
"Health_Monitoring_med",
"Health_Resources_med",
"Mask_Policies_med",
"School_Restrictions_med",
"Social_Distancing_med"))
retail_eiv <- eivreg(retail~Business_Restrictions_med +
Health_Monitoring_med +
Health_Resources_med +
Mask_Policies_med +
School_Restrictions_med +
Social_Distancing_med,
data=mobility,Sigma_error=Sigma)
```

We can then see what our estimated associations are given that we incorporated average standard errors for each index:

`summary(retail_eiv)`

```
##
## Call:
## eivreg(formula = retail ~ Business_Restrictions_med + Health_Monitoring_med +
## Health_Resources_med + Mask_Policies_med + School_Restrictions_med +
## Social_Distancing_med, data = mobility, Sigma_error = Sigma)
##
## Error Covariance Matrix
## Business_Restrictions_med Health_Monitoring_med
## Business_Restrictions_med 3.658 0.000
## Health_Monitoring_med 0.000 9.721
## Health_Resources_med 0.000 0.000
## Mask_Policies_med 0.000 0.000
## School_Restrictions_med 0.000 0.000
## Social_Distancing_med 0.000 0.000
## Health_Resources_med Mask_Policies_med
## Business_Restrictions_med 0.000 0.00
## Health_Monitoring_med 0.000 0.00
## Health_Resources_med 6.375 0.00
## Mask_Policies_med 0.000 15.79
## School_Restrictions_med 0.000 0.00
## Social_Distancing_med 0.000 0.00
## School_Restrictions_med Social_Distancing_med
## Business_Restrictions_med 0.000 0.000
## Health_Monitoring_med 0.000 0.000
## Health_Resources_med 0.000 0.000
## Mask_Policies_med 0.000 0.000
## School_Restrictions_med 1.755 0.000
## Social_Distancing_med 0.000 8.896
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.157 -11.276 2.005 12.105 57.204
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.45477 1.74730 7.128 1.30e-12 ***
## Business_Restrictions_med 0.27001 0.04336 6.228 5.48e-10 ***
## Health_Monitoring_med 0.02532 0.02871 0.882 0.3779
## Health_Resources_med 0.32352 0.06208 5.212 2.01e-07 ***
## Mask_Policies_med -0.04313 0.02231 -1.933 0.0533 .
## School_Restrictions_med 0.17267 0.01967 8.779 < 2e-16 ***
## Social_Distancing_med -1.42736 0.04879 -29.254 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Number of observations used: 2680
## Latent residual standard deviation: 16.14
## Latent R-squared: 0.4038, (df-adjusted: 0.4022)
##
## EIV-Adjusted vs Unadjusted Coefficients:
## Adjusted Unadjusted
## (Intercept) 12.45477 10.67932
## Business_Restrictions_med 0.27001 0.26442
## Health_Monitoring_med 0.02532 0.05415
## Health_Resources_med 0.32352 0.21373
## Mask_Policies_med -0.04313 -0.05209
## School_Restrictions_med 0.17267 0.14216
## Social_Distancing_med -1.42736 -1.25473
```

We see that the social distancing index has the strongest association with reduced retail mobility, followed by the mask policy index. The other indices are associated with increased retail mobility. It is important to note that even with the inclusion of measurement error, the effects are still very precisely estimated due to the amount of data we have.

## Bayesian Measurement Models

Next we will do the same exercise, except we will use Bayesian estimation with the R package `brms`

. The big advantage here is that we do not need to simplify to average posterior standard deviations for our estimate of uncertainty; rather, we can include the standard deviation/standard error for each variable separately. It’s also frankly easier to use this package as we don’t need to take the extra step of making a Sigma matrix.

Instead, we can use the handy `me()`

function as part of the `brms`

formula syntax to identify each variable with measurement error:

```
# read from disk as this can take a while to run
if(run_model) {
# turn off correlations among latent variables to match what we gave
# eivreg
brms_formula <- bf(retail ~
me(Business_Restrictions_med,Business_Restrictions_sd) +
me(Health_Monitoring_med,Health_Monitoring_sd) +
me(Health_Resources_med,Health_Resources_sd) +
me(Mask_Policies_med,Mask_Policies_sd) +
me(School_Restrictions_med,School_Restrictions_sd) +
me(Social_Distancing_med,Social_Distancing_sd),
center=TRUE) +
set_mecor(FALSE)
brms_est <- brm(brms_formula,
data=mobility,
silent=0,
chains=1,save_pars = save_pars(),
iter=500,
warmup=250,
backend='rstan')
saveRDS(brms_est,"data/brms_est.rds")
} else {
brms_est <- readRDS("data/brms_est.rds")
}
```

We can then look at the estimated associations:

`summary(brms_est)`

```
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: retail ~ me(Business_Restrictions_med, Business_Restrictions_sd) + me(Health_Monitoring_med, Health_Monitoring_sd) + me(Health_Resources_med, Health_Resources_sd) + me(Mask_Policies_med, Mask_Policies_sd) + me(School_Restrictions_med, School_Restrictions_sd) + me(Social_Distancing_med, Social_Distancing_sd)
## Data: mobility (Number of observations: 2680)
## Samples: 1 chains, each with iter = 500; warmup = 250; thin = 1;
## total post-warmup samples = 250
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI
## Intercept 13.89 1.95 10.20
## meBusiness_Restrictions_medBusiness_Restrictions_sd 0.28 0.04 0.21
## meHealth_Monitoring_medHealth_Monitoring_sd 0.06 0.02 0.02
## meHealth_Resources_medHealth_Resources_sd 0.20 0.05 0.11
## meMask_Policies_medMask_Policies_sd -0.06 0.02 -0.09
## meSchool_Restrictions_medSchool_Restrictions_sd 0.14 0.02 0.10
## meSocial_Distancing_medSocial_Distancing_sd -1.31 0.04 -1.38
## u-95% CI Rhat Bulk_ESS
## Intercept 17.55 1.00 218
## meBusiness_Restrictions_medBusiness_Restrictions_sd 0.35 1.00 136
## meHealth_Monitoring_medHealth_Monitoring_sd 0.11 1.00 119
## meHealth_Resources_medHealth_Resources_sd 0.28 1.00 119
## meMask_Policies_medMask_Policies_sd -0.03 1.01 249
## meSchool_Restrictions_medSchool_Restrictions_sd 0.18 1.06 288
## meSocial_Distancing_medSocial_Distancing_sd -1.23 1.00 213
## Tail_ESS
## Intercept 187
## meBusiness_Restrictions_medBusiness_Restrictions_sd 128
## meHealth_Monitoring_medHealth_Monitoring_sd 193
## meHealth_Resources_medHealth_Resources_sd 196
## meMask_Policies_medMask_Policies_sd 206
## meSchool_Restrictions_medSchool_Restrictions_sd 147
## meSocial_Distancing_medSocial_Distancing_sd 162
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 16.22 0.24 15.79 16.68 1.00 573 197
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

The associations are quite similar in magnitude and sign to those we estimated with `eivreg`

. We estimated that an increase of one unit in the social distancing index is associated with a -1.427 decrease in retail mobility, while with `brms`

we estimated a -1.31 decrease in retail mobility.

As we can see, for this simple model the two are equivalent, and `eivreg`

is much faster–it took almost no time to estimate the model, while `brms`

took 20 minutes. However, `brms`

is incorporating much more measurement uncertainty, and `brms`

also allows for a much wider of array of modeling options, such as GLMs (probit/logit/ordered logit), multilevel/hierarchical models, and multiple imputation. The `eivreg`

is limited to continuous (Gaussian) outcomes and does not allow for interactions with variables with measurement uncertainty. In any case, both of these packages will incorporate measurement uncertainty in their estimates.

## Comparison with No Measurement Error

Finally, we can examine what measurement error does to our coefficients’ uncertainty by comparing the results to the standard OLS `lm`

regression that ignores uncertainty:

```
lm_est <- lm(retail~Business_Restrictions_med +
Health_Monitoring_med +
Health_Resources_med +
Mask_Policies_med +
School_Restrictions_med +
Social_Distancing_med,
data=mobility)
summary(lm_est)
```

```
##
## Call:
## lm(formula = retail ~ Business_Restrictions_med + Health_Monitoring_med +
## Health_Resources_med + Mask_Policies_med + School_Restrictions_med +
## Social_Distancing_med, data = mobility)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.785 -10.103 2.617 11.961 55.443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.67932 1.97234 5.415 6.69e-08 ***
## Business_Restrictions_med 0.26442 0.03575 7.397 1.85e-13 ***
## Health_Monitoring_med 0.05415 0.02351 2.304 0.02132 *
## Health_Resources_med 0.21373 0.04215 5.071 4.23e-07 ***
## Mask_Policies_med -0.05209 0.01797 -2.899 0.00377 **
## School_Restrictions_med 0.14216 0.01881 7.557 5.64e-14 ***
## Social_Distancing_med -1.25473 0.03588 -34.971 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.66 on 2673 degrees of freedom
## Multiple R-squared: 0.3654, Adjusted R-squared: 0.364
## F-statistic: 256.6 on 6 and 2673 DF, p-value: < 2.2e-16
```

We can see that the coefficients are similar though not identical to `brms`

and `eivreg`

, and the standard errors are noticeably smaller. The `eivreg`

estimated that there was 19.44% more uncertainty than the naive `lm`

estimator, and `brms`

8.33% more uncertainty. As such, this kind of uncertainty is important to take into account, and it can get much more critical when there is less data available or when are examining non-linear effects such as interactions.