How to Estimate Models with Measurement Error for our COVID-19 Indices

Learn how to use errors-in-variables models to take into account coding and data accuracy uncertainty.

This tutorial gives an overview of the COVID-19 policy indexes just released by the CoronaNet project of which I am a part and the Oxford Government Response Tracker. See the associated paper here. This post shows you how to download the estimated indices and also how to incorporate measurement error into analyses using the eivtools and brms R packages.

As shown in our working paper, these six indices (social distancing policies, business restrictions, school restrictions, health resources, health monitoring and mask policies) provide statistical estimates that take into account coding and data limitations. Each index has an associated value between 0 and 100 along with an uncertainty interval that provides the most likely values for the true level of policy activity for that index type. We also provide for each estimate the posterior standard deviation of the estimate, which is an analogue to the standard error in frequentist statistics. These standard deviations are helpful for incorporating measurement error in analyses using existing packages without much further work. It is the aim of this blog post to show how to do that.

If you want to see the full code underlying this post as an Rmarkdown file, just hop over to the site’s Github repo. The Rmarkdown file is in the content/post folder and the Google mobility data is in the content/post/data folder.

Load and Merge Data

First, the outcome variable we will look at is the Google cell phone mobility data reported in retail establishments and restaurants. This is a key metric in COVID-19 spread because these kinds of areas can be a hotspot for indoors transmission of the virus.

mobility <- readRDS("data/gmob.rds") %>% 
    filter(is.na(sub_region_1), is.na(sub_region_2), is.na(metro_area)) %>% 
    select(country,date_policy="date",
           retail="retail_and_recreation_percent_change_from_baseline")

I next load the CoronaNet-OxCGRT indices from this direct download link as a CSV:

indices <- read_csv("https://drive.google.com/uc?export=download&id=1dMCTVPrf-tJyhv_uxr0yAQO-Elx0QOCG")

I also want to point out that we have the full list of indicators (i.e, individual policies as variables) available via a direct download link as well:

indicators <- read_csv("https://drive.google.com/uc?id=1lorcowHNnF0Vl6pxBjMdjTC4yPhHBLJI&export=download")

I’ll display the data to get a sense of what it looks like. First, these are the raw indicators. There are a lot of them–more than 160 variables:

indicators %>% 
  filter(date_policy>ymd("2020-04-01"),
         date_policy<ymd("2020-04-07"),
         country=="Afghanistan") %>% 
  select(country,date_policy,social_distance,mask_everywhere,primary_school) %>% 
  knitr::kable() %>% 
  kable_styling()
country date_policy social_distance mask_everywhere primary_school
Afghanistan 2020-04-02 0 0.1624358 4.516191
Afghanistan 2020-04-03 0 0.1624358 4.516191
Afghanistan 2020-04-04 0 0.1624358 4.565445
Afghanistan 2020-04-05 0 0.1624358 4.565445
Afghanistan 2020-04-06 0 0.1624358 4.565445

The data is organized in a country-day-policy record format with the different policy indicators in columns. The fractional values indicate a proportion of the population that have the following policy in place (i.e., at the provincial or municipal level). The high value for primary_school has to do with the fact that this variable naturally has 3 levels, from totally open to totally closed, and when provincial and municipal policies are added it can exceed 3. For more information about the underlying indicators, you can see our codebook.

We can now take a glance at the estimated indices that collapse these indicators into single constructs:

indices %>% 
  filter(date_policy==ymd("2020-04-01"),
         country=="Afghanistan") %>% 
  knitr::kable() %>% 
  kable_styling()
country modtype date_policy med_est high_est low_est sd_est
Afghanistan Business Restrictions 2020-04-01 62.53219 67.09647 57.94249 2.8344479
Afghanistan Health Monitoring 2020-04-01 41.40538 48.99068 34.40337 4.4375788
Afghanistan Health Resources 2020-04-01 46.97723 52.57886 41.45534 3.4747765
Afghanistan Mask Policies 2020-04-01 40.92886 48.72028 32.89501 4.8434060
Afghanistan School Restrictions 2020-04-01 77.33891 78.97166 75.70822 0.9911008
Afghanistan Social Distancing 2020-04-01 61.88964 68.51834 55.19300 3.9687823

This dataset is organized somewhat differently. We have one row per index value per index type. The modtype column records which index is in each row. The columns include estimates of the given index for that day. med_est is the median posterior estimate; i.e., the most likely estimate. The upper and lower bounds of the 5% - 95% posterior distribution (the uncertainty interval) are the low_est - high_est columns. Finally, the posterior standard deviation (standard error) is the sd_est column.

If you just want “the” value of the index, then med_est is the best bet. However, it’s best not to ignore the other columns. A simple way to take into account measurement error is to fit three separate models, one with med_est, another with low_est, and another with high_est. If there are important differences across these models than you would know that measurement error in the indices could be affecting your results.

However, there are better ways to incorporate your uncertainty into models, which I will show next. We will use the retail and recreation outcome from the Google mobility as the response we want to understand, and each of our indices as predictors. We will use the sd_est column to incorporate Normally-distributed measurement error in our models; i.e., we will assume that measurement error is centered around med_est with a standard deviation of sd_est. It’s not an exact incorporation of uncertainty as the posterior isn’t a perfect bell curve shape, but it’s pretty good as the posterior is usually at least unimodal if not symmetric.

First I use the eivtools package to do a what is called an errors-in-variables model. These have been around for some time in economics and other fields, and are relatively easy to estimate. To do so, first I convert our index into two datasets with the posterior median estimates and standard deviations and one column for each index. I can then join these together to make one dataset with both estimates per index:

index_med <- select(indices,country,date_policy,med_est,modtype) %>% 
  spread(key="modtype",value="med_est")

index_sds <- select(indices,country,date_policy,sd_est,modtype) %>% 
  spread(key="modtype",value="sd_est")

# need to adjust names so that we can bind by column

names(index_med)[3:8] <- paste0(names(index_med)[3:8],"_med") %>% 
  str_replace(" ","_")
names(index_sds)[3:8] <- paste0(names(index_sds)[3:8],"_sd") %>% 
  str_replace(" ","_")

# merge back together

index_combined <- left_join(index_med,index_sds,by=c("country","date_policy"))

Finally we will merge our predictors into the Google mobility data:

# we only have index values through Jan 16th, 2021

mobility <- left_join(mobility,index_combined,by=c("country","date_policy")) %>% 
  filter(!is.na(Social_Distancing_sd))

To make the modeling a bit easier, we will only use a subset of the countries in the full data.

mobility <- filter(mobility, country %in% c("United States",
                                            "Brazil","China",
                                            "United Kingdom",
                                            "France",
                                            "South Africa","Nigeria",
                                            "Indonesia",
                                            "Australia","Thailand"))

Conventional Errors-in-Variables

Now we can use the eivreg as a function for estimating a linear regression model for our outcome, retail. We need to simplify our uncertainty measures because conventional measurement/errors-in-variables models can only handle one standard deviation/standard error per variable. As such, we will pass in the average posterior standard deviation for each index as our estimate of measurement error Sigma:

# first create Sigma, the covariance matrix
# only has a diagonal value as we don't have covariance estimates across indices
# we have to square them first as Sigma is a covariance matrix and we have standard deviations (variance = SD^2)

Sigma <- diag(c(mean(mobility$`Business_Restrictions_sd`^2,na.rm=T),
                mean(mobility$`Health_Monitoring_sd`^2,na.rm=T),
                mean(mobility$`Health_Resources_sd`^2,na.rm=T),
                mean(mobility$`Mask_Policies_sd`^2,na.rm=T),
                mean(mobility$`School_Restrictions_sd`^2,na.rm=T),
                mean(mobility$`Social_Distancing_sd`^2,na.rm=T)))

# add dimnames to identify each variables

dimnames(Sigma) <- list(c("Business_Restrictions_med",
                          "Health_Monitoring_med",
                          "Health_Resources_med",
                          "Mask_Policies_med",
                          "School_Restrictions_med",
                          "Social_Distancing_med"),
                        c("Business_Restrictions_med",
                          "Health_Monitoring_med",
                          "Health_Resources_med",
                          "Mask_Policies_med",
                          "School_Restrictions_med",
                          "Social_Distancing_med"))

retail_eiv <- eivreg(retail~Business_Restrictions_med +
                            Health_Monitoring_med +
                            Health_Resources_med +
                            Mask_Policies_med +
                            School_Restrictions_med +
                            Social_Distancing_med,
                     data=mobility,Sigma_error=Sigma)

We can then see what our estimated associations are given that we incorporated average standard errors for each index:

summary(retail_eiv)
## 
## Call:
## eivreg(formula = retail ~ Business_Restrictions_med + Health_Monitoring_med + 
##     Health_Resources_med + Mask_Policies_med + School_Restrictions_med + 
##     Social_Distancing_med, data = mobility, Sigma_error = Sigma)
## 
## Error Covariance Matrix
##                           Business_Restrictions_med Health_Monitoring_med
## Business_Restrictions_med                     3.658                 0.000
## Health_Monitoring_med                         0.000                 9.721
## Health_Resources_med                          0.000                 0.000
## Mask_Policies_med                             0.000                 0.000
## School_Restrictions_med                       0.000                 0.000
## Social_Distancing_med                         0.000                 0.000
##                           Health_Resources_med Mask_Policies_med
## Business_Restrictions_med                0.000              0.00
## Health_Monitoring_med                    0.000              0.00
## Health_Resources_med                     6.375              0.00
## Mask_Policies_med                        0.000             15.79
## School_Restrictions_med                  0.000              0.00
## Social_Distancing_med                    0.000              0.00
##                           School_Restrictions_med Social_Distancing_med
## Business_Restrictions_med                   0.000                 0.000
## Health_Monitoring_med                       0.000                 0.000
## Health_Resources_med                        0.000                 0.000
## Mask_Policies_med                           0.000                 0.000
## School_Restrictions_med                     1.755                 0.000
## Social_Distancing_med                       0.000                 8.896
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.157 -11.276   2.005  12.105  57.204 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               12.45477    1.74730   7.128 1.30e-12 ***
## Business_Restrictions_med  0.27001    0.04336   6.228 5.48e-10 ***
## Health_Monitoring_med      0.02532    0.02871   0.882   0.3779    
## Health_Resources_med       0.32352    0.06208   5.212 2.01e-07 ***
## Mask_Policies_med         -0.04313    0.02231  -1.933   0.0533 .  
## School_Restrictions_med    0.17267    0.01967   8.779  < 2e-16 ***
## Social_Distancing_med     -1.42736    0.04879 -29.254  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Number of observations used: 2680 
## Latent residual standard deviation: 16.14 
## Latent R-squared: 0.4038, (df-adjusted: 0.4022)
## 
## EIV-Adjusted vs Unadjusted Coefficients:
##                           Adjusted Unadjusted
## (Intercept)               12.45477   10.67932
## Business_Restrictions_med  0.27001    0.26442
## Health_Monitoring_med      0.02532    0.05415
## Health_Resources_med       0.32352    0.21373
## Mask_Policies_med         -0.04313   -0.05209
## School_Restrictions_med    0.17267    0.14216
## Social_Distancing_med     -1.42736   -1.25473

We see that the social distancing index has the strongest association with reduced retail mobility, followed by the mask policy index. The other indices are associated with increased retail mobility. It is important to note that even with the inclusion of measurement error, the effects are still very precisely estimated due to the amount of data we have.

Bayesian Measurement Models

Next we will do the same exercise, except we will use Bayesian estimation with the R package brms. The big advantage here is that we do not need to simplify to average posterior standard deviations for our estimate of uncertainty; rather, we can include the standard deviation/standard error for each variable separately. It’s also frankly easier to use this package as we don’t need to take the extra step of making a Sigma matrix.

Instead, we can use the handy me() function as part of the brms formula syntax to identify each variable with measurement error:

# read from disk as this can take a while to run

if(run_model) {
  
  # turn off correlations among latent variables to match what we gave 
  # eivreg
  
  brms_formula <- bf(retail ~ 
                  me(Business_Restrictions_med,Business_Restrictions_sd) +
                  me(Health_Monitoring_med,Health_Monitoring_sd) +
                  me(Health_Resources_med,Health_Resources_sd) +
                  me(Mask_Policies_med,Mask_Policies_sd) +
                  me(School_Restrictions_med,School_Restrictions_sd) +
                  me(Social_Distancing_med,Social_Distancing_sd),
                  center=TRUE) +
                  set_mecor(FALSE)
  
  brms_est <- brm(brms_formula,
                data=mobility,
                silent=0,
                chains=1,save_pars = save_pars(),
                iter=500,
                warmup=250,
                backend='rstan')
  
  saveRDS(brms_est,"data/brms_est.rds")
  
} else {
  
  brms_est <- readRDS("data/brms_est.rds")
  
}

We can then look at the estimated associations:

summary(brms_est)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: retail ~ me(Business_Restrictions_med, Business_Restrictions_sd) + me(Health_Monitoring_med, Health_Monitoring_sd) + me(Health_Resources_med, Health_Resources_sd) + me(Mask_Policies_med, Mask_Policies_sd) + me(School_Restrictions_med, School_Restrictions_sd) + me(Social_Distancing_med, Social_Distancing_sd) 
##    Data: mobility (Number of observations: 2680) 
## Samples: 1 chains, each with iter = 500; warmup = 250; thin = 1;
##          total post-warmup samples = 250
## 
## Population-Level Effects: 
##                                                     Estimate Est.Error l-95% CI
## Intercept                                              13.89      1.95    10.20
## meBusiness_Restrictions_medBusiness_Restrictions_sd     0.28      0.04     0.21
## meHealth_Monitoring_medHealth_Monitoring_sd             0.06      0.02     0.02
## meHealth_Resources_medHealth_Resources_sd               0.20      0.05     0.11
## meMask_Policies_medMask_Policies_sd                    -0.06      0.02    -0.09
## meSchool_Restrictions_medSchool_Restrictions_sd         0.14      0.02     0.10
## meSocial_Distancing_medSocial_Distancing_sd            -1.31      0.04    -1.38
##                                                     u-95% CI Rhat Bulk_ESS
## Intercept                                              17.55 1.00      218
## meBusiness_Restrictions_medBusiness_Restrictions_sd     0.35 1.00      136
## meHealth_Monitoring_medHealth_Monitoring_sd             0.11 1.00      119
## meHealth_Resources_medHealth_Resources_sd               0.28 1.00      119
## meMask_Policies_medMask_Policies_sd                    -0.03 1.01      249
## meSchool_Restrictions_medSchool_Restrictions_sd         0.18 1.06      288
## meSocial_Distancing_medSocial_Distancing_sd            -1.23 1.00      213
##                                                     Tail_ESS
## Intercept                                                187
## meBusiness_Restrictions_medBusiness_Restrictions_sd      128
## meHealth_Monitoring_medHealth_Monitoring_sd              193
## meHealth_Resources_medHealth_Resources_sd                196
## meMask_Policies_medMask_Policies_sd                      206
## meSchool_Restrictions_medSchool_Restrictions_sd          147
## meSocial_Distancing_medSocial_Distancing_sd              162
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma    16.22      0.24    15.79    16.68 1.00      573      197
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

The associations are quite similar in magnitude and sign to those we estimated with eivreg. We estimated that an increase of one unit in the social distancing index is associated with a -1.427 decrease in retail mobility, while with brms we estimated a -1.31 decrease in retail mobility.

As we can see, for this simple model the two are equivalent, and eivreg is much faster–it took almost no time to estimate the model, while brms took 20 minutes. However, brms is incorporating much more measurement uncertainty, and brms also allows for a much wider of array of modeling options, such as GLMs (probit/logit/ordered logit), multilevel/hierarchical models, and multiple imputation. The eivreg is limited to continuous (Gaussian) outcomes and does not allow for interactions with variables with measurement uncertainty. In any case, both of these packages will incorporate measurement uncertainty in their estimates.

Comparison with No Measurement Error

Finally, we can examine what measurement error does to our coefficients’ uncertainty by comparing the results to the standard OLS lm regression that ignores uncertainty:

lm_est <- lm(retail~Business_Restrictions_med +
                            Health_Monitoring_med +
                            Health_Resources_med +
                            Mask_Policies_med +
                            School_Restrictions_med +
                            Social_Distancing_med,
                     data=mobility)

summary(lm_est)
## 
## Call:
## lm(formula = retail ~ Business_Restrictions_med + Health_Monitoring_med + 
##     Health_Resources_med + Mask_Policies_med + School_Restrictions_med + 
##     Social_Distancing_med, data = mobility)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.785 -10.103   2.617  11.961  55.443 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               10.67932    1.97234   5.415 6.69e-08 ***
## Business_Restrictions_med  0.26442    0.03575   7.397 1.85e-13 ***
## Health_Monitoring_med      0.05415    0.02351   2.304  0.02132 *  
## Health_Resources_med       0.21373    0.04215   5.071 4.23e-07 ***
## Mask_Policies_med         -0.05209    0.01797  -2.899  0.00377 ** 
## School_Restrictions_med    0.14216    0.01881   7.557 5.64e-14 ***
## Social_Distancing_med     -1.25473    0.03588 -34.971  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.66 on 2673 degrees of freedom
## Multiple R-squared:  0.3654, Adjusted R-squared:  0.364 
## F-statistic: 256.6 on 6 and 2673 DF,  p-value: < 2.2e-16

We can see that the coefficients are similar though not identical to brms and eivreg, and the standard errors are noticeably smaller. The eivreg estimated that there was 19.44% more uncertainty than the naive lm estimator, and brms 8.33% more uncertainty. As such, this kind of uncertainty is important to take into account, and it can get much more critical when there is less data available or when are examining non-linear effects such as interactions.

Related

Robert Kubinec
Robert Kubinec
Assistant Professor of Political Science

My research centers on political-economic issues such as corruption, economic development, and business-state relations in developing countries, and in particular the Middle East and North Africa. I am also involved in the development of Bayesian statistical models with Stan for hard-to-study subjects like corruption, polarization, and other latent social constructs.