Introduction

In this exercise, our goal is to fit a spatially-varying coefficient model to assess if there is any spatial variation in a ten-year (2010-2019) occurrence trend of the Eastern Wood-pewee across the eastern US. We will fit a model that explicitly accommodates observational biases that occur when collecting data on many wildlife species (i.e., imperfect detection, or failing to observe a species at a location when it is truly present). The data we use come from the North American Breeding Bird Survey, in which trained volunteers count the number of birds, identified to species, at approximately 3000 locations (called “routes”) across North America each year since the 1960s. Here we use presence/absence data from \(n = 500\) sites across the eastern US that are within the known distribution of the Eastern Wood-pewee, which are surveyed varying amounts over a ten year period from 2010-2019. At each site during each year it is sampled, we use data from \(K = 5\) replicate surveys. We assume the true presence/absence status of the bird at a given location does not change over the 5 replicate surveys, and that any variation in the data across the replicate surveys is related to observational error. In the wildlife ecology literature, this type of zero-inflated mixture model is referred to as an “occupancy model”.

We will fit the model using the function svcTPGOcc() in the spOccupancy package, where the svc stands for spatially-varying coefficients, the T indicates we’re fitting a model that has a temporal dimension to it, PG indicates we’re using Pólya-Gamma data augmentation (Polson, Scott, and Windle 2013), and Occ stands for occupancy model. Below we first load the packages we will use in this example.

set.seed(111)
library(spOccupancy)
library(ggplot2)
library(sf)
Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
library(stars)
Loading required package: abind

Exploratory data analysis

# Load the data and take a look at its structure
load("eastern-wood-pewee-data.rda")
str(data.EAWP)
List of 4
 $ y       : num [1:500, 1:10, 1:5] 0 1 1 0 0 NA 1 NA 0 1 ...
 $ occ.covs:List of 1
  ..$ years: int [1:500, 1:10] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
 $ det.covs:List of 2
  ..$ day  : num [1:500, 1:10] 158 151 181 151 159 NA 164 NA 167 157 ...
  ..$ years: int [1:500, 1:10] 2010 2010 2010 2010 2010 NA 2010 NA 2010 2010 ...
 $ coords  : num [1:500, 1:2] 5.04 11.54 467.96 1792.23 1153.17 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:500] "1745" "136" "945" "1538" ...
  .. ..$ : chr [1:2] "X" "Y"

The data list is comprised of four components:

Next, we generate a couple of maps that show the location of the data points and the proportion of times the species was detected at each location over the time period.

my.crs <- "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=km +no_defs"
coords.sf <- st_as_sf(as.data.frame(data.EAWP$coords),
                      coords = c("X", "Y"),
                      crs = my.crs)
usa <- st_as_sf(maps::map("state", fill = TRUE, plot = FALSE))
usa.no.states <- st_as_sf(maps::map("usa", fill = TRUE, plot = FALSE))
# Restrict to east of the 100th meridian
usa.bbox <- st_bbox(usa)
usa.bbox[1] <- -100
usa.bbox <- as.vector(usa.bbox)
sf_use_s2(FALSE)
east.us <- st_crop(st_make_valid(usa), xmin = usa.bbox[1], ymin = usa.bbox[2],
                   xmax = usa.bbox[3], ymax = usa.bbox[4])
Warning: attribute variables are assumed to be spatially constant throughout all
geometries
east.us <- east.us %>%
  st_transform(st_crs(coords.sf))
coords.sf$raw.prob <- apply(data.EAWP$y, 1, mean, na.rm = TRUE)
ggplot(coords.sf) +
  geom_sf(aes(col = raw.prob), size = 2) +
  scale_color_viridis_c('', option = 'plasma', limits = c(0, 1)) +
  geom_sf(data = east.us, fill = NA, color=alpha("black", 0.75)) +
  theme_bw(base_size = 14)

Below, we calculate the average proportion of times the species was observed at an average site each year

y.prop.by.year <- apply(data.EAWP$y, 2, mean, na.rm = TRUE)
plot(2010:2019, y.prop.by.year, pch = 19, xlab = 'Year', ylab = 'Average raw occurrence probability')

Model fitting

We next fit the model with the svcTPGOcc function in spOccupancy. The arguments are very similar to what we saw with the spatial factor model in the previous exercise. We first specify the priors below

dist.bbs <- dist(data.EAWP$coords)
low.dist <- quantile(dist.bbs, 0.10)
max.dist <- max(dist.bbs)
priors <- list(sigma.sq.ig = list(2, 1), # IG prior for spatial variances
               # Informative prior for spatial decay parameters
               phi.unif = list(3 / max.dist, 3 / low.dist),
               # Normal prior for occurrence coefficients
               beta.normal = list(mean = 0, var = 2.72), 
               # Normal prior for observational (detection) coefficients
               alpha.normal = list(mean = 0, var = 2.72))

Here we will again use an exponential spatial covariance model, so notice that our prior for the spatial decay parameters \(\phi\) restricts the effective spatial range to fall between the maximum intersite distance and the 10% quantile of the intersite distances. We use our usual inverse-Gamma prior for the spatial variances and Gaussian priors for the regression coefficients.

Next, we set the initial values and initial tuning parameters for \(\phi\) (we again use an Adaptive Metropolis update for these parameters as in the spatial factor model).

tuning.list <- list(phi = c(0.6, 0.5))
inits <- list(phi = c(3 / 500, 3 / 400), sigma.sq = c(5, 0.5), 
              beta = c(1.5, 0), alpha = 0)

Finally, we are about set to run the model. The function takes two right-sided formulas: one for the process (occurrence) model, and one for the observation (detection) model. Here we solely include a spatially-varying trend in the occurrence portion of the model, while we include a linear and quadratic effect of survey day in the detection model and a separate intercept for each of the ten years. Notice the svc.cols argument is used to specify the columns of the design design matrix that we wish to allow to vary over space.

# MCMC criteria
n.batch <- 300
batch.length <- 25
n.burn <- 5000
n.thin <- 5
n.chains <- 1

out <- svcTPGOcc(occ.formula = ~ scale(years), 
                 det.formula = ~ scale(day) + I(scale(day)^2) + 
                                 factor(years),
                 svc.cols = c(1, 2),
                 data = data.EAWP,
                 inits = inits,
                 priors = priors,
                 n.batch = n.batch,
                 batch.length = batch.length,
                 accept.rate = 0.43,
                 cov.model = "exponential",
                 tuning = tuning.list,
                 n.omp.threads = 1,
                 verbose = TRUE,
                 NNGP = TRUE,
                 n.neighbors = 5,
                 n.report = 30,
                 n.burn = n.burn,
                 n.thin = n.thin,
                 n.chains = n.chains) 
----------------------------------------
    Preparing the data
----------------------------------------
z is not specified in initial values.
Setting initial values based on observed data
----------------------------------------
    Building the neighbor list
----------------------------------------
----------------------------------------
Building the neighbors of neighbors list
----------------------------------------
----------------------------------------
    Model description
----------------------------------------
Spatial NNGP Trend Occupancy Model with Polya-Gamma latent
variable fit with 500 sites and 10 years.

Samples per chain: 7500 (300 batches of length 25)
Burn-in: 5000 
Thinning Rate: 5 
Number of Chains: 1 
Total Posterior Samples: 500 

Number of spatially-varying coefficients: 2 
Using the exponential spatial correlation model.

Using 5 nearest neighbors.

Source compiled with OpenMP support and model fit using 1 thread(s).

Adaptive Metropolis with target acceptance rate: 43.0
----------------------------------------
    Chain 1
----------------------------------------
Sampling ... 
Batch: 30 of 300, 10.00%
    Coefficient Parameter   Acceptance  Tuning
    1       phi     28.0        0.60603
    2       phi     48.0        0.65498
-------------------------------------------------
Batch: 60 of 300, 20.00%
    Coefficient Parameter   Acceptance  Tuning
    1       phi     32.0        0.57074
    2       phi     80.0        0.84947
-------------------------------------------------
Batch: 90 of 300, 30.00%
    Coefficient Parameter   Acceptance  Tuning
    1       phi     76.0        0.61827
    2       phi     56.0        1.07988
-------------------------------------------------
Batch: 120 of 300, 40.00%
    Coefficient Parameter   Acceptance  Tuning
    1       phi     40.0        0.63076
    2       phi     32.0        1.19346
-------------------------------------------------
Batch: 150 of 300, 50.00%
    Coefficient Parameter   Acceptance  Tuning
    1       phi     44.0        0.77042
    2       phi     36.0        1.31897
-------------------------------------------------
Batch: 180 of 300, 60.00%
    Coefficient Parameter   Acceptance  Tuning
    1       phi     32.0        0.71118
    2       phi     72.0        1.34562
-------------------------------------------------
Batch: 210 of 300, 70.00%
    Coefficient Parameter   Acceptance  Tuning
    1       phi     36.0        0.71118
    2       phi     28.0        1.29285
-------------------------------------------------
Batch: 240 of 300, 80.00%
    Coefficient Parameter   Acceptance  Tuning
    1       phi     56.0        0.75516
    2       phi     24.0        1.24216
-------------------------------------------------
Batch: 270 of 300, 90.00%
    Coefficient Parameter   Acceptance  Tuning
    1       phi     20.0        0.65650
    2       phi     44.0        1.37280
-------------------------------------------------
Batch: 300 of 300, 100.00%
summary(out)

Call:
svcTPGOcc(occ.formula = ~scale(years), det.formula = ~scale(day) + 
    I(scale(day)^2) + factor(years), data = data.EAWP, inits = inits, 
    priors = priors, tuning = tuning.list, svc.cols = c(1, 2), 
    cov.model = "exponential", NNGP = TRUE, n.neighbors = 5, 
    n.batch = n.batch, batch.length = batch.length, accept.rate = 0.43, 
    n.omp.threads = 1, verbose = TRUE, n.report = 30, n.burn = n.burn, 
    n.thin = n.thin, n.chains = n.chains)

Samples per Chain: 7500
Burn-in: 5000
Thinning Rate: 5
Number of Chains: 1
Total Posterior Samples: 500
Run Time (min): 3.8871

Occurrence (logit scale): 
                Mean     SD    2.5%     50%  97.5% Rhat ESS
(Intercept)   1.1923 0.5036  0.2543  1.2213 2.1294   NA   9
scale(years) -0.1258 0.1192 -0.3725 -0.1293 0.0976   NA  95

Detection (logit scale): 
                     Mean     SD    2.5%     50%   97.5% Rhat ESS
(Intercept)        0.0782 0.0567 -0.0410  0.0795  0.1901   NA 500
scale(day)        -0.0641 0.0182 -0.1008 -0.0648 -0.0289   NA 500
I(scale(day)^2)   -0.0395 0.0165 -0.0725 -0.0392 -0.0101   NA 500
factor(years)2011  0.1266 0.0775 -0.0244  0.1228  0.2717   NA 590
factor(years)2012  0.0941 0.0760 -0.0660  0.0923  0.2413   NA 500
factor(years)2013 -0.0387 0.0791 -0.1904 -0.0434  0.1157   NA 500
factor(years)2014  0.0786 0.0747 -0.0689  0.0756  0.2253   NA 500
factor(years)2015  0.0882 0.0797 -0.0739  0.0924  0.2538   NA 572
factor(years)2016  0.0980 0.0738 -0.0558  0.0971  0.2390   NA 459
factor(years)2017  0.1670 0.0723  0.0269  0.1669  0.3076   NA 500
factor(years)2018  0.2841 0.0778  0.1269  0.2819  0.4328   NA 500
factor(years)2019  0.2214 0.0797  0.0639  0.2212  0.3669   NA 323

Spatial Covariance: 
                         Mean     SD    2.5%     50%   97.5% Rhat ESS
sigma.sq-(Intercept)  19.5369 4.3518 12.8051 18.6281 29.4338   NA  45
sigma.sq-scale(years)  0.4636 0.1519  0.2346  0.4434  0.8301   NA  34
phi-(Intercept)        0.0056 0.0010  0.0036  0.0056  0.0075   NA  63
phi-scale(years)       0.0066 0.0012  0.0039  0.0069  0.0080   NA  27
plot(out$theta.samples, density = FALSE)

We can use the getSVCSamples() function to extract the MCMC samples for the complete spatially-varying coefficient for each coefficient.

svc.samples <- getSVCSamples(out)
str(svc.samples)
List of 2
 $ (Intercept) : 'mcmc' num [1:500, 1:500] -2.43 -4.13 -2.76 -1.37 -1.94 ...
  ..- attr(*, "mcpar")= num [1:3] 1 500 1
 $ scale(years): 'mcmc' num [1:500, 1:500] -0.726 0.142 1.315 -0.35 -0.18 ...
  ..- attr(*, "mcpar")= num [1:3] 1 500 1
# The first 9 spatially-varying trend estimates
plot(svc.samples[[2]][, 1:9], density = FALSE)

Prediction

We will next generate a map of the spatially-varying trend across a 30 x 30 km grid in the eastern US.

load('pred-coordinates.rda')
# Design "matrix" should be a 3-D array with dimensions corresponding to site, year, and parameter
X.0 <- array(NA, dim = c(nrow(coords.0), ncol(out$y), dim(out$X)[3]))
# Intercept
X.0[, , 1] <- 1
# Year
unique.years <- unique(c(data.EAWP$occ.covs$years))
for (t in 1:ncol(out$y)) {
  X.0[, t, 2] <- unique.years[t]
}
X.0[, , 2] <- (X.0[, , 2] - mean(c(data.EAWP$occ.covs$years))) / sd(c(data.EAWP$occ.covs$years))
# The t.cols argument specifies that we are predicting for all ten years.
out.pred <- predict(out, X.0, coords.0, t.cols = 1:10)
----------------------------------------
    Prediction description
----------------------------------------
Spatial NNGP Multi-season Occupancy model with Polya-Gamma latent
variable fit with 500 observations and 10 years.

Number of fixed covariates 2 (including intercept if specified).

Using the exponential spatial correlation model.

Using 5 nearest neighbors.

Number of MCMC samples 500.

Predicting at 4474 non-sampled locations.


Source compiled with OpenMP support and model fit using 1 threads.
-------------------------------------------------
        Predicting
-------------------------------------------------
Location: 100 of 4474, 2.24%
Location: 200 of 4474, 4.47%
Location: 300 of 4474, 6.71%
Location: 400 of 4474, 8.94%
Location: 500 of 4474, 11.18%
Location: 600 of 4474, 13.41%
Location: 700 of 4474, 15.65%
Location: 800 of 4474, 17.88%
Location: 900 of 4474, 20.12%
Location: 1000 of 4474, 22.35%
Location: 1100 of 4474, 24.59%
Location: 1200 of 4474, 26.82%
Location: 1300 of 4474, 29.06%
Location: 1400 of 4474, 31.29%
Location: 1500 of 4474, 33.53%
Location: 1600 of 4474, 35.76%
Location: 1700 of 4474, 38.00%
Location: 1800 of 4474, 40.23%
Location: 1900 of 4474, 42.47%
Location: 2000 of 4474, 44.70%
Location: 2100 of 4474, 46.94%
Location: 2200 of 4474, 49.17%
Location: 2300 of 4474, 51.41%
Location: 2400 of 4474, 53.64%
Location: 2500 of 4474, 55.88%
Location: 2600 of 4474, 58.11%
Location: 2700 of 4474, 60.35%
Location: 2800 of 4474, 62.58%
Location: 2900 of 4474, 64.82%
Location: 3000 of 4474, 67.05%
Location: 3100 of 4474, 69.29%
Location: 3200 of 4474, 71.52%
Location: 3300 of 4474, 73.76%
Location: 3400 of 4474, 75.99%
Location: 3500 of 4474, 78.23%
Location: 3600 of 4474, 80.46%
Location: 3700 of 4474, 82.70%
Location: 3800 of 4474, 84.94%
Location: 3900 of 4474, 87.17%
Location: 4000 of 4474, 89.41%
Location: 4100 of 4474, 91.64%
Location: 4200 of 4474, 93.88%
Location: 4300 of 4474, 96.11%
Location: 4400 of 4474, 98.35%
Location: 4474 of 4474, 100.00%
Generating latent occupancy state
str(out.pred)
List of 6
 $ z.0.samples  : num [1:500, 1:4474, 1:10] 1 1 1 1 1 1 1 1 1 1 ...
 $ w.0.samples  : num [1:500, 1:2, 1:4474] 7.79 5.9 5.4 1.95 5.39 ...
 $ psi.0.samples: num [1:500, 1:4474, 1:10] 1 0.999 1 0.992 1 ...
 $ run.time     : 'proc_time' Named num [1:5] 16.79 5.75 12.81 0 0
  ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
 $ call         : language predict.svcTPGOcc(object = out, X.0 = X.0, coords.0 = coords.0, t.cols = 1:10)
 $ object.class : chr "svcTPGOcc"
 - attr(*, "class")= chr "predict.svcTPGOcc"
# Get the full spatially-varying coefficients
svc.pred.samples <- getSVCSamples(out, out.pred)
str(svc.pred.samples)
List of 2
 $ (Intercept) : 'mcmc' num [1:500, 1:4474] 9.65 7.63 7.33 3.7 7.21 ...
  ..- attr(*, "mcpar")= num [1:3] 1 500 1
 $ scale(years): 'mcmc' num [1:500, 1:4474] -0.4769 0.0936 -0.3207 -0.6856 -0.5987 ...
  ..- attr(*, "mcpar")= num [1:3] 1 500 1

Finally, we use the stars and ggplot2 packages to generate a map of the estimated spatially-varying trend across the 30 x 30 m grid. We generate a map of uncertainty by plotting the probability the trend at any given location is greater than 0, and then discretizing this into five bins that illustrate the amount of support for a positive (close to 1) or negative (close to 0) trend.

plot.df <- data.frame(trend = apply(svc.pred.samples[[2]], 2, mean),
                      trend.prob.pos = apply(svc.pred.samples[[2]], 2, function(a) mean(a > 0)),
                      x = coords.0[, 1],
                      y = coords.0[, 2])
pred.stars <- st_as_stars(plot.df, dims = c('x', 'y'))
coords.sf <- st_as_sf(as.data.frame(data.EAWP$coords),
                      coords = c("X", "Y"),
                      crs = my.crs)
usa <- st_as_sf(maps::map("state", fill = TRUE, plot = FALSE))
usa.no.states <- st_as_sf(maps::map("usa", fill = TRUE, plot = FALSE))
# Restrict to east of the 100th meridian
usa.bbox <- st_bbox(usa)
usa.bbox[1] <- -100
usa.bbox <- as.vector(usa.bbox)
sf_use_s2(FALSE)
# Full data
east.us <- st_crop(st_make_valid(usa), xmin = usa.bbox[1], ymin = usa.bbox[2],
                   xmax = usa.bbox[3], ymax = usa.bbox[4])
east.us <- east.us %>%
  st_transform(st_crs(coords.sf))

# Map of the mean effect
ggplot() +
  geom_stars(data = pred.stars, aes(x = x, y = y, fill = trend), interpolate = FALSE) +
  geom_sf(data = east.us, alpha = 0, col = 'grey') +
  scale_fill_gradient2(midpoint = 0, low = '#B2182B', mid = 'white', high = '#2166AC',
                       na.value = NA) +
  theme_bw(base_size = 14) +
  labs(x = "Longitude", y = "Latitude", fill = "", title = 'Eastern Wood-Pewee Trend (2010-2019)') +
  theme(legend.position = c(0.87, 0.25),
        legend.background = element_rect(fill = NA),
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.title = element_text(size = 14),
        plot.title = element_text(size = 18),
        legend.text = element_text(size = 12))

ggplot() +
  geom_stars(data = pred.stars, aes(x = x, y = y, fill = trend.prob.pos), interpolate = FALSE) +
  geom_sf(data = east.us, alpha = 0, col = 'grey') +
  scale_fill_steps2(midpoint = 0.5, low = '#B2182B', mid = 'white', high = '#2166AC',
                   na.value = NA, limits = c(0, 1), n.breaks = 6) +
  theme_bw(base_size = 14) +
  labs(x = "Longitude", y = "Latitude", fill = "", title = 'P(trend > 0)') +
  theme(legend.position = c(0.87, 0.25),
        legend.background = element_rect(fill = NA),
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.title = element_text(size = 14),
        plot.title = element_text(size = 18),
        legend.text = element_text(size = 12))

References

Polson, Nicholas G, James G Scott, and Jesse Windle. 2013. “Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables.” Journal of the American Statistical Association 108 (504): 1339–49.