In this exercise, our goal is to fit a spatially-varying coefficient model to assess if there is any spatial variation in a ten-year (2010-2019) occurrence trend of the Eastern Wood-pewee across the eastern US. We will fit a model that explicitly accommodates observational biases that occur when collecting data on many wildlife species (i.e., imperfect detection, or failing to observe a species at a location when it is truly present). The data we use come from the North American Breeding Bird Survey, in which trained volunteers count the number of birds, identified to species, at approximately 3000 locations (called “routes”) across North America each year since the 1960s. Here we use presence/absence data from \(n = 500\) sites across the eastern US that are within the known distribution of the Eastern Wood-pewee, which are surveyed varying amounts over a ten year period from 2010-2019. At each site during each year it is sampled, we use data from \(K = 5\) replicate surveys. We assume the true presence/absence status of the bird at a given location does not change over the 5 replicate surveys, and that any variation in the data across the replicate surveys is related to observational error. In the wildlife ecology literature, this type of zero-inflated mixture model is referred to as an “occupancy model”.
We will fit the model using the function svcTPGOcc()
in the spOccupancy
package, where the svc
stands for spatially-varying coefficients, the T
indicates we’re fitting a model that has a temporal dimension to it, PG
indicates we’re using Pólya-Gamma data augmentation (Polson, Scott, and Windle 2013), and Occ
stands for occupancy model. Below we first load the packages we will use in this example.
set.seed(111)
library(spOccupancy)
library(ggplot2)
library(sf)
Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
library(stars)
Loading required package: abind
# Load the data and take a look at its structure
load("eastern-wood-pewee-data.rda")
str(data.EAWP)
List of 4
$ y : num [1:500, 1:10, 1:5] 0 1 1 0 0 NA 1 NA 0 1 ...
$ occ.covs:List of 1
..$ years: int [1:500, 1:10] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
$ det.covs:List of 2
..$ day : num [1:500, 1:10] 158 151 181 151 159 NA 164 NA 167 157 ...
..$ years: int [1:500, 1:10] 2010 2010 2010 2010 2010 NA 2010 NA 2010 2010 ...
$ coords : num [1:500, 1:2] 5.04 11.54 467.96 1792.23 1153.17 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:500] "1745" "136" "945" "1538" ...
.. ..$ : chr [1:2] "X" "Y"
The data list is comprised of four components:
y
: a three-dimensional array with dimensions corresponding to sites, years, and replicate, with each value containing the detection (1) or nondetection (0) of the species during the given survey.occ.covs
: a list comprised of the covariates that we will include in the process (i.e., occurrence) sub-modeldet.covs
: a list comprised of the covariates that we will include in the observation sub-model.coords
: a \(n \times 2\) matrix of the spatial coordinatesNext, we generate a couple of maps that show the location of the data points and the proportion of times the species was detected at each location over the time period.
my.crs <- "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=km +no_defs"
coords.sf <- st_as_sf(as.data.frame(data.EAWP$coords),
coords = c("X", "Y"),
crs = my.crs)
usa <- st_as_sf(maps::map("state", fill = TRUE, plot = FALSE))
usa.no.states <- st_as_sf(maps::map("usa", fill = TRUE, plot = FALSE))
# Restrict to east of the 100th meridian
usa.bbox <- st_bbox(usa)
usa.bbox[1] <- -100
usa.bbox <- as.vector(usa.bbox)
sf_use_s2(FALSE)
east.us <- st_crop(st_make_valid(usa), xmin = usa.bbox[1], ymin = usa.bbox[2],
xmax = usa.bbox[3], ymax = usa.bbox[4])
Warning: attribute variables are assumed to be spatially constant throughout all
geometries
east.us <- east.us %>%
st_transform(st_crs(coords.sf))
coords.sf$raw.prob <- apply(data.EAWP$y, 1, mean, na.rm = TRUE)
ggplot(coords.sf) +
geom_sf(aes(col = raw.prob), size = 2) +
scale_color_viridis_c('', option = 'plasma', limits = c(0, 1)) +
geom_sf(data = east.us, fill = NA, color=alpha("black", 0.75)) +
theme_bw(base_size = 14)
Below, we calculate the average proportion of times the species was observed at an average site each year
y.prop.by.year <- apply(data.EAWP$y, 2, mean, na.rm = TRUE)
plot(2010:2019, y.prop.by.year, pch = 19, xlab = 'Year', ylab = 'Average raw occurrence probability')
We next fit the model with the svcTPGOcc
function in spOccupancy
. The arguments are very similar to what we saw with the spatial factor model in the previous exercise. We first specify the priors below
dist.bbs <- dist(data.EAWP$coords)
low.dist <- quantile(dist.bbs, 0.10)
max.dist <- max(dist.bbs)
priors <- list(sigma.sq.ig = list(2, 1), # IG prior for spatial variances
# Informative prior for spatial decay parameters
phi.unif = list(3 / max.dist, 3 / low.dist),
# Normal prior for occurrence coefficients
beta.normal = list(mean = 0, var = 2.72),
# Normal prior for observational (detection) coefficients
alpha.normal = list(mean = 0, var = 2.72))
Here we will again use an exponential spatial covariance model, so notice that our prior for the spatial decay parameters \(\phi\) restricts the effective spatial range to fall between the maximum intersite distance and the 10% quantile of the intersite distances. We use our usual inverse-Gamma prior for the spatial variances and Gaussian priors for the regression coefficients.
Next, we set the initial values and initial tuning parameters for \(\phi\) (we again use an Adaptive Metropolis update for these parameters as in the spatial factor model).
tuning.list <- list(phi = c(0.6, 0.5))
inits <- list(phi = c(3 / 500, 3 / 400), sigma.sq = c(5, 0.5),
beta = c(1.5, 0), alpha = 0)
Finally, we are about set to run the model. The function takes two right-sided formulas: one for the process (occurrence) model, and one for the observation (detection) model. Here we solely include a spatially-varying trend in the occurrence portion of the model, while we include a linear and quadratic effect of survey day in the detection model and a separate intercept for each of the ten years. Notice the svc.cols
argument is used to specify the columns of the design design matrix that we wish to allow to vary over space.
# MCMC criteria
n.batch <- 300
batch.length <- 25
n.burn <- 5000
n.thin <- 5
n.chains <- 1
out <- svcTPGOcc(occ.formula = ~ scale(years),
det.formula = ~ scale(day) + I(scale(day)^2) +
factor(years),
svc.cols = c(1, 2),
data = data.EAWP,
inits = inits,
priors = priors,
n.batch = n.batch,
batch.length = batch.length,
accept.rate = 0.43,
cov.model = "exponential",
tuning = tuning.list,
n.omp.threads = 1,
verbose = TRUE,
NNGP = TRUE,
n.neighbors = 5,
n.report = 30,
n.burn = n.burn,
n.thin = n.thin,
n.chains = n.chains)
----------------------------------------
Preparing the data
----------------------------------------
z is not specified in initial values.
Setting initial values based on observed data
----------------------------------------
Building the neighbor list
----------------------------------------
----------------------------------------
Building the neighbors of neighbors list
----------------------------------------
----------------------------------------
Model description
----------------------------------------
Spatial NNGP Trend Occupancy Model with Polya-Gamma latent
variable fit with 500 sites and 10 years.
Samples per chain: 7500 (300 batches of length 25)
Burn-in: 5000
Thinning Rate: 5
Number of Chains: 1
Total Posterior Samples: 500
Number of spatially-varying coefficients: 2
Using the exponential spatial correlation model.
Using 5 nearest neighbors.
Source compiled with OpenMP support and model fit using 1 thread(s).
Adaptive Metropolis with target acceptance rate: 43.0
----------------------------------------
Chain 1
----------------------------------------
Sampling ...
Batch: 30 of 300, 10.00%
Coefficient Parameter Acceptance Tuning
1 phi 28.0 0.60603
2 phi 48.0 0.65498
-------------------------------------------------
Batch: 60 of 300, 20.00%
Coefficient Parameter Acceptance Tuning
1 phi 32.0 0.57074
2 phi 80.0 0.84947
-------------------------------------------------
Batch: 90 of 300, 30.00%
Coefficient Parameter Acceptance Tuning
1 phi 76.0 0.61827
2 phi 56.0 1.07988
-------------------------------------------------
Batch: 120 of 300, 40.00%
Coefficient Parameter Acceptance Tuning
1 phi 40.0 0.63076
2 phi 32.0 1.19346
-------------------------------------------------
Batch: 150 of 300, 50.00%
Coefficient Parameter Acceptance Tuning
1 phi 44.0 0.77042
2 phi 36.0 1.31897
-------------------------------------------------
Batch: 180 of 300, 60.00%
Coefficient Parameter Acceptance Tuning
1 phi 32.0 0.71118
2 phi 72.0 1.34562
-------------------------------------------------
Batch: 210 of 300, 70.00%
Coefficient Parameter Acceptance Tuning
1 phi 36.0 0.71118
2 phi 28.0 1.29285
-------------------------------------------------
Batch: 240 of 300, 80.00%
Coefficient Parameter Acceptance Tuning
1 phi 56.0 0.75516
2 phi 24.0 1.24216
-------------------------------------------------
Batch: 270 of 300, 90.00%
Coefficient Parameter Acceptance Tuning
1 phi 20.0 0.65650
2 phi 44.0 1.37280
-------------------------------------------------
Batch: 300 of 300, 100.00%
summary(out)
Call:
svcTPGOcc(occ.formula = ~scale(years), det.formula = ~scale(day) +
I(scale(day)^2) + factor(years), data = data.EAWP, inits = inits,
priors = priors, tuning = tuning.list, svc.cols = c(1, 2),
cov.model = "exponential", NNGP = TRUE, n.neighbors = 5,
n.batch = n.batch, batch.length = batch.length, accept.rate = 0.43,
n.omp.threads = 1, verbose = TRUE, n.report = 30, n.burn = n.burn,
n.thin = n.thin, n.chains = n.chains)
Samples per Chain: 7500
Burn-in: 5000
Thinning Rate: 5
Number of Chains: 1
Total Posterior Samples: 500
Run Time (min): 3.8871
Occurrence (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 1.1923 0.5036 0.2543 1.2213 2.1294 NA 9
scale(years) -0.1258 0.1192 -0.3725 -0.1293 0.0976 NA 95
Detection (logit scale):
Mean SD 2.5% 50% 97.5% Rhat ESS
(Intercept) 0.0782 0.0567 -0.0410 0.0795 0.1901 NA 500
scale(day) -0.0641 0.0182 -0.1008 -0.0648 -0.0289 NA 500
I(scale(day)^2) -0.0395 0.0165 -0.0725 -0.0392 -0.0101 NA 500
factor(years)2011 0.1266 0.0775 -0.0244 0.1228 0.2717 NA 590
factor(years)2012 0.0941 0.0760 -0.0660 0.0923 0.2413 NA 500
factor(years)2013 -0.0387 0.0791 -0.1904 -0.0434 0.1157 NA 500
factor(years)2014 0.0786 0.0747 -0.0689 0.0756 0.2253 NA 500
factor(years)2015 0.0882 0.0797 -0.0739 0.0924 0.2538 NA 572
factor(years)2016 0.0980 0.0738 -0.0558 0.0971 0.2390 NA 459
factor(years)2017 0.1670 0.0723 0.0269 0.1669 0.3076 NA 500
factor(years)2018 0.2841 0.0778 0.1269 0.2819 0.4328 NA 500
factor(years)2019 0.2214 0.0797 0.0639 0.2212 0.3669 NA 323
Spatial Covariance:
Mean SD 2.5% 50% 97.5% Rhat ESS
sigma.sq-(Intercept) 19.5369 4.3518 12.8051 18.6281 29.4338 NA 45
sigma.sq-scale(years) 0.4636 0.1519 0.2346 0.4434 0.8301 NA 34
phi-(Intercept) 0.0056 0.0010 0.0036 0.0056 0.0075 NA 63
phi-scale(years) 0.0066 0.0012 0.0039 0.0069 0.0080 NA 27
plot(out$theta.samples, density = FALSE)
We can use the getSVCSamples()
function to extract the MCMC samples for the complete spatially-varying coefficient for each coefficient.
svc.samples <- getSVCSamples(out)
str(svc.samples)
List of 2
$ (Intercept) : 'mcmc' num [1:500, 1:500] -2.43 -4.13 -2.76 -1.37 -1.94 ...
..- attr(*, "mcpar")= num [1:3] 1 500 1
$ scale(years): 'mcmc' num [1:500, 1:500] -0.726 0.142 1.315 -0.35 -0.18 ...
..- attr(*, "mcpar")= num [1:3] 1 500 1
# The first 9 spatially-varying trend estimates
plot(svc.samples[[2]][, 1:9], density = FALSE)
We will next generate a map of the spatially-varying trend across a 30 x 30 km grid in the eastern US.
load('pred-coordinates.rda')
# Design "matrix" should be a 3-D array with dimensions corresponding to site, year, and parameter
X.0 <- array(NA, dim = c(nrow(coords.0), ncol(out$y), dim(out$X)[3]))
# Intercept
X.0[, , 1] <- 1
# Year
unique.years <- unique(c(data.EAWP$occ.covs$years))
for (t in 1:ncol(out$y)) {
X.0[, t, 2] <- unique.years[t]
}
X.0[, , 2] <- (X.0[, , 2] - mean(c(data.EAWP$occ.covs$years))) / sd(c(data.EAWP$occ.covs$years))
# The t.cols argument specifies that we are predicting for all ten years.
out.pred <- predict(out, X.0, coords.0, t.cols = 1:10)
----------------------------------------
Prediction description
----------------------------------------
Spatial NNGP Multi-season Occupancy model with Polya-Gamma latent
variable fit with 500 observations and 10 years.
Number of fixed covariates 2 (including intercept if specified).
Using the exponential spatial correlation model.
Using 5 nearest neighbors.
Number of MCMC samples 500.
Predicting at 4474 non-sampled locations.
Source compiled with OpenMP support and model fit using 1 threads.
-------------------------------------------------
Predicting
-------------------------------------------------
Location: 100 of 4474, 2.24%
Location: 200 of 4474, 4.47%
Location: 300 of 4474, 6.71%
Location: 400 of 4474, 8.94%
Location: 500 of 4474, 11.18%
Location: 600 of 4474, 13.41%
Location: 700 of 4474, 15.65%
Location: 800 of 4474, 17.88%
Location: 900 of 4474, 20.12%
Location: 1000 of 4474, 22.35%
Location: 1100 of 4474, 24.59%
Location: 1200 of 4474, 26.82%
Location: 1300 of 4474, 29.06%
Location: 1400 of 4474, 31.29%
Location: 1500 of 4474, 33.53%
Location: 1600 of 4474, 35.76%
Location: 1700 of 4474, 38.00%
Location: 1800 of 4474, 40.23%
Location: 1900 of 4474, 42.47%
Location: 2000 of 4474, 44.70%
Location: 2100 of 4474, 46.94%
Location: 2200 of 4474, 49.17%
Location: 2300 of 4474, 51.41%
Location: 2400 of 4474, 53.64%
Location: 2500 of 4474, 55.88%
Location: 2600 of 4474, 58.11%
Location: 2700 of 4474, 60.35%
Location: 2800 of 4474, 62.58%
Location: 2900 of 4474, 64.82%
Location: 3000 of 4474, 67.05%
Location: 3100 of 4474, 69.29%
Location: 3200 of 4474, 71.52%
Location: 3300 of 4474, 73.76%
Location: 3400 of 4474, 75.99%
Location: 3500 of 4474, 78.23%
Location: 3600 of 4474, 80.46%
Location: 3700 of 4474, 82.70%
Location: 3800 of 4474, 84.94%
Location: 3900 of 4474, 87.17%
Location: 4000 of 4474, 89.41%
Location: 4100 of 4474, 91.64%
Location: 4200 of 4474, 93.88%
Location: 4300 of 4474, 96.11%
Location: 4400 of 4474, 98.35%
Location: 4474 of 4474, 100.00%
Generating latent occupancy state
str(out.pred)
List of 6
$ z.0.samples : num [1:500, 1:4474, 1:10] 1 1 1 1 1 1 1 1 1 1 ...
$ w.0.samples : num [1:500, 1:2, 1:4474] 7.79 5.9 5.4 1.95 5.39 ...
$ psi.0.samples: num [1:500, 1:4474, 1:10] 1 0.999 1 0.992 1 ...
$ run.time : 'proc_time' Named num [1:5] 16.79 5.75 12.81 0 0
..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
$ call : language predict.svcTPGOcc(object = out, X.0 = X.0, coords.0 = coords.0, t.cols = 1:10)
$ object.class : chr "svcTPGOcc"
- attr(*, "class")= chr "predict.svcTPGOcc"
# Get the full spatially-varying coefficients
svc.pred.samples <- getSVCSamples(out, out.pred)
str(svc.pred.samples)
List of 2
$ (Intercept) : 'mcmc' num [1:500, 1:4474] 9.65 7.63 7.33 3.7 7.21 ...
..- attr(*, "mcpar")= num [1:3] 1 500 1
$ scale(years): 'mcmc' num [1:500, 1:4474] -0.4769 0.0936 -0.3207 -0.6856 -0.5987 ...
..- attr(*, "mcpar")= num [1:3] 1 500 1
Finally, we use the stars
and ggplot2
packages to generate a map of the estimated spatially-varying trend across the 30 x 30 m grid. We generate a map of uncertainty by plotting the probability the trend at any given location is greater than 0, and then discretizing this into five bins that illustrate the amount of support for a positive (close to 1) or negative (close to 0) trend.
plot.df <- data.frame(trend = apply(svc.pred.samples[[2]], 2, mean),
trend.prob.pos = apply(svc.pred.samples[[2]], 2, function(a) mean(a > 0)),
x = coords.0[, 1],
y = coords.0[, 2])
pred.stars <- st_as_stars(plot.df, dims = c('x', 'y'))
coords.sf <- st_as_sf(as.data.frame(data.EAWP$coords),
coords = c("X", "Y"),
crs = my.crs)
usa <- st_as_sf(maps::map("state", fill = TRUE, plot = FALSE))
usa.no.states <- st_as_sf(maps::map("usa", fill = TRUE, plot = FALSE))
# Restrict to east of the 100th meridian
usa.bbox <- st_bbox(usa)
usa.bbox[1] <- -100
usa.bbox <- as.vector(usa.bbox)
sf_use_s2(FALSE)
# Full data
east.us <- st_crop(st_make_valid(usa), xmin = usa.bbox[1], ymin = usa.bbox[2],
xmax = usa.bbox[3], ymax = usa.bbox[4])
east.us <- east.us %>%
st_transform(st_crs(coords.sf))
# Map of the mean effect
ggplot() +
geom_stars(data = pred.stars, aes(x = x, y = y, fill = trend), interpolate = FALSE) +
geom_sf(data = east.us, alpha = 0, col = 'grey') +
scale_fill_gradient2(midpoint = 0, low = '#B2182B', mid = 'white', high = '#2166AC',
na.value = NA) +
theme_bw(base_size = 14) +
labs(x = "Longitude", y = "Latitude", fill = "", title = 'Eastern Wood-Pewee Trend (2010-2019)') +
theme(legend.position = c(0.87, 0.25),
legend.background = element_rect(fill = NA),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.title = element_text(size = 14),
plot.title = element_text(size = 18),
legend.text = element_text(size = 12))
ggplot() +
geom_stars(data = pred.stars, aes(x = x, y = y, fill = trend.prob.pos), interpolate = FALSE) +
geom_sf(data = east.us, alpha = 0, col = 'grey') +
scale_fill_steps2(midpoint = 0.5, low = '#B2182B', mid = 'white', high = '#2166AC',
na.value = NA, limits = c(0, 1), n.breaks = 6) +
theme_bw(base_size = 14) +
labs(x = "Longitude", y = "Latitude", fill = "", title = 'P(trend > 0)') +
theme(legend.position = c(0.87, 0.25),
legend.background = element_rect(fill = NA),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.title = element_text(size = 14),
plot.title = element_text(size = 18),
legend.text = element_text(size = 12))
Polson, Nicholas G, James G Scott, and Jesse Windle. 2013. “Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables.” Journal of the American Statistical Association 108 (504): 1339–49.