TraCESahulMisc: Workflow Vignette

Introduction

This vignette provides a worked example using TraCESahulMisc to load downscaled TraCE–Sahul climate data, summarise it in time, calculate BIOCLIM variables, match palaeo-observations, generate background points, and fit a quick Random Forest model to examine habitat suitability through time.

Installing and loading packages

For the vignette we need four packages.

library(TraCESahulMisc)
library(terra)
#> terra 1.8.87
library(randomForest)
#> randomForest 4.7-1.2
#> Type rfNews() to see new features/changes/bug fixes.
library(pbapply)

# only for plotting
library(rnaturalearth)
conts <- vect(rnaturalearth::ne_coastline(scale = 50))
conts <- crop(conts, ext(125, 135, -13, -8))

Downloading example data

dl_dir <- "~/Documents" # download path for example data
base_dir <- normalizePath(paste0(dl_dir,"/TraCE-Sahul")) # base dir to example data
# download_trace_data(dl_dir) # Run once only

Importing TraCE–Sahul data

This small function will read in the downloaded data to an object called sahul which will then have names for each of the variables.

vars <- c("pr", "tasmax", "tasmin")

# function to read in the data for each variable
load_tracesahul <- function(base, var) {
  d <- file.path(base, var)
  pat_dec <- sprintf(
    "TraCE-Sahul_decadal_22k_1500CE_%s_0[1-6]\\.nc$", var)
  f_dec <- list.files(d, pattern = pat_dec, full.names = TRUE)
  if (length(f_dec) == 0L) {
    stop("No valid decadal chunk files (01–06) found for ", var)
  }
  idx <- sub(
    sprintf(".*TraCE-Sahul_decadal_22k_1500CE_%s_(\\d{2})\\.nc$", var),
    "\\1", basename(f_dec))
  idx_int <- as.integer(idx)
  if (!all(idx_int %in% 1:6)) {
    stop("Invalid decadal chunk numbering found for ", var,". Only 01–06 are allowed.")
  }
  if (length(idx_int) != 6L) {
    stop("Expected six decadal chunk files (01–06) for ", var, ", but found ", length(idx_int), ".")
  }
  f_dec <- f_dec[order(idx_int)]
  f_1500 <- file.path(d, sprintf("TraCE-Sahul_annual_1500_1990_%s.nc", var))
  if (!file.exists(f_1500)) {
    stop("1500_1990 file not found for ", var)
  }
  import_TraCESahul(files = c(f_dec, f_1500), aoi = NULL)
}
sahul <- lapply(vars, function(v) load_tracesahul(base_dir, v))
names(sahul) <- vars
sahul
#> $pr
#> class       : SpatRaster 
#> size        : 100, 200, 31740  (nrow, ncol, nlyr)
#> resolution  : 0.05, 0.05  (x, y)
#> extent      : 125, 135, -13, -8  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#> sources     : TraCE-Sahul_decadal_22k_1500CE_pr_01.nc  (4812 layers) 
#>               TraCE-Sahul_decadal_22k_1500CE_pr_02.nc  (4812 layers) 
#>               TraCE-Sahul_decadal_22k_1500CE_pr_03.nc  (4812 layers) 
#>               ... and 4 more sources
#> varnames    : pr (Precipitation) 
#>               pr (Precipitation) 
#>               pr (Precipitation) 
#>               ...
#> names       :     pr_1,     pr_2,     pr_3,     pr_4,     pr_5,     pr_6,      ... 
#> unit        : mm/month 
#> depth       : 1 to 12 (calendar month [calendar month]: 12 steps) 
#> time (raw)  : -20045 to 1989 (2645 steps) 
#> 
#> $tasmax
#> class       : SpatRaster 
#> size        : 100, 200, 31740  (nrow, ncol, nlyr)
#> resolution  : 0.05, 0.05  (x, y)
#> extent      : 125, 135, -13, -8  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#> sources     : TraCE-Sahul_decadal_22k_1500CE_tasmax_01.nc  (4812 layers) 
#>               TraCE-Sahul_decadal_22k_1500CE_tasmax_02.nc  (4812 layers) 
#>               TraCE-Sahul_decadal_22k_1500CE_tasmax_03.nc  (4812 layers) 
#>               ... and 4 more sources
#> varnames    : tasmax (Daily Maximum Near-Surface Air Temperatures) 
#>               tasmax (Daily Maximum Near-Surface Air Temperatures) 
#>               tasmax (Daily Maximum Near-Surface Air Temperatures) 
#>               ...
#> names       : tasmax_1, tasmax_2, tasmax_3, tasmax_4, tasmax_5, tasmax_6,   ... 
#> unit        : deg_C 
#> depth       : 1 to 12 (calendar month [calendar month]: 12 steps) 
#> time (raw)  : -20045 to 1989 (2645 steps) 
#> 
#> $tasmin
#> class       : SpatRaster 
#> size        : 100, 200, 31740  (nrow, ncol, nlyr)
#> resolution  : 0.05, 0.05  (x, y)
#> extent      : 125, 135, -13, -8  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#> sources     : TraCE-Sahul_decadal_22k_1500CE_tasmin_01.nc  (4812 layers) 
#>               TraCE-Sahul_decadal_22k_1500CE_tasmin_02.nc  (4812 layers) 
#>               TraCE-Sahul_decadal_22k_1500CE_tasmin_03.nc  (4812 layers) 
#>               ... and 4 more sources
#> varnames    : tasmin (Daily Minimum Near-Surface Air Temperatures) 
#>               tasmin (Daily Minimum Near-Surface Air Temperatures) 
#>               tasmin (Daily Minimum Near-Surface Air Temperatures) 
#>               ...
#> names       : tasmin_1, tasmin_2, tasmin_3, tasmin_4, tasmin_5, tasmin_6,   ... 
#> unit        : deg_C 
#> depth       : 1 to 12 (calendar month [calendar month]: 12 steps) 
#> time (raw)  : -20045 to 1989 (2645 steps)

Temporal summarisation

Here we can use the summarise_TraCESahul to aggregate the data in time.

The function is able to summarise an imported TraCESahul SpatRaster to annual, monthly, or seasonal climatologies.

Monthly and seasonal summaries can be taken across the full record or grouped into right-aligned windows of years. When window = 10, the pre-1500 data are already decadal climatologies and so are treated accordingly when deriving annual, monthly, or seasonal outputs. The windows are guaranteed not to mix data from before and after 1500 CE as there is a switch from decadal monthly climatologies to annual monthly data from 1500 CE onwards.

Seasonal summaries follow austral seasons. December is treated as belonging to the following year/timestep to ensure correct DJF alignment, including when time steps are spaced irregularly or when windowed summaries are requested.

Below, the imported data is aggregated to monthly averages using a 30-year right aligned window.

This step is not necessary if you want to use the data “as-is”.

pr_monthly_win <- summarise_TraCESahul(sahul$pr, type = "monthly",
                                       sumfun = "mean", window = 30)
tasmax_monthly_win <- summarise_TraCESahul(sahul$tasmax, type = "monthly",
                                           sumfun = "mean", window = 30)
tasmin_monthly_win <- summarise_TraCESahul(sahul$tasmin, type = "monthly",
                                           sumfun = "mean", window = 30)

lyr_names <- rep(month.abb, nlyr(pr_monthly_win)/12)
names(pr_monthly_win) <- paste0(lyr_names, "_pr")
names(tasmax_monthly_win) <- paste0(lyr_names, "_tasmax")
names(tasmin_monthly_win) <- paste0(lyr_names, "_tasmin")

BIOCLIM calculation

The bioclim_TraCESahul function calculates bioclimatic variables (BIO1–BIO19) from monthly TraCESahul climate data using fastbioclim::derive_bioclim.

The function requires monthly maximum temperature, minimum temperature, and precipitation rasters, performs optional precipitation scaling, processes each year independently, and returns either a list of annual bioclim stacks or collated multi-year rasters for each requested bioclimatic variable.

Ideally, the data should be the output from summarise_TraCESahul as there are a number of checks performed on the data, which are guaranteed to pass if the data has been aggregated with summarise_TraCESahul.

Precipitation data can be converted from a flux (i.e. kg m-2 s-1) to a rate (mm/day) if the units are set as such. The default units for the TraCE-Sahul precipitation data are mm/month, so this step should not be necessary unless you have changed the data. See the pr_scale and new_units arguments for further details.

Data is automatically saved into a output directory that can be specified. If not specified a tempdir() is created and the user is notified of the location of the directory so files can be retrieved. The output directory is automatically created if it does not yet exist.

Optionally, the data can be collated so that each item in the list that is returned is one variable. The alternative collate = FALSE results in each item in the list being all the bioclim variables for a given year.

bioclims_dir <- "~/Desktop/TraCE-Sahul_bioclim"

bioclims <- bioclim_TraCESahul(
  tasmax = tasmax_monthly_win,
  tasmin = tasmin_monthly_win,
  pr = pr_monthly_win,
  outdir = bioclims_dir,
  # default is to use bioclims 1:19
  bioclims = c(1, 4, 5, 6, 7, 12, 13, 14, 17, 18), 
  collate = TRUE)
bioclims
#> class       : SpatRasterDataset 
#> subdatasets : 10 
#> dimensions  : 100, 200 (nrow, ncol)
#> nlyr        : 734, 734, 734, 734, 734, 734, 734, 734, 734, 734 
#> resolution  : 0.05, 0.05  (x, y)
#> extent      : 125, 135, -13, -8  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#> source(s)   : memory 
#> names       : bio01, bio04, bio05, bio06, bio07, bio12, ...

A quick plot shows the first item in the list is all the BIO01 (mean annual temperature) results.

Generate a mask

Here we generate the mask required by the fossil matching function.

# make a mask using bio01
mask_ras <- bioclims$bio01
mask_ras <- ifel(is.na(mask_ras), 0, 1)

Fossil matching

The example fossil dataset data(ex_foss) contains spatial point locations and associated radiocarbon age estimates. The raw object is stored as a SpatVector and needs to be converted into a data table that can be passed to pair_obs.

The terra::unwrap call ensures that all attributes are pulled into the R session, including the age and confidence interval fields.

To prepare the data, the spatial coordinates are extracted from the SpatVector, unique integer IDs are assigned, and the original date estimate and its upper and lower confidence bounds are retained. The resulting data table contains one row per observation, with explicit latitude, longitude, and temporal uncertainty.

data(ex_foss)
ex_foss <- terra::unwrap(ex_foss)

ex_foss <- data.table::data.table(
  ID = 1:nrow(ex_foss),
  Lat = crds(ex_foss)[,2],
  Lon = crds(ex_foss)[,1],
  Age = ex_foss$date_estimate,
  AgeMin = ex_foss$ci_upper,
  AgeMax = ex_foss$ci_lower
)
#>       ID     Lat     Lon    Age AgeMin AgeMax
#>    <int>   <num>   <num>  <num>  <num>  <num>
#> 1:     1 -11.575 132.725 -20900 -20859 -20941
#> 2:     2 -12.575 134.275 -20473 -20441 -20505
#> 3:     3 -12.225 133.875 -18041 -17984 -18097
#> 4:     4 -12.525 130.675 -17595 -17551 -17639
#> 5:     5 -12.725 130.475 -17259 -17233 -17285
#> 6:     6 -12.075 133.375 -16878 -16847 -16908

The expected input structure for pair_obs is a data.table containing:

  1. Lat and Lon in geographic coordinates (decimal degrees)

  2. Age, the central age estimate

  3. AgeMin and AgeMax, representing the youngest and oldest bounds of the age range

  4. any optional additional columns retained for downstream use

pair_obs uses the temporal range (AgeMin–AgeMax) to determine which climate layers should be included during matching.

Environmental matching with pair_obs

pair_obs extracts climate values from time-indexed raster stacks and associates them with each fossil record, accounting for temporal uncertainty, spatial neighbourhoods, and optional distance-based filtering. The function is designed specifically for datasets like TraCE–Sahul, where each raster layer corresponds to a single year (or summary period) and a long continuous time series is available.

The matching workflow includes the following steps:

The function returns a data table containing the original fossil metadata, the matched climate values, the temporal layer selected for each match, and diagnostic information about mask status, neighbourhood selection, and any records that could not be matched (e.g. ages older than the available climate layers).

In the following example, climate values are drawn from BIOCLIM rasters (ras_list = bioclims). A 30-year right-aligned temporal window is used (window = 30), eight-cell neighbourhoods are searched (neigh = 8), and values are averaged (summ_stat = "mean") when multiple neighbouring cells are available. Records more than 50 km from the raster grid are excluded (dist_cut = 50).

The final output contains matched BIOCLIM variables for all fossils whose ages fall within the time span of the dataset.

foss_matched_bioclim <- pair_obs(
  data = ex_foss,
  ras_list = bioclims,
  mask_layer = mask_ras,
  ras_time = terra::time(bioclims$bio01),
  window = 30, # 30-year window
  neigh = 8, # 8-neighbours (rook)
  prec = 2, # 2 decimal places
  summ_stat = "mean",
  dist_cut = 50, # km
  # parallel is disabled and will be set to 1 if any other value is given
  cores = 1L, 
  buff_width = NULL
)
#> Lon/Lat points will be snapped to nearest points on rasters. Cutoff distance = 50 km.
#> Extracting data from rasters sequentially
#> Warning in pair_obs(data = ex_foss, ras_list = bioclims, mask_layer = mask_ras, : 
#> There were 2 samples removed (no temporal match, > cutoff distance, or extraction error).
#> Warning in pair_obs(data = ex_foss, ras_list = bioclims, mask_layer = mask_ras,
#> : Samples removed: 1, 2
foss_matched_bioclim
#>         ID     Lon     Lat   Year bio01  bio04 bio05 bio06 bio07   bio12  bio13
#>      <int>   <num>   <num>  <num> <num>  <num> <num> <num> <num>   <num>  <num>
#>   1:     3 133.875 -12.225 -18095 24.49 180.11 34.47 14.86 19.61 1363.54 370.81
#>   2:     3 133.875 -12.225 -18065 24.51 181.98 34.90 14.90 20.01 1368.16 399.87
#>   3:     3 133.875 -12.225 -18035 24.41 186.44 34.86 14.77 20.09 1382.48 388.12
#>   4:     3 133.875 -12.225 -18005 24.53 166.22 33.70 15.11 18.59 1422.02 388.89
#>   5:     3 133.875 -12.225 -17975 24.53 165.02 33.32 14.90 18.42 1396.50 390.69
#>  ---                                                                           
#> 303:    73 130.675 -12.825    205 27.39 130.10 33.47 20.38 13.09 1579.40 403.04
#> 304:    74 131.675 -12.425    415 27.50 132.18 33.74 21.44 12.29 1514.76 368.52
#> 305:    74 131.675 -12.425    445 27.29 136.44 33.26 21.21 12.04 1526.56 354.88
#> 306:    74 131.675 -12.425    475 27.32 142.11 33.80 21.19 12.61 1523.48 365.83
#> 307:    75 132.325 -11.475   1869 26.81 137.70 30.31 22.38  7.94 1349.16 326.38
#>      bio14 bio17  bio18 LandSea
#>      <num> <num>  <num>   <num>
#>   1:  2.97 10.14 376.92      NA
#>   2:  2.97  9.11 354.28      NA
#>   3:  2.97  9.76 383.52      NA
#>   4:  2.97 10.14 407.75      NA
#>   5:  2.97  9.88 401.89      NA
#>  ---                           
#> 303:  2.97 13.26 591.24      NA
#> 304:  5.73 22.45 523.44      NA
#> 305:  5.60 23.88 537.89      NA
#> 306:  4.87 21.20 527.14      NA
#> 307:  3.28 11.65 249.89      NA

Species distribution modelling

Following data aggregation and record matching, we can create a simple species distribution model to examine how habitat suitability (predicted occurence) has changed through time. This is purely a proof of concept and not suggestive of how the data should be used for creating an SDM.

Here we use downsampled random forests after extracting some background values through time.

The down-sampling approach for random forests ensures that each tree in the Random Forest is trained using an equal number of presence and background samples. This is controlled through the sampsize argument, which accepts a named vector specifying how many observations to draw from each class during bootstrap sampling. Although each tree is fitted with balanced class sizes, the final forest still incorporates the full set of background records because different subsets are selected for each tree across the ensemble.

Background point generation

Here we generate 12,600 background points (50 samples x 252 timesteps)

# which timesteps from the TraCE-Sahul data are included in our fossil matched data
tidx <- which(terra::time(bioclims$bio01) %in% foss_matched_bioclim$Year)

# extract bg points.
## seed is set for reproducability
{set.seed(84564)
bg_points <- pblapply(tidx, function(t) {
  nsamps <- 50
  preds <- rast(lapply(seq_along(bioclims), function(i,...) bioclims[[i]][[t]]))
  names(preds) <- sapply(strsplit(names(preds), "_"), "[", 1)
  data.table::setDT(terra::spatSample(preds, nsamps, na.rm = TRUE))
})}

# collate the sample and round
bg_points <- data.table::rbindlist(bg_points)
cols <- names(bg_points)
bg_points[,(cols) := round(.SD, 2), .SDcols = cols]

A quick looks shows what we generated for the first 5 variables

#>    bio01  bio04 bio05 bio06 bio07
#>    <num>  <num> <num> <num> <num>
#> 1: 23.76  65.19 27.49 21.19  6.29
#> 2: 24.39 104.86 27.65 21.86  5.79
#> 3: 24.76  78.55 26.82 23.19  3.63
#> 4: 24.76  77.05 26.65 23.19  3.46
#> 5: 24.96 115.25 28.15 21.86  6.29
#> 6: 24.84 134.78 30.32 19.03 11.30

Random Forest modelling

Here we generate the training data and build the random forest.

Columns not needed are excluded, and the presence and absence are set as factors so we can do classification modelling.

training <- data.table::rbindlist(list(
  data.table::copy(foss_matched_bioclim)[, -c("ID", "Lon", "Lat", "Year", "LandSea")][, Occ := "1"],
  data.table::copy(bg_points)[, Occ := "0"]
))

training[, Occ := factor(Occ, levels = c("0","1"))]

prNum <- as.numeric(table(training$Occ)["1"])
spsize <- c("0" = prNum, "1" = prNum)

rf_dws <- randomForest(
  Occ ~ .,
  mtry = 5,
  data = training,
  ntree = 1000,
  importance = FALSE,
  sampsize = spsize,
  replace = TRUE)

We can see that the random forest model was moderately accurate using the misclassification matrix

#> 
#> Call:
#>  randomForest(formula = Occ ~ ., data = training, mtry = 5, ntree = 1000,      importance = FALSE, sampsize = spsize, replace = TRUE) 
#>                Type of random forest: classification
#>                      Number of trees: 1000
#> No. of variables tried at each split: 5
#> 
#>         OOB estimate of  error rate: 13.08%
#> Confusion matrix:
#>       0    1 class.error
#> 0 10971 1629   0.1292857
#> 1    59  248   0.1921824

Raster prediction example

Finally, we can predict from the RF model to generate layers of prediction probability (or habitat suitability).

Here we will extract the rasters and predict at a number of timesteps so we can see how probability of occurrence changes through time.

# time steps to extract at
ts <- c(1, 65, 422, 507, 730)

# HS predictions
preds_stack <- rast(lapply(ts, function(t) {
  r <- rast(lapply(seq_along(bioclims), function(i) {
    bioclims[[i]][[t]]
  }))
  # Give the correct names to the layers in each stack
  names(r) <- sapply(strsplit(names(r), "_"), "[", 1)
  # predict prob of occurence
  pr <- predict(r, rf_dws, type = "prob", index = 2)
  names(pr) <- terra::time(bioclims$bio01)[t]
  return(pr)
}))

preds_stack
#> class       : SpatRaster 
#> size        : 100, 200, 5  (nrow, ncol, nlyr)
#> resolution  : 0.05, 0.05  (x, y)
#> extent      : 125, 135, -13, -8  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#> source(s)   : memory
#> names       : -20015, -18095, -7385, -4835,  1869 
#> min values  :  0.000,   0.00, 0.002, 0.022, 0.038 
#> max values  :  0.849,   0.88, 0.845, 0.908, 0.737

Now we can get the true suitability and add it to our predicted suitability for comparison.

# get the true_suit raster
data(true_suit)
true_suit <- unwrap(true_suit)
names(true_suit) <- "true suitability"

# add the true_suit to the preds_stack
preds_stack <- c(preds_stack, true_suit)
preds_stack
#> class       : SpatRaster 
#> size        : 100, 200, 6  (nrow, ncol, nlyr)
#> resolution  : 0.05, 0.05  (x, y)
#> extent      : 125, 135, -13, -8  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#> source(s)   : memory
#> names       : -20015, -18095, -7385, -4835,  1869, true ~ility 
#> min values  :  0.000,   0.00, 0.002, 0.022, 0.038,           0 
#> max values  :  0.849,   0.88, 0.845, 0.908, 0.737,           1

Now we can plot the raster stack of predictions and compare it to the true habitat suitability that was used to generate the virtual species occurrences in the ex_foss records.

We can see that there are changes in habitat suitability through time. It’s also obvious that the Random Forest SDM is not great at reproducing the true suitability of the virtual species, however this is to be expected as the virtual species used a different modelling approach and variables!