This vignette provides a worked example using
TraCESahulMisc to load downscaled TraCE–Sahul climate data,
summarise it in time, calculate BIOCLIM variables, match
palaeo-observations, generate background points, and fit a quick Random
Forest model to examine habitat suitability through time.
For the vignette we need four packages.
library(TraCESahulMisc)
library(terra)
#> terra 1.8.87
library(randomForest)
#> randomForest 4.7-1.2
#> Type rfNews() to see new features/changes/bug fixes.
library(pbapply)
# only for plotting
library(rnaturalearth)
conts <- vect(rnaturalearth::ne_coastline(scale = 50))
conts <- crop(conts, ext(125, 135, -13, -8))This small function will read in the downloaded data to an object
called sahul which will then have names for each of the
variables.
vars <- c("pr", "tasmax", "tasmin")
# function to read in the data for each variable
load_tracesahul <- function(base, var) {
d <- file.path(base, var)
pat_dec <- sprintf(
"TraCE-Sahul_decadal_22k_1500CE_%s_0[1-6]\\.nc$", var)
f_dec <- list.files(d, pattern = pat_dec, full.names = TRUE)
if (length(f_dec) == 0L) {
stop("No valid decadal chunk files (01–06) found for ", var)
}
idx <- sub(
sprintf(".*TraCE-Sahul_decadal_22k_1500CE_%s_(\\d{2})\\.nc$", var),
"\\1", basename(f_dec))
idx_int <- as.integer(idx)
if (!all(idx_int %in% 1:6)) {
stop("Invalid decadal chunk numbering found for ", var,". Only 01–06 are allowed.")
}
if (length(idx_int) != 6L) {
stop("Expected six decadal chunk files (01–06) for ", var, ", but found ", length(idx_int), ".")
}
f_dec <- f_dec[order(idx_int)]
f_1500 <- file.path(d, sprintf("TraCE-Sahul_annual_1500_1990_%s.nc", var))
if (!file.exists(f_1500)) {
stop("1500_1990 file not found for ", var)
}
import_TraCESahul(files = c(f_dec, f_1500), aoi = NULL)
}
sahul <- lapply(vars, function(v) load_tracesahul(base_dir, v))
names(sahul) <- vars
sahul
#> $pr
#> class : SpatRaster
#> size : 100, 200, 31740 (nrow, ncol, nlyr)
#> resolution : 0.05, 0.05 (x, y)
#> extent : 125, 135, -13, -8 (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326)
#> sources : TraCE-Sahul_decadal_22k_1500CE_pr_01.nc (4812 layers)
#> TraCE-Sahul_decadal_22k_1500CE_pr_02.nc (4812 layers)
#> TraCE-Sahul_decadal_22k_1500CE_pr_03.nc (4812 layers)
#> ... and 4 more sources
#> varnames : pr (Precipitation)
#> pr (Precipitation)
#> pr (Precipitation)
#> ...
#> names : pr_1, pr_2, pr_3, pr_4, pr_5, pr_6, ...
#> unit : mm/month
#> depth : 1 to 12 (calendar month [calendar month]: 12 steps)
#> time (raw) : -20045 to 1989 (2645 steps)
#>
#> $tasmax
#> class : SpatRaster
#> size : 100, 200, 31740 (nrow, ncol, nlyr)
#> resolution : 0.05, 0.05 (x, y)
#> extent : 125, 135, -13, -8 (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326)
#> sources : TraCE-Sahul_decadal_22k_1500CE_tasmax_01.nc (4812 layers)
#> TraCE-Sahul_decadal_22k_1500CE_tasmax_02.nc (4812 layers)
#> TraCE-Sahul_decadal_22k_1500CE_tasmax_03.nc (4812 layers)
#> ... and 4 more sources
#> varnames : tasmax (Daily Maximum Near-Surface Air Temperatures)
#> tasmax (Daily Maximum Near-Surface Air Temperatures)
#> tasmax (Daily Maximum Near-Surface Air Temperatures)
#> ...
#> names : tasmax_1, tasmax_2, tasmax_3, tasmax_4, tasmax_5, tasmax_6, ...
#> unit : deg_C
#> depth : 1 to 12 (calendar month [calendar month]: 12 steps)
#> time (raw) : -20045 to 1989 (2645 steps)
#>
#> $tasmin
#> class : SpatRaster
#> size : 100, 200, 31740 (nrow, ncol, nlyr)
#> resolution : 0.05, 0.05 (x, y)
#> extent : 125, 135, -13, -8 (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326)
#> sources : TraCE-Sahul_decadal_22k_1500CE_tasmin_01.nc (4812 layers)
#> TraCE-Sahul_decadal_22k_1500CE_tasmin_02.nc (4812 layers)
#> TraCE-Sahul_decadal_22k_1500CE_tasmin_03.nc (4812 layers)
#> ... and 4 more sources
#> varnames : tasmin (Daily Minimum Near-Surface Air Temperatures)
#> tasmin (Daily Minimum Near-Surface Air Temperatures)
#> tasmin (Daily Minimum Near-Surface Air Temperatures)
#> ...
#> names : tasmin_1, tasmin_2, tasmin_3, tasmin_4, tasmin_5, tasmin_6, ...
#> unit : deg_C
#> depth : 1 to 12 (calendar month [calendar month]: 12 steps)
#> time (raw) : -20045 to 1989 (2645 steps)Here we can use the summarise_TraCESahul to aggregate
the data in time.
The function is able to summarise an imported TraCESahul SpatRaster to annual, monthly, or seasonal climatologies.
Monthly and seasonal summaries can be taken across the full record or
grouped into right-aligned windows of years. When
window = 10, the pre-1500 data are already decadal
climatologies and so are treated accordingly when deriving annual,
monthly, or seasonal outputs. The windows are guaranteed not to mix data
from before and after 1500 CE as there is a switch from decadal monthly
climatologies to annual monthly data from 1500 CE onwards.
Seasonal summaries follow austral seasons. December is treated as belonging to the following year/timestep to ensure correct DJF alignment, including when time steps are spaced irregularly or when windowed summaries are requested.
Below, the imported data is aggregated to monthly averages using a 30-year right aligned window.
This step is not necessary if you want to use the data “as-is”.
pr_monthly_win <- summarise_TraCESahul(sahul$pr, type = "monthly",
sumfun = "mean", window = 30)
tasmax_monthly_win <- summarise_TraCESahul(sahul$tasmax, type = "monthly",
sumfun = "mean", window = 30)
tasmin_monthly_win <- summarise_TraCESahul(sahul$tasmin, type = "monthly",
sumfun = "mean", window = 30)
lyr_names <- rep(month.abb, nlyr(pr_monthly_win)/12)
names(pr_monthly_win) <- paste0(lyr_names, "_pr")
names(tasmax_monthly_win) <- paste0(lyr_names, "_tasmax")
names(tasmin_monthly_win) <- paste0(lyr_names, "_tasmin")The bioclim_TraCESahul function calculates bioclimatic
variables (BIO1–BIO19) from monthly TraCESahul climate data using
fastbioclim::derive_bioclim.
The function requires monthly maximum temperature, minimum temperature, and precipitation rasters, performs optional precipitation scaling, processes each year independently, and returns either a list of annual bioclim stacks or collated multi-year rasters for each requested bioclimatic variable.
Ideally, the data should be the output from
summarise_TraCESahul as there are a number of checks
performed on the data, which are guaranteed to pass if the data has been
aggregated with summarise_TraCESahul.
Precipitation data can be converted from a flux (i.e. kg m-2 s-1) to
a rate (mm/day) if the units are set as such. The default units for the
TraCE-Sahul precipitation data are mm/month, so this step should not be
necessary unless you have changed the data. See the
pr_scale and new_units arguments for further
details.
Data is automatically saved into a output directory that can be
specified. If not specified a tempdir() is created and the
user is notified of the location of the directory so files can be
retrieved. The output directory is automatically created if it does not
yet exist.
Optionally, the data can be collated so that each item in the list
that is returned is one variable. The alternative
collate = FALSE results in each item in the list being all
the bioclim variables for a given year.
bioclims_dir <- "~/Desktop/TraCE-Sahul_bioclim"
bioclims <- bioclim_TraCESahul(
tasmax = tasmax_monthly_win,
tasmin = tasmin_monthly_win,
pr = pr_monthly_win,
outdir = bioclims_dir,
# default is to use bioclims 1:19
bioclims = c(1, 4, 5, 6, 7, 12, 13, 14, 17, 18),
collate = TRUE)
bioclims#> class : SpatRasterDataset
#> subdatasets : 10
#> dimensions : 100, 200 (nrow, ncol)
#> nlyr : 734, 734, 734, 734, 734, 734, 734, 734, 734, 734
#> resolution : 0.05, 0.05 (x, y)
#> extent : 125, 135, -13, -8 (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326)
#> source(s) : memory
#> names : bio01, bio04, bio05, bio06, bio07, bio12, ...
A quick plot shows the first item in the list is all the BIO01 (mean annual temperature) results.
Here we generate the mask required by the fossil matching function.
The example fossil dataset data(ex_foss) contains
spatial point locations and associated radiocarbon age estimates. The
raw object is stored as a SpatVector and needs to be converted into a
data table that can be passed to pair_obs.
The terra::unwrap call ensures that all attributes are
pulled into the R session, including the age and confidence interval
fields.
To prepare the data, the spatial coordinates are extracted from the SpatVector, unique integer IDs are assigned, and the original date estimate and its upper and lower confidence bounds are retained. The resulting data table contains one row per observation, with explicit latitude, longitude, and temporal uncertainty.
data(ex_foss)
ex_foss <- terra::unwrap(ex_foss)
ex_foss <- data.table::data.table(
ID = 1:nrow(ex_foss),
Lat = crds(ex_foss)[,2],
Lon = crds(ex_foss)[,1],
Age = ex_foss$date_estimate,
AgeMin = ex_foss$ci_upper,
AgeMax = ex_foss$ci_lower
)#> ID Lat Lon Age AgeMin AgeMax
#> <int> <num> <num> <num> <num> <num>
#> 1: 1 -11.575 132.725 -20900 -20859 -20941
#> 2: 2 -12.575 134.275 -20473 -20441 -20505
#> 3: 3 -12.225 133.875 -18041 -17984 -18097
#> 4: 4 -12.525 130.675 -17595 -17551 -17639
#> 5: 5 -12.725 130.475 -17259 -17233 -17285
#> 6: 6 -12.075 133.375 -16878 -16847 -16908
The expected input structure for pair_obs is a
data.table containing:
Lat and Lon in geographic coordinates (decimal degrees)
Age, the central age estimate
AgeMin and AgeMax, representing the youngest and oldest bounds of the age range
any optional additional columns retained for downstream use
pair_obs uses the temporal range (AgeMin–AgeMax) to
determine which climate layers should be included during matching.
pair_obspair_obs extracts climate values from time-indexed
raster stacks and associates them with each fossil record, accounting
for temporal uncertainty, spatial neighbourhoods, and optional
distance-based filtering. The function is designed specifically for
datasets like TraCE–Sahul, where each raster layer corresponds to a
single year (or summary period) and a long continuous time series is
available.
The matching workflow includes the following steps:
Identification of candidate climate layers based on fossil ages and the specified temporal window (window). For example, a 30-year window uses the 30 most recent layers preceding the most-recent age estimate (i.e. right aligned).
Spatial filtering using a mask (mask_layer), which ensures that only land pixels or valid climate surfaces are used. Records occurring on land/sea can be filtered post-matching depending on what is required.
Extraction of nearby cell values using the requested
neighbourhood size (neigh). Values may be taken from the
closest cell, or averaged across all neighbouring cells if a summary
statistic is supplied.
Optional distance-based exclusion using dist_cut,
which removes records where the spatial offset between the fossil
location and the raster grid exceeds a specified threshold.
Precision control through the prec argument, which rounds output values.
Parallelisation using cores for improved performance on large datasets. All rasters are wrapped and unwrapped internally to ensure compatibility with parallel workers. This is currently disabled.
The function returns a data table containing the original fossil metadata, the matched climate values, the temporal layer selected for each match, and diagnostic information about mask status, neighbourhood selection, and any records that could not be matched (e.g. ages older than the available climate layers).
In the following example, climate values are drawn from BIOCLIM
rasters (ras_list = bioclims). A 30-year right-aligned
temporal window is used (window = 30), eight-cell
neighbourhoods are searched (neigh = 8), and values are
averaged (summ_stat = "mean") when multiple neighbouring
cells are available. Records more than 50 km from the raster grid are
excluded (dist_cut = 50).
The final output contains matched BIOCLIM variables for all fossils whose ages fall within the time span of the dataset.
foss_matched_bioclim <- pair_obs(
data = ex_foss,
ras_list = bioclims,
mask_layer = mask_ras,
ras_time = terra::time(bioclims$bio01),
window = 30, # 30-year window
neigh = 8, # 8-neighbours (rook)
prec = 2, # 2 decimal places
summ_stat = "mean",
dist_cut = 50, # km
# parallel is disabled and will be set to 1 if any other value is given
cores = 1L,
buff_width = NULL
)
#> Lon/Lat points will be snapped to nearest points on rasters. Cutoff distance = 50 km.
#> Extracting data from rasters sequentially
#> Warning in pair_obs(data = ex_foss, ras_list = bioclims, mask_layer = mask_ras, :
#> There were 2 samples removed (no temporal match, > cutoff distance, or extraction error).
#> Warning in pair_obs(data = ex_foss, ras_list = bioclims, mask_layer = mask_ras,
#> : Samples removed: 1, 2
foss_matched_bioclim
#> ID Lon Lat Year bio01 bio04 bio05 bio06 bio07 bio12 bio13
#> <int> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 3 133.875 -12.225 -18095 24.49 180.11 34.47 14.86 19.61 1363.54 370.81
#> 2: 3 133.875 -12.225 -18065 24.51 181.98 34.90 14.90 20.01 1368.16 399.87
#> 3: 3 133.875 -12.225 -18035 24.41 186.44 34.86 14.77 20.09 1382.48 388.12
#> 4: 3 133.875 -12.225 -18005 24.53 166.22 33.70 15.11 18.59 1422.02 388.89
#> 5: 3 133.875 -12.225 -17975 24.53 165.02 33.32 14.90 18.42 1396.50 390.69
#> ---
#> 303: 73 130.675 -12.825 205 27.39 130.10 33.47 20.38 13.09 1579.40 403.04
#> 304: 74 131.675 -12.425 415 27.50 132.18 33.74 21.44 12.29 1514.76 368.52
#> 305: 74 131.675 -12.425 445 27.29 136.44 33.26 21.21 12.04 1526.56 354.88
#> 306: 74 131.675 -12.425 475 27.32 142.11 33.80 21.19 12.61 1523.48 365.83
#> 307: 75 132.325 -11.475 1869 26.81 137.70 30.31 22.38 7.94 1349.16 326.38
#> bio14 bio17 bio18 LandSea
#> <num> <num> <num> <num>
#> 1: 2.97 10.14 376.92 NA
#> 2: 2.97 9.11 354.28 NA
#> 3: 2.97 9.76 383.52 NA
#> 4: 2.97 10.14 407.75 NA
#> 5: 2.97 9.88 401.89 NA
#> ---
#> 303: 2.97 13.26 591.24 NA
#> 304: 5.73 22.45 523.44 NA
#> 305: 5.60 23.88 537.89 NA
#> 306: 4.87 21.20 527.14 NA
#> 307: 3.28 11.65 249.89 NAFollowing data aggregation and record matching, we can create a simple species distribution model to examine how habitat suitability (predicted occurence) has changed through time. This is purely a proof of concept and not suggestive of how the data should be used for creating an SDM.
Here we use downsampled random forests after extracting some background values through time.
The down-sampling approach for random forests ensures that each tree
in the Random Forest is trained using an equal number of presence and
background samples. This is controlled through the sampsize
argument, which accepts a named vector specifying how many observations
to draw from each class during bootstrap sampling. Although each tree is
fitted with balanced class sizes, the final forest still incorporates
the full set of background records because different subsets are
selected for each tree across the ensemble.
Here we generate 12,600 background points (50 samples x 252 timesteps)
# which timesteps from the TraCE-Sahul data are included in our fossil matched data
tidx <- which(terra::time(bioclims$bio01) %in% foss_matched_bioclim$Year)
# extract bg points.
## seed is set for reproducability
{set.seed(84564)
bg_points <- pblapply(tidx, function(t) {
nsamps <- 50
preds <- rast(lapply(seq_along(bioclims), function(i,...) bioclims[[i]][[t]]))
names(preds) <- sapply(strsplit(names(preds), "_"), "[", 1)
data.table::setDT(terra::spatSample(preds, nsamps, na.rm = TRUE))
})}
# collate the sample and round
bg_points <- data.table::rbindlist(bg_points)
cols <- names(bg_points)
bg_points[,(cols) := round(.SD, 2), .SDcols = cols]A quick looks shows what we generated for the first 5 variables
#> bio01 bio04 bio05 bio06 bio07
#> <num> <num> <num> <num> <num>
#> 1: 23.76 65.19 27.49 21.19 6.29
#> 2: 24.39 104.86 27.65 21.86 5.79
#> 3: 24.76 78.55 26.82 23.19 3.63
#> 4: 24.76 77.05 26.65 23.19 3.46
#> 5: 24.96 115.25 28.15 21.86 6.29
#> 6: 24.84 134.78 30.32 19.03 11.30
Here we generate the training data and build the random forest.
Columns not needed are excluded, and the presence and absence are set as factors so we can do classification modelling.
training <- data.table::rbindlist(list(
data.table::copy(foss_matched_bioclim)[, -c("ID", "Lon", "Lat", "Year", "LandSea")][, Occ := "1"],
data.table::copy(bg_points)[, Occ := "0"]
))
training[, Occ := factor(Occ, levels = c("0","1"))]
prNum <- as.numeric(table(training$Occ)["1"])
spsize <- c("0" = prNum, "1" = prNum)
rf_dws <- randomForest(
Occ ~ .,
mtry = 5,
data = training,
ntree = 1000,
importance = FALSE,
sampsize = spsize,
replace = TRUE)We can see that the random forest model was moderately accurate using the misclassification matrix
#>
#> Call:
#> randomForest(formula = Occ ~ ., data = training, mtry = 5, ntree = 1000, importance = FALSE, sampsize = spsize, replace = TRUE)
#> Type of random forest: classification
#> Number of trees: 1000
#> No. of variables tried at each split: 5
#>
#> OOB estimate of error rate: 13.08%
#> Confusion matrix:
#> 0 1 class.error
#> 0 10971 1629 0.1292857
#> 1 59 248 0.1921824
Finally, we can predict from the RF model to generate layers of prediction probability (or habitat suitability).
Here we will extract the rasters and predict at a number of timesteps so we can see how probability of occurrence changes through time.
# time steps to extract at
ts <- c(1, 65, 422, 507, 730)
# HS predictions
preds_stack <- rast(lapply(ts, function(t) {
r <- rast(lapply(seq_along(bioclims), function(i) {
bioclims[[i]][[t]]
}))
# Give the correct names to the layers in each stack
names(r) <- sapply(strsplit(names(r), "_"), "[", 1)
# predict prob of occurence
pr <- predict(r, rf_dws, type = "prob", index = 2)
names(pr) <- terra::time(bioclims$bio01)[t]
return(pr)
}))
preds_stack
#> class : SpatRaster
#> size : 100, 200, 5 (nrow, ncol, nlyr)
#> resolution : 0.05, 0.05 (x, y)
#> extent : 125, 135, -13, -8 (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326)
#> source(s) : memory
#> names : -20015, -18095, -7385, -4835, 1869
#> min values : 0.000, 0.00, 0.002, 0.022, 0.038
#> max values : 0.849, 0.88, 0.845, 0.908, 0.737Now we can get the true suitability and add it to our predicted suitability for comparison.
# get the true_suit raster
data(true_suit)
true_suit <- unwrap(true_suit)
names(true_suit) <- "true suitability"
# add the true_suit to the preds_stack
preds_stack <- c(preds_stack, true_suit)
preds_stack
#> class : SpatRaster
#> size : 100, 200, 6 (nrow, ncol, nlyr)
#> resolution : 0.05, 0.05 (x, y)
#> extent : 125, 135, -13, -8 (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326)
#> source(s) : memory
#> names : -20015, -18095, -7385, -4835, 1869, true ~ility
#> min values : 0.000, 0.00, 0.002, 0.022, 0.038, 0
#> max values : 0.849, 0.88, 0.845, 0.908, 0.737, 1Now we can plot the raster stack of predictions and compare it to the
true habitat suitability that was used to generate the virtual
species occurrences in the ex_foss records.
We can see that there are changes in habitat suitability through time. It’s also obvious that the Random Forest SDM is not great at reproducing the true suitability of the virtual species, however this is to be expected as the virtual species used a different modelling approach and variables!