Main Package Functions
This package contains the functions necessary to compute Kernel Linear Regression (KLR) on mean kernel embeddings, functions for preparing site and background data, and a function for simulate archaeological site location data.
KLR_funs.R for Fitting and Predicting on Tabular Data
build_K
- Function takes list of training data, scalar value forsigma
hyperparameter, and a distance method to compute a mean embedding similarity kernel. This kernel is a pair-wise (N x N) matrix of the mean similarity between the attributes describing each site location and background group. Optional inouts areprogress
for a progress bar anddist_metric
for the distance computation. By default, the distance metric iseuclidean
and should likely stay as such unless you have explored other distances and know why/how you want to use them.KLR
- Function takes the similarity kernel matrixK
, a vector of presence/absence coded as1
or0
, and a scalar value for thelambda
regularizing hyperparameter; optionally values for maximum iterations and threshold. This function performs Kernel Logistic Regression (KLR) via Iterative Re-weighted Least Squares (IRLS). The objective is to approximate a set of parameters that minimize the negative likelihood of the parameters given the data and response. This function returns a list ofpred
, the estimated response (probability of site-presence) for the training data, andalphas
, the approximated parameters from the IRLS algorithm.KLR_predict
- Function takes a list of the training data, a list of the testing data, a vector of the approximatedalphas
parameters, a scalar value for thesigma
kernel hyperparameter, and a distance method. This function predicts the probability of site presence for new observations based on the training data andalphas
parameters. This is accomplished by building thek*k
kernel matrix as the similarity between the training test data then computing the inverse logit ofk*k %*% alphas
. The output is the predicted probability of site presence for each training data example.
raster_predict_funs.R for Predicting on Raster Data
rescale_sim_raster
- Function that rescales simulated rasters fromNLMR::nlm_gaussianfield
[@NLMR] or whatever you want to use, to the mean and standard deviation of the simulated data used to fit theklr
model. You will have to add themean
andsd
arguments manually based on what you put into theget_sim_data
function. The example in the code above inputs the defaultmean
andsd
values from the defualts of theget_sim_data
function. Returned is a raster scaled too your simualted training data.sim_trend
- Function is used to take createn
number of simulated site locations ofsize
cell dimensions on arows
bycols
raster. The latter two arguments should match the size of your simulated rasters. The function randomly locates the sites and then creates a distance gradient (trend) from the site locations outward. The trend is a value1
at the sites and reduces to0
at the maximum combined distance from all sites. The output of this function is alist
of amatrix
of simulated site x/y coordinates (centers) and araster
of the trend surface. The point of the trend is to then combine it with the simulated rasters (as down in the code above) such that the raster depicting site-likely conditions is multiplied by the trend to output a raster retaining site-likely conditions near simulated site locations. Conversely, the site-unlikely simulated raster is multiplied by the inverse of the trend to result in a raster retaining site-unlikely characteristics away from the site locations. When those two rasters are added you get a simulated environment that is more preferable to site locations near site locations. It is a bit involved for something that had nothing to do with the actual KLRfome model, but it is needed to produce actual correlated environments for model testing.scale_prediction_rasters
- Function scales your predictorrater stack
based on theparams
list created in the model fitting process. This script simply loops over the rasters in the stack and centers and scales based on mean and sd of the training data used to fit theklr
model. The function outputs araster stack
.KLR_raster_predict
- Function predicts the probability of site-presence based on araster stack
of center/scaled predictor rasters, a focal neighborhood size in cells asngb
, and theparams
list of model fit parameters. Finally, the function also needs to the the number of columns ascols
and rows asrows
of the study areas raster stack. The rest of the arguments default to predicting the entire raster stack in one pass and only on a single core (not in parallel). The rest of the argument control whether the study area is split (split = TRUE
) into a grid of blocks. Theppside
positive integer controls the number of blocks along each axis of the study area. If you wish to compute the prediction in parallel, you will need to split it into blocks so that each block can be sent to a different processor core. The final set of optional arguments control how the prediction is returned, either as aoutput = "list"
oroutput = "save"
for returning a list of rasters of saving each out as a GeoTiff, and then arguments for the GeoTiff location withsave_loc
and whether to overwrite existing GeoTiffs withoverwrite
. The function contains a bit of logic to try and assist the user in which arguments go with what. Perhaps future versions will streamline this a bit.
format_site_data.R for Data Formatting
-
format_site_data
- Function takes adata.frame
of presence/background observations. The columnpresence
indicated site presence of background as1
or0
respectively. TheSITENO
column indicates either a site identifier orbackground
. The remaining columns should be measurements of environmental or cultural variables for modeling similarity. It is expected that each site will have multiple rows indicating multiple measurements of environmental variables from within the site boundary or land form. The function centers and scales the predictor variables, sub-samples the data into training and test sets, and re-formats the data as both tabular format and list format for both training and test data sets. Importantly, the function makes sure that no single site is in both the training and test set. No test sites are used to train the model. Also returned is the mean and standard deviation of the data so that new data can be center and scaled to the same parameters.
get_sim_data.R for Simulating Site Location Data
-
get_sim_data
- Function takes a mean and SD for two simulated predictor variables for each of sites and background. With archaeological site data being a protected data set in many settings (including this project), it cannot be freely shared. However, it is difficult to implement this model without properly format data. This function simulates archaeological site data and background environmental data for testing or tuning of this model. The concept of the function is to simulate two predictor variables for both sites and background. The inputs to the function (defaults provided) control how similar or different the site data is from the background. The model can be tested on simulated site data that is very similar to the background or very different; the model behaves accordingly. The output of the function is formatted in the same way as theformat_site_data
function.
metrics.R for Calculating Performance Metrics
CM_quads
- Function takes a data.frame of predicted probabilities (must be calledpred
) and actual observed presence/absence (must be calledpresence
and contain1 = presence
and0 = absence
). Optional input is a threshold value or values in the form of a single scalar (default = 0.5) or a numeric vector to loop over. The function calculates the True Positive, True Negative, False Positive, and False Negative quantities at each threshold. These quantities are typically shown in a 2x2 confusion matrix, hence the name Confusion Matrix (CM) quadrants. The output is a data.frame with columns for theTP
,FP
,TN
, andFN
at each desiredThreshold
. This is useful for figuring out how the model performs at a variety of thresholds. In the example code above I also compute theAUC
statistic that gives an indication of model performance across all thresholds.metrics
- Function provides a large quantity of metrics based on the quadrants of the Confusion Matrix derived from a threshold of the predicted probability values. The input to this function is theTP
,TN
,FP
, andFN
quantities of a thresholded model as derived from theCM_quads
function. The output of this model is a list of51
different metrics such asSensitivity
,Youden's J
,Positive Predictive Gain
, and many more. Some are commonplace and others are practically unknown. I put a few note in the function as to where they come from. The function returns all51
as a list.
plot.R for Plotting Similarity Kernel
-
K_corrplot
- Function is a pretty simple wrapper aroundcorrplot::corrplot()
with hierarchical clustering. Inputs for this function are the similarity matrixK
and thetrain_data
with the option of an integer for the number of clusters to partition the data; default = 4. Output is a plot or if assigned to an object, it is an object of classmatrix
.