Title: | A Unified Framework for Random Forest Prediction Error Estimation |
---|---|
Description: | Estimates the conditional error distributions of random forest predictions and common parameters of those distributions, including conditional misclassification rates, conditional mean squared prediction errors, conditional biases, and conditional quantiles, by out-of-bag weighting of out-of-bag prediction errors as proposed by Lu and Hardin (2021). This package is compatible with several existing packages that implement random forests in R. |
Authors: | Benjamin Lu and Johanna Hardin |
Maintainer: | Benjamin Lu <[email protected]> |
License: | GPL-3 |
Version: | 1.1.0 |
Built: | 2024-11-08 04:12:12 UTC |
Source: | https://github.com/benjilu/foresterror |
Computes each training observation's out-of-bag prediction error using the random forest and, for each tree for which the training observation is out of bag, finds the terminal node of the tree in which the training observation falls.
findOOBErrors(forest, X.train, Y.train = NULL, n.cores = 1)
findOOBErrors(forest, X.train, Y.train = NULL, n.cores = 1)
forest |
The random forest object being used for prediction. |
X.train |
A |
Y.train |
A vector of the responses of the observations that were used
to train |
n.cores |
Number of cores to use (for parallel computation in |
This function accepts classification or regression random forests built using
the randomForest
, ranger
, randomForestSRC
, and
quantregForest
packages. When training the random forest using
randomForest
, ranger
, or quantregForest
, keep.inbag
must be set to TRUE
. When training the random forest using
randomForestSRC
, membership
must be set to TRUE
.
A data.table
with the following three columns:
tree |
The ID of the tree of the random forest |
terminal_node |
The ID of the terminal node of the tree |
node_errs |
A vector of the out-of-bag prediction errors that fall within the terminal node of the tree |
Benjamin Lu <[email protected]>
; Johanna Hardin <[email protected]>
# load data data(airquality) # remove observations with missing predictor variable values airquality <- airquality[complete.cases(airquality), ] # get number of observations and the response column index n <- nrow(airquality) response.col <- 1 # split data into training and test sets train.ind <- sample(1:n, n * 0.9, replace = FALSE) Xtrain <- airquality[train.ind, -response.col] Ytrain <- airquality[train.ind, response.col] Xtest <- airquality[-train.ind, -response.col] # fit random forest to the training data rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5, ntree = 500, keep.inbag = TRUE) # compute out-of-bag prediction errors and locate each # training observation in the trees for which it is out # of bag train_nodes <- findOOBErrors(rf, Xtrain) # estimate conditional mean squared prediction errors, # biases, prediction intervals, and error distribution # functions for the test observations. provide # train_nodes to avoid recomputing that step. output <- quantForestError(rf, Xtrain, Xtest, train_nodes = train_nodes)
# load data data(airquality) # remove observations with missing predictor variable values airquality <- airquality[complete.cases(airquality), ] # get number of observations and the response column index n <- nrow(airquality) response.col <- 1 # split data into training and test sets train.ind <- sample(1:n, n * 0.9, replace = FALSE) Xtrain <- airquality[train.ind, -response.col] Ytrain <- airquality[train.ind, response.col] Xtest <- airquality[-train.ind, -response.col] # fit random forest to the training data rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5, ntree = 500, keep.inbag = TRUE) # compute out-of-bag prediction errors and locate each # training observation in the trees for which it is out # of bag train_nodes <- findOOBErrors(rf, Xtrain) # estimate conditional mean squared prediction errors, # biases, prediction intervals, and error distribution # functions for the test observations. provide # train_nodes to avoid recomputing that step. output <- quantForestError(rf, Xtrain, Xtest, train_nodes = train_nodes)
Returns probabilities from the estimated conditional cumulative distribution function of the prediction error associated with each test observation.
perror(q, xs)
perror(q, xs)
q |
A vector of quantiles. |
xs |
A vector of the indices of the test observations for which the
conditional error CDFs are desired. Defaults to all test observations
given in the call of |
This function is only defined as output of the quantForestError
function.
It is not exported as a standalone function. See the example.
If either q
or xs
has length one, then a vector is
returned with the desired probabilities. If both have length greater than
one, then a data.frame
of probabilities is returned, with rows
corresponding to the inputted xs
and columns corresponding to the
inputted q
.
Benjamin Lu <[email protected]>
; Johanna Hardin <[email protected]>
# load data data(airquality) # remove observations with missing predictor variable values airquality <- airquality[complete.cases(airquality), ] # get number of observations and the response column index n <- nrow(airquality) response.col <- 1 # split data into training and test sets train.ind <- sample(1:n, n * 0.9, replace = FALSE) Xtrain <- airquality[train.ind, -response.col] Ytrain <- airquality[train.ind, response.col] Xtest <- airquality[-train.ind, -response.col] Ytest <- airquality[-train.ind, response.col] # fit random forest to the training data rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5, ntree = 500, keep.inbag = TRUE) # estimate conditional error distribution functions output <- quantForestError(rf, Xtrain, Xtest, what = c("p.error", "q.error")) # get the probability that the error associated with each test # prediction is less than -4 and the probability that the error # associated with each test prediction is less than 7 output$perror(c(-4, 7)) # same as above but only for the first three test observations output$perror(c(-4, 7), 1:3)
# load data data(airquality) # remove observations with missing predictor variable values airquality <- airquality[complete.cases(airquality), ] # get number of observations and the response column index n <- nrow(airquality) response.col <- 1 # split data into training and test sets train.ind <- sample(1:n, n * 0.9, replace = FALSE) Xtrain <- airquality[train.ind, -response.col] Ytrain <- airquality[train.ind, response.col] Xtest <- airquality[-train.ind, -response.col] Ytest <- airquality[-train.ind, response.col] # fit random forest to the training data rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5, ntree = 500, keep.inbag = TRUE) # estimate conditional error distribution functions output <- quantForestError(rf, Xtrain, Xtest, what = c("p.error", "q.error")) # get the probability that the error associated with each test # prediction is less than -4 and the probability that the error # associated with each test prediction is less than 7 output$perror(c(-4, 7)) # same as above but only for the first three test observations output$perror(c(-4, 7), 1:3)
Returns quantiles of the estimated conditional error distribution associated with each test prediction.
qerror(p, xs)
qerror(p, xs)
p |
A vector of probabilities. |
xs |
A vector of the indices of the test observations for which the
conditional error quantiles are desired. Defaults to all test observations
given in the call of |
This function is only defined as output of the quantForestError
function.
It is not exported as a standalone function. See the example.
If either p
or xs
has length one, then a vector is
returned with the desired quantiles. If both have length greater than
one, then a data.frame
of quantiles is returned, with rows
corresponding to the inputted xs
and columns corresponding to the
inputted p
.
Benjamin Lu <[email protected]>
; Johanna Hardin <[email protected]>
# load data data(airquality) # remove observations with missing predictor variable values airquality <- airquality[complete.cases(airquality), ] # get number of observations and the response column index n <- nrow(airquality) response.col <- 1 # split data into training and test sets train.ind <- sample(1:n, n * 0.9, replace = FALSE) Xtrain <- airquality[train.ind, -response.col] Ytrain <- airquality[train.ind, response.col] Xtest <- airquality[-train.ind, -response.col] Ytest <- airquality[-train.ind, response.col] # fit random forest to the training data rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5, ntree = 500, keep.inbag = TRUE) # estimate conditional error distribution functions output <- quantForestError(rf, Xtrain, Xtest, what = c("p.error", "q.error")) # get the 0.25 and 0.8 quantiles of the error distribution # associated with each test prediction output$qerror(c(0.25, 0.8)) # same as above but only for the first three test observations output$qerror(c(0.25, 0.8), 1:3)
# load data data(airquality) # remove observations with missing predictor variable values airquality <- airquality[complete.cases(airquality), ] # get number of observations and the response column index n <- nrow(airquality) response.col <- 1 # split data into training and test sets train.ind <- sample(1:n, n * 0.9, replace = FALSE) Xtrain <- airquality[train.ind, -response.col] Ytrain <- airquality[train.ind, response.col] Xtest <- airquality[-train.ind, -response.col] Ytest <- airquality[-train.ind, response.col] # fit random forest to the training data rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5, ntree = 500, keep.inbag = TRUE) # estimate conditional error distribution functions output <- quantForestError(rf, Xtrain, Xtest, what = c("p.error", "q.error")) # get the 0.25 and 0.8 quantiles of the error distribution # associated with each test prediction output$qerror(c(0.25, 0.8)) # same as above but only for the first three test observations output$qerror(c(0.25, 0.8), 1:3)
Estimates the conditional misclassification rates, conditional mean squared prediction errors, conditional biases, conditional prediction intervals, and conditional error distributions of random forest predictions.
quantForestError( forest, X.train, X.test, Y.train = NULL, what = if (grepl("class", c(forest$type, forest$family, forest$treetype), TRUE)) "mcr" else c("mspe", "bias", "interval", "p.error", "q.error"), alpha = 0.05, train_nodes = NULL, return_train_nodes = FALSE, n.cores = 1 )
quantForestError( forest, X.train, X.test, Y.train = NULL, what = if (grepl("class", c(forest$type, forest$family, forest$treetype), TRUE)) "mcr" else c("mspe", "bias", "interval", "p.error", "q.error"), alpha = 0.05, train_nodes = NULL, return_train_nodes = FALSE, n.cores = 1 )
forest |
The random forest object being used for prediction. |
X.train |
A |
X.test |
A |
Y.train |
A vector of the responses of the observations that were used
to train |
what |
A vector of characters indicating what estimates are desired.
Possible options are conditional mean squared prediction errors ( |
alpha |
A vector of type-I error rates desired for the conditional prediction
intervals; required if |
train_nodes |
A |
return_train_nodes |
A boolean indicating whether to return the
|
n.cores |
Number of cores to use (for parallel computation in |
This function accepts classification or regression random forests built using
the randomForest
, ranger
, randomForestSRC
, and
quantregForest
packages. When training the random forest using
randomForest
, ranger
, or quantregForest
, keep.inbag
must be set to TRUE
. When training the random forest using
randomForestSRC
, membership
must be set to TRUE
.
The predictions computed by ranger
can be parallelized by setting the
value of n.cores
to be greater than 1.
The random forest predictions are always returned as a data.frame
. Additional
columns are included in the data.frame
depending on the user's selections in
the argument what
. In particular, including "mspe"
in what
will add an additional column with the conditional mean squared prediction
error of each test prediction to the data.frame
; including "bias"
in
what
will add an additional column with the conditional bias of each test
prediction to the data.frame
; including "interval"
in what
will add to the data.frame
additional columns with the lower and
upper bounds of conditional prediction intervals for each test prediction;
and including "mcr"
in what
will add an additional column with
the conditional misclassification rate of each test prediction to the
data.frame
. The conditional misclassification rate can be estimated
only for classification random forests, while the other parameters can be
estimated only for regression random forests.
If "p.error"
or "q.error"
is included in what
, or if
return_train_nodes
is set to TRUE
, then a list will be returned
as output. The first element of the list, named "estimates"
, is the
data.frame
described in the above paragraph. The other elements of the
list are the estimated cumulative distribution functions (perror
) of
the conditional error distributions, the estimated quantile functions
(qerror
) of the conditional error distributions, and/or a data.table
indicating what out-of-bag prediction errors each terminal node of each tree
in the random forest contains.
A data.frame
with one or more of the following columns, as described
in the details section:
pred |
The random forest predictions of the test observations |
mspe |
The estimated conditional mean squared prediction errors of the random forest predictions |
bias |
The estimated conditional biases of the random forest predictions |
lower_alpha |
The estimated lower bounds of the conditional alpha-level prediction intervals for the test observations |
upper_alpha |
The estimated upper bounds of the conditional alpha-level prediction intervals for the test observations |
mcr |
The estimated conditional misclassification rate of the random forest predictions |
In addition, one or both of the following functions, as described in the details section:
perror |
The estimated cumulative distribution functions of the conditional error distributions associated with the test predictions |
qerror |
The estimated quantile functions of the conditional error distributions associated with the test predictions |
In addition, if return_train_nodes
is TRUE
, then a data.table
called train_nodes
indicating what out-of-bag prediction errors each
terminal node of each tree in forest
contains.
Benjamin Lu <[email protected]>
; Johanna Hardin <[email protected]>
# load data data(airquality) # remove observations with missing predictor variable values airquality <- airquality[complete.cases(airquality), ] # get number of observations and the response column index n <- nrow(airquality) response.col <- 1 # split data into training and test sets train.ind <- sample(c("A", "B", "C"), n, replace = TRUE, prob = c(0.8, 0.1, 0.1)) Xtrain <- airquality[train.ind == "A", -response.col] Ytrain <- airquality[train.ind == "A", response.col] Xtest1 <- airquality[train.ind == "B", -response.col] Xtest2 <- airquality[train.ind == "C", -response.col] # fit regression random forest to the training data rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5, ntree = 500, keep.inbag = TRUE) # estimate conditional mean squared prediction errors, # biases, prediction intervals, and error distribution # functions for the observations in Xtest1. return # train_nodes to avoid recomputation in the next # line of code. output1 <- quantForestError(rf, Xtrain, Xtest1, return_train_nodes = TRUE) # estimate just the conditional mean squared prediction errors # and prediction intervals for the observations in Xtest2. # avoid recomputation by providing train_nodes from the # previous line of code. output2 <- quantForestError(rf, Xtrain, Xtest2, what = c("mspe", "interval"), train_nodes = output1$train_nodes) # for illustrative purposes, convert response to categorical Ytrain <- as.factor(Ytrain > 31.5) # fit classification random forest to the training data rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 3, ntree = 500, keep.inbag = TRUE) # estimate conditional misclassification rate of the # predictions of Xtest1 output <- quantForestError(rf, Xtrain, Xtest1)
# load data data(airquality) # remove observations with missing predictor variable values airquality <- airquality[complete.cases(airquality), ] # get number of observations and the response column index n <- nrow(airquality) response.col <- 1 # split data into training and test sets train.ind <- sample(c("A", "B", "C"), n, replace = TRUE, prob = c(0.8, 0.1, 0.1)) Xtrain <- airquality[train.ind == "A", -response.col] Ytrain <- airquality[train.ind == "A", response.col] Xtest1 <- airquality[train.ind == "B", -response.col] Xtest2 <- airquality[train.ind == "C", -response.col] # fit regression random forest to the training data rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5, ntree = 500, keep.inbag = TRUE) # estimate conditional mean squared prediction errors, # biases, prediction intervals, and error distribution # functions for the observations in Xtest1. return # train_nodes to avoid recomputation in the next # line of code. output1 <- quantForestError(rf, Xtrain, Xtest1, return_train_nodes = TRUE) # estimate just the conditional mean squared prediction errors # and prediction intervals for the observations in Xtest2. # avoid recomputation by providing train_nodes from the # previous line of code. output2 <- quantForestError(rf, Xtrain, Xtest2, what = c("mspe", "interval"), train_nodes = output1$train_nodes) # for illustrative purposes, convert response to categorical Ytrain <- as.factor(Ytrain > 31.5) # fit classification random forest to the training data rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 3, ntree = 500, keep.inbag = TRUE) # estimate conditional misclassification rate of the # predictions of Xtest1 output <- quantForestError(rf, Xtrain, Xtest1)