Title: | Sequential Imputation with Bayesian Trees Mixed-Effects Models for Longitudinal Data |
---|---|
Description: | Implements a sequential imputation framework using Bayesian Mixed-Effects Trees ('SBMTrees') for handling missing data in longitudinal studies. The package supports a variety of models, including non-linear relationships and non-normal random effects and residuals, leveraging Dirichlet Process priors for increased flexibility. Key features include handling Missing at Random (MAR) longitudinal data, imputation of both covariates and outcomes, and generating posterior predictive samples for further analysis. The methodology is designed for applications in epidemiology, biostatistics, and other fields requiring robust handling of missing data in longitudinal settings. |
Authors: | Jungang Zou [aut, cre], Liangyuan Hu [aut], Robert McCulloch [ctb], Rodney Sparapani [ctb], Charles Spanbauer [ctb] |
Maintainer: | Jungang Zou <[email protected]> |
License: | GPL-2 |
Version: | 1.2 |
Built: | 2024-12-11 01:26:26 UTC |
Source: | https://github.com/zjg540066169/sbmtrees |
The SBMTrees package implements a Bayesian non-parametric framework for imputing missing covariates and outcomes in longitudinal data under the Missing at Random (MAR) assumption. Its core model, the Bayesian Trees Mixed-Effects Model (BMTrees), extends Mixed-Effects BART by employing centralized Dirichlet Process (CDP) Normal Mixture priors. This allows handling non-normal random effects and errors, addressing model misspecification, and capturing complex relationships.
SBMTrees offers tools for predicting and imputing missing values in longitudinal data using Bayesian Trees Mixed-Effects Models. The package supports various semiparametric variants, including BMTrees_R and BMTrees_RE, and integrates mixedBART as a baseline model. Key functionalities include:
- BMTrees_prediction
: Predicts longitudinal outcomes based on mixed-effects models.
- sequential_imputation
: Imputes missing covariates and outcomes sequentially in longitudinal datasets.
The package supports flexibility in specifying priors for random effects and errors, making it suitable for diverse longitudinal data settings. Core computations leverage efficient Gibbs samplers implemented in C++.
This package modifies and extends C++ code originally derived from the BART3 package, developed by Rodney Sparapani, which is licensed under the GNU General Public License version 2 (GPL-2).
The modified code is redistributed in accordance with the GPL-2 license. For more details on the modifications, see the package's documentation.
This package and all associated documentation are licensed under the GNU General Public License version 2 (GPL-2). See the LICENSE
file for the full text of the license.
Jungang Zou <[email protected]>
BART3 package: https://github.com/rsparapa/bnptools/tree/master, originally developed by Rodney Sparapani.
BMTrees_prediction
, sequential_imputation
## Example of predicting longitudinal outcomes ## To make it faster to compile and check, we only run 30 iterations for ## burn-in and 40 for posterior sampling phases. ## Please increase to 3000 and 4000 iterations, respectively, when running the models. data <- simulation_prediction(n_subject = 100, seed = 1234, nonlinear = TRUE, nonrandeff = TRUE, nonresidual = TRUE) X_train <- data$X_train Y_train <- data$Y_train Z_train <- data$Z_train subject_id_train <- data$subject_id_train X_test <- data$X_test Z_test <- data$Z_test subject_id_test <- data$subject_id_test model <- BMTrees_prediction(X_train, Y_train, Z_train, subject_id_train, X_test, Z_test, subject_id_test, model = "BMTrees", seed = 1234) model$post_predictive_y_test data2 = simulation_imputation(n_subject = 100, seed = 1234, nonrandeff = TRUE, nonresidual = TRUE, alligned = FALSE) X_mis = data2$X_mis # get missing covariates Y_mis = data2$Y_mis # get missing outcomes Z = data2$Z # get random predictors subject_id = data2$subject_id # get subject id model2 = sequential_imputation(X_mis, Y_mis, Z, subject_id, rep(0, 9), FALSE, model = "BMTrees", nburn = 30L, npost = 40L, skip = 2L, verbose = TRUE, seed = 1234) model2$imputed_data model2$imputed_data[,,10]
## Example of predicting longitudinal outcomes ## To make it faster to compile and check, we only run 30 iterations for ## burn-in and 40 for posterior sampling phases. ## Please increase to 3000 and 4000 iterations, respectively, when running the models. data <- simulation_prediction(n_subject = 100, seed = 1234, nonlinear = TRUE, nonrandeff = TRUE, nonresidual = TRUE) X_train <- data$X_train Y_train <- data$Y_train Z_train <- data$Z_train subject_id_train <- data$subject_id_train X_test <- data$X_test Z_test <- data$Z_test subject_id_test <- data$subject_id_test model <- BMTrees_prediction(X_train, Y_train, Z_train, subject_id_train, X_test, Z_test, subject_id_test, model = "BMTrees", seed = 1234) model$post_predictive_y_test data2 = simulation_imputation(n_subject = 100, seed = 1234, nonrandeff = TRUE, nonresidual = TRUE, alligned = FALSE) X_mis = data2$X_mis # get missing covariates Y_mis = data2$Y_mis # get missing outcomes Z = data2$Z # get random predictors subject_id = data2$subject_id # get subject id model2 = sequential_imputation(X_mis, Y_mis, Z, subject_id, rep(0, 9), FALSE, model = "BMTrees", nburn = 30L, npost = 40L, skip = 2L, verbose = TRUE, seed = 1234) model2$imputed_data model2$imputed_data[,,10]
Imputes missing values in a matrix by applying Last Observation Carried Forward (LOCF) followed by Next Observation Carried Backward (NOCB) for each subject.
apply_locf_nocb(X, subject_id)
apply_locf_nocb(X, subject_id)
X |
A matrix where rows represent observations and columns represent variables. |
subject_id |
A vector of subject IDs corresponding to the rows of |
A matrix with missing values imputed using LOCF and NOCB.
X <- matrix(c(NA, 2, NA, 4, 5, NA, 7, 8, NA, NA), nrow = 5, byrow = TRUE) subject_id <- c(1, 1, 1, 2, 2) apply_locf_nocb(X, subject_id)
X <- matrix(c(NA, 2, NA, 4, 5, NA, 7, 8, NA, NA), nrow = 5, byrow = TRUE) subject_id <- c(1, 1, 1, 2, 2) apply_locf_nocb(X, subject_id)
Provides predictions for outcomes in longitudinal data using Bayesian Trees Mixed-Effects Models (BMTrees) and its semiparametric variants. The function predicts values for test data while accounting for random effects, complex relationships, and potential model misspecification.
BMTrees_prediction( X_train, Y_train, Z_train, subject_id_train, X_test, Z_test, subject_id_test, model = c("BMTrees", "BMTrees_R", "BMTrees_RE", "mixedBART"), binary = FALSE, nburn = 3000L, npost = 4000L, skip = 1L, verbose = TRUE, seed = NULL, tol = 1e-20, resample = 5, ntrees = 200, pi_CDP = 0.99 )
BMTrees_prediction( X_train, Y_train, Z_train, subject_id_train, X_test, Z_test, subject_id_test, model = c("BMTrees", "BMTrees_R", "BMTrees_RE", "mixedBART"), binary = FALSE, nburn = 3000L, npost = 4000L, skip = 1L, verbose = TRUE, seed = NULL, tol = 1e-20, resample = 5, ntrees = 200, pi_CDP = 0.99 )
X_train |
A matrix of covariates in the training set. |
Y_train |
A numeric or logical vector of outcomes in the training set. |
Z_train |
A matrix of random predictors in the training set. |
subject_id_train |
A character vector of subject IDs in the training set. |
X_test |
A matrix of covariates in the testing set. |
Z_test |
A matrix of random predictors in the testing set. |
subject_id_test |
A character vector of subject IDs in the testing set. |
model |
A character string specifying the predictive model. Options are |
binary |
Logical. Indicates whether the outcome is binary ( |
nburn |
An integer specifying the number of burn-in iterations for Gibbs sampler.
Default: |
npost |
An integer specifying the number of posterior samples to collect. Default: |
skip |
An integer indicating the thinning interval for MCMC samples. Default: |
verbose |
Logical. If |
seed |
An optional integer for setting the random seed to ensure reproducibility. Default: |
tol |
A numeric tolerance value to prevent numerical overflow and underflow in the model. Default: |
resample |
An integer specifying the number of resampling steps for the CDP prior. Default: |
ntrees |
An integer specifying the number of trees in BART. Default: |
pi_CDP |
A value between 0 and 1 for calculating the empirical prior in the CDP prior. Default: |
A list containing posterior samples and predictions:
Posterior samples of the fixed-effects from BART on training data.
Posterior samples of covariance matrices in random effects.
Posterior samples of lambda parameter in CDP normal mixture on random errors.
Posterior samples of lambda parameter in CDP normal mixture on random-effects.
Posterior samples of the coefficients in random effects.
Posterior samples of random effects for training data.
Posterior samples of error deviation.
Posterior expectations of training data outcomes, equal to fixed-effects + random effects.
Posterior expectations of testing data outcomes, equal to fixed-effects + random effects.
Posterior predictive distributions for training outcomes, equal to fixed-effects + random effects + predictive residual.
Posterior predictive distributions for testing outcomes, equal to fixed-effects + random effects + predictive residual.
Posterior samples of location parameters in CDP normal mixture on random errors.
Posterior samples of location parameters in CDP normal mixture on random effects.
This function utilizes modified C++ code originally derived from the BART3 package (Bayesian Additive Regression Trees). The original package was developed by Rodney Sparapani and is licensed under GPL-2. Modifications were made by Jungang Zou, 2024.
For more information about the original BART3 package, see: https://github.com/rsparapa/bnptools/tree/master/BART3
data = simulation_prediction(n_subject = 100, seed = 1234, nonlinear = TRUE, nonrandeff = TRUE, nonresidual = TRUE) # To make it faster to compile and check, we only run 30 iterations for burn-in # and 40 for posterior sampling phases. # Please increase to 3000 and 4000 iterations, respectively, when running the model. model = BMTrees_prediction(data$X_train, data$Y_train, data$Z_train, data$subject_id_train, data$X_test, data$Z_test, data$subject_id_test, model = "BMTrees", binary = FALSE, nburn = 30L, npost = 40L, skip = 1L, verbose = TRUE, seed = 1234) model$post_predictive_y_test model$post_sigma
data = simulation_prediction(n_subject = 100, seed = 1234, nonlinear = TRUE, nonrandeff = TRUE, nonresidual = TRUE) # To make it faster to compile and check, we only run 30 iterations for burn-in # and 40 for posterior sampling phases. # Please increase to 3000 and 4000 iterations, respectively, when running the model. model = BMTrees_prediction(data$X_train, data$Y_train, data$Z_train, data$subject_id_train, data$X_test, data$Z_test, data$subject_id_test, model = "BMTrees", binary = FALSE, nburn = 30L, npost = 40L, skip = 1L, verbose = TRUE, seed = 1234) model$post_predictive_y_test model$post_sigma
Implements sequential imputation for missing covariates and outcomes in longitudinal data. The function uses a Bayesian non-parametric framework with mixed-effects models to handle both normal and non-normal random effects and errors. It sequentially imputes missing values by constructing univariate models in a fixed order, ensuring simplicity and consistency with a valid joint distribution.
sequential_imputation( X, Y, Z = NULL, subject_id, type, binary_outcome = FALSE, model = c("BMTrees", "BMTrees_R", "BMTrees_RE", "mixedBART"), nburn = 0L, npost = 3L, skip = 1L, verbose = TRUE, seed = NULL, tol = 1e-20, resample = 5, ntrees = 200, reordering = TRUE, pi_CDP = 0.99 )
sequential_imputation( X, Y, Z = NULL, subject_id, type, binary_outcome = FALSE, model = c("BMTrees", "BMTrees_R", "BMTrees_RE", "mixedBART"), nburn = 0L, npost = 3L, skip = 1L, verbose = TRUE, seed = NULL, tol = 1e-20, resample = 5, ntrees = 200, reordering = TRUE, pi_CDP = 0.99 )
X |
A matrix of missing covariates. |
Y |
A vector of missing outcomes (numeric or logical). |
Z |
A matrix of complete random predictors. |
subject_id |
A vector of subject IDs corresponding to the rows of |
type |
A logical vector indicating whether each covariate in |
binary_outcome |
A logical value indicating whether the outcome |
model |
A character vector specifying the imputation model. Options are |
nburn |
An integer specifying the number of burn-in iterations. Default: |
npost |
An integer specifying the number of sampling iterations. Default: |
skip |
An integer specifying the interval for keeping samples in the sampling phase. Default: |
verbose |
A logical value indicating whether to display progress and MCMC information. Default: |
seed |
A random seed for reproducibility. Default: |
tol |
A small numerical tolerance to prevent numerical overflow or underflow in the model. Default: |
resample |
An integer specifying the number of resampling steps for the CDP prior. Default: |
ntrees |
An integer specifying the number of trees in BART. Default: |
reordering |
A logical value indicating whether to apply a reordering strategy for sorting covariates. Default: |
pi_CDP |
A value between 0 and 1 for calculating the empirical prior in the CDP prior. Default: |
The function builds on the Bayesian Trees Mixed-Effects Model (BMTrees), which extends Mixed-Effects BART by using centralized Dirichlet Process (CDP) Normal Mixture priors. This framework handles non-normal random effects and errors, addresses model misspecification, and captures complex relationships. The function employs a Metropolis-Hastings MCMC method to sequentially impute missing values.
A three-dimensional array of imputed data with dimensions (npost / skip, N, p + 1)
, where:
N
is the number of observations.
p
is the number of covariates in X
.
The array includes imputed covariates and outcomes.
This function utilizes modified C++ code originally derived from the BART3 package (Bayesian Additive Regression Trees). The original package was developed by Rodney Sparapani and is licensed under GPL-2. Modifications were made by Jungang Zou, 2024.
For more information about the original BART3 package, see: https://github.com/rsparapa/bnptools/tree/master/BART3
data <- simulation_imputation(n_subject = 100, seed = 1234, nonrandeff = TRUE, nonresidual = TRUE, alligned = FALSE) # To make it faster to compile and check, we only run 30 iterations for burn-in # and 40 for posterior sampling phases. # Please increase to 3000 and 4000 iterations, respectively, when running the model. model <- sequential_imputation(data$X_mis, data$Y_mis, data$Z, data$subject_id, rep(0, 9), binary_outcome = FALSE, model = "BMTrees", nburn = 30L, npost = 40L, skip = 2L, verbose = TRUE, seed = 1234) model$imputed_data
data <- simulation_imputation(n_subject = 100, seed = 1234, nonrandeff = TRUE, nonresidual = TRUE, alligned = FALSE) # To make it faster to compile and check, we only run 30 iterations for burn-in # and 40 for posterior sampling phases. # Please increase to 3000 and 4000 iterations, respectively, when running the model. model <- sequential_imputation(data$X_mis, data$Y_mis, data$Z, data$subject_id, rep(0, 9), binary_outcome = FALSE, model = "BMTrees", nburn = 30L, npost = 40L, skip = 2L, verbose = TRUE, seed = 1234) model$imputed_data
Generates a dataset with longitudinal data containing missing covariates and outcomes. The function allows customization of random effects, residuals, and the alignment of covariates to simulate data under different conditions.
simulation_imputation( n_subject = 800, seed = NULL, nonrandeff = FALSE, nonresidual = FALSE, alligned = FALSE )
simulation_imputation( n_subject = 800, seed = NULL, nonrandeff = FALSE, nonresidual = FALSE, alligned = FALSE )
n_subject |
Number of subjects in the dataset. Each subject has multiple observations. Default: |
seed |
Random seed for reproducibility. Default: |
nonrandeff |
Logical value indicating whether the random effects are non-normal. Default: |
nonresidual |
Logical value indicating whether the residuals are non-normal. Default: |
alligned |
Logical value indicating whether the covariates should be aligned ( |
This function creates longitudinal data for multiple subjects, each observed across 6 time points. Non-normal or normal random effects and residual conditions can be specified. Missing values are introduced based MAR assumption. The alignment of covariates can be customized to test different imputation scenarios.
A list containing:
X_mis
Matrix of missing covariates.
Y_mis
Vector of missing outcomes.
Z
Matrix of complete random predictors.
subject_id
Vector of subject IDs.
time
Time points for each observation.
X_O
Matrix of original complete covariates (for evaluation).
Y_O
Vector of original complete outcomes (for evaluation).
Normal
, Uniform
, Binomial
, Chisquare
, GammaDist
Mvnorm
dmst
reexports
, mutate
, select
pivot_wider
, pivot_longer
, reexports
, separate
invlogit
simulated_data <- simulation_imputation( n_subject = 800, seed = 123, nonrandeff = TRUE, nonresidual = TRUE, alligned = FALSE )
simulated_data <- simulation_imputation( n_subject = 800, seed = 123, nonrandeff = TRUE, nonresidual = TRUE, alligned = FALSE )
Generates a fixed population longitudinal dataset, with random seeds to generate different training and testing sets. The function supports customization of linear/nonlinear associations, normal/non-normal random effects, and random errors. It splits the data into training and testing sets, with the testing set comprising approximately 40% of the data.
simulation_prediction( n_subject = 800, seed = NULL, nonlinear = FALSE, nonrandeff = FALSE, nonresidual = FALSE )
simulation_prediction( n_subject = 800, seed = NULL, nonlinear = FALSE, nonrandeff = FALSE, nonresidual = FALSE )
n_subject |
Number of subjects in the dataset. Each subject has multiple observations across 6 follow-up time points. Default: |
seed |
Random seed for reproducibility. Ensures different training-testing splits. Default: |
nonlinear |
Logical value indicating whether the outcome model includes nonlinear associations. Default: |
nonrandeff |
Logical value indicating whether the random effects are non-normal. Default: |
nonresidual |
Logical value indicating whether the residuals are non-normal. Default: |
The function creates a dataset with individuals observed at 6 follow-up time points. It allows users to specify whether the associations are linear or nonlinear and whether random effects and residuals follow normal or non-normal distributions. Approximately 40% of the data is randomly chosen to form the testing set, while the remaining 60% constitutes the training set.
A list containing:
Y_test_true
True values of the vector of outcomes in the testing set.
X_train
Matrix of covariates in the training set.
Y_train
Vector of outcomes in the training set.
Z_train
Matrix of random predictors in the training set.
subject_id_train
Vector of subject IDs in the training set.
time_train
Vector of time point in the training set.
X_test
Matrix of covariates in the testing set.
Y_test
Vector of outcomes in the testing set.
Z_test
Matrix of random predictors in the testing set.
subject_id_test
Vector of subject IDs in the testing set.
time_test
Vector of time point in the testing set.
# Generate data with nonlinear associations and non-normal random effects and residuals data <- simulation_prediction( n_subject = 800, seed = 123, nonlinear = TRUE, nonrandeff = TRUE, nonresidual = TRUE ) # Access training and testing data X_train <- data$X_train Y_train <- data$Y_train Z_train <- data$Z_train subject_id_train <- data$subject_id_train X_test <- data$X_test Y_test <- data$Y_test Z_test <- data$Z_test subject_id_test <- data$subject_id_test Y_test_true = data$Y_test_true
# Generate data with nonlinear associations and non-normal random effects and residuals data <- simulation_prediction( n_subject = 800, seed = 123, nonlinear = TRUE, nonrandeff = TRUE, nonresidual = TRUE ) # Access training and testing data X_train <- data$X_train Y_train <- data$Y_train Z_train <- data$Z_train subject_id_train <- data$subject_id_train X_test <- data$X_test Y_test <- data$Y_test Z_test <- data$Z_test subject_id_test <- data$subject_id_test Y_test_true = data$Y_test_true