Package 'SBMTrees' reference manual

Title:	Sequential Imputation with Bayesian Trees Mixed-Effects Models for Longitudinal Data
Description:	Implements a sequential imputation framework using Bayesian Mixed-Effects Trees ('SBMTrees') for handling missing data in longitudinal studies. The package supports a variety of models, including non-linear relationships and non-normal random effects and residuals, leveraging Dirichlet Process priors for increased flexibility. Key features include handling Missing at Random (MAR) longitudinal data, imputation of both covariates and outcomes, and generating posterior predictive samples for further analysis. The methodology is designed for applications in epidemiology, biostatistics, and other fields requiring robust handling of missing data in longitudinal settings.
Authors:	Jungang Zou [aut, cre], Liangyuan Hu [aut], Robert McCulloch [ctb], Rodney Sparapani [ctb], Charles Spanbauer [ctb]
Maintainer:	Jungang Zou <[email protected]>
License:	GPL-2
Version:	1.2
Built:	2025-03-11 03:01:12 UTC
Source:	https://github.com/zjg540066169/sbmtrees

Sequential Imputation with Bayesian Trees Mixed-Effects Models

Description

The SBMTrees package implements a Bayesian non-parametric framework for imputing missing covariates and outcomes in longitudinal data under the Missing at Random (MAR) assumption. Its core model, the Bayesian Trees Mixed-Effects Model (BMTrees), extends Mixed-Effects BART by employing centralized Dirichlet Process (CDP) Normal Mixture priors. This allows handling non-normal random effects and errors, addressing model misspecification, and capturing complex relationships.

Details

SBMTrees offers tools for predicting and imputing missing values in longitudinal data using Bayesian Trees Mixed-Effects Models. The package supports various semiparametric variants, including BMTrees_R and BMTrees_RE, and integrates mixedBART as a baseline model. Key functionalities include:

- BMTrees_prediction: Predicts longitudinal outcomes based on mixed-effects models.

- sequential_imputation: Imputes missing covariates and outcomes sequentially in longitudinal datasets.

The package supports flexibility in specifying priors for random effects and errors, making it suitable for diverse longitudinal data settings. Core computations leverage efficient Gibbs samplers implemented in C++.

This package modifies and extends C++ code originally derived from the BART3 package, developed by Rodney Sparapani, which is licensed under the GNU General Public License version 2 (GPL-2).

The modified code is redistributed in accordance with the GPL-2 license. For more details on the modifications, see the package's documentation.

Note

This package and all associated documentation are licensed under the GNU General Public License version 2 (GPL-2). See the LICENSE file for the full text of the license.

Author(s)

Jungang Zou <[email protected]>

References

BART3 package: https://github.com/rsparapa/bnptools/tree/master, originally developed by Rodney Sparapani.

Examples

     ## Example of predicting longitudinal outcomes
     ## To make it faster to compile and check, we only run 30 iterations for 
     ## burn-in and 40 for posterior sampling phases.
     ## Please increase to 3000 and 4000 iterations, respectively, when running the models.
     
     data <- simulation_prediction(n_subject = 100, seed = 1234, nonlinear = TRUE,
             nonrandeff = TRUE, nonresidual = TRUE)
     X_train <- data$X_train
     Y_train <- data$Y_train
     Z_train <- data$Z_train
     subject_id_train <- data$subject_id_train

     X_test <- data$X_test
     Z_test <- data$Z_test
     subject_id_test <- data$subject_id_test

     model <- BMTrees_prediction(X_train, Y_train, Z_train, subject_id_train, 
              X_test, Z_test, subject_id_test, model = "BMTrees", seed = 1234)
     model$post_predictive_y_test
     
     
     
     data2 = simulation_imputation(n_subject = 100, seed = 1234, nonrandeff = TRUE, 
             nonresidual = TRUE, alligned = FALSE) 
     X_mis = data2$X_mis # get missing covariates
     Y_mis = data2$Y_mis # get missing outcomes
     Z = data2$Z # get random predictors
     subject_id = data2$subject_id  # get subject id

     model2 = sequential_imputation(X_mis, Y_mis, Z, subject_id, rep(0, 9), FALSE, 
             model = "BMTrees", nburn = 30L, npost = 40L, skip = 2L, 
             verbose = TRUE, seed = 1234)
     model2$imputed_data
     model2$imputed_data[,,10]
     
## Example of predicting longitudinal outcomes
     ## To make it faster to compile and check, we only run 30 iterations for 
     ## burn-in and 40 for posterior sampling phases.
     ## Please increase to 3000 and 4000 iterations, respectively, when running the models.
     
     data <- simulation_prediction(n_subject = 100, seed = 1234, nonlinear = TRUE,
             nonrandeff = TRUE, nonresidual = TRUE)
     X_train <- data$X_train
     Y_train <- data$Y_train
     Z_train <- data$Z_train
     subject_id_train <- data$subject_id_train

     X_test <- data$X_test
     Z_test <- data$Z_test
     subject_id_test <- data$subject_id_test

     model <- BMTrees_prediction(X_train, Y_train, Z_train, subject_id_train, 
              X_test, Z_test, subject_id_test, model = "BMTrees", seed = 1234)
     model$post_predictive_y_test
     
     
     
     data2 = simulation_imputation(n_subject = 100, seed = 1234, nonrandeff = TRUE, 
             nonresidual = TRUE, alligned = FALSE) 
     X_mis = data2$X_mis # get missing covariates
     Y_mis = data2$Y_mis # get missing outcomes
     Z = data2$Z # get random predictors
     subject_id = data2$subject_id  # get subject id

     model2 = sequential_imputation(X_mis, Y_mis, Z, subject_id, rep(0, 9), FALSE, 
             model = "BMTrees", nburn = 30L, npost = 40L, skip = 2L, 
             verbose = TRUE, seed = 1234)
     model2$imputed_data
     model2$imputed_data[,,10]

Impute Missing Values Using LOCF and NOCB

Description

Imputes missing values in a matrix by applying Last Observation Carried Forward (LOCF) followed by Next Observation Carried Backward (NOCB) for each subject.

Usage

apply_locf_nocb(X, subject_id)
apply_locf_nocb(X, subject_id)

Arguments

`X`	A matrix where rows represent observations and columns represent variables.
`subject_id`	A vector of subject IDs corresponding to the rows of `X`.

Value

A matrix with missing values imputed using LOCF and NOCB.

Examples

X <- matrix(c(NA, 2, NA, 4, 5, NA, 7, 8, NA, NA), nrow = 5, byrow = TRUE)
subject_id <- c(1, 1, 1, 2, 2)
apply_locf_nocb(X, subject_id)

X <- matrix(c(NA, 2, NA, 4, 5, NA, 7, 8, NA, NA), nrow = 5, byrow = TRUE)
subject_id <- c(1, 1, 1, 2, 2)
apply_locf_nocb(X, subject_id)

Bayesian Trees Mixed-Effects Models for Predicting Longitudinal Outcomes

Description

Provides predictions for outcomes in longitudinal data using Bayesian Trees Mixed-Effects Models (BMTrees) and its semiparametric variants. The function predicts values for test data while accounting for random effects, complex relationships, and potential model misspecification.

Usage

BMTrees_prediction(
  X_train,
  Y_train,
  Z_train,
  subject_id_train,
  X_test,
  Z_test,
  subject_id_test,
  model = c("BMTrees", "BMTrees_R", "BMTrees_RE", "mixedBART"),
  binary = FALSE,
  nburn = 3000L,
  npost = 4000L,
  skip = 1L,
  verbose = TRUE,
  seed = NULL,
  tol = 1e-20,
  resample = 5,
  ntrees = 200,
  pi_CDP = 0.99
)
BMTrees_prediction(
  X_train,
  Y_train,
  Z_train,
  subject_id_train,
  X_test,
  Z_test,
  subject_id_test,
  model = c("BMTrees", "BMTrees_R", "BMTrees_RE", "mixedBART"),
  binary = FALSE,
  nburn = 3000L,
  npost = 4000L,
  skip = 1L,
  verbose = TRUE,
  seed = NULL,
  tol = 1e-20,
  resample = 5,
  ntrees = 200,
  pi_CDP = 0.99
)

Arguments

`X_train`	A matrix of covariates in the training set.
`Y_train`	A numeric or logical vector of outcomes in the training set.
`Z_train`	A matrix of random predictors in the training set.
`subject_id_train`	A character vector of subject IDs in the training set.
`X_test`	A matrix of covariates in the testing set.
`Z_test`	A matrix of random predictors in the testing set.
`subject_id_test`	A character vector of subject IDs in the testing set.
`model`	A character string specifying the predictive model. Options are `"BMTrees"`, `"BMTrees_R"`, `"BMTrees_RE"`, and `"mixedBART"`. Default: `"BMTrees"`.
`binary`	Logical. Indicates whether the outcome is binary (`TRUE`) or continuous (`FALSE`). Default: `FALSE`.
`nburn`	An integer specifying the number of burn-in iterations for Gibbs sampler. Default: `3000L`.
`npost`	An integer specifying the number of posterior samples to collect. Default: `4000L`.
`skip`	An integer indicating the thinning interval for MCMC samples. Default: `1L`.
`verbose`	Logical. If `TRUE`, displays MCMC progress. If `FALSE`, shows a progress bar. Default: `TRUE`.
`seed`	An optional integer for setting the random seed to ensure reproducibility. Default: `NULL`.
`tol`	A numeric tolerance value to prevent numerical overflow and underflow in the model. Default: `1e-20`.
`resample`	An integer specifying the number of resampling steps for the CDP prior. Default: `5`. This parameter is only valid for `"BMTrees"` and `"BMTrees_R"`.
`ntrees`	An integer specifying the number of trees in BART. Default: `200`.
`pi_CDP`	A value between 0 and 1 for calculating the empirical prior in the CDP prior. Default: `0.99`.

Value

A list containing posterior samples and predictions:

post_tree_train: Posterior samples of the fixed-effects from BART on training data.
post_Sigma: Posterior samples of covariance matrices in random effects.
post_lambda_F: Posterior samples of lambda parameter in CDP normal mixture on random errors.
post_lambda_G: Posterior samples of lambda parameter in CDP normal mixture on random-effects.
post_B: Posterior samples of the coefficients in random effects.
post_random_effect_train: Posterior samples of random effects for training data.
post_sigma: Posterior samples of error deviation.
post_expectation_y_train: Posterior expectations of training data outcomes, equal to fixed-effects + random effects.
post_expectation_y_test: Posterior expectations of testing data outcomes, equal to fixed-effects + random effects.
post_predictive_y_train: Posterior predictive distributions for training outcomes, equal to fixed-effects + random effects + predictive residual.
post_predictive_y_test: Posterior predictive distributions for testing outcomes, equal to fixed-effects + random effects + predictive residual.
post_eta: Posterior samples of location parameters in CDP normal mixture on random errors.
post_mu: Posterior samples of location parameters in CDP normal mixture on random effects.

Note

This function utilizes modified C++ code originally derived from the BART3 package (Bayesian Additive Regression Trees). The original package was developed by Rodney Sparapani and is licensed under GPL-2. Modifications were made by Jungang Zou, 2024.

References

For more information about the original BART3 package, see: https://github.com/rsparapa/bnptools/tree/master/BART3

Examples


data = simulation_prediction(n_subject = 100, seed = 1234, nonlinear = TRUE, 
nonrandeff = TRUE, nonresidual = TRUE) 

# To make it faster to compile and check, we only run 30 iterations for burn-in 
# and 40 for posterior sampling phases.
# Please increase to 3000 and 4000 iterations, respectively, when running the model.
model = BMTrees_prediction(data$X_train, data$Y_train, data$Z_train, 
data$subject_id_train, data$X_test, data$Z_test, data$subject_id_test, model = "BMTrees", 
binary = FALSE, nburn = 30L, npost = 40L, skip = 1L, verbose = TRUE, seed = 1234)
model$post_predictive_y_test
model$post_sigma

data = simulation_prediction(n_subject = 100, seed = 1234, nonlinear = TRUE, 
nonrandeff = TRUE, nonresidual = TRUE) 

# To make it faster to compile and check, we only run 30 iterations for burn-in 
# and 40 for posterior sampling phases.
# Please increase to 3000 and 4000 iterations, respectively, when running the model.
model = BMTrees_prediction(data$X_train, data$Y_train, data$Z_train, 
data$subject_id_train, data$X_test, data$Z_test, data$subject_id_test, model = "BMTrees", 
binary = FALSE, nburn = 30L, npost = 40L, skip = 1L, verbose = TRUE, seed = 1234)
model$post_predictive_y_test
model$post_sigma

Sequential Imputation for Missing Data

Description

Implements sequential imputation for missing covariates and outcomes in longitudinal data. The function uses a Bayesian non-parametric framework with mixed-effects models to handle both normal and non-normal random effects and errors. It sequentially imputes missing values by constructing univariate models in a fixed order, ensuring simplicity and consistency with a valid joint distribution.

Usage

sequential_imputation(
  X,
  Y,
  Z = NULL,
  subject_id,
  type,
  binary_outcome = FALSE,
  model = c("BMTrees", "BMTrees_R", "BMTrees_RE", "mixedBART"),
  nburn = 0L,
  npost = 3L,
  skip = 1L,
  verbose = TRUE,
  seed = NULL,
  tol = 1e-20,
  resample = 5,
  ntrees = 200,
  reordering = TRUE,
  pi_CDP = 0.99
)
sequential_imputation(
  X,
  Y,
  Z = NULL,
  subject_id,
  type,
  binary_outcome = FALSE,
  model = c("BMTrees", "BMTrees_R", "BMTrees_RE", "mixedBART"),
  nburn = 0L,
  npost = 3L,
  skip = 1L,
  verbose = TRUE,
  seed = NULL,
  tol = 1e-20,
  resample = 5,
  ntrees = 200,
  reordering = TRUE,
  pi_CDP = 0.99
)

Arguments

`X`	A matrix of missing covariates.
`Y`	A vector of missing outcomes (numeric or logical).
`Z`	A matrix of complete random predictors.
`subject_id`	A vector of subject IDs corresponding to the rows of `X` and `Y`. Can be both integer or character
`type`	A logical vector indicating whether each covariate in `X` is binary (1) or continuous (0).
`binary_outcome`	A logical value indicating whether the outcome `Y` is binary (1) or continuous (0). Default: `0`.
`model`	A character vector specifying the imputation model. Options are `"BMTrees"`, `"BMTrees_R"`, `"BMTrees_RE"`, and `"mixedBART"`. Default: `"BMTrees"`.
`nburn`	An integer specifying the number of burn-in iterations. Default: `0`.
`npost`	An integer specifying the number of sampling iterations. Default: `3`.
`skip`	An integer specifying the interval for keeping samples in the sampling phase. Default: `1`.
`verbose`	A logical value indicating whether to display progress and MCMC information. Default: `TRUE`.
`seed`	A random seed for reproducibility. Default: `NULL`.
`tol`	A small numerical tolerance to prevent numerical overflow or underflow in the model. Default: `1e-20`.
`resample`	An integer specifying the number of resampling steps for the CDP prior. Default: `5`. This parameter is only valid for `"BMTrees"` and `"BMTrees_R"`.
`ntrees`	An integer specifying the number of trees in BART. Default: `200`.
`reordering`	A logical value indicating whether to apply a reordering strategy for sorting covariates. Default: `TRUE`.
`pi_CDP`	A value between 0 and 1 for calculating the empirical prior in the CDP prior. Default: `0.99`.

Details

The function builds on the Bayesian Trees Mixed-Effects Model (BMTrees), which extends Mixed-Effects BART by using centralized Dirichlet Process (CDP) Normal Mixture priors. This framework handles non-normal random effects and errors, addresses model misspecification, and captures complex relationships. The function employs a Metropolis-Hastings MCMC method to sequentially impute missing values.

Value

A three-dimensional array of imputed data with dimensions (npost / skip, N, p + 1), where:

N is the number of observations.
p is the number of covariates in X. The array includes imputed covariates and outcomes.

Note

References

For more information about the original BART3 package, see: https://github.com/rsparapa/bnptools/tree/master/BART3

Examples


data <- simulation_imputation(n_subject = 100, seed = 1234, nonrandeff = TRUE, 
        nonresidual = TRUE, alligned = FALSE) 

# To make it faster to compile and check, we only run 30 iterations for burn-in 
# and 40 for posterior sampling phases.
# Please increase to 3000 and 4000 iterations, respectively, when running the model.
model <- sequential_imputation(data$X_mis, data$Y_mis, data$Z, data$subject_id, 
        rep(0, 9), binary_outcome = FALSE, model = "BMTrees", nburn = 30L, 
        npost = 40L, skip = 2L, verbose = TRUE, seed = 1234)
model$imputed_data

data <- simulation_imputation(n_subject = 100, seed = 1234, nonrandeff = TRUE, 
        nonresidual = TRUE, alligned = FALSE) 

# To make it faster to compile and check, we only run 30 iterations for burn-in 
# and 40 for posterior sampling phases.
# Please increase to 3000 and 4000 iterations, respectively, when running the model.
model <- sequential_imputation(data$X_mis, data$Y_mis, data$Z, data$subject_id, 
        rep(0, 9), binary_outcome = FALSE, model = "BMTrees", nburn = 30L, 
        npost = 40L, skip = 2L, verbose = TRUE, seed = 1234)
model$imputed_data

Simulate Longitudinal Data with Missingness

Description

Generates a dataset with longitudinal data containing missing covariates and outcomes. The function allows customization of random effects, residuals, and the alignment of covariates to simulate data under different conditions.

Usage

simulation_imputation(
  n_subject = 800,
  seed = NULL,
  nonrandeff = FALSE,
  nonresidual = FALSE,
  alligned = FALSE
)
simulation_imputation(
  n_subject = 800,
  seed = NULL,
  nonrandeff = FALSE,
  nonresidual = FALSE,
  alligned = FALSE
)

Arguments

`n_subject`	Number of subjects in the dataset. Each subject has multiple observations. Default: `800`.
`seed`	Random seed for reproducibility. Default: `123`.
`nonrandeff`	Logical value indicating whether the random effects are non-normal. Default: `FALSE`.
`nonresidual`	Logical value indicating whether the residuals are non-normal. Default: `FALSE`.
`alligned`	Logical value indicating whether the covariates should be aligned (`TRUE`) or shuffled (`FALSE`). Default: `FALSE`. If it is shuffled, we will return covariate order as X1, X2, X3, X4, X5, X6, X9, X8, X7. If it is alligned, we will return covariate order as X1, X2, X3, X4, X5, X6, X7, X8, X9.

Details

This function creates longitudinal data for multiple subjects, each observed across 6 time points. Non-normal or normal random effects and residual conditions can be specified. Missing values are introduced based MAR assumption. The alignment of covariates can be customized to test different imputation scenarios.

Value

A list containing:

X_mis: Matrix of missing covariates.
Y_mis: Vector of missing outcomes.
Z: Matrix of complete random predictors.
subject_id: Vector of subject IDs.
time: Time points for each observation.
X_O: Matrix of original complete covariates (for evaluation).
Y_O: Vector of original complete outcomes (for evaluation).

Examples

simulated_data <- simulation_imputation(
  n_subject = 800,
  seed = 123,
  nonrandeff = TRUE,
  nonresidual = TRUE,
  alligned = FALSE
)
simulated_data <- simulation_imputation(
  n_subject = 800,
  seed = 123,
  nonrandeff = TRUE,
  nonresidual = TRUE,
  alligned = FALSE
)

Simulate Longitudinal Data for Prediction

Description

Generates a fixed population longitudinal dataset, with random seeds to generate different training and testing sets. The function supports customization of linear/nonlinear associations, normal/non-normal random effects, and random errors. It splits the data into training and testing sets, with the testing set comprising approximately 40% of the data.

Usage

simulation_prediction(
  n_subject = 800,
  seed = NULL,
  nonlinear = FALSE,
  nonrandeff = FALSE,
  nonresidual = FALSE
)
simulation_prediction(
  n_subject = 800,
  seed = NULL,
  nonlinear = FALSE,
  nonrandeff = FALSE,
  nonresidual = FALSE
)

Arguments

`n_subject`	Number of subjects in the dataset. Each subject has multiple observations across 6 follow-up time points. Default: `800`.
`seed`	Random seed for reproducibility. Ensures different training-testing splits. Default: `123`.
`nonlinear`	Logical value indicating whether the outcome model includes nonlinear associations. Default: `FALSE`.
`nonrandeff`	Logical value indicating whether the random effects are non-normal. Default: `FALSE`.
`nonresidual`	Logical value indicating whether the residuals are non-normal. Default: `FALSE`.

Details

The function creates a dataset with individuals observed at 6 follow-up time points. It allows users to specify whether the associations are linear or nonlinear and whether random effects and residuals follow normal or non-normal distributions. Approximately 40% of the data is randomly chosen to form the testing set, while the remaining 60% constitutes the training set.

Value

A list containing:

Y_test_true: True values of the vector of outcomes in the testing set.
X_train: Matrix of covariates in the training set.
Y_train: Vector of outcomes in the training set.
Z_train: Matrix of random predictors in the training set.
subject_id_train: Vector of subject IDs in the training set.
time_train: Vector of time point in the training set.
X_test: Matrix of covariates in the testing set.
Y_test: Vector of outcomes in the testing set.
Z_test: Matrix of random predictors in the testing set.
subject_id_test: Vector of subject IDs in the testing set.
time_test: Vector of time point in the testing set.

Examples

  # Generate data with nonlinear associations and non-normal random effects and residuals
  data <- simulation_prediction(
    n_subject = 800,
    seed = 123,
    nonlinear = TRUE,
    nonrandeff = TRUE,
    nonresidual = TRUE
  )
  # Access training and testing data
  X_train <- data$X_train
  Y_train <- data$Y_train
  Z_train <- data$Z_train
  subject_id_train <- data$subject_id_train

  X_test <- data$X_test
  Y_test <- data$Y_test
  Z_test <- data$Z_test
  subject_id_test <- data$subject_id_test
  
  Y_test_true = data$Y_test_true

# Generate data with nonlinear associations and non-normal random effects and residuals
  data <- simulation_prediction(
    n_subject = 800,
    seed = 123,
    nonlinear = TRUE,
    nonrandeff = TRUE,
    nonresidual = TRUE
  )
  # Access training and testing data
  X_train <- data$X_train
  Y_train <- data$Y_train
  Z_train <- data$Z_train
  subject_id_train <- data$subject_id_train

  X_test <- data$X_test
  Y_test <- data$Y_test
  Z_test <- data$Z_test
  subject_id_test <- data$subject_id_test
  
  Y_test_true = data$Y_test_true

Package 'SBMTrees'

Help Index

Sequential Imputation with Bayesian Trees Mixed-Effects Models

Description

Details

Note

Author(s)

References

See Also

Examples

Impute Missing Values Using LOCF and NOCB

Description

Usage

Arguments

Value

Examples

Bayesian Trees Mixed-Effects Models for Predicting Longitudinal Outcomes

Description

Usage

Arguments

Value

Note

References

Examples

Sequential Imputation for Missing Data

Description

Usage

Arguments

Details

Value

Note

References

Examples

Simulate Longitudinal Data with Missingness

Description

Usage

Arguments

Details

Value

See Also

Examples

Simulate Longitudinal Data for Prediction

Description

Usage

Arguments

Details

Value

See Also

Examples