The R package SBMTrees (Sequential Imputation with Bayesian Trees Mixed-Effects Models) provides a powerful Bayesian non-parametric framework for imputing missing covariates and outcomes in longitudinal data under the Missing at Random (MAR) assumption. The package leverages centralized Dirichlet Process (CDP) Normal Mixture priors to model non-normal random effects and errors, offering robust handling of model misspecification and capturing complex relationships in longitudinal data.
This vignette introduces the key functionalities of the package, including:
BMTrees_prediction
to predict
longitudinal outcomes.sequential_imputation
for imputing
missing values.The BMTrees_prediction
function is used to predict
longitudinal outcomes based on Bayesian Mixed-Effects Models. Below is
an example of how to generate data, split it into training and testing
datasets, and run predictions.
# Simulate data
data <- simulation_prediction(
n_subject = 100,
seed = 1234,
nonlinear = TRUE,
nonrandeff = TRUE,
nonresidual = TRUE
)
#> Warning: No records are assigned to patterns 16, 18, 19, 40, 41. These patterns
#> will not be generated. Consider reducing the number of patterns or increasing
#> the dataset size.
# Extract training and testing datasets
X_train <- data$X_train
Y_train <- data$Y_train
Z_train <- data$Z_train
subject_id_train <- data$subject_id_train
X_test <- data$X_test
Z_test <- data$Z_test
subject_id_test <- data$subject_id_test
Y_test_true <- data$Y_test_true
We then run the prediction model BMTrees
, with 3 burn-in
iterations and 4 posterior samples. The number of burn-in and posterior
iterations should be increase to 3000 and 4000, respectively. Here we
only use the small numbers to simply debug.
# Fit the predictive model
model <- BMTrees_prediction(
X_train, Y_train, Z_train, subject_id_train,
X_test, Z_test, subject_id_test,
model = "BMTrees",
binary = FALSE,
nburn = 3L,
npost = 4L,
skip = 1L,
verbose = FALSE,
seed = 1234
)
# Posterior expectation for the testing dataset
posterior_predictions <- model$post_predictive_y_test
head(colMeans(posterior_predictions))
#> [1] 22.52958 -55.43787 -93.74764 -109.30361 -52.16865 -37.98970
To evaluate the model’s predictive performance, we compute the Mean Absolute Error (MAE), and the Mean Square Error (MSE). We also calculate the 95% posterior predictive intervals to check coverage, and visualize the results using scatterplots of true versus predicted values.
point_predictions = colMeans(posterior_predictions)
# Compute MAE
mae <- mean(abs(point_predictions - Y_test_true))
cat("Mean Absolute Error (MAE):", mae, "\n")
#> Mean Absolute Error (MAE): 13.44481
# Compute MSE
mse <- mean((point_predictions - Y_test_true)^2)
cat("Mean Squared Error (MSE):", mse, "\n")
#> Mean Squared Error (MSE): 272.1943
# Compute 95% credible intervals
lower_bounds <- apply(posterior_predictions, 2, quantile, probs = 0.025)
upper_bounds <- apply(posterior_predictions, 2, quantile, probs = 0.975)
# Check if true values fall within the intervals
coverage <- mean(Y_test_true >= lower_bounds & Y_test_true <= upper_bounds)
cat("95% Posterior Predictive Interval Coverage:", coverage * 100, "%\n")
#> 95% Posterior Predictive Interval Coverage: 11.30435 %
plot(Y_test_true, point_predictions,
xlab = "True Values",
ylab = "Predicted Values",
main = "True vs Predicted Values")
abline(0, 1, col = "red") # Add a 45-degree reference line
The sequential_imputation
function is used to impute
missing covariates and outcomes in longitudinal data. Below is an
example of how to generate longitudinal data with MAR missing, and run
imputations.
# Simulate data with missing values
data <- simulation_imputation(
n_subject = 100,
seed = 1234,
nonrandeff = TRUE,
nonresidual = TRUE,
alligned = FALSE
)
# Extract components of the dataset
X_mis <- data$X_mis # Covariates with missing values
Y_mis <- data$Y_mis # Outcomes with missing values
Z <- data$Z # Random predictors
subject_id <- data$subject_id
We then run the sequential imputation with BMTrees
, with
3 burn-in iterations, 40 posterior iterations, and sample one posterior
sample for every 2 posterior iterations, ensuring 20 multiply-imputed
sets are generated. The number of burn-in and posterior iterations
should be increase to 3000 and 4000, respectively. Here we only use the
small numbers to simply debug.
# Run the sequential imputation
imputed_model <- sequential_imputation(
X_mis, Y_mis, Z, subject_id,
type = rep(0, ncol(X_mis)),
binary_outcome = FALSE,
model = "BMTrees",
nburn = 3L,
npost = 40L,
skip = 2L,
verbose = FALSE,
seed = 1234
)
#> reordering: new covariates order is X1 X2 X3 X4 X5 X6 X8 X7 X9
#> Start to initialize imputed missing data by LOCF and NOCB.
#> Completed.
#> Start to impute using Longitudinal Sequential Imputation with:
#> BMTrees
#> Outcome variable has missing values
#> Start initializing models
#> Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
#> Model failed to converge with max|grad| = 0.00201562 (tol = 0.002, component 1)
#>
#> Complete initialization
#>
#> Finish imputation with 20 imputed sets
# Extract imputed data
imputed_data <- imputed_model$imputed_data
dim(imputed_data) # Dimensions: posterior samples x observations x variables
#> [1] 20 600 10
To evaluate the model’s imputation performance, we apply Rubin`s rule to estimate linear mixed-effects model on the multiply-imputed sets.
# create structure which can be used in mitml
MI_data = list()
for (i in 1:dim(imputed_data)[1]) {
MI_data[[i]] = cbind(as.data.frame(imputed_data[i,,]), Z, subject_id)
colnames(MI_data[[i]]) = c(colnames(X_mis), "Y", "Z1", "Z2", "Z3", "subject_id")
}
MI_data <- as.mitml.list(MI_data) # Replace with actual datasets
# Fit the model on each imputed dataset
lmm_results <- with(MI_data, lmer(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + (0 + Z1 + Z2 + Z3 | subject_id)))
#> Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
#> Model failed to converge with max|grad| = 0.00462371 (tol = 0.002, component 1)
#> Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
#> Model failed to converge with max|grad| = 0.0128379 (tol = 0.002, component 1)
#> boundary (singular) fit: see help('isSingular')
#> boundary (singular) fit: see help('isSingular')
#> Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
#> Model failed to converge with max|grad| = 0.00226588 (tol = 0.002, component 1)
#> Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
#> Model failed to converge with max|grad| = 0.00450694 (tol = 0.002, component 1)
#> Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
#> Model failed to converge with max|grad| = 0.00280681 (tol = 0.002, component 1)
#> Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
#> Model failed to converge with max|grad| = 0.00369226 (tol = 0.002, component 1)
#> boundary (singular) fit: see help('isSingular')
# Pool fixed effects using Rubin's Rules
pooled_results <- testEstimates(lmm_results)
# Print pooled results
print(pooled_results)
#>
#> Call:
#>
#> testEstimates(model = lmm_results)
#>
#> Final parameter estimates and inferences obtained from 20 imputed data sets.
#>
#> Estimate Std.Error t.value df P(>|t|) RIV FMI
#> (Intercept) -6.894 1.154 -5.974 205.395 0.000 0.437 0.311
#> X1 0.157 0.419 0.375 267.095 0.708 0.364 0.272
#> X2 -0.523 0.509 -1.029 63.513 0.307 1.207 0.561
#> X3 1.264 0.388 3.255 326.418 0.001 0.318 0.246
#> X4 2.359 0.325 7.266 122.087 0.000 0.652 0.404
#> X5 -0.476 0.484 -0.984 37.423 0.331 2.479 0.727
#> X6 1.284 0.342 3.751 119.376 0.000 0.664 0.409
#> X7 -0.727 0.121 -6.010 195.363 0.000 0.453 0.319
#> X8 0.500 0.106 4.733 41.193 0.000 2.117 0.694
#> X9 -0.665 0.082 -8.136 32.937 0.000 3.158 0.773
#>
#> Unadjusted hypothesis test as appropriate in larger samples.
The SBMTrees package provides flexible tools for handling missing values and making predictions in longitudinal data. By leveraging Bayesian non-parametric methods, it effectively addresses challenges associated with model misspecification, non-normal random effects, and non-normal errors.
For further details, please refer to the package documentation and the paper: Nonparametric Bayesian Additive Regression Trees for Predicting and Handling Missing Covariates and Outcomes in Longitudinal Data.
This vignette is part of the SBMTrees R package and is distributed under the terms of the GNU General Public License (GPL-2).