This guide explains an exemplary usage of the semforest function within the semtree package – step by step. We will create a latent growth curve model in OpenMx, load some data, and infer a SEM Forest.
Step 1 – Load packages
Step 2 – Load dataset
Download the example file: example1.txt (1526). Load the dataset with the following command (add a local path to the file, if necessary):
dataset <- data.frame(read.csv("example1.txt",sep="\t", header=T))
STEP 3 – Create a model
Let’s create a linear latent growth curve model with a latent intercept and a latent slope, five occasions of measurement, and independent residual errors with equal variance. The following Figure shows a graphical representation of the model that we will define:
The following code is an OpenMx model specification of the above model:
manifests <- names(dataset)[1:5] growthCurveModel <- mxModel("Linear Growth Curve Model Path Specification", type="RAM", manifestVars=manifests, latentVars=c("intercept","slope"), mxData(dataset, type="raw"), # residual variances mxPath( from=manifests, arrows=2, free=TRUE, values = c(.1, .1, .1, .1, .1), labels=c("residual","residual","residual","residual","residual") ), # latent variances and covariance mxPath( from=c("intercept","slope"), arrows=2, connect="unique.pairs", free=TRUE, values=c(2, 0, 1), labels=c("vari", "cov", "vars") ), # intercept loadings mxPath( from="intercept", to=manifests, arrows=1, free=FALSE, values=c(1, 1, 1, 1, 1) ), # slope loadings mxPath( from="slope", to=manifests, arrows=1, free=FALSE, values=c(0, 1, 2, 3, 4) ), # manifest means mxPath( from="one", to=manifests, arrows=1, free=FALSE, values=c(0, 0, 0, 0, 0) ), # latent means mxPath( from="one", to=c("intercept", "slope"), arrows=1, free=TRUE, values=c(1, 1), labels=c("meani", "means") ) ) # close model
We can estimate the model fit from the full dataset by fitting the model in OpenMx:
run <- mxRun(growthCurveModel) summary(run)
Step 4 – Create control objects and run semforest
A forest has some default settings that you may want to change. The default settings for the tree include: 5 default trees, 2 variables compared per node, and subsampling. The number of variables per node randomly selects n numbers of variables from the complete set of unmodeled variables at each node. The commands to create default control objects are:
control <- semforest.control() control$semtree.control <- semtree.control()
The default semtree control object is also needed to define the hyper parameters from the semtree analysis (see the semtree guide for more information about hyper parameters). Changing the default settings in the semforest control object is done as shown with the following commands:
control$num.trees <- 50 # number of trees to fit control$sampling <- "subsample" # sampling process control$mtry <- 2 # number of variables compared per node print(control)
Running a simple semforest with sequential tree estimation is done with the command:
forest.subsample <- semforest(growthCurveModel, dataset, control)
An additional option is changing the sampling method for the semforest calculations. This is done by changing the semforest control sampling option from „subsample“ to „bootstrap“.
control$sampling <- "bootstrap" forest.bootstrap <- semforest(growthCurveModel, dataset, control)
All columns in the dataset with names not corresponding to manifest variables in the model are considered covariates and, thus, are included in the list of split candidates. In this example, the columns „X1“ to „X5“ correspond to manifest variables. The remaining columns „age“, „training“, and „noise“ are therefore split candidates. Please be patient. A large number of models are fitted during the inference of a forest.
Step 5 – Variable importance plots
This code will compile the node information to determine the strength of data subsetting for each tree.
set.seed(436) vim.subsample <- varimp(forest.subsample)
Plotting of the variable importance values is done by passing the completed variable importance objects to the plot function. Bargraphs show the average improvement in model fit for each variable when they are used to subset the complete node data.
The graphical output should look similar to the following figure:
Step 6 – Case proximity plots
Clustering of the cases by terminal node proximity is calculated in the proximity function. The rationale for this computation is to identify clusters of of cases that tend to stick together in the final models of the trees. Clustered cases and variable importance provide ways to visualize how models differ for previously unmodeled case characteristics.
prox.subsample <- proximity(forest.subsample)
The plotting functions work the same way as they did for the variable importance plots. Just use the plot function with the case proximity object.
The variable importance plots for this dataset points to three clusters of cases based on the proximity analysis. These plots are shown below:
We can then see how the case proximity plots change when the unmodeled variable information is added to the plot. In this case age group and training both provide some information about subsetting the data, improving model fit. Age group is shown as black for young (=0) and red for old (=1). Training is displayed with circles untrained (=0) and triangles trained (=1).
plot(prox.subsample, col=(1+dataset$agegroup), pch=(1+dataset$training))
The plot should look like:
Pruning a forest
Once a full forest analysis has completed, the maximum depth of the trees may be adjusted. This pruning procedure may be used to analyze differential variable importance and case proximity when the depth of the trees is varied. The following code prunes a semforest object:
pforest.subsample <- prune.semforest(forest.subsample, max.depth=1)
Parallel computing with semforest
If computer hardware supports it, parallel computation of individual trees will safe time. The
snowfall parallel package is used to create clusters for computations. In the initialization of the cluster, the number of cpus is selected.
cl <- makeCluster(2) # cluster of 2 CPUs created, in parallel
A cluster can be stopped when not needed. (Note – if you want to restart the cluster, you must use the sfClusterEval functions to load the packages for use with the new cluster.)
A SEM Forest can be grown in parallel by specifying the cluster to be used:
forest.parallel <- semforest(growthCurveModel, dataset, control, cluster=cl) vim.parallel <- varimp(forest.parallel, cluster=cl)