Skip to main content

Fun with the proto package: building an MCMC sampler for Bayesian regression

The proto package is my latest favourite R goodie. It brings prototype-based programming to the R language - a style of programming that lets you do many of the things you can do with classes, but with a lot less up-front work. Louis Kates and Thomas Petzoldt provide an excellent introduction to using proto in the package vignette.

As a learning exercise I concocted the example below involving Bayesian logistic regression. It was inspired by an article on Matt Shotwell's blog about using R environments rather than lists to store the state of a Markov Chain Monte Carlo sampler. Here I use proto to create a parent class-like object (or trait in proto-ese) to contain the regression functions and create child objects to hold both data and results for individual analyses.

First here's an example session...

# Make up some data with a continuous predictor and binary response
nrec <- 500
x <- rnorm(nrec)
y <- rbinom(nrec, 1, plogis(2 - 4*x))

# Predictor matrix with a col of 1s for intercept
pred <- matrix(c(rep(1, nrec), x), ncol=2)
colnames(pred) <- c("intercept", "X")

# Load the proto package
library(proto)

# Use the Logistic parent object to create a child object which will
# hold the data and run the regression (the $ operator references
# functions and data within a proto object)
lr <- Logistic$new(pred, y)
lr$run(5000, 1000)

# lr now contains both data and results
str(lr)

proto object
$ cov : num [1:2, 1:2] 0.05 -0.0667 -0.0667 0.1621
..- attr(*, "dimnames")=List of 2
$ prior.cov: num [1:2, 1:2] 100 0 0 100
$ prior.mu : num [1:2] 0 0
$ beta : num [1:5000, 1:2] 2.09 2.09 2.09 2.21 2.21 ...
..- attr(*, "dimnames")=List of 2
$ adapt : num 1000
$ y : num [1:500] 0 1 1 1 1 1 1 1 1 1 ...
$ x : num [1:500, 1:2] 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "dimnames")=List of 2
parent: proto object

# Use the Logistic summary function to tabulate and plot results
lr$summary()

From 5000 samples after 1000 iterations burning in
intercept X
Min. :1.420 Min. :-5.296
1st Qu.:1.840 1st Qu.:-3.915
Median :2.000 Median :-3.668
Mean :1.994 Mean :-3.693
3rd Qu.:2.128 3rd Qu.:-3.455
Max. :2.744 Max. :-2.437



And here's the code for the Logistic trait...

Logistic <- proto()

Logistic$new <- function(., x, y) {
# Creates a child object to hold data and results
#
# x - a design matrix (ie. predictors
# y - a binary reponse vector

proto(., x=x, y=y)
}

Logistic$run <- function(., niter, adapt=1000) {
# Perform the regression by running the MCMC
# sampler
#
# niter - number of iterations to sample
# adapt - number of prior iterations to run
# for the 'burning in' period

require(mvtnorm)

# Set up variables used by the sampler
.$adapt <- adapt
total.iter <- niter + adapt
.$beta <- matrix(0, nrow=total.iter, ncol=ncol(.$x))
.$prior.mu <- rep(0, ncol(.$x))
.$prior.cov <- diag(100, ncol(.$x))
.$cov <- diag(ncol(.$x))

# Run the sampler
b <- rep(0, ncol(.$x))
for (i in 1:total.iter) {
b <- .$update(i, b)
.$beta[i,] <- b
}

# Trim the results matrix to remove the burn-in
# period
if (.$adapt > 0) {
.$beta <- .$beta[(.$adapt + 1):total.iter,]
}
}

Logistic$update <- function(., it, beta.old) {
# Perform a single iteration of the MCMC sampler using
# Metropolis-Hastings algorithm.
# Adapted from code by Brian Neelon published at:
# http://www.duke.edu/~neelo003/r/
#
# it - iteration number
# beta.old - vector of coefficient values from
# the previous iteration

# Update the coefficient covariance if we are far
# enough through the sampling
if (.$adapt > 0 & it > 2 * .$adapt) {
.$cov <- cov(.$beta[(it - .$adapt):(it - 1),])
}

# generate proposed new coefficient values
beta.new <- c(beta.old + rmvnorm(1, sigma=.$cov))

# calculate prior and current probabilities and log-likelihood
if (it == 1) {
.$..log.prior.old <- dmvnorm(beta.old, .$prior.mu, .$prior.cov, log=TRUE)
.$..probs.old <- plogis(.$x %*% beta.old)
.$..LL.old <- sum(log(ifelse(.$y, .$..probs.old, 1 - .$..probs.old)))
}
log.prior.new <- dmvnorm(beta.new, .$prior.mu, .$prior.cov, log=TRUE)
probs.new <- plogis(.$x %*% beta.new)
LL.new <- sum(log(ifelse(.$y, probs.new, 1-probs.new)))

# Metropolis-Hastings acceptance ratio (log scale)
ratio <- LL.new + log.prior.new - .$..LL.old - .$..log.prior.old

if (log(runif(1)) < ratio) {
.$..log.prior.old <- log.prior.new
.$..probs.old <- probs.new
.$..LL.old <- LL.new
return(beta.new)
} else {
return(beta.old)
}
}

Logistic$summary <- function(., show.plot=TRUE) {
# Summarize the results

cat("From", nrow(.$beta), "samples after", .$adapt, "iterations burning in\n")
base::print(base::summary(.$beta))

if (show.plot) {
par(mfrow=c(1, ncol(.$beta)))
for (i in 1:ncol(.$beta)) {
plot(density(.$beta[,i]), main=colnames(.$beta)[i])
}
}
}


Now that's probably not the greatest design in the world, but it only took me a few minutes to put it together and it's incredibly easy to modify or extend. Try it !

Thanks to Brian Neelon for making his MCMC logistic regression code available (http://www.duke.edu/~neelo003/r/).

Comments

Popular posts from this blog

Circle packing with R

To visualize the results of a simulation model of woodland trees within R, I needed an algorithm that could arrange a large number of circles within a rectangle such that no two circles overlapped by more than a specified amount. A colleague had approached this problem earlier by sorting the circles in order of descending size, then randomly dropping each one into the rectangle repeatedly until it landed in a position with acceptable overlap. I suspected a faster and more robust algorithm could be constructed using some kind of "jiggling the circles" approach. Luckily for me, I discovered that Sean McCullough had written a really nice example of circles packing into a cluster using the Processing language. Sean's program is based on an iterative pair-repulsion algorithm in which overlapping circles move away from each other. Based on this, and modifying the algorithm a little, I came up with an R function to produce constrained random layouts of a given set of circles. ...

Circle packing in R (again)

Back in 2010 I posted some R code for circle packing . Now, just five years later, I've ported the code to Rcpp and created a little package which you can find at GitHub . The main function is circleLayout which takes a set of overlapping circles and tries to find a non-overlapping arrangement for them. Here's an example: And here's the code: # Create some random circles, positioned within the central portion # of a bounding square, with smaller circles being more common than # larger ones. ncircles

Homogeneity analysis of hierarchical classifications

I've spent more years than I care to remember analysing vegetation survey data (typically species abundances in plots) using a variety of software including my own algorithms coded in FORTRAN and C++. A recent query on the r-help list, about how to determine the number of groups to define in a hierarchical classification produced with the hclust function, prompted me to unearth one of these algorithms, homogeneity analysis 1 , which can help to visualize how different levels of grouping partition the variability in a distance matrix. This algorithm is extremely simple. The classification is progressively divided into groups, with all groups being defined at the same dendrogram height. At each level of grouping, the average of within-group pairwise distances is calculated. Homogeneity is then defined as: H = 1 - Dav within-group - Dav total where Dav total is the average pairwise distance in the dataset as a whole. For data were there exist well-defined clusters of values, a...