Skip to main content


Showing posts from August, 2010

Fun with the proto package: building an MCMC sampler for Bayesian regression

The proto package is my latest favourite R goodie. It brings prototype-based programming to the R language - a style of programming that lets you do many of the things you can do with classes, but with a lot less up-front work. Louis Kates and Thomas Petzoldt provide an excellent introduction to using proto in the package vignette.

As a learning exercise I concocted the example below involving Bayesian logistic regression. It was inspired by an article on Matt Shotwell's blog about using R environments rather than lists to store the state of a Markov Chain Monte Carlo sampler. Here I use proto to create a parent class-like object (or trait in proto-ese) to contain the regression functions and create child objects to hold both data and results for individual analyses.

First here's an example session...

# Make up some data with a continuous predictor and binary response
nrec <- 500
x <- rnorm(nrec)
y <- rbinom(nrec, 1, plogis(2 - 4*x))

# Predictor matrix with a col of 1s for i…

Homogeneity analysis of hierarchical classifications

I've spent more years than I care to remember analysing vegetation survey data (typically species abundances in plots) using a variety of software including my own algorithms coded in FORTRAN and C++. A recent query on the r-help list, about how to determine the number of groups to define in a hierarchical classification produced with the hclust function, prompted me to unearth one of these algorithms, homogeneity analysis1, which can help to visualize how different levels of grouping partition the variability in a distance matrix.

This algorithm is extremely simple. The classification is progressively divided into groups, with all groups being defined at the same dendrogram height. At each level of grouping, the average of within-group pairwise distances is calculated. Homogeneity is then defined as:
H = 1 - Davwithin-group - Davtotal

where Davtotal is the average pairwise distance in the dataset as a whole.

For data were there exist well-defined clusters of values, a plot of h…

Meeting in the middle; or fudging model II regression with nls

My colleague Karen needed an equation to predict trunk diameter given tree height, which she hoped to base on measurements of trees in semi-arid Australian woodlands. This is the dark art of allometry and a quick google found a large number of formulae that have been used in different studies of tree dimensions. No problem: I started to play with a few of them and eventually settled on this one:
dbh = exp( b0 + b1 / (b2 + h) )
where: dbh is trunk diameter at breast height; h is tree height.

Karen also needed to do reverse predictions, ie. predict a tree's height given its trunk diameter. Again no problem, the inverse equation is simply:
height = b1 / (log( dbh ) - b0) - b2

But then, the pièce de résistance: the forward predictions had to agree with the reverse predictions, ie. if plugging height h into the forward equation gave trunk diameter d, then plugging d into the reverse equation should get you back to h. Karen pointed out that this seemed to be a Model II regression problem bec…