Sunday, September 30, 2012

Fitting an ellipse to point data

Some time ago I wrote an R function to fit an ellipse to point data, using an algorithm developed by Radim Halíř and Jan Flusser1 in Matlab, and posted it to the r-help list. The implementation was a bit hacky, returning odd results for some data.

A couple of days ago, an email arrived from John Minter asking for a pointer to the original code. I replied with a link and mentioned that I'd be interested to know if John made any improvements to the code. About ten minutes later, John emailed again with a much improved version ! Not only is it more reliable, but also more efficient. So with many thanks to John, here is the improved code:


fit.ellipse <- function (x, y = NULL) {
  # from:
  # http://r.789695.n4.nabble.com/Fitting-a-half-ellipse-curve-tp2719037p2720560.html
  #
  # Least squares fitting of an ellipse to point data
  # using the algorithm described in: 
  #   Radim Halir & Jan Flusser. 1998. 
  #   Numerically stable direct least squares fitting of ellipses. 
  #   Proceedings of the 6th International Conference in Central Europe 
  #   on Computer Graphics and Visualization. WSCG '98, p. 125-132 
  #
  # Adapted from the original Matlab code by Michael Bedward (2010)
  # michael.bedward@gmail.com
  #
  # Subsequently improved by John Minter (2012)
  # 
  # Arguments: 
  # x, y - x and y coordinates of the data points.
  #        If a single arg is provided it is assumed to be a
  #        two column matrix.
  #
  # Returns a list with the following elements: 
  #
  # coef - coefficients of the ellipse as described by the general 
  #        quadratic:  ax^2 + bxy + cy^2 + dx + ey + f = 0 
  #
  # center - center x and y
  #
  # major - major semi-axis length
  #
  # minor - minor semi-axis length
  #
  EPS <- 1.0e-8 
  dat <- xy.coords(x, y) 
  
  D1 <- cbind(dat$x * dat$x, dat$x * dat$y, dat$y * dat$y) 
  D2 <- cbind(dat$x, dat$y, 1) 
  S1 <- t(D1) %*% D1 
  S2 <- t(D1) %*% D2 
  S3 <- t(D2) %*% D2 
  T <- -solve(S3) %*% t(S2) 
  M <- S1 + S2 %*% T 
  M <- rbind(M[3,] / 2, -M[2,], M[1,] / 2) 
  evec <- eigen(M)$vec 
  cond <- 4 * evec[1,] * evec[3,] - evec[2,]^2 
  a1 <- evec[, which(cond > 0)] 
  f <- c(a1, T %*% a1) 
  names(f) <- letters[1:6] 
  
  # calculate the center and lengths of the semi-axes 
  #
  # see http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2288654/
  # J. R. Minter
  # for the center, linear algebra to the rescue
  # center is the solution to the pair of equations
  # 2ax +  by + d = 0
  # bx  + 2cy + e = 0
  # or
  # | 2a   b |   |x|   |-d|
  # |  b  2c | * |y| = |-e|
  # or
  # A x = b
  # or
  # x = Ainv b
  # or
  # x = solve(A) %*% b
  A <- matrix(c(2*f[1], f[2], f[2], 2*f[3]), nrow=2, ncol=2, byrow=T )
  b <- matrix(c(-f[4], -f[5]), nrow=2, ncol=1, byrow=T)
  soln <- solve(A) %*% b

  b2 <- f[2]^2 / 4
  
  center <- c(soln[1], soln[2]) 
  names(center) <- c("x", "y") 
  
  num  <- 2 * (f[1] * f[5]^2 / 4 + f[3] * f[4]^2 / 4 + f[6] * b2 - f[2]*f[4]*f[5]/4 - f[1]*f[3]*f[6]) 
  den1 <- (b2 - f[1]*f[3]) 
  den2 <- sqrt((f[1] - f[3])^2 + 4*b2) 
  den3 <- f[1] + f[3] 
  
  semi.axes <- sqrt(c( num / (den1 * (den2 - den3)),  num / (den1 * (-den2 - den3)) )) 
  
  # calculate the angle of rotation 
  term <- (f[1] - f[3]) / f[2] 
  angle <- atan(1 / term) / 2 
  
  list(coef=f, center = center, major = max(semi.axes), minor = min(semi.axes), angle = unname(angle)) 
}

Next here is a utility function which takes a fitted ellipse and returns a matrix of vertices for plotting:

get.ellipse <- function( fit, n=360 ) 
{
  # Calculate points on an ellipse described by 
  # the fit argument as returned by fit.ellipse 
  # 
  # n is the number of points to render 
  
  tt <- seq(0, 2*pi, length=n) 
  sa <- sin(fit$angle) 
  ca <- cos(fit$angle) 
  ct <- cos(tt) 
  st <- sin(tt) 
  
  x <- fit$center[1] + fit$maj * ct * ca - fit$min * st * sa 
  y <- fit$center[2] + fit$maj * ct * sa + fit$min * st * ca 
  
  cbind(x=x, y=y) 
}

And finally, some demo code from John:

create.test.ellipse <- function(Rx=300,         # X-radius
                                Ry=200,         # Y-radius
                                Cx=250,         # X-center
                                Cy=150,         # Y-center
                                Rotation=0.4,   # Radians
                                NoiseLevel=0.5) # Gaussian Noise level
{
  set.seed(42)
  t <- seq(0, 100, by=1)
  x <- Rx * cos(t)
  y <- Ry * sin(t)
  nx <- x*cos(Rotation)-y*sin(Rotation) + Cx
  nx <- nx + rnorm(length(t))*NoiseLevel 
  ny <- x*sin(Rotation)+y*cos(Rotation) + Cy
  ny  <- ny + rnorm(length(t))*NoiseLevel
  cbind(x=nx, y=ny)
}

X <- create.test.ellipse()
efit <- fit.ellipse(X)
e <- get.ellipse(efit)
plot(X) 
lines(e, col="red") 

print(efit)

1 Halíř R., Flusser J.: Numerically stable direct least squares fitting of ellipses. In: Proceedings of the 6th International Conference in Central Europe on Computer Graphics and Visualization. WSCG '98. CZ, Plzeň 1998, pp. 125-132. (postscript file)

Monday, August 6, 2012

IDE errors with Maven projects after upgrading to NetBeans 7.2

A recent upgrade of NetBeans from version 7.1 to 7.2 seemed to cause problems with Maven multi-module projects. Normally, a newly created class in one Maven project module will be instantly visible to other modules open in the IDE, at least for editing purposes. But after the upgrade I started to see "cannot find symbol" errors in the IDE for such new classes. Despite this, a full build of the project would succeed.

Googling for an answer drew a blank. The usual kludge fixes such as deleting the NetBeans cache (for NetBeans 7.2 on OSX this seems to have moved from ~/.netbeans to ~/Library/Application Support/Netbeans) and re-indexing my local repo didn't make any difference. Despair loomed.

Happily, I stumbled across the cause of the problem after only a moderate amount of pulling of hair and gnashing of teeth. My Maven settings.xml file included this section...


    <profiles>
        <profile>
            <id>netbeans-ide</id>
            <activation>
                <activeByDefault>true</activeByDefault>
            </activation>
            <properties>
                <netbeans.installation>/Applications/NetBeans/NetBeans 7.1.app/Contents/Resources/NetBeans</netbeans.installation>
            </properties>
        </profile>
    </profiles>

After editing the path to point to 7.2.app the IDE errors disappeared.

Tuesday, April 10, 2012

Build an application plus a separate library uber-jar using Maven

I've been working on a small Java application with a colleague to simulate animal movements and look at the efficiency of different survey methods. It uses the GeoTools library to support map projections and shapefile output. GeoTools is great but comes at a cost in terms of size: the jar for our little application alone is less than 50kb but bundling it with GeoTools and its dependencies blows that out to 20Mb.

The application code has been changing on a daily basis as we explore ideas, add features and fix bugs. Working with my colleague at a distance, over a fairly feeble internet connection, I wanted to package the static libraries and the volatile application into separate jars so that he only needed to download the former once (another option would have been for my colleague to set up a local Maven repository but for various reasons this was impractical).

A slight complication with bundling GeoTools modules into a single jar (aka uber-jar) is that individual modules make extensive use of META-INF/services files which need to be merged properly. The Maven Shade plugin does this nicely and I'd previously used it when creating single, executable jars for GeoTools projects. For the current case I realised all I needed to do was set up a multi-module project with one module for the application and another, inside a profile, for the combined libraries jar.

Following is the top level POM. All of the application dependencies are declared here to be inherited by the child application and libraries uber-jar POMs. Note that we define a profile "libs" to enclose the module for the libraries jar:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

<modelVersion>4.0.0</modelVersion>
<groupId>org.cafeanimal</groupId>
<artifactId>trapsimulation</artifactId>
<packaging>pom</packaging>

<version>0.1-SNAPSHOT</version>
<name>Fauna trapping simulation</name>

<properties>
<geotools.version>8.0-M3</geotools.version>
</properties>

<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.0.2</version>
<configuration>
<source>1.6</source>
<target>1.6</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<artifactId>maven-resources-plugin</artifactId>
<version>2.2</version>
<configuration>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
</plugins>
</build>

<dependencies>
<dependency>
<groupId>org.geotools</groupId>
<artifactId>gt-main</artifactId>
<version>${geotools.version}</version>
</dependency>
<dependency>
<groupId>org.geotools</groupId>
<artifactId>gt-referencing</artifactId>
<version>${geotools.version}</version>
</dependency>
<dependency>
<groupId>org.geotools</groupId>
<artifactId>gt-epsg-hsql</artifactId>
<version>${geotools.version}</version>
</dependency>

<!-- Lots of other dependencies omitted here for brevity -->

</dependencies>

<modules>
<module>app</module>
</modules>

<!-- specify -Dlibs to build the lib module with its uber-jar of dependencies -->
<profiles>
<profile>
<id>libs</id>
<activation>
<property>
<name>libs</name>
</property>
</activation>

<modules>
<module>lib</module>
</modules>

</profile>
</profiles>

</project>


The Maven module for the application resides in a sub-directory (app) of the parent project directory. Since all of the dependencies were declared in the parent POM there is very little required for the application itself:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

<modelVersion>4.0.0</modelVersion>

<parent>
<groupId>org.cafeanimal</groupId>
<artifactId>trapsimulation</artifactId>
<version>0.1-SNAPSHOT</version>
</parent>

<artifactId>app</artifactId>
<name>Simulation app</name>
<packaging>jar</packaging>

</project>


The Maven module for the libraries uber-jar is in another sub-directory (lib) of the parent project directory. It references the Maven Shade plugin which will merge the GeoTools META-INF/services files and combine the individual jars for the many dependencies into a single jar:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

<modelVersion>4.0.0</modelVersion>

<parent>
<artifactId>trapsimulation</artifactId>
<groupId>org.cafeanimal</groupId>
<version>0.1-SNAPSHOT</version>
</parent>

<artifactId>lib</artifactId>
<packaging>jar</packaging>
<name>Library uber-jar</name>

<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>1.5</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<!-- This transformer combines the META-INF/services files from the individual GeoTools modules -->
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>

</project>


Now to build both the application and the libraries jar from the command line, cd to the parent project directory and enter:
mvn clean install -Dlibs

To build the application alone, simply omit the -Dlibs profile variable.

Monday, December 6, 2010

Smooth as...

For a current project I needed a polygon smoothing algorithm in Java that:

  1. generated a curve passing through all polygon vertices

  2. had some way of controlling tightness of fit to the polygon

  3. fell within the domain of my very limited mathematical grasp

Luckily for me, Maxim Shemanarev had not only worked out a Bezier curve algorithm that satisfied all of the above, but he also published a beautifully clear explanation of how it worksnote 1. Maxim also provided some illustrative C code, but my Java implementation below is an independent effort (ie. all mistakes are my fault). It uses geometry objects from the JTS library.

Below is an example of the algorithm in action. The original polygon is in black. The smoothed polygon composed of fitted cubic Bezier segments is in blue. The red points and lines indicate positions of the Bezier control points calculated with Maxim's algorithm. There are a pair of control points for each vertex of the input polygon.

The tightness of fit of the smoothed boundary to the original boundary is controlled by a single parameter which varies between 0 and 1. In the figure above the tightness was set to 0. Below is the same polygon smoothed with tightness set to 0.6.

Notice that the higher tightness parameter has shifted each pair of control points closer to their respective polygon vertices.

Finally, here's the Java code. Share and enjoy...
/*
* Copyright 2010 Michael Bedward
*
* This file is part of jai-tools.
*
* jai-tools is free software: you can redistribute it and/or modify
* it under the terms of the GNU Lesser General Public License as
* published by the Free Software Foundation, either version 3 of the
* License, or (at your option) any later version.
*
* jai-tools is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public
* License along with jai-tools. If not, see <http://www.gnu.org/licenses/>.
*/

package mxb.interpolation;

import com.vividsolutions.jts.geom.Coordinate;
import com.vividsolutions.jts.geom.GeometryFactory;
import com.vividsolutions.jts.geom.LinearRing;
import com.vividsolutions.jts.geom.Polygon;


/**
* Polygon smoothing by interpolation with cubic Bazier curves.
* This code implements an algorithm which was devised by Maxim Shemanarev
* and described by him at:
* <a href="http://www.antigrain.com/research/bezier_interpolation/index.html">
* http://www.antigrain.com/research/bezier_interpolation/index.html</a>.
*
* Note: the code here is <b>not</b> that written by Maxim to accompany his
* algorithm description. Rather, it is an original implementation and any
* errors are my fault.
*/
public class PolygonSmoother {
private GeometryFactory geomFactory;

/**
* Calculates a pair of Bezier control points for each vertex in an
* array of {@code Coordinates}.
*
* @param coords vertex coordinates
* @param N number of coordinates in {@coords} to use
* @param alpha tightness of fit
*
* @return array of {@code Coordinates} for positions of control points
*/
private Coordinate[][] getControlPoints(Coordinate[] coords, int N, double alpha) {
if (alpha < 0.0 || alpha > 1.0) {
throw new IllegalArgumentException("alpha must be a value between 0 and 1 inclusive");
}

Coordinate[][] ctrl = new Coordinate[N][2];

Coordinate[] v = new Coordinate[3];

Coordinate[] mid = new Coordinate[2];
mid[0] = new Coordinate();
mid[1] = new Coordinate();

Coordinate anchor = new Coordinate();
double[] vdist = new double[2];
double mdist;

v[1] = coords[N - 1];
v[2] = coords[0];
mid[1].x = (v[1].x + v[2].x) / 2.0;
mid[1].y = (v[1].y + v[2].y) / 2.0;
vdist[1] = v[1].distance(v[2]);

for (int i = 0; i < N; i++) {
v[0] = v[1];
v[1] = v[2];
v[2] = coords[(i + 1) % N];

mid[0].x = mid[1].x;
mid[0].y = mid[1].y;
mid[1].x = (v[1].x + v[2].x) / 2.0;
mid[1].y = (v[1].y + v[2].y) / 2.0;

vdist[0] = vdist[1];
vdist[1] = v[1].distance(v[2]);

double p = vdist[0] / (vdist[0] + vdist[1]);
anchor.x = mid[0].x + p * (mid[1].x - mid[0].x);
anchor.y = mid[0].y + p * (mid[1].y - mid[0].y);

double xdelta = anchor.x - v[1].x;
double ydelta = anchor.y - v[1].y;

ctrl[i][0] = new Coordinate(
alpha*(v[1].x - mid[0].x + xdelta) + mid[0].x - xdelta,
alpha*(v[1].y - mid[0].y + ydelta) + mid[0].y - ydelta);

ctrl[i][1] = new Coordinate(
alpha*(v[1].x - mid[1].x + xdelta) + mid[1].x - xdelta,
alpha*(v[1].y - mid[1].y + ydelta) + mid[1].y - ydelta);
}

return ctrl;
}


/**
* Calculates vertices along a cubic Bazier curve given start point, end point
* and two control points.
*
* @param start start position
* @param end end position
* @param ctrl1 first control point
* @param ctrl2 second control point
* @param nv number of vertices including the start and end points
*
* @return vertices along the Bezier curve
*/
private Coordinate[] cubicBezier(final Coordinate start, final Coordinate end,
final Coordinate ctrl1, final Coordinate ctrl2, final int nv) {

final Coordinate[] curve = new Coordinate[nv];

final Coordinate[] buf = new Coordinate[3];
for (int i = 0; i < buf.length; i++) {
buf[i] = new Coordinate();
}

curve[0] = new Coordinate(start);
curve[nv - 1] = new Coordinate(end);

for (int i = 1; i < nv-1; i++) {
double t = (double) i / (nv - 1);
double tc = 1.0 - t;

double t0 = tc*tc*tc;
double t1 = 3.0*tc*tc*t;
double t2 = 3.0*tc*t*t;
double t3 = t*t*t;
double tsum = t0 + t1 + t2 + t3;

Coordinate c = new Coordinate();

c.x = t0*start.x + t1*ctrl1.x + t2*ctrl2.x + t3*end.x;
c.x /= tsum;
c.y = t0*start.y + t1*ctrl1.y + t2*ctrl2.y + t3*end.y;
c.y /= tsum;

curve[i] = c;
}

return curve;
}

/**
* Creates a new {@code Polygon} whose exterior shell is a smoothed
* version of the input {@code Polygon}.
* <p>
* Note: this method presently ignores holes.
*
* @param p the input {@code Polygon}
*
* @param alpha a value between 0 and 1 (inclusive) specifying the tightness
* of fit of the smoothed boundary (0 is loose)
*
* @param pointsPerSegment the number of vertices to use to represent each
* Bezier curve derived from an edge of the input {@code Polygon}
*
* @return the smoothed {@code Polygon}
*/
public Polygon smooth(Polygon p, double alpha, int pointsPerSegment) {

if (geomFactory == null) {
geomFactory = new GeometryFactory();
}

Coordinate[] coords = p.getExteriorRing().getCoordinates();
int Nvertices = coords.length - 1; // first coord == last coord

Coordinate[][] controlPoints = getControlPoints(coords, Nvertices, alpha);

Coordinate[] smoothCoords = new Coordinate[Nvertices * pointsPerSegment];
for (int i = 0, k = 0; i < Nvertices; i++) {
int next = (i + 1) % Nvertices;

Coordinate[] segment = cubicBezier(
coords[i], coords[next],
controlPoints[i][1], controlPoints[next][0],
pointsPerSegment);

for (int j = 0; j < pointsPerSegment; j++, k++) {
smoothCoords[k] = segment[j];
}
}

LinearRing shell = geomFactory.createLinearRing(smoothCoords);
return geomFactory.createPolygon(shell, null);
}
}


1 Maxim Shemanarev's algorithm http://www.antigrain.com/research/bezier_interpolation/index.html

Wednesday, September 1, 2010

Monte Carlo testing of classification groups

This is another article on the theme of defining groups in a hierarchical classification. A previous article described homogeneity analysis to visualize how any well any number of groups, defined at the same level accounts for the variability in the dataset, as measured by within-group pairwise distances. Here we will look at testing whether splitting a particular group into two daughter groups is supported by the data.

Suppose we have a group G containing N objects which we are considering splitting into two groups G1 and G2 consisting of N1 and N2 objects respectively. A Monte Carlo approach to testing whether this is worth doing is as follows:

1. Calculate the mean within-group pairwise distance that would result from defining G1 and G2.

2. Randomly allocate objects to two groups of sizes N1 and N2 and calculate the resulting mean within-group pairwise distance.

3. Repeat step 2 a large number of times (e.g. 999) to create a vector of Monte Carlo means. Add the actual mean from step 1 to this vector.

4. Calculate an empirical probability from the position of the actual mean in the sorted vector from step 3.

If the groups G1 and G2 are meaningful, the probability from step 4 should be low, ie. the actual mean should be lower than most or all of the Monte Carlo means.

Below is the R code for this algorithm. It takes as input a distance matrix; a vector of group IDs for all objects; the IDs of the two groups being tested and the number of Monte Carlo means to sample.



testGroups <- function (d, grp, g1=1, g2=2, nsamples=999)
{
# Performs a Monte-Carlo test of the informativeness of two groups of
# objects by comparing the actual mean of within-group pairwise distances
# to a vector of means derived by randomly allocating objects to the
# two groups. If the two groups are informative, the actual mean should
# be less than most or all of the Monte-Carlo means.
#
# Arguments:
# d - distance matrix
# grp - vector of group ids for all objects in the distance matrix
# (ids of objects outside the two groups being considered can
# be set to arbitrary values)
# g1 - id of first group to be tested
# g2 - id of second group to be tested
# nsamples - number of random groupings to generate (if possible)
#
# The function attempts to generate unique allocations using the following rules:
#
# 1. If nsamples is greater than choose(N1+N2, N1) it is reduced accordingly
# and a warning message issued. All unique allocations are then
# generated systematically.
#
# 2. If nsamples is greater than choose(N1+N2, N1) / 20 allocations are generated
# systematically to guarantee uniqueness.
#
# 3. If nsamples is less than choose(N1+N2, N1) / 20, allocations are
# generated randomly with no check for uniqueness to reduce running
# time.
#
# The function invisibly returns a list with the following elements:
# actual.mean - the mean within-group pairwise distance for actual groupings
# prob - probability of observing a mean less than or equal to the actual
# mean calculated by comparing the actual value to a vector of
# means from random allocations plus the actual mean
# mc.mean - mean within-group pairwise distance for each random allocation
#
# Michael Bedward, August 2010

# Constants
SYSTEMATIC <- 1
RANDOM <- 2

Nobj <- length(grp)
if (length(d) != Nobj*(Nobj-1)/2) {
stop("Number of objects differs between dist object and group vector")
}

N1 <- sum(grp == g1)
N2 <- sum(grp == g2)
if (N1 == 1 & N2 == 1) {
stop("Can't test split between two singleton groups")
}

objects <- which(grp == g1 | grp == g2)

Nc <- choose(N1+N2, N1)

if (nsamples >= Nc/20) {
if (nsamples > Nc) {
warning(paste("Requested number of samples (", nsamples, ") exceeds possible rearrangements (", Nc, ")", sep=""))
nsamples <- Nc
}
sample.method <- SYSTEMATIC
} else {
sample.method <- RANDOM
}

# Helper function: calculates total pairwise distance between
# objects within a group. Returns total and number of
# pairwise distances
calc.grp.dist <- function(grp.obj) {
if (length(grp.obj) == 1) {
return( c(0, 0) )
}

pairwise <- combn(sort(grp.obj), 2)
di <- Nobj * (pairwise[1,] - 1) - pairwise[1,]*(pairwise[1,]-1)/2 + pairwise[2,]-pairwise[1,]
c(sum(d[di]), length(di))
}

# Helper function: calculates mean within-group distance.
# Arg indices holds the indices of elements in the objects vector for
# one of the groups.
calc.mean <- function(indices) {
g1dist <- calc.grp.dist(objects[indices])
g2dist <- calc.grp.dist(objects[-indices])
(g1dist[1] + g2dist[1]) / (g1dist[2] + g2dist[2])
}

samples <- numeric(nsamples+1)

# actual within-group mean distance
samples[1] <- calc.mean(which(grp == g1))

# randomly allocate objects to groups and calculate means
Nmin <- min(N1, N2)
k <- 2
if (sample.method == SYSTEMATIC) {
combns <- combn(N1+N2, Nmin)[, sample(Nc, nsamples)]
} else { # sample.method == RANDOM
combns <- replicate(nsamples, sample(N1+N2, Nmin))
}

apply( combns, 2, function(x){samples[k] <<- calc.mean(x); k <<- k+1} )
prob <- ecdf(samples)(samples[1])
print(paste("Prob =", prob, "from", nsamples, "samples plus actual value"))
invisible(list(actual.mean=samples[1], prob=prob, mc.means=samples[-1]))
}

Thursday, August 12, 2010

Fun with the proto package: building an MCMC sampler for Bayesian regression

The proto package is my latest favourite R goodie. It brings prototype-based programming to the R language - a style of programming that lets you do many of the things you can do with classes, but with a lot less up-front work. Louis Kates and Thomas Petzoldt provide an excellent introduction to using proto in the package vignette.

As a learning exercise I concocted the example below involving Bayesian logistic regression. It was inspired by an article on Matt Shotwell's blog about using R environments rather than lists to store the state of a Markov Chain Monte Carlo sampler. Here I use proto to create a parent class-like object (or trait in proto-ese) to contain the regression functions and create child objects to hold both data and results for individual analyses.

First here's an example session...

# Make up some data with a continuous predictor and binary response
nrec <- 500
x <- rnorm(nrec)
y <- rbinom(nrec, 1, plogis(2 - 4*x))

# Predictor matrix with a col of 1s for intercept
pred <- matrix(c(rep(1, nrec), x), ncol=2)
colnames(pred) <- c("intercept", "X")

# Load the proto package
library(proto)

# Use the Logistic parent object to create a child object which will
# hold the data and run the regression (the $ operator references
# functions and data within a proto object)
lr <- Logistic$new(pred, y)
lr$run(5000, 1000)

# lr now contains both data and results
str(lr)

proto object
$ cov : num [1:2, 1:2] 0.05 -0.0667 -0.0667 0.1621
..- attr(*, "dimnames")=List of 2
$ prior.cov: num [1:2, 1:2] 100 0 0 100
$ prior.mu : num [1:2] 0 0
$ beta : num [1:5000, 1:2] 2.09 2.09 2.09 2.21 2.21 ...
..- attr(*, "dimnames")=List of 2
$ adapt : num 1000
$ y : num [1:500] 0 1 1 1 1 1 1 1 1 1 ...
$ x : num [1:500, 1:2] 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "dimnames")=List of 2
parent: proto object

# Use the Logistic summary function to tabulate and plot results
lr$summary()

From 5000 samples after 1000 iterations burning in
intercept X
Min. :1.420 Min. :-5.296
1st Qu.:1.840 1st Qu.:-3.915
Median :2.000 Median :-3.668
Mean :1.994 Mean :-3.693
3rd Qu.:2.128 3rd Qu.:-3.455
Max. :2.744 Max. :-2.437



And here's the code for the Logistic trait...

Logistic <- proto()

Logistic$new <- function(., x, y) {
# Creates a child object to hold data and results
#
# x - a design matrix (ie. predictors
# y - a binary reponse vector

proto(., x=x, y=y)
}

Logistic$run <- function(., niter, adapt=1000) {
# Perform the regression by running the MCMC
# sampler
#
# niter - number of iterations to sample
# adapt - number of prior iterations to run
# for the 'burning in' period

require(mvtnorm)

# Set up variables used by the sampler
.$adapt <- adapt
total.iter <- niter + adapt
.$beta <- matrix(0, nrow=total.iter, ncol=ncol(.$x))
.$prior.mu <- rep(0, ncol(.$x))
.$prior.cov <- diag(100, ncol(.$x))
.$cov <- diag(ncol(.$x))

# Run the sampler
b <- rep(0, ncol(.$x))
for (i in 1:total.iter) {
b <- .$update(i, b)
.$beta[i,] <- b
}

# Trim the results matrix to remove the burn-in
# period
if (.$adapt > 0) {
.$beta <- .$beta[(.$adapt + 1):total.iter,]
}
}

Logistic$update <- function(., it, beta.old) {
# Perform a single iteration of the MCMC sampler using
# Metropolis-Hastings algorithm.
# Adapted from code by Brian Neelon published at:
# http://www.duke.edu/~neelo003/r/
#
# it - iteration number
# beta.old - vector of coefficient values from
# the previous iteration

# Update the coefficient covariance if we are far
# enough through the sampling
if (.$adapt > 0 & it > 2 * .$adapt) {
.$cov <- cov(.$beta[(it - .$adapt):(it - 1),])
}

# generate proposed new coefficient values
beta.new <- c(beta.old + rmvnorm(1, sigma=.$cov))

# calculate prior and current probabilities and log-likelihood
if (it == 1) {
.$..log.prior.old <- dmvnorm(beta.old, .$prior.mu, .$prior.cov, log=TRUE)
.$..probs.old <- plogis(.$x %*% beta.old)
.$..LL.old <- sum(log(ifelse(.$y, .$..probs.old, 1 - .$..probs.old)))
}
log.prior.new <- dmvnorm(beta.new, .$prior.mu, .$prior.cov, log=TRUE)
probs.new <- plogis(.$x %*% beta.new)
LL.new <- sum(log(ifelse(.$y, probs.new, 1-probs.new)))

# Metropolis-Hastings acceptance ratio (log scale)
ratio <- LL.new + log.prior.new - .$..LL.old - .$..log.prior.old

if (log(runif(1)) < ratio) {
.$..log.prior.old <- log.prior.new
.$..probs.old <- probs.new
.$..LL.old <- LL.new
return(beta.new)
} else {
return(beta.old)
}
}

Logistic$summary <- function(., show.plot=TRUE) {
# Summarize the results

cat("From", nrow(.$beta), "samples after", .$adapt, "iterations burning in\n")
base::print(base::summary(.$beta))

if (show.plot) {
par(mfrow=c(1, ncol(.$beta)))
for (i in 1:ncol(.$beta)) {
plot(density(.$beta[,i]), main=colnames(.$beta)[i])
}
}
}


Now that's probably not the greatest design in the world, but it only took me a few minutes to put it together and it's incredibly easy to modify or extend. Try it !

Thanks to Brian Neelon for making his MCMC logistic regression code available (http://www.duke.edu/~neelo003/r/).

Tuesday, August 10, 2010

Homogeneity analysis of hierarchical classifications

I've spent more years than I care to remember analysing vegetation survey data (typically species abundances in plots) using a variety of software including my own algorithms coded in FORTRAN and C++. A recent query on the r-help list, about how to determine the number of groups to define in a hierarchical classification produced with the hclust function, prompted me to unearth one of these algorithms, homogeneity analysis1, which can help to visualize how different levels of grouping partition the variability in a distance matrix.

This algorithm is extremely simple. The classification is progressively divided into groups, with all groups being defined at the same dendrogram height. At each level of grouping, the average of within-group pairwise distances is calculated. Homogeneity is then defined as:

H = 1 - Davwithin-group - Davtotal

where Davtotal is the average pairwise distance in the dataset as a whole.

For data were there exist well-defined clusters of values, a plot of homogeneity against number of groups will display an 'elbow' where the initial rapid increase in homogeneity turns to a more gradual increase. The example above shows a classification of the USArrests dataset and corresponding homogeneity plot which suggests defining 7 groups. It was generated as follows:

> d <- dist(USArrests)
> hc <- hclust(d, method="average")
> h <- hmg(d, hc)
> plot(h, type="s", main="USArrests homogeneity")
> abline(v=7, lty=2)
> plot(hc, labels=FALSE, main="USArrests dendrogram")
> rect.hclust(hc, k=7)

Here is the code for the hmg function:

function(d, hc) {

# R version of homogeneity analysis as described in:
# Bedward, Pressey and Keith. 1992. Homogeneity analysis: assessing the
# utility of classifications and maps of natural resources
# Australian Journal of Ecology 17:133-139.
#
# Arguments:
# d - either an object produced by dist() or a vector of
# pairwise dissimilarity values ordered in the manner of
# a dist result
#
# hc - classification produced by hclust()
#
# Value:
# A two column matrix of number of groups and corresponding homogeneity value
#
# This code by Michael Bedward, 2010

m <- hc$merge

# check args for consistency
Nobj <- nrow(m) + 1
if (length(d) != Nobj * (Nobj-1) / 2) {
dname <- dparse(substitute(d))
hcname <- dparse(substitute(hc))
stop(paste(dname, "and", hcname, "refer to different numbers of objects"))
}

distMean <- mean(d)
grp <- matrix(c(1:Nobj, rep(0, Nobj)), ncol=2)

# helper function - decodes values in the merge matrix
# from hclust
getGrp <- function( idx ) {
ifelse( idx < 0, grp[-idx], getGrp( m[idx, 1] ) )
}

grpDistTotal <- numeric(Nobj)
grpDistCount <- numeric(Nobj)
hmg <- numeric(Nobj)
hmg[1] <- 0
hmg[Nobj] <- 1

distTotal <- 0
distCount <- 0
for (step in 1:(Nobj-2)) {
mrec <- m[step, ]
grp1 <- getGrp(mrec[1])
grp2 <- getGrp(mrec[2])

distTotal <- distTotal - grpDistTotal[grp1] - grpDistTotal[grp2]
distCount <- distCount - grpDistCount[grp1] - grpDistCount[grp2]

grpDistTotal[grp1] <- grpDistTotal[grp1] + grpDistTotal[grp2]
grpDistCount[grp1] <- grpDistCount[grp1] + grpDistCount[grp2]

grp1Obj <- which(grp == grp1)
grp2Obj <- which(grp == grp2)

for (i in grp1Obj) {
for (j in grp2Obj) {
ii <- min(i,j)
jj <- max(i,j)
distIdx <- Nobj*(ii-1) - ii*(ii-1)/2 + jj-ii
grpDistTotal[grp1] <- grpDistTotal[grp1] + d[distIdx]
grpDistCount[grp1] <- grpDistCount[grp1] + 1
}
}

# update group vector
grp[grp == grp2] <- grp1

distTotal <- distTotal + grpDistTotal[grp1]
distCount <- distCount + grpDistCount[grp1]

hmg[Nobj - step] <- 1 - distTotal / ( distCount * distMean )
}

out <- matrix(c(1:Nobj, hmg), ncol=2)
colnames(out) <- c("ngroups", "homogeneity")
out
}


Reference
1M. Bedward, D. A. Keith, R. L. Pressey. 1992. Homogeneity analysis: Assessing the utility of classifications and maps of natural resources. Australian Journal of Ecology 17(2) 133-139.