Skip to main content

Monte Carlo testing of classification groups

This is another article on the theme of defining groups in a hierarchical classification. A previous article described homogeneity analysis to visualize how any well any number of groups, defined at the same level accounts for the variability in the dataset, as measured by within-group pairwise distances. Here we will look at testing whether splitting a particular group into two daughter groups is supported by the data.

Suppose we have a group G containing N objects which we are considering splitting into two groups G1 and G2 consisting of N1 and N2 objects respectively. A Monte Carlo approach to testing whether this is worth doing is as follows:

1. Calculate the mean within-group pairwise distance that would result from defining G1 and G2.

2. Randomly allocate objects to two groups of sizes N1 and N2 and calculate the resulting mean within-group pairwise distance.

3. Repeat step 2 a large number of times (e.g. 999) to create a vector of Monte Carlo means. Add the actual mean from step 1 to this vector.

4. Calculate an empirical probability from the position of the actual mean in the sorted vector from step 3.

If the groups G1 and G2 are meaningful, the probability from step 4 should be low, ie. the actual mean should be lower than most or all of the Monte Carlo means.

Below is the R code for this algorithm. It takes as input a distance matrix; a vector of group IDs for all objects; the IDs of the two groups being tested and the number of Monte Carlo means to sample.



testGroups <- function (d, grp, g1=1, g2=2, nsamples=999)
{
# Performs a Monte-Carlo test of the informativeness of two groups of
# objects by comparing the actual mean of within-group pairwise distances
# to a vector of means derived by randomly allocating objects to the
# two groups. If the two groups are informative, the actual mean should
# be less than most or all of the Monte-Carlo means.
#
# Arguments:
# d - distance matrix
# grp - vector of group ids for all objects in the distance matrix
# (ids of objects outside the two groups being considered can
# be set to arbitrary values)
# g1 - id of first group to be tested
# g2 - id of second group to be tested
# nsamples - number of random groupings to generate (if possible)
#
# The function attempts to generate unique allocations using the following rules:
#
# 1. If nsamples is greater than choose(N1+N2, N1) it is reduced accordingly
# and a warning message issued. All unique allocations are then
# generated systematically.
#
# 2. If nsamples is greater than choose(N1+N2, N1) / 20 allocations are generated
# systematically to guarantee uniqueness.
#
# 3. If nsamples is less than choose(N1+N2, N1) / 20, allocations are
# generated randomly with no check for uniqueness to reduce running
# time.
#
# The function invisibly returns a list with the following elements:
# actual.mean - the mean within-group pairwise distance for actual groupings
# prob - probability of observing a mean less than or equal to the actual
# mean calculated by comparing the actual value to a vector of
# means from random allocations plus the actual mean
# mc.mean - mean within-group pairwise distance for each random allocation
#
# Michael Bedward, August 2010

# Constants
SYSTEMATIC <- 1
RANDOM <- 2

Nobj <- length(grp)
if (length(d) != Nobj*(Nobj-1)/2) {
stop("Number of objects differs between dist object and group vector")
}

N1 <- sum(grp == g1)
N2 <- sum(grp == g2)
if (N1 == 1 & N2 == 1) {
stop("Can't test split between two singleton groups")
}

objects <- which(grp == g1 | grp == g2)

Nc <- choose(N1+N2, N1)

if (nsamples >= Nc/20) {
if (nsamples > Nc) {
warning(paste("Requested number of samples (", nsamples, ") exceeds possible rearrangements (", Nc, ")", sep=""))
nsamples <- Nc
}
sample.method <- SYSTEMATIC
} else {
sample.method <- RANDOM
}

# Helper function: calculates total pairwise distance between
# objects within a group. Returns total and number of
# pairwise distances
calc.grp.dist <- function(grp.obj) {
if (length(grp.obj) == 1) {
return( c(0, 0) )
}

pairwise <- combn(sort(grp.obj), 2)
di <- Nobj * (pairwise[1,] - 1) - pairwise[1,]*(pairwise[1,]-1)/2 + pairwise[2,]-pairwise[1,]
c(sum(d[di]), length(di))
}

# Helper function: calculates mean within-group distance.
# Arg indices holds the indices of elements in the objects vector for
# one of the groups.
calc.mean <- function(indices) {
g1dist <- calc.grp.dist(objects[indices])
g2dist <- calc.grp.dist(objects[-indices])
(g1dist[1] + g2dist[1]) / (g1dist[2] + g2dist[2])
}

samples <- numeric(nsamples+1)

# actual within-group mean distance
samples[1] <- calc.mean(which(grp == g1))

# randomly allocate objects to groups and calculate means
Nmin <- min(N1, N2)
k <- 2
if (sample.method == SYSTEMATIC) {
combns <- combn(N1+N2, Nmin)[, sample(Nc, nsamples)]
} else { # sample.method == RANDOM
combns <- replicate(nsamples, sample(N1+N2, Nmin))
}

apply( combns, 2, function(x){samples[k] <<- calc.mean(x); k <<- k+1} )
prob <- ecdf(samples)(samples[1])
print(paste("Prob =", prob, "from", nsamples, "samples plus actual value"))
invisible(list(actual.mean=samples[1], prob=prob, mc.means=samples[-1]))
}

Comments

Popular posts from this blog

Fitting an ellipse to point data

Some time ago I wrote an R function to fit an ellipse to point data, using an algorithm developed by Radim Halíř and Jan Flusser1 in Matlab, and posted it to the r-help list. The implementation was a bit hacky, returning odd results for some data. A couple of days ago, an email arrived from John Minter asking for a pointer to the original code. I replied with a link and mentioned that I'd be interested to know if John made any improvements to the code. About ten minutes later, John emailed again with a much improved version ! Not only is it more reliable, but also more efficient. So with many thanks to John, here is the improved code: fit.ellipse <- function (x, y = NULL) { # from: # http://r.789695.n4.nabble.com/Fitting-a-half-ellipse-curve-tp2719037p2720560.html # # Least squares fitting of an ellipse to point data # using the algorithm described in: # Radim Halir & Jan Flusser. 1998. # Numerically stable direct least squares fitting of ellipses. …

Build an application plus a separate library uber-jar using Maven

I've been working on a small Java application with a colleague to simulate animal movements and look at the efficiency of different survey methods. It uses the GeoTools library to support map projections and shapefile output. GeoTools is great but comes at a cost in terms of size: the jar for our little application alone is less than 50kb but bundling it with GeoTools and its dependencies blows that out to 20Mb.

The application code has been changing on a daily basis as we explore ideas, add features and fix bugs. Working with my colleague at a distance, over a fairly feeble internet connection, I wanted to package the static libraries and the volatile application into separate jars so that he only needed to download the former once (another option would have been for my colleague to set up a local Maven repository but for various reasons this was impractical).

A slight complication with bundling GeoTools modules into a single jar (aka uber-jar) is that individual modules make ext…

packcircles version 0.2.0 released

Version 0.2.0 of the packcircles package has just been published on CRAN. This package provides functions to find non-overlapping arrangements of circles in bounded and unbounded areas. The package how has a new circleProgressiveLayout function. It uses an efficient deterministic algorithm to arrange circles by consecutively placing each one externally tangent to two previously placed circles while avoiding overlaps. It was adapted from a version written in C by Peter Menzel who lent his support to creating this R/Rcpp version. The underlying algorithm is described in the paper: Visualization of large hierarchical data by circle packing by Weixin Wang, Hui Wang, Guozhong Dai, and Hongan Wang. Published in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2006, pp. 517-520 (Available from ACM). Here is a small example of the new function, taken from the package vignette: library(packcircles) library(ggplot2) t <- theme_bw() + theme(panel.grid = el…