Skip to main content

Monte Carlo testing of classification groups

This is another article on the theme of defining groups in a hierarchical classification. A previous article described homogeneity analysis to visualize how any well any number of groups, defined at the same level accounts for the variability in the dataset, as measured by within-group pairwise distances. Here we will look at testing whether splitting a particular group into two daughter groups is supported by the data.

Suppose we have a group G containing N objects which we are considering splitting into two groups G1 and G2 consisting of N1 and N2 objects respectively. A Monte Carlo approach to testing whether this is worth doing is as follows:

1. Calculate the mean within-group pairwise distance that would result from defining G1 and G2.

2. Randomly allocate objects to two groups of sizes N1 and N2 and calculate the resulting mean within-group pairwise distance.

3. Repeat step 2 a large number of times (e.g. 999) to create a vector of Monte Carlo means. Add the actual mean from step 1 to this vector.

4. Calculate an empirical probability from the position of the actual mean in the sorted vector from step 3.

If the groups G1 and G2 are meaningful, the probability from step 4 should be low, ie. the actual mean should be lower than most or all of the Monte Carlo means.

Below is the R code for this algorithm. It takes as input a distance matrix; a vector of group IDs for all objects; the IDs of the two groups being tested and the number of Monte Carlo means to sample.



testGroups <- function (d, grp, g1=1, g2=2, nsamples=999)
{
# Performs a Monte-Carlo test of the informativeness of two groups of
# objects by comparing the actual mean of within-group pairwise distances
# to a vector of means derived by randomly allocating objects to the
# two groups. If the two groups are informative, the actual mean should
# be less than most or all of the Monte-Carlo means.
#
# Arguments:
# d - distance matrix
# grp - vector of group ids for all objects in the distance matrix
# (ids of objects outside the two groups being considered can
# be set to arbitrary values)
# g1 - id of first group to be tested
# g2 - id of second group to be tested
# nsamples - number of random groupings to generate (if possible)
#
# The function attempts to generate unique allocations using the following rules:
#
# 1. If nsamples is greater than choose(N1+N2, N1) it is reduced accordingly
# and a warning message issued. All unique allocations are then
# generated systematically.
#
# 2. If nsamples is greater than choose(N1+N2, N1) / 20 allocations are generated
# systematically to guarantee uniqueness.
#
# 3. If nsamples is less than choose(N1+N2, N1) / 20, allocations are
# generated randomly with no check for uniqueness to reduce running
# time.
#
# The function invisibly returns a list with the following elements:
# actual.mean - the mean within-group pairwise distance for actual groupings
# prob - probability of observing a mean less than or equal to the actual
# mean calculated by comparing the actual value to a vector of
# means from random allocations plus the actual mean
# mc.mean - mean within-group pairwise distance for each random allocation
#
# Michael Bedward, August 2010

# Constants
SYSTEMATIC <- 1
RANDOM <- 2

Nobj <- length(grp)
if (length(d) != Nobj*(Nobj-1)/2) {
stop("Number of objects differs between dist object and group vector")
}

N1 <- sum(grp == g1)
N2 <- sum(grp == g2)
if (N1 == 1 & N2 == 1) {
stop("Can't test split between two singleton groups")
}

objects <- which(grp == g1 | grp == g2)

Nc <- choose(N1+N2, N1)

if (nsamples >= Nc/20) {
if (nsamples > Nc) {
warning(paste("Requested number of samples (", nsamples, ") exceeds possible rearrangements (", Nc, ")", sep=""))
nsamples <- Nc
}
sample.method <- SYSTEMATIC
} else {
sample.method <- RANDOM
}

# Helper function: calculates total pairwise distance between
# objects within a group. Returns total and number of
# pairwise distances
calc.grp.dist <- function(grp.obj) {
if (length(grp.obj) == 1) {
return( c(0, 0) )
}

pairwise <- combn(sort(grp.obj), 2)
di <- Nobj * (pairwise[1,] - 1) - pairwise[1,]*(pairwise[1,]-1)/2 + pairwise[2,]-pairwise[1,]
c(sum(d[di]), length(di))
}

# Helper function: calculates mean within-group distance.
# Arg indices holds the indices of elements in the objects vector for
# one of the groups.
calc.mean <- function(indices) {
g1dist <- calc.grp.dist(objects[indices])
g2dist <- calc.grp.dist(objects[-indices])
(g1dist[1] + g2dist[1]) / (g1dist[2] + g2dist[2])
}

samples <- numeric(nsamples+1)

# actual within-group mean distance
samples[1] <- calc.mean(which(grp == g1))

# randomly allocate objects to groups and calculate means
Nmin <- min(N1, N2)
k <- 2
if (sample.method == SYSTEMATIC) {
combns <- combn(N1+N2, Nmin)[, sample(Nc, nsamples)]
} else { # sample.method == RANDOM
combns <- replicate(nsamples, sample(N1+N2, Nmin))
}

apply( combns, 2, function(x){samples[k] <<- calc.mean(x); k <<- k+1} )
prob <- ecdf(samples)(samples[1])
print(paste("Prob =", prob, "from", nsamples, "samples plus actual value"))
invisible(list(actual.mean=samples[1], prob=prob, mc.means=samples[-1]))
}

Comments

Popular posts from this blog

Circle packing in R (again)

Back in 2010 I posted some R code for circle packing . Now, just five years later, I've ported the code to Rcpp and created a little package which you can find at GitHub . The main function is circleLayout which takes a set of overlapping circles and tries to find a non-overlapping arrangement for them. Here's an example: And here's the code: # Create some random circles, positioned within the central portion # of a bounding square, with smaller circles being more common than # larger ones. ncircles

Fitting an ellipse to point data

Some time ago I wrote an R function to fit an ellipse to point data, using an algorithm developed by Radim Halíř and Jan Flusser 1 in Matlab, and posted it to the r-help list . The implementation was a bit hacky, returning odd results for some data. A couple of days ago, an email arrived from John Minter asking for a pointer to the original code. I replied with a link and mentioned that I'd be interested to know if John made any improvements to the code. About ten minutes later, John emailed again with a much improved version ! Not only is it more reliable, but also more efficient. So with many thanks to John, here is the improved code: fit.ellipse 0)] f Next here is a utility function which takes a fitted ellipse and returns a matrix of vertices for plotting: get.ellipse And finally, some demo code from John: create.test.ellipse 1 Halíř R., Flusser J.: Numerically stable direct least squares fitting of ellipses. In: Proceedings of the 6th International Confere...

Circle packing with R

To visualize the results of a simulation model of woodland trees within R, I needed an algorithm that could arrange a large number of circles within a rectangle such that no two circles overlapped by more than a specified amount. A colleague had approached this problem earlier by sorting the circles in order of descending size, then randomly dropping each one into the rectangle repeatedly until it landed in a position with acceptable overlap. I suspected a faster and more robust algorithm could be constructed using some kind of "jiggling the circles" approach. Luckily for me, I discovered that Sean McCullough had written a really nice example of circles packing into a cluster using the Processing language. Sean's program is based on an iterative pair-repulsion algorithm in which overlapping circles move away from each other. Based on this, and modifying the algorithm a little, I came up with an R function to produce constrained random layouts of a given set of circles. ...