Skip to main content

Monte Carlo testing of classification groups

This is another article on the theme of defining groups in a hierarchical classification. A previous article described homogeneity analysis to visualize how any well any number of groups, defined at the same level accounts for the variability in the dataset, as measured by within-group pairwise distances. Here we will look at testing whether splitting a particular group into two daughter groups is supported by the data.

Suppose we have a group G containing N objects which we are considering splitting into two groups G1 and G2 consisting of N1 and N2 objects respectively. A Monte Carlo approach to testing whether this is worth doing is as follows:

1. Calculate the mean within-group pairwise distance that would result from defining G1 and G2.

2. Randomly allocate objects to two groups of sizes N1 and N2 and calculate the resulting mean within-group pairwise distance.

3. Repeat step 2 a large number of times (e.g. 999) to create a vector of Monte Carlo means. Add the actual mean from step 1 to this vector.

4. Calculate an empirical probability from the position of the actual mean in the sorted vector from step 3.

If the groups G1 and G2 are meaningful, the probability from step 4 should be low, ie. the actual mean should be lower than most or all of the Monte Carlo means.

Below is the R code for this algorithm. It takes as input a distance matrix; a vector of group IDs for all objects; the IDs of the two groups being tested and the number of Monte Carlo means to sample.



testGroups <- function (d, grp, g1=1, g2=2, nsamples=999)
{
# Performs a Monte-Carlo test of the informativeness of two groups of
# objects by comparing the actual mean of within-group pairwise distances
# to a vector of means derived by randomly allocating objects to the
# two groups. If the two groups are informative, the actual mean should
# be less than most or all of the Monte-Carlo means.
#
# Arguments:
# d - distance matrix
# grp - vector of group ids for all objects in the distance matrix
# (ids of objects outside the two groups being considered can
# be set to arbitrary values)
# g1 - id of first group to be tested
# g2 - id of second group to be tested
# nsamples - number of random groupings to generate (if possible)
#
# The function attempts to generate unique allocations using the following rules:
#
# 1. If nsamples is greater than choose(N1+N2, N1) it is reduced accordingly
# and a warning message issued. All unique allocations are then
# generated systematically.
#
# 2. If nsamples is greater than choose(N1+N2, N1) / 20 allocations are generated
# systematically to guarantee uniqueness.
#
# 3. If nsamples is less than choose(N1+N2, N1) / 20, allocations are
# generated randomly with no check for uniqueness to reduce running
# time.
#
# The function invisibly returns a list with the following elements:
# actual.mean - the mean within-group pairwise distance for actual groupings
# prob - probability of observing a mean less than or equal to the actual
# mean calculated by comparing the actual value to a vector of
# means from random allocations plus the actual mean
# mc.mean - mean within-group pairwise distance for each random allocation
#
# Michael Bedward, August 2010

# Constants
SYSTEMATIC <- 1
RANDOM <- 2

Nobj <- length(grp)
if (length(d) != Nobj*(Nobj-1)/2) {
stop("Number of objects differs between dist object and group vector")
}

N1 <- sum(grp == g1)
N2 <- sum(grp == g2)
if (N1 == 1 & N2 == 1) {
stop("Can't test split between two singleton groups")
}

objects <- which(grp == g1 | grp == g2)

Nc <- choose(N1+N2, N1)

if (nsamples >= Nc/20) {
if (nsamples > Nc) {
warning(paste("Requested number of samples (", nsamples, ") exceeds possible rearrangements (", Nc, ")", sep=""))
nsamples <- Nc
}
sample.method <- SYSTEMATIC
} else {
sample.method <- RANDOM
}

# Helper function: calculates total pairwise distance between
# objects within a group. Returns total and number of
# pairwise distances
calc.grp.dist <- function(grp.obj) {
if (length(grp.obj) == 1) {
return( c(0, 0) )
}

pairwise <- combn(sort(grp.obj), 2)
di <- Nobj * (pairwise[1,] - 1) - pairwise[1,]*(pairwise[1,]-1)/2 + pairwise[2,]-pairwise[1,]
c(sum(d[di]), length(di))
}

# Helper function: calculates mean within-group distance.
# Arg indices holds the indices of elements in the objects vector for
# one of the groups.
calc.mean <- function(indices) {
g1dist <- calc.grp.dist(objects[indices])
g2dist <- calc.grp.dist(objects[-indices])
(g1dist[1] + g2dist[1]) / (g1dist[2] + g2dist[2])
}

samples <- numeric(nsamples+1)

# actual within-group mean distance
samples[1] <- calc.mean(which(grp == g1))

# randomly allocate objects to groups and calculate means
Nmin <- min(N1, N2)
k <- 2
if (sample.method == SYSTEMATIC) {
combns <- combn(N1+N2, Nmin)[, sample(Nc, nsamples)]
} else { # sample.method == RANDOM
combns <- replicate(nsamples, sample(N1+N2, Nmin))
}

apply( combns, 2, function(x){samples[k] <<- calc.mean(x); k <<- k+1} )
prob <- ecdf(samples)(samples[1])
print(paste("Prob =", prob, "from", nsamples, "samples plus actual value"))
invisible(list(actual.mean=samples[1], prob=prob, mc.means=samples[-1]))
}

Comments

Popular posts from this blog

Fitting an ellipse to point data

Some time ago I wrote an R function to fit an ellipse to point data, using an algorithm developed by Radim Halíř and Jan Flusser 1 in Matlab, and posted it to the r-help list . The implementation was a bit hacky, returning odd results for some data. A couple of days ago, an email arrived from John Minter asking for a pointer to the original code. I replied with a link and mentioned that I'd be interested to know if John made any improvements to the code. About ten minutes later, John emailed again with a much improved version ! Not only is it more reliable, but also more efficient. So with many thanks to John, here is the improved code: fit.ellipse 0)] f Next here is a utility function which takes a fitted ellipse and returns a matrix of vertices for plotting: get.ellipse And finally, some demo code from John: create.test.ellipse 1 Halíř R., Flusser J.: Numerically stable direct least squares fitting of ellipses. In: Proceedings of the 6th International Confere...

Circle packing in R (again)

Back in 2010 I posted some R code for circle packing . Now, just five years later, I've ported the code to Rcpp and created a little package which you can find at GitHub . The main function is circleLayout which takes a set of overlapping circles and tries to find a non-overlapping arrangement for them. Here's an example: And here's the code: # Create some random circles, positioned within the central portion # of a bounding square, with smaller circles being more common than # larger ones. ncircles

Graph-based circle packing

The previous two posts showed examples of a simple circle packing algorithm using the packcircles package (available from CRAN and GitHub ). The algorithm involved iterative pair-repulsion to jiggle the circles until (hopefully) a non-overlapping arrangement emerged. In this post we'll look an alternative approach. An algorithm to find an arrangement of circles satisfying a prior specification of circle sizes and tangencies was described by Collins and Stephenson in their 2003 paper in Computation Geometry Theory and Applications. A version of their algorithm was implemented in Python by David Eppstein as part of his PADS library (see CirclePack.py ). I've now ported David's version to R/Rcpp and included it in the packcircles package. In the figure below, the graph on the left represents the desired pattern of circle tangencies: e.g. circle 7 should touch all of, and only, circles 1, 2, 6 and 8. Circles 5, 7, 8 and 9 are internal , while the remaining circles are exter...