Wednesday, September 1, 2010

Monte Carlo testing of classification groups

This is another article on the theme of defining groups in a hierarchical classification. A previous article described homogeneity analysis to visualize how any well any number of groups, defined at the same level accounts for the variability in the dataset, as measured by within-group pairwise distances. Here we will look at testing whether splitting a particular group into two daughter groups is supported by the data.

Suppose we have a group G containing N objects which we are considering splitting into two groups G1 and G2 consisting of N1 and N2 objects respectively. A Monte Carlo approach to testing whether this is worth doing is as follows:

1. Calculate the mean within-group pairwise distance that would result from defining G1 and G2.

2. Randomly allocate objects to two groups of sizes N1 and N2 and calculate the resulting mean within-group pairwise distance.

3. Repeat step 2 a large number of times (e.g. 999) to create a vector of Monte Carlo means. Add the actual mean from step 1 to this vector.

4. Calculate an empirical probability from the position of the actual mean in the sorted vector from step 3.

If the groups G1 and G2 are meaningful, the probability from step 4 should be low, ie. the actual mean should be lower than most or all of the Monte Carlo means.

Below is the R code for this algorithm. It takes as input a distance matrix; a vector of group IDs for all objects; the IDs of the two groups being tested and the number of Monte Carlo means to sample.



testGroups <- function (d, grp, g1=1, g2=2, nsamples=999)
{
# Performs a Monte-Carlo test of the informativeness of two groups of
# objects by comparing the actual mean of within-group pairwise distances
# to a vector of means derived by randomly allocating objects to the
# two groups. If the two groups are informative, the actual mean should
# be less than most or all of the Monte-Carlo means.
#
# Arguments:
# d - distance matrix
# grp - vector of group ids for all objects in the distance matrix
# (ids of objects outside the two groups being considered can
# be set to arbitrary values)
# g1 - id of first group to be tested
# g2 - id of second group to be tested
# nsamples - number of random groupings to generate (if possible)
#
# The function attempts to generate unique allocations using the following rules:
#
# 1. If nsamples is greater than choose(N1+N2, N1) it is reduced accordingly
# and a warning message issued. All unique allocations are then
# generated systematically.
#
# 2. If nsamples is greater than choose(N1+N2, N1) / 20 allocations are generated
# systematically to guarantee uniqueness.
#
# 3. If nsamples is less than choose(N1+N2, N1) / 20, allocations are
# generated randomly with no check for uniqueness to reduce running
# time.
#
# The function invisibly returns a list with the following elements:
# actual.mean - the mean within-group pairwise distance for actual groupings
# prob - probability of observing a mean less than or equal to the actual
# mean calculated by comparing the actual value to a vector of
# means from random allocations plus the actual mean
# mc.mean - mean within-group pairwise distance for each random allocation
#
# Michael Bedward, August 2010

# Constants
SYSTEMATIC <- 1
RANDOM <- 2

Nobj <- length(grp)
if (length(d) != Nobj*(Nobj-1)/2) {
stop("Number of objects differs between dist object and group vector")
}

N1 <- sum(grp == g1)
N2 <- sum(grp == g2)
if (N1 == 1 & N2 == 1) {
stop("Can't test split between two singleton groups")
}

objects <- which(grp == g1 | grp == g2)

Nc <- choose(N1+N2, N1)

if (nsamples >= Nc/20) {
if (nsamples > Nc) {
warning(paste("Requested number of samples (", nsamples, ") exceeds possible rearrangements (", Nc, ")", sep=""))
nsamples <- Nc
}
sample.method <- SYSTEMATIC
} else {
sample.method <- RANDOM
}

# Helper function: calculates total pairwise distance between
# objects within a group. Returns total and number of
# pairwise distances
calc.grp.dist <- function(grp.obj) {
if (length(grp.obj) == 1) {
return( c(0, 0) )
}

pairwise <- combn(sort(grp.obj), 2)
di <- Nobj * (pairwise[1,] - 1) - pairwise[1,]*(pairwise[1,]-1)/2 + pairwise[2,]-pairwise[1,]
c(sum(d[di]), length(di))
}

# Helper function: calculates mean within-group distance.
# Arg indices holds the indices of elements in the objects vector for
# one of the groups.
calc.mean <- function(indices) {
g1dist <- calc.grp.dist(objects[indices])
g2dist <- calc.grp.dist(objects[-indices])
(g1dist[1] + g2dist[1]) / (g1dist[2] + g2dist[2])
}

samples <- numeric(nsamples+1)

# actual within-group mean distance
samples[1] <- calc.mean(which(grp == g1))

# randomly allocate objects to groups and calculate means
Nmin <- min(N1, N2)
k <- 2
if (sample.method == SYSTEMATIC) {
combns <- combn(N1+N2, Nmin)[, sample(Nc, nsamples)]
} else { # sample.method == RANDOM
combns <- replicate(nsamples, sample(N1+N2, Nmin))
}

apply( combns, 2, function(x){samples[k] <<- calc.mean(x); k <<- k+1} )
prob <- ecdf(samples)(samples[1])
print(paste("Prob =", prob, "from", nsamples, "samples plus actual value"))
invisible(list(actual.mean=samples[1], prob=prob, mc.means=samples[-1]))
}

No comments:

Post a Comment