<- 2
N <- 4
M <- M:1
P <- 1e4
reps <- replicate(reps, sample(1:M, size=N, replace=FALSE, prob=P))
v <- table(unlist(v))
tbl / reps tbl
1 2 3 4
0.7103 0.6084 0.4375 0.2438
May 12, 2016
Today I needed to better understand what the resulting distribution is, when sampling without replacement using the R sample
function, where each element exists exactly once and has a different weight of being drawn. Let
N <- 2
M <- 4
P <- M:1
reps <- 1e4
v <- replicate(reps, sample(1:M, size=N, replace=FALSE, prob=P))
tbl <- table(unlist(v))
tbl / reps
1 2 3 4
0.7103 0.6084 0.4375 0.2438
The distribution that models this setting is Wallenius’ noncentral multivariate hypergeometrical distribution (WNMHD). Conceptually, it is the same as the multivariate hypergeometrical distribution except for that the elements additionally have weights. The probability for a certain result can be calculated using the corresponding density function from the package BiasedUrn
. For example, drawing the elements
Repeating this process for every possible outcome for a given
# approach one: calculate all combinations
l <- replicate(M, 0:1, simplify = F) # all combinations
d <- t(as.matrix(expand.grid(l))) # generate all combinations
d <- d[ , colSums(d) == N] # columns with N elements drawn
p <- dMWNCHypergeo(x = d, m = rep(1, M), n=N, odds=P) # get density
mu <- as.vector(d %*% cbind(p)) # calculating mean
mu
[1] 0.7158730 0.6083333 0.4412698 0.2345238
Using this brute force approach quickly runs into trouble when meanMWNCHypergeo
. Settings the precision
argument to a small value will yield an exact solution.
[1] 0.7158730 0.6083333 0.4412698 0.2345238
Additionally, the function also allows to estimate the means with reduced precision for high values of
[1] 0.77546895 0.75805727 0.73929536 0.71907852 0.69729394 0.67382002
[7] 0.64852578 0.62127004 0.59190071 0.56025388 0.52615293 0.48940756
[13] 0.44981270 0.40714738 0.36117350 0.31163449 0.25825389 0.20073378
[19] 0.13875316 0.07196614
Using this knowledge, we can calculate the resulting distribution for a sampling strategy applied in my research. In my setting,
Calculating the weighted sum of estimated WNMHD means for each N gives.
# stage 2 sampling
M <- 4
P <- M:1
MU <- mapply(meanMWNCHypergeo, n=1:3,
MoreArgs = list(m = rep(1, M), odds = P, precision = 0))
MU <- t(MU)
P1 %*% MU
[,1] [,2] [,3] [,4]
[1,] 0.6884921 0.5967262 0.4603175 0.2544643
To see if the results are correct let’s try it again using a simulation.
# stage 1 sampling
reps <- 1e5
N1 <- sample(1:3, reps, replace=T, prob=P1)
v <- lapply(N1, function(x) {
sample(1:M, size=x, replace=FALSE, prob=P)
})
tbl <- table(unlist(v))
tbl / reps
1 2 3 4
0.68715 0.59547 0.45996 0.25432
The results are almost identical. As the WNMHD means are estimated and thus only approximately correct, we cannot expect identity of the results even for many replications.