```
<- 2
N <- 4
M <- M:1
P <- 1e4
reps <- replicate(reps, sample(1:M, size=N, replace=FALSE, prob=P))
v <- table(unlist(v))
tbl / reps tbl
```

```
1 2 3 4
0.7234 0.6025 0.4409 0.2332
```

psychometrics

statistics

EN

Published

May 12, 2016

Today I needed to better understand what the resulting distribution is, when sampling without replacement using the R `sample`

function, where each element exists exactly once and has a different weight of being drawn. Let \(M\) be the total number of elements (i.e. colors when using urn terminology) and \(N\) the number of elements sampled from \(M\), and \(P\) a vector of weights. Repeatedly sampling \(N=2\) out of \(M=4\) with \(P=[4,3,2,1]\) gives the mean occurence of each element in the sample.

```
N <- 2
M <- 4
P <- M:1
reps <- 1e4
v <- replicate(reps, sample(1:M, size=N, replace=FALSE, prob=P))
tbl <- table(unlist(v))
tbl / reps
```

```
1 2 3 4
0.7234 0.6025 0.4409 0.2332
```

The distribution that models this setting is Wallenius’ noncentral multivariate hypergeometrical distribution (WNMHD). Conceptually, it is the same as the multivariate hypergeometrical distribution except for that the elements additionally have weights. The probability for a certain result can be calculated using the corresponding density function from the package `BiasedUrn`

. For example, drawing the elements \({1,3}\) for the same parameters as above yields the probability

Repeating this process for every possible outcome for a given \(M\) and \(N\), we can calculate the distribution of means for the WNMHD.

```
# approach one: calculate all combinations
l <- replicate(M, 0:1, simplify = F) # all combinations
d <- t(as.matrix(expand.grid(l))) # generate all combinations
d <- d[ , colSums(d) == N] # columns with N elements drawn
p <- dMWNCHypergeo(x = d, m = rep(1, M), n=N, odds=P) # get density
mu <- as.vector(d %*% cbind(p)) # calculating mean
mu
```

`[1] 0.7158730 0.6083333 0.4412698 0.2345238`

Using this brute force approach quickly runs into trouble when \(M\) gets large. The calculation of all possible combinations will quickly fail due to memory limits. A better way to calculate the means is to use function `meanMWNCHypergeo`

. Settings the `precision`

argument to a small value will yield an exact solution.

`[1] 0.7158730 0.6083333 0.4412698 0.2345238`

Additionally, the function also allows to estimate the means with reduced precision for high values of \(M\) and \(N\). By this one can evade the trouble caused by the brute force approach above.

```
[1] 0.77546895 0.75805727 0.73929536 0.71907852 0.69729394 0.67382002
[7] 0.64852578 0.62127004 0.59190071 0.56025388 0.52615293 0.48940756
[13] 0.44981270 0.40714738 0.36117350 0.31163449 0.25825389 0.20073378
[19] 0.13875316 0.07196614
```

Using this knowledge, we can calculate the resulting distribution for a sampling strategy applied in my research. In my setting, \(N\) is not fixed but is also a random variable used as a hyper-parameter in the second sampling stage using WNMHD as described above. As all draws are indendent, the expected means of the compound distribution can simply be calculated as a weighted sum of the estimated WNMHD means for every possible realization of \(N\). Let the distribution for the random variable \(N\) be

Calculating the weighted sum of estimated WNMHD means for each N gives.

```
# stage 2 sampling
M <- 4
P <- M:1
MU <- mapply(meanMWNCHypergeo, n=1:3,
MoreArgs = list(m = rep(1, M), odds = P, precision = 0))
MU <- t(MU)
P1 %*% MU
```

```
[,1] [,2] [,3] [,4]
[1,] 0.6884921 0.5967262 0.4603175 0.2544643
```

To see if the results are correct let’s try it again using a simulation.

```
# stage 1 sampling
reps <- 1e5
N1 <- sample(1:3, reps, replace=T, prob=P1)
v <- lapply(N1, function(x) {
sample(1:M, size=x, replace=FALSE, prob=P)
})
tbl <- table(unlist(v))
tbl / reps
```

```
1 2 3 4
0.69105 0.59602 0.45630 0.25650
```

The results are almost identical. As the WNMHD means are estimated and thus only approximately correct, we cannot expect identity of the results even for many replications.