This tutorial shows how to use the MADPop package to test for genetic differences between two populations, of which the individuals of a same species contain a variable number of alleles.
## Loading required package: MADPop
## Loading required package: rstan
## Loading required package: StanHeaders
##
## rstan version 2.32.6 (Stan version 2.32.2)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)
Our data consists of N = 215 recordings of Major Histocompatibility Complex (MHC) genotypes of lake trout from K = 11 lakes in Ontario, Canada. For each of the fish, between 1-4 alleles in the MHC genotype are recorded. This is partially because duplicate genes are undetectable by current instrumentation, and possibly because the fish possess a variable number of alleles of a given MHC gene.
Our dataset fish215
is included with
MADPop. A random sample from it looks like this:
## Lake A1 A2 A3 A4
## 68 Michipicoten r.3 r.4
## 167 Kingscote r.1 r.8
## 129 Dickey v.3 r.4
## 162 Kingscote w.5
## 43 Crystal r.4 t.7
## 14 Hogan x.4 r.4
The first column is the lake name (or population ID) for each sample,
the remaining four columns are for potentially recorded allele codes
(A1
-A4
). Here the code to identify a unique
allele is a small letter followed by a number, but it could have been
the sequence of integers 1, 2, …, A, which for the
fish215
data is A = 57 unique alleles.
It is relatively straightforward to import a CSV file into the format above. An example of this is given along with our raw data in the extdata directory of the local copy of the MADPop package.
Suppose that we wish to compare two lakes, say Dickey and Simcoe. The
allele counts in these lakes are in the table below. It is a subset of
the full contingency table on all K = 11 lakes, which is produced by
the MADPop function UM.suff()
:
popId <- c("Dickey", "Simcoe") # lakes to compare
Xsuff <- UM.suff(fish215) # summary statistics for dataset
ctab <- Xsuff$tab[popId,] # contingency table
ctab <- ctab[,colSums(ctab) > 0] # remove alleles with no counts
#ctab
rbind(ctab, Total = colSums(ctab))
## 1.20 1.4.5 1.4.9.45 1.5 10 20 20.54 27.28.29 3 4 4.10 4.11 4.20.27
## Dickey 1 1 0 2 0 2 1 1 0 0 1 0 1
## Simcoe 0 0 1 1 2 0 0 0 1 1 2 1 0
## Total 1 1 1 3 2 2 1 1 1 1 3 1 1
## 4.20.31 4.20.45 4.27 4.33 4.5 4.53.55 4.56 4.7 4.7.10 5 5.20 5.30 5.32
## Dickey 1 1 1 0 1 1 1 0 0 0 1 1 1
## Simcoe 0 0 0 1 1 0 0 3 1 1 0 0 0
## Total 1 1 1 1 2 1 1 3 1 1 1 1 1
## 5.7 7 9.11
## Dickey 1 0 0
## Simcoe 1 2 1
## Total 2 2 1
The unique allele identifiers are encoded as integers between 1 and A and separated by dots. The
original allele names are stored in Xsuff$A
, such that the
genotype of the first column 1.20
is
## [1] 1 20
## A1 A20
## "r.1" "u.4"
There are C = 29 genotype combinations observed in these two lakes, corresponding to each column of the table.
In the two-population problem we have K = 2 lakes with N1 and N2 fish sampled from each. Let Yk = (Yk1, …YkC) denote the counts for each genotype observed in lake k, such that $\sum_{i=1}^C Y_{ki} = N_k$. The sampling model for these data is $$ \boldsymbol Y_k \stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \boldsymbol \rho_k), \quad k = 1,2, $$ where ρk = (ρk1, ..., ρkC) are the population proportions of each genotype, and $\sum_{i=1}^C \rho_{ki} = 1$.
Our objective is to test $$ \begin{split} H_0 &: \textrm{The two populations have the same genotype proportions} \\ & \phantom{: } \iff \boldsymbol \rho_1 = \boldsymbol \rho_2. \end{split} $$ The classical test statistics for assessing H0 are Pearson’s Chi-Square statistic 𝒳 and the Likelihood Ratio statistic Λ, $$ \mathcal X = \sum_{k=1}^2 \sum_{i=1}^C \frac{(N_k\hat\rho_i - Y_{ki})^2}{N_k\hat\rho_i}, \qquad \Lambda = 2 \sum_{k=1}^2 \sum_{i=1}^C Y_{ki} \log\left(\frac{Y_{ki}}{N_k\hat\rho_i}\right), \qquad \hat \rho_i = \frac{Y_{1i} + Y_{2i}}{N_1 + N_2}. $$ Under H0, the asymptotic distribution of either of these test statistics T = 𝒳 or Λ is χ(C − 1)2, such that the p-value pv = Pr(T > Tobs ∣ H0) for an observed value of Tobs can be estimated as follows:
# observed values of the test statistics
chi2.obs <- chi2.stat(ctab) # Pearson's chi^2
LRT.obs <- LRT.stat(ctab) # LR test statistic
T.obs <- c(chi2 = chi2.obs, LRT = LRT.obs)
# p-value with asymptotic calculation
C <- ncol(ctab)
pv.asy <- pchisq(q = T.obs, df = C-1, lower.tail = FALSE)
signif(pv.asy, 2)
## chi2 LRT
## 0.330 0.041
The Chi-Square and LR tests are asymptotically equivalent and so should give roughly the same p-values. The huge discrepancy observed here indicates that the sample sizes are too small for asymptotics to kick in. A more reliable p-value estimate can be obtained by the Bootstrap method, which in this case consists of simulating M contigency tables with $\boldsymbol Y_k^{(m)} \stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \hat{\boldsymbol \rho})$, m = 1, …, M, where $\hat{\boldsymbol \rho}$ is the estimate of the common probability vector ρ1 = ρ2 under H0. For each contingency table (Y1(m), Y2(m)), we calculate the test statistic T(m), and the bootstrapped p-value is defined as $$ \textrm{p}_\textrm{v}^{\textrm{boot}}= \frac 1 M \sum_{m=1}^M \delta(T^{(m)} \ge T_\textrm{obs}). $$ pvboot is calculated with MADPop as follows:
N1 <- sum(ctab[1,]) # size of first sample
N2 <- sum(ctab[2,]) # size of second sample
rho.hat <- colSums(ctab)/(N1+N2) # common probability vector
# bootstrap distribution of the test statistics
# set verbose = TRUE for progress output
system.time({
T.boot <- UM.eqtest(N1 = N1, N2 = N2, p0 = rho.hat, nreps = 1e4,
verbose = FALSE)
})
## user system elapsed
## 0.459 0.008 0.467
## chi2 LRT
## 0.0068 0.0060
Note that the bootstrap p-values for both tests are roughly the same and decisively reject H0, whereas the less reliable asymptotic p-values both failed to reject (at quite different significance levels).
Bootstrapping overcomes many deficiencies of the asymptotic p-value calculation. However,
bootstrapping has a tendency to reject H0 when sample sizes are
small. To see why this is, consider all columns of ctab
which have only one genotype count between the two lakes:
itab1 <- colSums(ctab) == 1 # single count genotypes
cbind(ctab[,itab1],
Other = rowSums(ctab[,!itab1]),
Total = rowSums(ctab))
## 1.20 1.4.5 1.4.9.45 20.54 27.28.29 3 4 4.11 4.20.27 4.20.31 4.20.45 4.27
## Dickey 1 1 0 1 1 0 0 0 1 1 1 1
## Simcoe 0 0 1 0 0 1 1 1 0 0 0 0
## 4.33 4.53.55 4.56 4.7.10 5 5.20 5.30 5.32 9.11 Other Total
## Dickey 0 1 1 0 0 1 1 1 0 7 20
## Simcoe 1 0 0 1 1 0 0 0 1 12 20
There are c1 = 21 such columns,
accounting for p̂1 = 0.525 of the common
genotype distribution under H0, as estimated from the
two-lake sample. For each of these columns, observing counts in one lake
but not the other provides evidence against H0. Moreover, under the
estimated common distribution $\hat
{\boldsymbol \rho}$, it is very unlikely to have counts in only
one of the lakes for each of these c1 = 21 genotypes.
Therefore, the data appear to provide very strong evidence against H0. However, it is not so
unlikely to have c1 = 21 one-count
genotypes if the true number of unique genotypes in these two lakes is
much larger than the observed value of C = 29. With C = 29 unique genotypes in only
N = N1 + N2 = 40
fish samples, it is quite plausible that a new sample of fish would
yield several genotypes which are not present in the original two-lake
sample ctab
.
One way to obtain information about the unobserved genotypes in lakes Dickey and Simcoe is to consider the genotypes in all K = 11 Ontario lakes for which we have collected data. A natural way to do this is through a hierarchical model: $$ \begin{split} \boldsymbol Y_k \mid \boldsymbol \rho_k & \stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \boldsymbol \rho_k), \quad k = 1,\ldots,K \\ \boldsymbol \rho_k & \stackrel{\textrm{iid}}{\sim}\textrm{Dirichlet}(\boldsymbol \alpha), \qquad \boldsymbol \alpha= (\alpha_1, \ldots, \alpha_C). \end{split} $$ That is, each population is allowed to have its own probability vector ρk. However, these probability vectors are drawn from a common Dirichlet distribution. The common distribution specifies that before any data is drawn, we have $$ E[\boldsymbol \rho] = \bar{\boldsymbol \alpha}, \qquad \textrm{var}(\rho_i) = \frac{(\bar \alpha_i)(1-\bar \alpha_i)}{\alpha_0 + 1}, $$ where $\alpha_0 = \sum_{i=1}^C \alpha_i$ and $\bar{\boldsymbol \alpha} = \boldsymbol \alpha/ \alpha_0$. Moreover, the posterior distribution of ρk given α and the data is also Dirichlet: $$ \boldsymbol \rho_k \mid \boldsymbol Y\stackrel{\textrm{ind}}{\sim}\textrm{Dirichlet}(\boldsymbol \alpha+ \boldsymbol Y_k). $$ In this sense, the posterior estimate of the proportion of genotype i in population k, ρki, can be non-zero even if Y * ki = 0. In practice, α is estimated from Y, the data from all K populations (details in the following section). The extent to which the genotypes observed in one lake affect inference in the other lakes is determined by α * 0 (the larger this value, the larger the effect).
The likelihood function for this model corresponds to a Dirichlet-Multinomial distribution, $$ \begin{split} \mathcal L(\boldsymbol \alpha\mid \boldsymbol Y) = \prod_{k=1}^K p(\boldsymbol Y_k \mid \boldsymbol \alpha) & = \prod_{k=1}^K \int p(\boldsymbol Y_k \mid \boldsymbol \rho_k) p(\boldsymbol \rho_k \mid \boldsymbol \alpha) \,\mathrm{d}\boldsymbol \rho_k \\ & = \prod_{k=1}^K \left[\frac{N_k!\cdot\Gamma(\alpha_0)}{\Gamma(N_k+\alpha_0)} \prod_{i=1}^C \frac{\Gamma(Y_{ki} + \alpha_i)}{Y_{ki}!\cdot\Gamma(\alpha_i)}\right], \end{split} $$ from which α can be estimated by maximum likelihood (Minka 2000). Alternatively we estimate α using a Bayesian approach, which is more readily extensible to more complex genotype models, for which ℒ(α ∣ Y) has no closed form (see Future Work).
First, we specify a prior distribution $$ \pi(\alpha_0, \bar{\boldsymbol \alpha}) = \frac{1}{(1+ \alpha_0)^2}, $$ which is a uniform distribution on the prior probability vector $\bar{\boldsymbol \alpha}$, and a uniform distribution on the variance-like parameter A = (1 + α0)−1, in the sense that the prior mean and variance of a given genotype population probability are E[ρkj] = α̃j, var(ρkj) = A ⋅ α̃j(1 − α̃j). Then, we sample from the posterior distribution p(α ∣ Y) ∝ ℒ(α ∣ Y)π(α) using a Markov chain Monte Carlo (MCMC) algorithm provided by the Stan programming language (Stan Development Team 2016), as detailed below.
Under the hierarchical model, the null hypothesis of equal genetic proportions between two populations, say k = 1, 2, is H0 : ρ1 = ρ2 = ρ12. Our Bayesian hypothesis-testing strategy is implemented in two steps:
In order to estimate α under H0, we apply the
Dirichlet-Multinomial distribution to K − 1 lakes, with the first two
being collapsed into a common count vector Y12 = Y1 + Y2
with sample size N12 = N1 + N2.
MCMC sampling is implemented via a Hybrid Monte Carlo (HMC) variant
provided by the R package rstan. In
MADPop, sampling from p(α ∣ Y, H0)
is accomplished with the function hUM.post()
, where “hUM”
stands for hierarchical Unconstrained Multinomial model. We
start by merging the two samples from equal distributions under H0:
## [1] "Dickey" "Simcoe"
eqId0 <- paste0(popId, collapse = ".") # merged lake name
popId0 <- as.character(fish215$Lake) # all lake names
popId0[popId0 %in% popId] <- eqId0
table(popId0, dnn = NULL) # merged lake counts
## Crystal Dickey.Simcoe Hogan Kingscote Macdonald
## 20 40 21 18 19
## Manitou Michipicoten Opeongo Seneca Slate
## 20 20 20 18 19
Next, we sample from p(α ∣ Y, H0) using rstan:
X0 <- cbind(Id = popId0, fish215[-1]) # allele data with merged lakes
nsamples <- 1e4
fit0 <- hUM.post(nsamples = nsamples, X = X0,
rhoId = eqId0, # output rho only for merged lakes
chains = 1, # next two arguments are passed to rstan
warmup = min(1e4, floor(nsamples/10)),
full.stan.out = FALSE)
##
## SAMPLING FOR MODEL 'DirichletMultinomial' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0.000173 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.73 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 1: Iteration: 1001 / 10000 [ 10%] (Sampling)
## Chain 1: Iteration: 2000 / 10000 [ 20%] (Sampling)
## Chain 1: Iteration: 3000 / 10000 [ 30%] (Sampling)
## Chain 1: Iteration: 4000 / 10000 [ 40%] (Sampling)
## Chain 1: Iteration: 5000 / 10000 [ 50%] (Sampling)
## Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 1.732 seconds (Warm-up)
## Chain 1: 15.58 seconds (Sampling)
## Chain 1: 17.312 seconds (Total)
## Chain 1:
rstan typically outputs a number of messages and
warnings, many of which are harmless. The MADPop
package lists rstan as a dependency, such that its
sophisticated tuning mechanism is exposed. Optionally,
hUM.post()
returns the entire rstan output
(full.stan.out = TRUE
), which can be run through
rstan’s MCMC convergence diagnostics.
By setting the rhoId
argument of
hUM.post()
, the fit0
object contains samples
from p(α, ρ12 ∣ Y, H0).
Boxplots of the 50 highest posterior probability genotypes in the common
distribution ρ12 are displayed
below:
rho.post <- fit0$rho[,1,] # p(rho12 | Y)
# sort genotype counts by decreasing order
rho.count <- colSums(Xsuff$tab[popId,])
rho.ord <- order(colMeans(rho.post), decreasing = TRUE)
# plot
nbxp <- 50 # number of genotypes for which to make boxplots
clrs <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2",
"#D55E00", "#CC79A7")
rho.ct <- rho.count[rho.ord[1:nbxp]]
rho.lgd <- unique(sort(rho.ct))
rho.box <- rho.post[,rho.ord[1:nbxp]]
par(mar = c(5,4.5,.5,.5)+.1, cex.axis = .7)
boxplot(x = rho.box,
las = 2, col = clrs[rho.ct+1], pch = 16, cex = .2)
title(xlab = "Genotype", line = 4)
title(ylab = expression(p(bold(rho)[(12)]*" | "*bold(Y))))
legend("topright", legend = rev(rho.lgd), fill = rev(clrs[rho.lgd+1]),
title = "Counts", ncol = 2, cex = .8)
This is calculated analogously to the bootstrapped p-value, by generating M contingency tables with $\boldsymbol Y_k^{(m)} \stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \boldsymbol \rho_{12}^{(m)}$, m = 1, …, M. However, for each contingency table, the common probability vector ρ12(m) is different, corresponding to a random draw from p(ρ12 ∣ Y, H0). The test statistic T(m) is then calculated and we have $$ \textrm{p}_\textrm{v}^{\textrm{post}}= \frac 1 M \sum_{m=1}^M \delta(T^{(m)} \ge T_{\textrm{obs}}). $$ pvpost is calculated with MADPop as follows:
## user system elapsed
## 0.889 0.012 0.902
We can now compare the three types of p-values (asymptotic, bootstrapped, posterior) for each test statistic (𝒳 and Λ):
Asymptotic | Bootstrap | Posterior | |
---|---|---|---|
𝒳 | 33.0 | 0.68 | 13 |
Λ | 4.1 | 0.60 | 13 |
We see that pvpost is much more conservative than pvboot.
Here we used a completely unconstrained Multinomial model for the genotype counts, in the sense that the only restriction on ρk is that ρki ≥ 0 and $\sum_{i=1}^C \rho_{ki} = 1$. However, it is possible to impose genetic constraints such as Hardy-Weinberg equilibrium, preferential mating, etc., which effectively reduce the degrees of freedom of ρk below C − 1. In this case, the closed-form likelihood ℒ(α ∣ Y) is typically unavailable, but the Stan code can easily be modified to sample from p(α, ℛ ∣ Y), where ℛ = (ρ1, …, ρK). We hope to feature some of these extensions to the basic Dirichlet-Multinomial model in the next version of MADPop.