--- title: "Bayesian Testing of Equal Genotype Proportions between Multiple Populations" author: "Martin Lysy, Wookjung P. Kim, Terin Robinson" date: "`r Sys.Date()`" output: rmarkdown::html_vignette header-includes: - \usepackage{bm} link-citations: true references: - id: minka00 title: Estimating a Dirichlet Distribution author: - family: Minka given: T.P. container-title: Technical Report publisher: Massachusetts Institute of Technology URL: https://tminka.github.io/papers/dirichlet/minka-dirichlet.pdf type: article-journal issued: year: 2000 - id: rstan title: "Rstan: the R interface to Stan" author: - family: "Stan Development Team" URL: https://mc-stan.org/rstan/ type: webpage issued: year: 2016 vignette: > %\VignetteIndexEntry{Bayesian Testing of Equal Genotype Proportions between Multiple Populations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- \newcommand{\Y}{\boldsymbol Y} \newcommand{\rr}{\boldsymbol \rho} \newcommand{\RR}{\boldsymbol{\mathcal R}} \renewcommand{\aa}{\boldsymbol \alpha} \newcommand{\LL}{\mathcal L} \newcommand{\ind}{\stackrel{\textrm{ind}}{\sim}} \newcommand{\iid}{\stackrel{\textrm{iid}}{\sim}} \newcommand{\var}{\textrm{var}} \newcommand{\pv}{\textrm{p}_\textrm{v}} \newcommand{\pvb}{\textrm{p}_\textrm{v}^{\textrm{boot}}} \newcommand{\pvp}{\textrm{p}_\textrm{v}^{\textrm{post}}} ```{r setup, echo = FALSE, include = FALSE} require(knitr) ## require(rmarkdown) set.seed(1) knitr::opts_chunk$set(cache = FALSE, autodep = FALSE) ## if(FALSE) { ## # compile ## #rmarkdown::render(file.path("vignettes", "MADPop-quicktut.Rmd")) ## rmarkdown::render("MADPop-quicktut.Rmd") ## } ## # view it by opening MADPop/vignettes/MADPop-tutorial.html ``` This tutorial shows how to use the **MADPop** package to test for genetic differences between two populations, of which the individuals of a same species contain a variable number of alleles. ```{r madpop} require(MADPop) ``` ## Pre-Processing ```{r preproc, echo = FALSE} nObs <- nrow(fish215) nPop <- nlevels(fish215$Lake) nAlleles <- length(table(c(as.matrix(fish215[,-1])))) - 1 ``` Our data consists of $N = `r nObs`$ recordings of Major Histocompatibility Complex (MHC) genotypes of lake trout from $K = `r nPop`$ lakes in Ontario, Canada. For each of the fish, between 1-4 alleles in the MHC genotype are recorded. This is partially because duplicate genes are undetectable by current instrumentation, and possibly because the fish possess a variable number of alleles of a given MHC gene. Our dataset `fish215` is included with **MADPop**. A random sample from it looks like this: ```{r fish215} head(fish215[sample(nObs),]) ``` The first column is the lake name (or population ID) for each sample, the remaining four columns are for potentially recorded allele codes (`A1`-`A4`). Here the code to identify a unique allele is a small letter followed by a number, but it could have been the sequence of integers $1, 2,\ldots, A$, which for the `fish215` data is $A = `r nAlleles`$ unique alleles. It is relatively straightforward to import a CSV file into the format above. An example of this is given along with our raw data in the **extdata** directory of the local copy of the **MADPop** package. ## Two-Population Comparisons ```{r setup2, ref.label = "table2", echo = FALSE, results = "hide"} ``` Suppose that we wish to compare two lakes, say `r popId[1]` and `r popId[2]`. The allele counts in these lakes are in the table below. It is a subset of the full contingency table on all $K = `r nPop`$ lakes, which is produced by the **MADPop** function `UM.suff()`: ```{r table2} popId <- c("Dickey", "Simcoe") # lakes to compare Xsuff <- UM.suff(fish215) # summary statistics for dataset ctab <- Xsuff$tab[popId,] # contingency table ctab <- ctab[,colSums(ctab) > 0] # remove alleles with no counts #ctab rbind(ctab, Total = colSums(ctab)) ``` The unique allele identifiers are encoded as integers between $1$ and $A$ and separated by dots. The original allele names are stored in `Xsuff$A`, such that the genotype of the first column ``r colnames(ctab)[1]`` is ```{r gtype} gtype <- colnames(ctab)[1] gtype <- as.numeric(strsplit(gtype, "[.]")[[1]]) gtype names(gtype) <- paste0("A", gtype) sapply(gtype, function(ii) Xsuff$A[ii]) ``` There are $C = `r ncol(ctab)`$ genotype combinations observed in these two lakes, corresponding to each column of the table. ### Multinomial Model In the two-population problem we have $K = 2$ lakes with $N_1$ and $N_2$ fish sampled from each. Let $\Y_k = (Y_{k1}, \ldots Y_{kC})$ denote the counts for each genotype observed in lake $k$, such that $\sum_{i=1}^C Y_{ki} = N_k$. The sampling model for these data is $$ \Y_k \ind \textrm{Multinomial}(N_k, \rr_k), \quad k = 1,2, $$ where $\rr_k = (\rho_{k1},...,\rho_{kC})$ are the population proportions of each genotype, and $\sum_{i=1}^C \rho_{ki} = 1$. ### Hypothesis Testing Our objective is to test $$ \begin{split} H_0 &: \textrm{The two populations have the same genotype proportions} \\ & \phantom{: } \iff \rr_1 = \rr_2. \end{split} $$ The classical test statistics for assessing $H_0$ are Pearson's Chi-Square statistic $\mathcal X$ and the Likelihood Ratio statistic $\Lambda$, $$ \mathcal X = \sum_{k=1}^2 \sum_{i=1}^C \frac{(N_k\hat\rho_i - Y_{ki})^2}{N_k\hat\rho_i}, \qquad \Lambda = 2 \sum_{k=1}^2 \sum_{i=1}^C Y_{ki} \log\left(\frac{Y_{ki}}{N_k\hat\rho_i}\right), \qquad \hat \rho_i = \frac{Y_{1i} + Y_{2i}}{N_1 + N_2}. $$ Under $H_0$, the asymptotic distribution of either of these test statistics $T = \mathcal X$ or $\Lambda$ is $\chi^2_{(C-1)}$, such that the $p$-value $$ \pv = \textrm{Pr}(T > T_\textrm{obs} \mid H_0) $$ for an observed value of $T_\textrm{obs}$ can be estimated as follows: ```{r pvasy} # observed values of the test statistics chi2.obs <- chi2.stat(ctab) # Pearson's chi^2 LRT.obs <- LRT.stat(ctab) # LR test statistic T.obs <- c(chi2 = chi2.obs, LRT = LRT.obs) # p-value with asymptotic calculation C <- ncol(ctab) pv.asy <- pchisq(q = T.obs, df = C-1, lower.tail = FALSE) signif(pv.asy, 2) ``` The Chi-Square and LR tests are asymptotically equivalent and so should give roughly the same $p$-values. The huge discrepancy observed here indicates that the sample sizes are too small for asymptotics to kick in. A more reliable $p$-value estimate can be obtained by the Bootstrap method, which in this case consists of simulating $M$ contigency tables with $\Y_k^{(m)} \ind \textrm{Multinomial}(N_k, \hat{\rr})$, $m = 1, \ldots, M$, where $\hat{\rr}$ is the estimate of the common probability vector $\rr_1 = \rr_2$ under $H_0$. For each contingency table $(\Y_1^{(m)}, \Y_2^{(m)})$, we calculate the test statistic $T^{(m)}$, and the bootstrapped p-value is defined as $$ \pvb = \frac 1 M \sum_{m=1}^M \delta(T^{(m)} \ge T_\textrm{obs}). $$ $\pvb$ is calculated with **MADPop** as follows: ```{r pvboot, cache = FALSE} N1 <- sum(ctab[1,]) # size of first sample N2 <- sum(ctab[2,]) # size of second sample rho.hat <- colSums(ctab)/(N1+N2) # common probability vector # bootstrap distribution of the test statistics # set verbose = TRUE for progress output system.time({ T.boot <- UM.eqtest(N1 = N1, N2 = N2, p0 = rho.hat, nreps = 1e4, verbose = FALSE) }) # bootstrap p-value pv.boot <- rowMeans(t(T.boot) >= T.obs) signif(pv.boot, 2) ``` Note that the bootstrap $p$-values for both tests are roughly the same and decisively reject $H_0$, whereas the less reliable asymptotic $p$-values both failed to reject (at quite different significance levels). ## Pairwise Comparisons between Multiple Populations Bootstrapping overcomes many deficiencies of the asymptotic $p$-value calculation. However, bootstrapping has a tendency to reject $H_0$ when sample sizes are small. To see why this is, consider all columns of `ctab` which have only one genotype count between the two lakes: ```{r table3} itab1 <- colSums(ctab) == 1 # single count genotypes cbind(ctab[,itab1], Other = rowSums(ctab[,!itab1]), Total = rowSums(ctab)) ``` ```{r setup4, echo = FALSE} c1 <- sum(itab1) # number of single-count columns n1 <- sum(ctab[,itab1]) # number of single counts ``` There are $c_1 = `r c1`$ such columns, accounting for $\hat p_1 = `r n1/sum(ctab)`$ of the common genotype distribution under $H_0$, as estimated from the two-lake sample. For each of these columns, observing counts in one lake but not the other provides evidence against $H_0$. Moreover, under the estimated common distribution $\hat {\rr}$, it is very unlikely to have counts in only one of the lakes for each of these $c_1 = `r c1`$ genotypes. Therefore, the data appear to provide very strong evidence against $H_0$. However, it is not so unlikely to have $c_1 = `r c1`$ one-count genotypes if the true number of unique genotypes in these two lakes is much larger than the observed value of $C = `r C`$. With $C = `r C`$ unique genotypes in only $N = N_1 + N_2 = `r N1 + N2`$ fish samples, it is quite plausible that a new sample of fish would yield several genotypes which are not present in the original two-lake sample `ctab`. One way to obtain information about the unobserved genotypes in lakes `r popId[1]` and `r popId[2]` is to consider the genotypes in *all* $K = `r nPop`$ Ontario lakes for which we have collected data. A natural way to do this is through a hierarchical model: $$ \begin{split} \Y_k \mid \rr_k & \ind \textrm{Multinomial}(N_k, \rr_k), \quad k = 1,\ldots,K \\ \rr_k & \iid \textrm{Dirichlet}(\aa), \qquad \aa = (\alpha_1, \ldots, \alpha_C). \end{split} $$ That is, each population is allowed to have its own probability vector $\rr_k$. However, these probability vectors are drawn from a common Dirichlet distribution. The common distribution specifies that before any data is drawn, we have $$ E[\rr] = \bar{\aa}, \qquad \var(\rho_i) = \frac{(\bar \alpha_i)(1-\bar \alpha_i)}{\alpha_0 + 1}, $$ where $\alpha_0 = \sum_{i=1}^C \alpha_i$ and $\bar{\aa} = \aa / \alpha_0$. Moreover, the posterior distribution of $\rr_k$ given $\aa$ and the data is also Dirichlet: $$ \rr_k \mid \Y \ind \textrm{Dirichlet}(\aa + \Y_k). $$ In this sense, the posterior estimate of the proportion of genotype $i$ in population $k$, $\rho_{ki}$, can be non-zero even if $Y*{ki} = 0$. In practice, $\aa$ is estimated from $\Y$, the data from all $K$ populations (details in the following section). The extent to which the genotypes observed in one lake affect inference in the other lakes is determined by $\alpha*0$ (the larger this value, the larger the effect). ### Parameter Estimation The likelihood function for this model corresponds to a Dirichlet-Multinomial distribution, $$ \begin{split} \mathcal L(\aa \mid \Y) = \prod_{k=1}^K p(\Y_k \mid \aa) & = \prod_{k=1}^K \int p(\Y_k \mid \rr_k) p(\rr_k \mid \aa) \,\mathrm{d}\rr_k \\ & = \prod_{k=1}^K \left[\frac{N_k!\cdot\Gamma(\alpha_0)}{\Gamma(N_k+\alpha_0)} \prod_{i=1}^C \frac{\Gamma(Y_{ki} + \alpha_i)}{Y_{ki}!\cdot\Gamma(\alpha_i)}\right], \end{split} $$ from which $\aa$ can be estimated by maximum likelihood [@minka00]. Alternatively we estimate $\aa$ using a Bayesian approach, which is more readily extensible to more complex genotype models, for which $\mathcal L(\aa \mid \Y)$ has no closed form (see [Future Work](#future-work)). First, we specify a prior distribution $$ \pi(\alpha_0, \bar{\aa}) = \frac{1}{(1+ \alpha_0)^2}, $$ which is a uniform distribution on the prior probability vector $\bar{\aa}$, and a uniform distribution on the variance-like parameter $A = (1+\alpha_0)^{-1}$, in the sense that the prior mean and variance of a given genotype population probability are $$ E[\rho_{kj}] = \tilde \alpha_j, \qquad \var(\rho_{kj}) = A \cdot \tilde \alpha_j(1-\tilde \alpha_j). $$ Then, we sample from the posterior distribution $p(\aa \mid \Y) \propto \LL(\aa \mid \Y) \pi(\aa)$ using a Markov chain Monte Carlo (MCMC) algorithm provided by the **Stan** programming language [@rstan], as detailed below. ## Bayesian Hypothesis Testing Under the hierarchical model, the null hypothesis of equal genetic proportions between two populations, say $k = 1,2$, is $H_0: \rr_1 = \rr_2 = \rr_{12}$. Our Bayesian hypothesis-testing strategy is implemented in two steps: 1. Sample from the posterior distribution $p(\aa \mid \Y, H_0)$. 2. Sample from the posterior distribution of either classical test statistic, $T = \mathcal X$ or $\Lambda$, thus obtaining a *posterior p-value* $$ \pvp = \textrm{Pr}(T > T_\textrm{obs} \mid \Y, H_0). $$ ### MCMC Sampling from the Posterior Distribution In order to estimate $\aa$ under $H_0$, we apply the Dirichlet-Multinomial distribution to $K-1$ lakes, with the first two being collapsed into a common count vector $\Y_{12} = \Y_1 + \Y_2$ with sample size $N_{12} = N_1 + N_2$. MCMC sampling is implemented via a Hybrid Monte Carlo (HMC) variant provided by the **R** package **rstan**. In **MADPop**, sampling from $p(\aa \mid \Y, H_0)$ is accomplished with the function `hUM.post()`, where "hUM" stands for *hierarchical Unconstrained Multinomial* model. We start by merging the two samples from equal distributions under $H_0$: ```{r popId0} popId # equal lakes under H0 eqId0 <- paste0(popId, collapse = ".") # merged lake name popId0 <- as.character(fish215$Lake) # all lake names popId0[popId0 %in% popId] <- eqId0 table(popId0, dnn = NULL) # merged lake counts ``` Next, we sample from $p(\aa \mid \Y, H_0)$ using **rstan**: ```{r stan, cache = FALSE} X0 <- cbind(Id = popId0, fish215[-1]) # allele data with merged lakes nsamples <- 1e4 fit0 <- hUM.post(nsamples = nsamples, X = X0, rhoId = eqId0, # output rho only for merged lakes chains = 1, # next two arguments are passed to rstan warmup = min(1e4, floor(nsamples/10)), full.stan.out = FALSE) ``` **rstan** typically outputs a number of messages and warnings, many of which are harmless. The **MADPop** package lists **rstan** as a dependency, such that its sophisticated tuning mechanism is exposed. Optionally, `hUM.post()` returns the entire **rstan** output (``full.stan.out = TRUE``), which can be run through **rstan**'s MCMC convergence diagnostics. ```{r setup3, ref.label = "bxp", echo = FALSE, fig.keep = "none"} ``` By setting the `rhoId` argument of `hUM.post()`, the `fit0` object contains samples from $p(\aa, \rr_{12} \mid \Y, H_0)$. Boxplots of the `r nbxp` highest posterior probability genotypes in the common distribution $\rr_{12}$ are displayed below: ```{r bxp, fig.width = 7, fig.height = 3.5} rho.post <- fit0$rho[,1,] # p(rho12 | Y) # sort genotype counts by decreasing order rho.count <- colSums(Xsuff$tab[popId,]) rho.ord <- order(colMeans(rho.post), decreasing = TRUE) # plot nbxp <- 50 # number of genotypes for which to make boxplots clrs <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7") rho.ct <- rho.count[rho.ord[1:nbxp]] rho.lgd <- unique(sort(rho.ct)) rho.box <- rho.post[,rho.ord[1:nbxp]] par(mar = c(5,4.5,.5,.5)+.1, cex.axis = .7) boxplot(x = rho.box, las = 2, col = clrs[rho.ct+1], pch = 16, cex = .2) title(xlab = "Genotype", line = 4) title(ylab = expression(p(bold(rho)[(12)]*" | "*bold(Y)))) legend("topright", legend = rev(rho.lgd), fill = rev(clrs[rho.lgd+1]), title = "Counts", ncol = 2, cex = .8) ``` ### Calculating the Posterior P-Value This is calculated analogously to the bootstrapped p-value, by generating $M$ contingency tables with $\Y_k^{(m)} \ind \textrm{Multinomial}(N_k, \rr_{12}^{(m)}$, $m = 1, \ldots, M$. However, for each contingency table, the common probability vector $\rr_{12}^{(m)}$ is different, corresponding to a random draw from $p(\rr_{12} \mid \Y, H_0)$. The test statistic $T^{(m)}$ is then calculated and we have $$ \pvp = \frac 1 M \sum_{m=1}^M \delta(T^{(m)} \ge T_{\textrm{obs}}). $$ $\pvp$ is calculated with **MADPop** as follows: ```{r pvpost, cache = FALSE} system.time({ T.post <- UM.eqtest(N1 = N1, N2 = N2, p0 = rho.post, nreps = 1e4, verbose = FALSE) }) # posterior p-value pv.post <- rowMeans(t(T.post) >= T.obs) ``` We can now compare the three types of p-values (asymptotic, bootstrapped, posterior) for each test statistic ($\mathcal X$ and $\Lambda$): ```{r pvtable, echo = FALSE} pv.tab <- cbind(asy = pv.asy, boot = pv.boot, post = pv.post) pv.tab <- signif(pv.tab*100, 2) colnames(pv.tab) <- c("Asymptotic", "Bootstrap", "Posterior") rownames(pv.tab) <- c("$\\mathcal X$", "$\\Lambda$") kable(pv.tab, digits = 2, caption = "p-value ($\\times100\\%$)") ``` We see that $\pvp$ is much more conservative than $\pvb$. ## Future Work Here we used a completely unconstrained Multinomial model for the genotype counts, in the sense that the only restriction on $\rr_k$ is that $\rho_{ki} \ge 0$ and $\sum_{i=1}^C \rho_{ki} = 1$. However, it is possible to impose genetic constraints such as Hardy-Weinberg equilibrium, preferential mating, etc., which effectively reduce the degrees of freedom of $\rr_k$ below $C-1$. In this case, the closed-form likelihood $\LL(\aa \mid \Y)$ is typically unavailable, but the **Stan** code can easily be modified to sample from $p(\aa, \RR \mid \Y)$, where $\RR = (\rr_1, \ldots, \rr_K)$. We hope to feature some of these extensions to the basic Dirichlet-Multinomial model in the next version of **MADPop**. ## References