Title: | MHC Allele-Based Differencing Between Populations |
---|---|
Description: | Tools for the analysis of population differences using the Major Histocompatibility Complex (MHC) genotypes of samples having a variable number of alleles (1-4) recorded for each individual. A hierarchical Dirichlet-Multinomial model on the genotype counts is used to pool small samples from multiple populations for pairwise tests of equality. Bayesian inference is implemented via the 'rstan' package. Bootstrapped and posterior p-values are provided for chi-squared and likelihood ratio tests of equal genotype probabilities. |
Authors: | Martin Lysy [cre, aut], Peter W.J. Kim [aut], Terin Robinson [ctb] |
Maintainer: | Martin Lysy <[email protected]> |
License: | GPL-3 |
Version: | 1.1.6 |
Built: | 2024-10-26 04:43:08 UTC |
Source: | https://github.com/mlysy/madpop |
Tools for the analysis of population differences using the Major Histocompatibility Complex (MHC) genotypes of samples having a variable number of alleles (1-4) recorded for each individual.
For a full tutorial see package vignette: vignette("MADPop-quicktut")
.
Maintainer: Martin Lysy [email protected]
Authors:
Peter W.J. Kim
Other contributors:
Terin Robinson [contributor]
Useful links:
# typical dataset head(fish215[sample(nrow(fish215)),]) table(fish215$Lake) # number of samples per lake # contingency table on two lakes iLakes <- c("Michipicoten", "Simcoe") Xsuff <- UM.suff(X = fish215[fish215$Lake %in% iLakes,]) ctab <- Xsuff$tab ctab # bootstrapped p-value calculation for chi^2 and LR tests p.MLE <- colSums(ctab)/sum(ctab) N1 <- sum(ctab[1,]) N2 <- sum(ctab[2,]) # bootstrapped test statistics (chi^2 and LRT) T.boot <- UM.eqtest(N1 = N1, N2 = N2, p0 = p.MLE, nreps = 1e3) # observed test statistics T.obs <- c(chi2 = chi2.stat(ctab), LRT = LRT.stat(ctab)) # p-values rowMeans(t(T.boot) > T.obs) # posterior sampler for hierarchical model # output posterior probability for each genotype in lake Simcoe rhoId <- "Simcoe" nsamples <- 500 hUM.fit <- hUM.post(nsamples = nsamples, X = fish215, rhoId = rhoId, chains = 1) # first 20 genotype posterior probabilities in lake Simcoe rho.post <- hUM.fit$rho[,1,] boxplot(rho.post[,1:20], las = 2, xlab = "Genotype", ylab = "Posterior Probability", pch = ".", col = "grey")
# typical dataset head(fish215[sample(nrow(fish215)),]) table(fish215$Lake) # number of samples per lake # contingency table on two lakes iLakes <- c("Michipicoten", "Simcoe") Xsuff <- UM.suff(X = fish215[fish215$Lake %in% iLakes,]) ctab <- Xsuff$tab ctab # bootstrapped p-value calculation for chi^2 and LR tests p.MLE <- colSums(ctab)/sum(ctab) N1 <- sum(ctab[1,]) N2 <- sum(ctab[2,]) # bootstrapped test statistics (chi^2 and LRT) T.boot <- UM.eqtest(N1 = N1, N2 = N2, p0 = p.MLE, nreps = 1e3) # observed test statistics T.obs <- c(chi2 = chi2.stat(ctab), LRT = LRT.stat(ctab)) # p-values rowMeans(t(T.boot) > T.obs) # posterior sampler for hierarchical model # output posterior probability for each genotype in lake Simcoe rhoId <- "Simcoe" nsamples <- 500 hUM.fit <- hUM.post(nsamples = nsamples, X = fish215, rhoId = rhoId, chains = 1) # first 20 genotype posterior probabilities in lake Simcoe rho.post <- hUM.fit$rho[,1,] boxplot(rho.post[,1:20], las = 2, xlab = "Genotype", ylab = "Posterior Probability", pch = ".", col = "grey")
Calculates the chi-squared test statistic for a two-way contingency table.
chi2.stat(tab)
chi2.stat(tab)
tab |
A |
Suppose that tab
consists of counts from populations (rows) in
categories. The chi-squared test statistic is computed as
where is the observed number of counts in the
th row and
th column of
tab
, and is the expected number of counts under
that the populations have indentical proportions in each category:
where is the total number of counts in
tab
.
The calculated value of the chi-squared statistic.
# simple contingency table ctab <- rbind(pop1 = c(5, 3, 0, 3), pop2 = c(4, 10, 2, 5)) colnames(ctab) <- LETTERS[1:4] ctab chi2.stat(ctab) # chi^2 test statistic
# simple contingency table ctab <- rbind(pop1 = c(5, 3, 0, 3), pop2 = c(4, 10, 2, 5)) colnames(ctab) <- LETTERS[1:4] ctab chi2.stat(ctab) # chi^2 test statistic
Observable genotypes (up to possibly duplicated alleles) of 215 lake trout (Salvelinus namaycush) collected from 11 lakes in Ontario, Canada.
A data.frame
with 215 rows and 5 columns. The first column is an (optional) vector of population identifiers. The next four columns contain the recorded genotype for each observation (row). Each row contains up to four distinct allele identifiers in any order. Missing alleles should be denoted by NA
, or "", but not both.
This data.frame
is how a typical spreadsheet of genotype data gets imported into R. Data must adhere to this format to be correctly processed by the functions in MADPop.
Kuntz, S. (2014). Population Differentiation of Ontario Lake trout (Salvelinus namaycush) using the Major Histocompatibility Complex class II gene (URL).
head(fish215)
head(fish215)
MCMC sampling from a Dirichlet-Multinomial model using stan
.
hUM.post(nsamples, X, popId, rhoId, full.stan.out = FALSE, ...)
hUM.post(nsamples, X, popId, rhoId, full.stan.out = FALSE, ...)
nsamples |
Number of posterior samples |
X |
4-column or 5-column matrix of observations in the correct format. See |
popId |
Optional vector of population identifiers. See |
rhoId |
Populations for which posterior samples of the genotype probability vector |
full.stan.out |
Logical. Whether or not to return the full |
... |
Further arguments to be passed to the |
The hierarchical Dirichlet-Multinomial model is given by
where and
. MCMC sampling is achieved with the rstan package, which is listed as a dependency for MADPop so as to expose rstan's sophisticated tuning mechanism and convergence diagnostics.
A list with elements
A
: The unique allele names.
G
: The 4-column matrix Package libcurl was not found in the pkg-config search path.of unique genotype combinations.
rho
: A matrix with ncol(rho) == nrow(G)
, where each row is a draw from the posterior distribution of inheritance probabilities.
sfit
: If full.stan.out = TRUE
, the fitted stan
object.
# fit hierarchical model to fish215 data # only output posterior samples for lake Simcoe rhoId <- "Simcoe" nsamples <- 500 hUM.fit <- hUM.post(nsamples = nsamples, X = fish215, rhoId = rhoId, chains = 1) # number of MCMC chains # plot first 20 posterior probabilities in lake Simcoe rho.post <- hUM.fit$rho[,1,] boxplot(rho.post[,1:20], las = 2, xlab = "Genotype", ylab = "Posterior Probability", pch = ".", col = "grey")
# fit hierarchical model to fish215 data # only output posterior samples for lake Simcoe rhoId <- "Simcoe" nsamples <- 500 hUM.fit <- hUM.post(nsamples = nsamples, X = fish215, rhoId = rhoId, chains = 1) # number of MCMC chains # plot first 20 posterior probabilities in lake Simcoe rho.post <- hUM.fit$rho[,1,] boxplot(rho.post[,1:20], las = 2, xlab = "Genotype", ylab = "Posterior Probability", pch = ".", col = "grey")
Calculate the likelihood ratio test statistic for general two-way contingency tables.
LRT.stat(tab)
LRT.stat(tab)
tab |
A |
Suppose that tab
consists of counts from populations (rows) in
categories. The likelihood ratio test statistic is computed as
where is the observed number of counts in the
th row and
th column of
tab
, is the unconstrained estimate of the proportion of category
in population
, and
is the estimate of this proportion under
that the populations have indentical proportions in each category. If any column has only zeros it is removed before calculating the LRT statistic.
The calculated value of the LRT statistic.
# simple contingency table ctab <- rbind(pop1 = c(5, 3, 0, 3), pop2 = c(4, 10, 2, 5)) colnames(ctab) <- LETTERS[1:4] ctab LRT.stat(ctab) # likelihood ratio statistic
# simple contingency table ctab <- rbind(pop1 = c(5, 3, 0, 3), pop2 = c(4, 10, 2, 5)) colnames(ctab) <- LETTERS[1:4] ctab LRT.stat(ctab) # likelihood ratio statistic
Generate multinomial samples from a common probability vector and calculate the Chi-square and Likelihood Ratio test statistics.
UM.eqtest(N1, N2, p0, nreps, verbose = TRUE)
UM.eqtest(N1, N2, p0, nreps, verbose = TRUE)
N1 |
Size of sample 1. |
N2 |
Size of sample 2. |
p0 |
Common probability vector from which to draw the multinomial samples. Can also be a matrix, in which case each simulation randomly draws with replacement from the rows of p0. |
nreps |
Number of replications of the simulation. |
verbose |
Logical. If |
The chi-squared and likelihood ratio test statistics are calculated from multinomial samples , where
where is the
th row of
p0
.
An nreps x 2
matrix with the simulated chi-squared and LR values.
# bootstrapped p-value calculation against equal genotype proportions # in lakes Michipicoten and Simcoe # contingency table popId <- c("Michipicoten", "Simcoe") ctab <- UM.suff(fish215[fish215$Lake %in% popId,])$tab ctab # MLE of probability vector p.MLE <- colSums(ctab)/sum(ctab) # sample sizes N1 <- sum(ctab[1,]) # Michipicoten N2 <- sum(ctab[2,]) # Simcoe # bootstrapped test statistics (chi^2 and LRT) T.boot <- UM.eqtest(N1 = N1, N2 = N2, p0 = p.MLE, nreps = 1e3) # observed test statistics T.obs <- c(chi2 = chi2.stat(ctab), LRT = LRT.stat(ctab)) # p-values rowMeans(t(T.boot) > T.obs)
# bootstrapped p-value calculation against equal genotype proportions # in lakes Michipicoten and Simcoe # contingency table popId <- c("Michipicoten", "Simcoe") ctab <- UM.suff(fish215[fish215$Lake %in% popId,])$tab ctab # MLE of probability vector p.MLE <- colSums(ctab)/sum(ctab) # sample sizes N1 <- sum(ctab[1,]) # Michipicoten N2 <- sum(ctab[2,]) # Simcoe # bootstrapped test statistics (chi^2 and LRT) T.boot <- UM.eqtest(N1 = N1, N2 = N2, p0 = p.MLE, nreps = 1e3) # observed test statistics T.obs <- c(chi2 = chi2.stat(ctab), LRT = LRT.stat(ctab)) # p-values rowMeans(t(T.boot) > T.obs)
Converts a matrix or data.frame of genotype data into the sufficient statistics required to fit a Dirichlet-Multinomial hierarchical model.
UM.suff(X, popId)
UM.suff(X, popId)
X |
Genotype adata. Either a |
popId |
grouping variable for |
A list with elements:
A
: Vector of unique alleles names. The allele numbers in the following quantities correspond to the indices of A
.
G
: 4-column matrix of unique genotype combinations. The presence of 0's indicates that less than four alleles were amplified indicating that a given genotype either has less than 4 distinct alleles or that some alleles are duplicated.
tab
: Observed data in a simplified numerical format. This is a contingency table with rows given by the unique elements of popId
and columns given by each row of G
.
# sufficient statistics in 3 lakes X <- fish215[fish215$Lake %in% c("Hogan", "Manitou", "Simcoe"),] suff <- UM.suff(X) suff$A # allele names suff$G # unique genotypes suff$tab # contingency table
# sufficient statistics in 3 lakes X <- fish215[fish215$Lake %in% c("Hogan", "Manitou", "Simcoe"),] suff <- UM.suff(X) suff$A # allele names suff$G # unique genotypes suff$tab # contingency table