Consider a statistical model p(Y ∣ θ, B, Σ) of the form Xθ = Xn × p(θ) is the design matrix which depends on parameters θ, Bp × q are regression coefficients, Vθ = Vn × n(θ) and Σq × q are between-row and between-column variance matrices, and the Matrix-Normal distribution is defined as Zp × q ∼ MatNorm(Λp × q, Ωp × p, Σq × q) ⇔ vec(Z) ∼ 𝒩(vec(Λ), Σ ⊗ Ω), where vec(Z) is a vector stacks the columns of Z, and Σ ⊗ Ω denotes the Kronecker product.
Model @ref(eq:lmn) is referred to as a Linear Model with Nuisance parameters (LMN) (B, Σ) for parameters of interest θ. The LMN package provides tools to efficiently conduct Frequentist or Bayesian inference on all parameters Θ = (θ, B, Σ) by estimating θ first, and subsequently (B, Σ), as illustrated in the examples below.
In Chan et al. (1992), a model for the interest rate Rt as a function of time is given by the stochastic differential equation (SDE) where Bt is Brownian motion and the parameters are restricted to γ, μ, σ, λ > 0. Suppose the data R = (R0, …, RN) consists of equispaced observations Rn = Rn ⋅ Δt with interobservation time Δt. A commonly-used discrete-time approximation is given by Sample data from the discrete-time approximation @ref(eq:euler) to the so-called generalized Cox-Ingersoll-Ross (gCIR) model @ref(eq:chan) is generated below, with Δt = 1/12 (one month) N = 240 (20 years), and true parameter values Θ = (γ, μ, σ, λ) = (0.07, 0.01, 0.6, 0.9).
# simulate data from the gcir model
gcir_sim <- function(N, dt, Theta, x0) {
# parameters
gamma <- Theta[1]
mu <- Theta[2]
sigma <- Theta[3]
lambda <- Theta[4]
Rt <- rep(NA, N+1)
Rt[1] <- x0
for(ii in 1:N) {
Rt[ii+1] <- rnorm(1, mean = Rt[ii] - gamma * (Rt[ii] - mu) * dt,
sd = sigma * Rt[ii]^lambda * sqrt(dt))
}
Rt
}
# true parameter values
Theta <- c(gamma = .07, mu = .01, sigma = .6, lambda = .9)
dt <- 1/12 # interobservation time (in years)
N <- 12 * 20 # number of observations (20 years)
Rt <- gcir_sim(N = N, dt = dt, Theta = Theta, x0 = Theta["mu"])
par(mar = c(4,4,.5,.5))
plot(x = 0:N*dt, y = 100*Rt, pch = 16, cex = .8,
xlab = "Time (years)", ylab = "Interest Rate (%)")
The likelihood for Θ is of LMN form
@ref(eq:lmn) with θ = λ, B2 × 1 = (γ, γμ),
Σ1 × 1 = σ2,
and $$
\begin{aligned}
{\boldsymbol{Y}}_{N\times 1} & = \begin{bmatrix} R_1 - R_0 \\ \vdots
\\ R_N - R_{N-1} \end{bmatrix}, & {\boldsymbol{X}}_{N \times 2}
& = \begin{bmatrix} - R_0 \Delta t& \Delta t\\ \vdots &
\vdots \\ - R_{N-1} \Delta t& \Delta t\end{bmatrix}, &
{\boldsymbol{V}}_{N \times N}(\lambda) & = \begin{bmatrix}
X_0^{2\lambda} \Delta t& & 0 \\ & \ddots & \\ 0 &
& X_{N-1}^{2\lambda} \Delta t\end{bmatrix}.
\end{aligned}
$$ Thus, we may proceed to maximum likelihood estimation via
profile likelihood as above, using the Vtype = diag
argument to lmn_suff()
to optimize calculations for
diagonal V.
# precomputed values
Y <- matrix(diff(Rt))
X <- cbind(-Rt[1:N], 1) * dt
# since Rt^(2*lambda) is calculated as exp(2*lambda * log(Rt)),
# precompute 2*log(Rt) to speed up calculations
lR2 <- 2 * log(Rt[1:N])
# sufficient statistics for gCIR model
gcir_suff <- function(lambda) {
lmn_suff(Y = Y, X = X,
V = exp(lambda * lR2) * dt, Vtype = "diag")
}
# _negative_ profile likelihood for gCIR model
gcir_prof <- function(lambda) {
if(lambda <= 0) return(Inf)
-lmn_prof(suff = gcir_suff(lambda))
}
# MLE of Theta via profile likelihood
# profile likelihood for lambda
opt <- optimize(f = gcir_prof, interval = c(.001, 10))
lambda_mle <- opt$minimum
# conditional MLE for remaining parameters
suff <- gcir_suff(lambda_mle)
Theta_mle <- c(gamma = suff$Bhat[1,1],
mu = suff$Bhat[2,1]/suff$Bhat[1,1],
sigma = sqrt(suff$S[1,1]/suff$n),
lambda = lambda_mle)
Theta_mle
## gamma mu sigma lambda
## -0.02757594 -0.16881757 0.49526361 0.85936149
However, we can see that the profile likelihood method produces negative estimates of γ and μ, i.e., outside of the parameter support. We could try to restrict or penalize the likelihood optimization problem to obtain an admissible MLE, but then the profile likelihood simplifications would no longer apply.
Instead, consider the following Bayesian approach. First, we note that the conjugate prior for (B, Σ) in the LMN model conditional on θ is the Matrix-Normal Inverse-Wishart (MNIW) distribution, $$ {\boldsymbol{B}}, {\boldsymbol{\Sigma}}\mid {\boldsymbol{\theta}}\sim \textrm{MNIW}({\boldsymbol{\Lambda}}_{\boldsymbol{\theta}}, {\boldsymbol{\Omega}}_{\boldsymbol{\theta}}, {\boldsymbol{\Psi}}_{\boldsymbol{\theta}}, \nu_{\boldsymbol{\theta}}) \qquad \iff \qquad \begin{aligned} {\boldsymbol{B}}\mid {\boldsymbol{\Sigma}}& \sim \textrm{MatNorm}({\boldsymbol{\Lambda}}_{\boldsymbol{\theta}}, {\boldsymbol{\Omega}}_{\boldsymbol{\theta}}^{-1}, {\boldsymbol{\Sigma}}) \\ {\boldsymbol{\Sigma}}& \sim \textrm{InvWish}({\boldsymbol{\Psi}}_{\boldsymbol{\theta}}, \nu_{\boldsymbol{\theta}}), \end{aligned} $$ where InvWish denotes the Inverse-Wishart distribution, and the hyperparameters Φθ = (Λθ, Ωθ, Ψθ, νθ) can depend on θ. Thus, for the prior distribution π(B, Σ, θ) given by
we have the following analytical results:
The conjugate posterior distribution p(B, Σ ∣ Y, θ)
is also MNIW, with closed-form
expressions for the hyperparameters $\hat
{\boldsymbol{\Phi}}_{\boldsymbol{\theta}}= (\hat
{\boldsymbol{\Lambda}}_{\boldsymbol{\theta}}, \hat
{\boldsymbol{\Omega}}_{\boldsymbol{\theta}}, \hat
{\boldsymbol{\Psi}}_{\boldsymbol{\theta}}, \hat
\nu_{\boldsymbol{\theta}})$ calculated by
lmn_post()
.
The marginal posterior distribution p(θ ∣ Y)
has a closed-form expression calculated by
lmn_marg()
.
Both closed-form expressions are provided below.
Putting these results together, we can efficiently conduct Bayesian
inference for LMN models by first sampling θ(1), …, θ(B) ∼ p(θ ∣ Y),
then conditionally sampling $({\boldsymbol{B}}^{(b)},
{\boldsymbol{\Sigma}}^{(b)}) \stackrel
{\textrm{ind}}{\sim}\textrm{MNIW}(\hat
{\boldsymbol{\Phi}}_{{\boldsymbol{\theta}}^{(b)}})$ for b = 1, …, B. This is done
in the R code below with the default prior π(Θ) ∝ |Σ|−(q + 1)/2,
which is obtained from lmn_prior()
:
## $Lambda
## [,1]
## [1,] 0
## [2,] 0
##
## $Omega
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
##
## $Psi
## [,1]
## [1,] 0
##
## $nu
## [1] 0
First, we implement a grid-based sampler for λ ∼ p(λ ∣ R):
# log of marginal posterior p(lambda | R)
gcir_marg <- function(lambda) {
suff <- gcir_suff(lambda)
post <- lmn_post(suff, prior)
lmn_marg(suff = suff, prior = prior, post = post)
}
# grid sampler for lambda ~ p(lambda | R)
# estimate the effective support of lambda by taking
# mode +/- 5 * sqrt(quadrature)
lambda_mode <- optimize(f = gcir_marg,
interval = c(.01, 10),
maximum = TRUE)$maximum
lambda_quad <- -numDeriv::hessian(func = gcir_marg, x = lambda_mode)[1]
lambda_rng <- lambda_mode + c(-5,5) * 1/sqrt(lambda_quad)
# plot posterior on this range
lambda_seq <- seq(lambda_rng[1], lambda_rng[2], len = 1000)
lambda_lpdf <- sapply(lambda_seq, gcir_marg) # log-pdf
# normalized pdf
lambda_pdf <- exp(lambda_lpdf - max(lambda_lpdf))
lambda_pdf <- lambda_pdf / sum(lambda_pdf) / (lambda_seq[2]-lambda_seq[1])
par(mar = c(2,4,2,.5))
plot(lambda_seq, lambda_pdf, type = "l",
xlab = expression(lambda), ylab = "",
main = expression(p(lambda*" | "*bold(R))))
The grid appears to have captured the effective support of p(λ ∣ R),
so we may proceed to conditional sampling. To do this effectively we use
the function mniw::rmniw()
in the mniw
package, which vectorizes simulations over different MNIX parameters
$\hat
{\boldsymbol{\Phi}}_{{\boldsymbol{\theta}}^{(1)}}, \ldots,
{\boldsymbol{\Phi}}_{{\boldsymbol{\theta}}^{(1)}}$.
npost <- 5e4 # number of posterior draws
# marginal sampling from p(lambda | R)
lambda_post <- sample(lambda_seq, size = npost, prob = lambda_pdf,
replace = TRUE)
# conditional sampling from p(B, Sigma | lambda, R)
BSig_post <- lapply(lambda_post, function(lambda) {
lmn_post(gcir_suff(lambda), prior)
})
BSig_post <- list2mniw(BSig_post) # convert to vectorized mniw format
BSig_post <- mniw::rmniw(npost,
Lambda = BSig_post$Lambda,
Omega = BSig_post$Omega,
Psi = BSig_post$Psi,
nu = BSig_post$nu)
# convert to Theta = (gamma, mu, sigma, lambda)
Theta_post <- cbind(gamma = BSig_post$X[1,1,],
mu = BSig_post$X[2,1,]/BSig_post$X[1,1,],
sigma = sqrt(BSig_post$V[1,1,]),
lambda = lambda_post)
apply(Theta_post, 2, min)
## gamma mu sigma lambda
## -0.9897117 -717.6090519 0.2101862 0.6300930
We can see that the posterior sampling scheme above for p(Θ ∣ R) did not always produce positive values for γ and μ. However, we can correct this post-hoc by making use of the following fact:
Rejection Sampling. Suppose that for a given prior Θ ∼ π(Θ) and likelihood function p(R ∣ Θ), we obtain a sample Θ(1), …, Θ(B) from the posterior distribution p(Θ ∣ R). Then if we keep only the samples such that Θ(b) ∈ 𝒮, this results in samples from the posterior distribution with likelihood p(R ∣ Θ) and constrained prior distribution Θ ∼ π(Θ ∣ Θ ∈ S).
In other words, if we eliminate from Theta_post
all rows
for which γ < 0 or μ < 0, we are left with a sample
form the posterior distribution with prior
π(Θ) ∝ 1/σ2 × I{γ, μ > 0}.
Posterior parameter distributions from the corresponding rejection sampler are displayed below.
# keep only draws for which gamma, mu > 0
ikeep <- pmin(Theta_post[,1], Theta_post[,2]) > 0
mean(ikeep) # a good number of draws get discarded
## [1] 0.45556
Theta_post <- Theta_post[ikeep,]
# convert mu to log scale for plotting purposes
Theta_post[,"mu"] <- log10(Theta_post[,"mu"])
# posterior distributions and true parameter values
Theta_names <- c("gamma", "log[10](mu)", "sigma", "lambda")
Theta_true <- Theta
Theta_true["mu"] <- log10(Theta_true["mu"])
par(mfrow = c(2,2), mar = c(2,2,3,.5)+.5)
for(ii in 1:ncol(Theta_post)) {
hist(Theta_post[,ii], breaks = 40, freq = FALSE,
xlab = "", ylab = "",
main = parse(text = paste0("p(",
Theta_names[ii], "*\" | \"*bold(R))")))
abline(v = Theta_true[ii], col = "red", lwd = 2)
if(ii == 1) {
legend("topright", inset = .05,
legend = c("Posterior Distribution", "True Parameter Value"),
lwd = c(NA, 2), pch = c(22, NA), seg.len = 1.5,
col = c("black", "red"), bg = c("white", NA), cex = .85)
}
}
For the regression model @ref(eq:lmn) with conjugate prior @ref(eq:conjprior), the posterior distribution p(B, Σ, θ ∣ Y) is given by
$$ \begin{aligned} {\boldsymbol{\theta}}\mid {\boldsymbol{Y}}& \sim p({\boldsymbol{\theta}}\mid {\boldsymbol{Y}}) \propto \pi({\boldsymbol{\theta}}) \frac{\Xi({\boldsymbol{\Psi}}_{{\boldsymbol{\theta}}}, \nu_{{\boldsymbol{\theta}}})}{\Xi(\hat{\boldsymbol{\Psi}}_{{\boldsymbol{\theta}}},\hat\nu_{{\boldsymbol{\theta}}})} \left(\frac{|{\boldsymbol{\Omega}}_{{\boldsymbol{\theta}}}|(2\pi)^{-n}}{|\hat {\boldsymbol{\Omega}}_{{\boldsymbol{\theta}}}||{\boldsymbol{V}}_{{\boldsymbol{\theta}}}|}\right)^{q/2} \\ {\boldsymbol{B}}, {\boldsymbol{\Sigma}}\mid {\boldsymbol{\theta}}, {\boldsymbol{Y}}& \sim \textrm{MNIW}(\hat {\boldsymbol{\Lambda}}_{\boldsymbol{\theta}}, \hat{\boldsymbol{\Omega}}_{\boldsymbol{\theta}}, \hat {\boldsymbol{\Psi}}_{\boldsymbol{\theta}}, \hat \nu_{\boldsymbol{\theta}}), \end{aligned} $$
where
$$ \begin{aligned} \Xi({\boldsymbol{\Psi}},\nu) & = \frac{|{\boldsymbol{\Psi}}|^{\nu/2}}{\sqrt{2^{\nu q}} \Gamma_q(\tfrac \nu 2)}, & \Gamma_q(a) & = \pi^{q(q-1)/4} \prod_{j=1}^q \Gamma[a + \tfrac 1 2(1-j)], \\ \hat {\boldsymbol{\Omega}}_{{\boldsymbol{\theta}}} & = {\boldsymbol{\Omega}}_{{\boldsymbol{\theta}}} + {\boldsymbol{T}}_{{\boldsymbol{\theta}}}, & \hat {\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}} & = \hat {\boldsymbol{\Omega}}_{{\boldsymbol{\theta}}}^{-1}({\boldsymbol{T}}_{{\boldsymbol{\theta}}} \hat {\boldsymbol{B}}_{{\boldsymbol{\theta}}} + {\boldsymbol{\Omega}}_{{\boldsymbol{\theta}}} {\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}), \\ \hat \nu_{{\boldsymbol{\theta}}} & = \nu_{{\boldsymbol{\theta}}} + n, & \hat {\boldsymbol{\Psi}}_{{\boldsymbol{\theta}}} & = {\boldsymbol{\Psi}}_{{\boldsymbol{\theta}}} + {\boldsymbol{S}}_{{\boldsymbol{\theta}}} + \hat {\boldsymbol{B}}_{{\boldsymbol{\theta}}}' {\boldsymbol{T}}_{{\boldsymbol{\theta}}} \hat {\boldsymbol{B}}_{{\boldsymbol{\theta}}} + {\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}' {\boldsymbol{\Omega}}_{{\boldsymbol{\theta}}} {\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}} - \hat {\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}' \hat {\boldsymbol{\Omega}}_{{\boldsymbol{\theta}}} \hat {\boldsymbol{\Lambda}}_{{\boldsymbol{\theta}}}, \end{aligned} $$
and Tθ,
Sθ
and $\hat
{\boldsymbol{B}}_{{\boldsymbol{\theta}}}$ are outputs from
lmn_suff()
defined above.