LocalCop: Local likelihood inference for conditional copulas

Overview

This vignette shows how to use the LocalCop package to conduct local likelihood inference for the dependence parameters in conditional copula models, based on the works Acar, Craiu, and Yao (2011), Acar, Craiu, and Yao (2013) and Acar, Czado, and Lysy (2019).

LocalCop is primarily intended for users who want to apply conditional copula models to real-world data with an interest to infer covariate effects on the conditional dependence between two response variables. This is done semiparametrically using kernel smoothing within the likelihood framework as detailed below.

# load required packages
library(LocalCop)   # for conditional copula modeling
library(VineCopula) # for simulating copula data 
library(dplyr)      # for data manipulations
library(tidyr)      # and
library(ggplot2)    # plotting

Data generating model

For any bivariate response vector (Y₁, Y₂), the conditional joint distribution given a covariate X is given by where F_1 ∣ X(y₁ ∣ x) and F_2 ∣ X(y₂ ∣ x) are the conditional marginal distributions of Y₁ and Y₂ given X, and C_X(u, v ∣ x) is a conditional copula function. That is, for given X = x, the function C_X(u, v ∣ x) is a bivariate CDF with uniform margins.

If the true conditional marginal distrutions are known, we may compute the marginally uniform variables U = F_1|X(Y₁ ∣ X) and V = F_2|X(Y₂ ∣ X). In practice these distributions are unknown, such that Û = F̂_1|X(Y₁ ∣ X) and V̂ = F̂_1|X(Y₂ ∣ X) are pseudo-uniform variables estimated using parametric or nonparametric techniques – see Acar, Czado, and Lysy (2019) for an application using ARMA-GARCH models. Either way, the focus of LocalCop is on estimating the conditional copula function, which is given the semi-parametric model where 𝒞(u, v θ, ν) is one of the parametric copula families listed in Table @ref(tab:calib), the copula dependence parameter θ ∈ Θ is an arbitrary function of X, and ν ∈ Υ is an additional copula parameter present in some models. Since most parametric copula families have a restricted range Θ ⊊ ℝ, we describe the data generating model (DGM) in terms of the calibration function η(x), such that where g⁻¹ : ℝ → Θ an inverse-link function which ensures that the copula parameter has the correct range. The choice of g⁻¹(η) is not unique and depends on the copula family. Table @ref(tab:calib) displays the copula function 𝒞(u, v ∣ θ, ν) for each of the copula families provided by LocalCop, along with other relevant information including the canonical choice of the inverse link function g⁻¹(η). In Table @ref(tab:calib), Φ⁻¹(p) denotes the inverse CDF of the standard normal; t_ν⁻¹(p) denotes the inverse CDF of the Student-t with ν degrees of freedom; Φ_θ(z₁, z₂) denotes the CDF of a bivariate normal with mean (0, 0) and variance $\left[\begin{smallmatrix}1 & \theta \\ \theta & 1\end{smallmatrix}\right]$; and t_θ, ν(z₁, z₂) denotes the CDF of a bivariate Student-t with location (0, 0), scale $\left[\begin{smallmatrix}1 & \theta \\ \theta & 1\end{smallmatrix}\right]$, and degrees of freedom ν.

Copula families implemented in **LocalCop**.
Family	𝒞(u, v ∣ θ, ν)	θ ∈ Θ	ν ∈ Υ	g⁻¹(η)	τ(θ)
Gaussian	Φ_θ(Φ⁻¹(u), Φ⁻¹(v))	(−1, 1)		$\dfrac{e^{\eta} - e^{-\eta}}{e^{\eta} + e^{-\eta}}$	$\frac{2}{\pi}\arcsin(\theta)$
Student-t	t_θ, ν(t_ν⁻¹(u), t_ν⁻¹(v))	(−1, 1)	(0, ∞)	$\dfrac{e^{\eta} - e^{-\eta}}{e^{\eta} + e^{-\eta}}$	$\frac{2}{\pi}\arcsin(\theta)$
Clayton	$\displaystyle (u^{-\theta} + v^{-\theta} -1)^{-\frac{1}{\theta}}$	(0, ∞)		e^η	$\frac{\theta}{\theta + 2}$
Gumbel	$\displaystyle \exp\left[ - \{ (-\log u)^\theta + (-\log v)^\theta\}^{\frac{1}{\theta}} \right]$	[1, ∞)		e^η + 1	$1 - \frac{1}{\theta}$
Frank	$-\frac{1}{\theta}\log \left\{ 1 + \frac{(e^{-\theta u} - 1)(e^{-\theta v} - 1)}{e^{-\theta} - 1}\right\}$	(−∞, ∞) \ {0}		η	no closed form

The code below shows how to simulate data from model @ref(eq:copmod) using the Gumbel family, where the change in Kendall τ as a function of the covariate in displayed in Figure @ref(fig:dgm).

# simulate copula data given a covariate
set.seed(2024) 
family <- 4  # Gumbel Copula 
n <- 1000    # number of observations
x <- sort(runif(n))  # covariate values
eta_fun <- function(x) sin(6*pi*x) # calibration function

# simulate data
eta_true <- eta_fun(x)
par_true <- BiCopEta2Par(family = family, eta = eta_true) # copula parameter
udata <- VineCopula::BiCopSim(n, family = family, par = par_true$par)

# plot tau(x)
tibble(
  x = seq(0, 1, len = 100),
) %>%
  mutate(
    tau = BiCopEta2Tau(family, eta = eta_fun(x))
  ) %>%
  ggplot(aes(x = x, y = tau)) +
  geom_line() +
  ylim(c(0, 1)) + 
  xlab(expression(x)) + ylab(expression(tau(x))) + 
  theme_bw()

$Conditional Kendall $\tau$ for Gumbel copula under DGM.$

Conditional Kendall τ for Gumbel copula under DGM.

Local likelihood estimation of the dependence parameter

Local likelihood estimation uses Taylor expansions to approximate the dependence parameter function at an observed covariate value X = x near a fixed point X = x₀, i.e., $$ \eta(x)\approx \eta(x_0) + \eta^{(1)}(x_0) (x - x_0) + \ldots + \dfrac{\eta^{(p)}(x_0)}{p!} (x - x_0)^{p}. $$ One then estimates β_k = η^(k)(x₀)/k! for k = 0, …, p using a kernel-weighted local likelihood function where (u_i, v_i, x_i) is the data for observation i, x_i = (1, x_i − x₀, (x_i − x₀)², …, (x_i − x₀)^p), β = (β₀, β₁, …, β_p), and K_h(z) is a kernel function with bandwidth parameter h > 0. Having maximized ℓ(β) in @ref(eq:locallik), one estimates η(x₀) by η̂(x₀) = β̂₀. Usually, a linear fit with p = 1 suffices to obtain a good estimate in practice. The estimation procedure for a single value of X = x₀ for given copula family and bandwidth parameter is carried out by CondiCopLocFit().

x0 <- 0.1
band <- 0.1
degree <- 1
kernel <- KernEpa # Epanichov kernel (default value)
fit1 <- CondiCopLocFit(
  u1 = udata[,1], u2 = udata[,2], x = x,
  x0 = x0,
  family = family, 
  degree = degree,
  kernel = kernel,
  band = band
)
fit1
#> $x
#> [1] 0.1
#> 
#> $eta
#> [1] 0.839931
#> 
#> $nu
#> [1] 0

Repeating this procedure over a grid of covariate values produces an estimate η̂(x) of the underlying dependence function. This can also be accomplished with CondiCopLocFit() as shown below.

x0 <- seq(0, 1, by=0.02)
fitseq <- CondiCopLocFit(
  u1 = udata[,1], u2 = udata[,2], x = x, 
  x0 = x0,
  family = family, 
  degree = degree,
  kernel = kernel,
  band = band
)

# plot true vs fitted tau
legend_names <- c(expression(tau(x)),
                  expression(hat(tau)(x)))
tibble(
  x = x0,
  True = BiCopEta2Tau(family, eta = eta_fun(x0)),
  Fitted = BiCopEta2Tau(fitseq$eta, family= family)
) %>%
  pivot_longer(True:Fitted,
               names_to = "Type", values_to = "y") %>%
  mutate(
    Type = factor(Type, levels = c("True", "Fitted"))
  ) %>%
  ggplot(aes(x = x, y = y, group = Type)) +
  geom_line(aes(color = Type, linetype = Type)) + 
  geom_point(aes(shape = Type, color = Type)) +
  scale_color_manual(values = c("black", "red"), labels = legend_names) +
  scale_shape_manual(values = c(NA, 16), labels = legend_names) +
  scale_linetype_manual(values = c("solid", NA), labels = legend_names) +
  ylim(c(0, 1)) + 
  xlab(expression(x)) + ylab(expression("Kendall "*tau)) +
  theme_bw() + 
  theme(
    legend.position = "bottom",
    legend.title = element_blank()
  )

$Conditional Kendall $\tau$ estimates under the Gumbel copula using bandwidth $h=0.1$.$

Conditional Kendall τ estimates under the Gumbel copula using bandwidth h = 0.1.

Copula Selection and Model Tuning

Selection of the copula family and bandwidth parameter is done via leave-one-out cross-validation (LOO-CV) as described in Acar, Czado, and Lysy (2019). To reduce the computation time in model selection, we may calculate the LOO-CV for just a subset of the sample.

bandset <- c(0.1, 0.2, 0.4, 0.8, 1) # bandwidth set
famset <- c(1, 2, 3, 4, 5) # family set
n_loo <- 100  # number of leave-one-out observations in CV likelihood calculation

system.time({
  cvselect1 <- CondiCopSelect(
    u1 = udata[,1], u2 = udata[,2], x = x,
    family = famset, 
    degree = degree,
    kernel = kernel,
    band = bandset,
    xind = n_loo
  )
})

cv_res1 <- cvselect1$cv
cv_res1
#>    band family       cv
#> 1   0.1      1 33.72408
#> 2   0.2      1 30.22907
#> 3   0.4      1 26.91116
#> 4   0.8      1 27.22770
#> 5   1.0      1 27.54328
#> 6   0.1      2 34.95500
#> 7   0.2      2 31.06883
#> 8   0.4      2 28.20758
#> 9   0.8      2 28.92498
#> 10  1.0      2 29.21409
#> 11  0.1      3 23.26226
#> 12  0.2      3 20.51586
#> 13  0.4      3 17.96064
#> 14  0.8      3 18.34648
#> 15  1.0      3 18.45101
#> 16  0.1      4 38.68924
#> 17  0.2      4 34.64196
#> 18  0.4      4 30.75936
#> 19  0.8      4 31.43345
#> 20  1.0      4 31.77389
#> 21  0.1      5 29.72769
#> 22  0.2      5 25.59749
#> 23  0.4      5 23.23388
#> 24  0.8      5 24.02684
#> 25  1.0      5 24.43723

The leave-one-out cross-validation for the full sample is provided below for comparisons.

system.time({
  cvselect2 <- CondiCopSelect(
    u1 = udata[,1], u2 = udata[,2], x = x,
    family = famset, 
    degree = degree,
    kernel = kernel,
    band = bandset,
    xind = nrow(udata)
  )
})

cv_res2 <- cvselect2$cv
cv_res2
#>    band family       cv
#> 1   0.1      1 422.2925
#> 2   0.2      1 380.4366
#> 3   0.4      1 353.6886
#> 4   0.8      1 350.2815
#> 5   1.0      1 348.7304
#> 6   0.1      2 428.2676
#> 7   0.2      2 395.4231
#> 8   0.4      2 374.7095
#> 9   0.8      2 373.3327
#> 10  1.0      2 372.6614
#> 11  0.1      3 252.9172
#> 12  0.2      3 234.8385
#> 13  0.4      3 221.4233
#> 14  0.8      3 220.4843
#> 15  1.0      3 219.8683
#> 16  0.1      4 461.6522
#> 17  0.2      4 421.7474
#> 18  0.4      4 394.1287
#> 19  0.8      4 391.4590
#> 20  1.0      4 390.0860
#> 21  0.1      5 384.3447
#> 22  0.2      5 347.7702
#> 23  0.4      5 325.6584
#> 24  0.8      5 325.0720
#> 25  1.0      5 324.6259

Figure @ref(fig:cvplot) displays the LOO-CV metric (larger is better) for both the full sample (n = 1000) and a subset of the sample (n = 100). The rank of each combination of family and bandwidth is very similar in both settings. In particular, i both cases we select the Gumbel copula with bandwidth parameter h = 0.1. This suggests that subset-based model selection can be used to reduce the computation time.

fam_names <- c("Gaussian", "Student", "Clayton", "Gumbel", "Frank")
bind_rows(as_tibble(cvselect1$cv) %>%
          mutate(n = n_loo),
          as_tibble(cvselect2$cv) %>%
          mutate(n = nrow(udata))) %>%
  mutate(
    family = factor(family, levels = c(1,2,3,4,5),
                    labels = fam_names),
    Bandwidth = factor(band),
    n = factor(paste0("n = ", n))
  ) %>%
  ggplot(aes(x = family, y  = cv, fill = Bandwidth)) +
  geom_bar(stat="identity", position = "dodge") +
  facet_grid(. ~ n) +
  scale_fill_brewer(palette="Blues", direction=-1) +
  xlab("") + ylab("CV Likelihood") +
  theme_bw() + 
  theme(legend.position = "bottom",
        legend.direction = "horizontal")

Cross-validated likelihood for copula and bandwidth selection based on a subset and the full sample.

Sensitivity to the choice of copula family

Figure @ref(fig:localseq2) displays the true and estimated conditional Kendall τ for each of the copula families using the selected bandwidth parameter h = 0.1. The results suggest that misspecification of the copula family does not drastically affect the local likelihood estimates of τ(x), except for the Clayton copula which shows the most departure from the Gumbel copula used in the DGM.

x0 <- seq(0, 1, by=0.01)
tau_est <- sapply(1:5, function(fam_id) {
  fit <- CondiCopLocFit(
    u1 = udata[,1], u2 = udata[,2], x = x,
    x0 = x0,
    family = fam_id, 
    degree = degree,
    kernel = kernel,
    band = band)
  BiCopEta2Tau(fit$eta, family=fam_id)
})

colnames(tau_est) <- fam_names

as_tibble(tau_est) %>%
  mutate(
    x = x0,
    True = BiCopEta2Tau(family, eta = eta_fun(x0))
  ) %>%
  pivot_longer(!x, names_to = "Type", values_to = "tau") %>%
  mutate(
    Type = factor(Type, levels = c("True", fam_names))
  ) %>%
  ggplot(aes(x = x, y = tau, group = Type)) +
  geom_line(aes(col = Type, linewidth = Type)) +
  scale_color_manual(
    values = c("black", "red", "blue", "brown", "orange", "green4")
  ) + 
  scale_linewidth_manual(values = c(1, rep(.5, 5))) + 
  ylim(c(0, 1)) + 
  xlab(expression(x)) + ylab(expression(tau(x))) +
  theme_bw() + 
  theme(
    legend.position = "bottom",
    legend.title = element_blank(),
  )

$Conditional Kendall $\tau$ estimates under different copula families.$

Conditional Kendall τ estimates under different copula families.

References

Acar, Elif F., Radu V. Craiu, and Fang Yao. 2011. “Dependence Calibration in Conditional Copulas: A Nonparametric Approach.” Biometrics 67 (2): 445–53.

———. 2013. “Statistical Testing of Covariate Effects in Conditional Copula Models.” Electronic Journal of Statistics 7: 2822–50.

Acar, Elif F., Claudia Czado, and Martin Lysy. 2019. “Dynamic Vine Copula Models for Multivariate Time Series Data.” Econometrics and Statistics 12: 181–97.