Semiparametric Likelihood Estimation with Clayton-Oakes Model for Multivariate Current Status Data
Yeqian
Liu* and Hanyi Li
Department of
Mathematical Sciences, Middle Tennessee State University, Murfreesboro, USA
*Corresponding
author: Yeqian Liu, Assistant Professor, Department
of Mathematical Sciences, Middle Tennessee State University, Murfreesboro, USA
Received Date: 16
April, 2020; Accepted Date: 24 April, 2020; Published Date: 30 April, 2020
Citation:
Liu Y, Li H (2020) Semiparametric
Likelihood Estimation with Clayton-Oakes Model for Multivariate Current Status
Data. J Biostat Biom: JBSB-109. DOI:
10.29011/JBSB-109.100009
Abstract
In biomedical and public health studies, current
status data arises when each subject is observed only once and the failure time
is only known to be either smaller or larger than the observation time. This
paper discusses regression analysis of multivariate current status failure time
data which usually arise in epidemiological studies and animal carcinogenicity
experiments. We propose an EM-algorithm to estimate the regression coefficients
under Clayton-Oakes model for multivariate current status data. The
Clayton-Oakes model is a very promising model for analysis of multivariate
failure time data, it characterize the dependence of multiple failure time by
using a gamma frailty. The simulation study indicated that the method works
well for practical situations. An application from a tumorigenicity experiment
is provided.
Keywords: Multivariate current
status data; Clayton-Oaks model; Frailty model; EM algorithm; Maximum
likelihood estimation
Introduction
Current
status data arise when each subject in a survival study is observed only once
and the failure time of interest is known only to be smaller or larger than the
observation time. In other words, the failure time of interest is either left-
or right-censored instead of being observed exactly. Current status data occur
quite often in many applications, including tumorigenicity experiments and
epidemiological investigations of the natural history of a disease.
Current
status data is a special case of interval-censored data and is often referred
to as case 1 interval-censored data [1].
Interval-censored failure time data occur when the failure time of interest is
only observed to be bracketed by two adjacent examination times [2]. The interval-censored data reduce to current
status data if all intervals include either zero of infinity. In this paper, we
will deal with multivariate current status data, arising when there are
multiple failure times of interest in a survival study and only current status
data are available for each failure time.
We
will focus on the regression analysis of multivariate current status data. In
order to model the covariate effects on the failure times of interest as well
as accounting for the potential dependence of the failure times, we propose to
use the Clayton-Oakes (CO) model jointly and the proportional hazards model
marginally. The CO model is one of parametric copula models which express the
joint distribution of failure times as a function of the marginal distributions
and a dependence parameter [3]. The proportional
hazards model is one of the most commonly used semiparametric models in
survival analysis. In this paper, we will consider the estimation of the CO
model with marginal proportional hazards model based on multivariate current
status data.
Glidden DV, et al. (1999) [4], considered the estimation of the CO model based
on multivariate right-censored failure time data and they developed an
approximate EM algorithm for the determination of maximum likelihood estimates.
Unlike the right-censored data, the failure time of interest is never observed
exactly under current status data. Therefore, the well-developed tools used for
right-censored data cannot be readily generalized to fitting current status
data. For example, the partial likelihood approach and the elegant martingale
theory used in the estimation of proportional hazards model under
right-censored data do not carry over to the case of current status data [5]. Thus, the regression analysis of current status
data is not an easy task. Here we are trying to develop an EM algorithm for the
estimation of a semiparametric regression model with multivariate current
status data. This is even more challenging that the univariate case.
Sun T, et al. (2019) [3], pointed out that a reparametrization of the CO
model is equivalent to one of gamma frailty models discussed in Wang N, et al. (2015) [6]. Therefore, after a reparametrization, we are
essentially dealing with a gamma frailty model but the marginal hazard
functions would become much more complicated that proportional hazards
functions [6]. The CO model and the equivalent
gamma frailty model will be given in section 2. The frailty model approach has
been commonly used in the analysis of multivariate failure time data. This
approach assumes that there exists a common but unobserved latent variable,
called frailty that characterizes the dependence of the possibly correlated
failure times. Chen MH, et al. (2009)
[7], considered a frailty model approach for
regression analysis of multivariate current status data.
The rest of this paper
is organized as follows. In section 2, we present models, assumptions and the
likelihood function. In section 3, we develop a sieve maximum likelihood
approach for the estimation of the unknown parameters. The sieves are
constructed based on piecewise linear functions. By using the sieve method, we
avoid the estimation of infinite dimensional parameters. To determine the sieve
maximum likelihood estimate, we propose an EM algorithm by treating the
unobservable latent variable as missing values. In section 4, we conduct
simulation studies to examine the finite sample performance of our proposed
method. Section 5 applies the approach to NCTR’s tumorigenicity experiment.
Section 6 contains some discussion and concluding remarks.
Assumptions,
Models and the Likelihood Function
Suppose we are
interested in K failure times in a survival study, denoted by T1, …, Tk.
Let Zk be the covariates that may affect Tk, k=1, …, k,
respectively. We assume the proportional hazards model for each failure time,
i.e. the marginal hazard function for Tk
is given by
Where
λ0k(t) is the baseline hazard
function for Tk
and β is the vector of regression
parameters. Note that we assume a common vector of covariate effects. Also, we
assume that the joint distribution of the failure times follows the
Clayton-Oakes (CO) model, i.e., the joint survival function is given by
Where
the parameter α specifies the CO model and
. When α = 0, the
joint survival function becomes the product of the marginal survival functions
and failures are independent. Increasing values of
imply increasing positive association between
failures [8].
The CO model is equivalent to
the following gamma frailty model [4]. Suppose ξ is a gamma random variable
with mean 1 and unknown variance α.
Conditional on the frailty ξ, it
is assumed that the failure times are independent with hazard functions:
Suppose
that only current status data are available for each Tk, and the observed data for
a single subject are where ck denotes
the observation time for
th failure and τ denotes the length of
study. Without loss of generality, we will assume that the
are the
same, and the common observation time is denoted by c. Also, we assume that c is
independent of Tk given Zk
and the distribution of c does not depend on the unknown parameters
i.e., the censoring mechanism is
noninformative. Then, given ξ, the likelihood of
observing δk is
proportional to
This
likelihood involves infinite-dimensional parameters Λ0k(t) which is difficult to deal
with. We propose a sieve approach to estimate Λ0k(t), that is, to approximate it
by a piecewise linear function
Define and denote all
unknown parameters as θ = (β,α,γ), then the likelihood function
for a single subject is given by
Where f(ξ;α) is the density function of the gamma distribution with mean 1 and variance α.
From (2.2), it is easy to
show that the conditional density function of ξ
given Ω* has the form
We want to obtain the value of θ that maximizes L(θ;0), i.e. the maximum
likelihood estimate . Since there is no
closed form solution for the maximum likelihood estimate, we try to determine
using EM algorithm.In order to develop the EM
algorithm, we consider the latent variables
to be unobserved values and develop the E-step
and M-step iteratively for the EM-algorithm.
Then the conditional expectation of the
log-likelihood has the form
With θ set to be θ(m) obtained at the mth iteration. Since the expectation has no closed form, we employ some numerical integral algorithms to compute such expectation
for any function h(ξi) of ξi
In order to determine we can rewrite it as
Where the expectations on the right-hand side
are taken with respect to the gamma distribution with mean 1 and variance α(m), α(m)
denotes the estimate of α obtained at the mth iteration, and
This suggests that for sufficiently large L,
the expectation can be approximated by
Where {ξl, l = 1, … , L} are i.i.d. samples from the gamma distribution with mean 1 and variance
α(m).
Now, we will describe the M-step, which maximizes the conditional
expectation l(θ;0) with replacing all expectations involving functions h(ξi) by
their approximation given in (3.1) to determine the updated
estimate θ(m+1). We adopt the moment estimate for α in each iteration, that is, the estimate of α at the (m+1)th iteration is given
by
For the maximum likelihood estimator of
parameters β
and γ, we have
Where
Where {ξl, l = 1, … ,L} are i.i.d. samples from the gamma distribution
with mean 1 and variance .Then for estimation of β and γ, we can solve the equation
It is obvious that it would not be easy to solve all p + K * J equations
in (3.3) above simultaneously. For this, we suggest the following procedure for
the (m+1)th iteration.
To guarantee that α is positive at each iteration, we reparameterize the
frailty variance as in the procedure. With a little abuse of
notations, we still denote the parameter as α. The key advantage of the
procedureis that, by using sieve method, one can avoid solving high dimensional
or large number of equations and thus can easily apply well developed
optimization algorithms in the M-step, such as Newton-Raphson algorithm.
Moreover, the computation is more stable and efficient than that using
nonparametric estimation of the cumulative baseline hazards functions.
A Simulation Study
This section carries out extensive simulation studies to examine the performance of the proposed method. Our model specification mostly follows that used by Glidden DV, et al. (1999) [4], where a Clayton-Oakes (CO) with gamma frailty was studied for multivariate survival times subject to right censoring. In this study, we consider the situation where k = 2, so there exists two failure times T1 and T2. Let the covariate z be one-dimensional and generated from a binomial distribution Bin (1,0.5). We assume the cumulative baseline hazard function for T1 and T2 to be Λ1(t) = 0.2t and Λ2(t) = 0.05t, respectively. Then the conditional hazard functions for T1 and T2 given ξ are
respectively. Here, ξ is the gamma frailty which is assumed to have a
gamma distribution with mean 1 and variance α. The failure times T1
and T2 can be generated from the inverse function of the conditional
cumulative distribution function given the generated ξ. In the study, we took with J = 6, and the observation times cj’s
are generated from the discrete uniform distribution over {s1, s2,
… , sj}. The results given
below are based on 300 replications with L=1000 for the approximation and the
sample size n=200 or 400.
To obtain the estimates we consider four different cases. In the tables 1 and 2, we include the Averages
(AVE) and the Sample Standard Deviations (SSE). The true value of β was taken
as 0.5 or -0.5. And the frailty variance α was set to be 1.2 or 0.8. One can
see from these tables that the proposed estimate
seems to relatively unbiased.
As mentioned in [7], one could estimate β based on the observed current status data on T1 and T2 by assuming that β in the model is the same for T1 and T2. However, this marginal approach is less efficient than the estimation procedure developed here. We also found the EM-algorithm developed is sensitive to the initial value of γk. To obtain good convergence rate and increase convergence speed, it is important and necessary to have good initial values.
An Application
Now we apply the proposed method to an animal
tumorigenicity experiment conducted by the National Center for Toxicological
Research (NCTR). It is a 2-year rodent carcinogenicity study of chloroprene
consisting of F344/N rats and B6C3F1 rats with both sexes. There are 50 male
and 50 female rats in both control and high dose group. Rats in the high dose
group were exposed to chloroprene at the concentrations of 80 ppm, 6 h per day,
5 days per week for up to 2 years. The animal either died during the study or
was sacrificed at the end of the study. At the death or sacrifice, the presence
or occurrence of different types of tumors was determined through a
pathological examination. Due to the nature of the pathological examination
process, the tumor occurrence time was not exactly observed but instead known
only to be smaller or greater than the death or sacrifice time. Therefore, the
tumor occurrence time of lung and adrenal cancers is bivariate current status
data. Following Dunson DB, et al. (2002)
[9], the main objective of the study is to investigate the effect of
choloroprene on lung and adrenal cancer by comparing the tumor growth rates between
the control and high dose group.
Let T1 and T2 denote the occurrence times of
adrenal and lung tumors, we model the joint distribution of T1 and T2
using the proposed Clayton-Oaks model with Gamma frailty. Application of the
proposed EM-algorithm gave with an
estimated standard error of 0.2851. Therefore, the animals in the high dose
group had significantly higher occurrence rate of both adrenal and lung tumors
than those in the control group. To illustrate this occurrence rate difference,
Figure 1 shows the estimated survivals functions of the lung tumor occurrence
time for animals in the control and high dose groups.
To compare the performance of the proposed Clayton-Oaks model with
traditional marginal proportional hazards model. The Mean Square Error (MSE)
was calculated for both models. For the control group of the lung tumor, we
obtained MSE=0.8741 under Clayton-Oaks model and 1.1681 under the marginal proportional
hazards model. For the high dose group of the lung tumor, the corresponding
values are 0.8254 and 1.3517, respectively. These results suggest that the
proposed Clayton-Oaks model fits the data better than the marginal proportional
hazard model for the lung tumor. The results for the adrenal tumor are similar.
Discussion and Concluding Remarks
In this paper, we proposed an EM-algorithm to
estimate the regression coefficients under Clayton-Oakes model for multivariate
current status data. The simulation study and real data application indicated
that the method works well for practical situations. The Clayton-Oakes model is
a very promising model for analysis of multivariate failure time data, it
characterize the dependence of multiple failure time by using a gamma frailty.
This model is different from the commonly used marginal Cox model and normal
frailty Cox model. The analysis of CO model would benefit from methods which
can diagnose the appropriateness of the model.
Actually, the gamma frailty assumption is not essential, and the proposed procedure can be carried out for other choices of frailty distributions, for example, log-normal frailty, positive stable frailty, or inverse Gaussian frailty. It is noteworthy that the proposed frailty model does not provide the marginal model directly, but the marginal model can be derived by integrating out the frailty.
The EM-algorithm developed in this paper can also be extended to case II
interval censored data. However, the presented methodology has several
limitations, which lead to some directions for future research. The first
limitation is that one may have a hard time conducting model selection between
our proposed CO model and the log-normal frailty Cox model proposed by Chen MH, et al. (2009) [7]. Moreover, in
our model, we assume that for in each cluster i, the failure times {T1,…,Tk}
are independent. In practice, this may not be true. However, to consider the
dependence between k failure times, we have to introduce a multivariate gamma
frailty which will significantly increase the computational difficulty.
Figure 1: Estimated marginal survival functions for time
to lung tumor.
|
Parameters |
n |
J |
True |
AVE |
BIAS |
SSE |
MSE |
Case 1 |
α |
200 |
6 |
0.8 |
0.877 |
0.077 |
0.126 |
0.132 |
|
β |
|
|
0.5 |
0.473 |
-0.027 |
0.163 |
0.189 |
Case 2 |
α |
200 |
6 |
0.8 |
0.749 |
0.051 |
0.112 |
0.115 |
|
β |
|
|
-0.5 |
-0.554 |
-0.054 |
0.177 |
0.181 |
Case 3 |
α |
200 |
6 |
1.2 |
1.281 |
0.081 |
0.147 |
0.154 |
|
β |
|
|
0.5 |
0.435 |
0.065 |
0.181 |
0.185 |
Case 4 |
α |
200 |
6 |
1.2 |
1.131 |
-0.069 |
0.153 |
0.157 |
|
β |
|
|
-0.5 |
-0.578 |
-0.078 |
0.172 |
0.179 |
|
Parameters |
n |
J |
True |
AVE |
BIAS |
SSE |
MSE |
Case 1 |
α |
400 |
6 |
0.8 |
0.826 |
0.026 |
0.086 |
0.093 |
|
β |
|
|
0.5 |
0.485 |
-0.015 |
0.116 |
0.129 |
Case 2 |
α |
400 |
6 |
0.8 |
0.773 |
0.027 |
0.082 |
0.086 |
|
β |
|
|
-0.5 |
-0.514 |
-0.014 |
0.137 |
0.125 |
Case 3 |
α |
400 |
6 |
1.2 |
1.245 |
0.045 |
0.116 |
0.107 |
|
β |
|
|
0.5 |
0.471 |
0.029 |
0.135 |
0.128 |
Case 4 |
α |
400 |
6 |
1.2 |
1.161 |
-0.039 |
0.108 |
0.119 |
|
β |
|
|
-0.5 |
-0.593 |
-0.007 |
0.126 |
0.131 |
- Sun J (2006) The statistical
analysis of interval-censored data. Springer.
- Zeng D, Gao F, Lin D (2017) Maximum likelihood estimation for semiparametric regression models with multivariate interval-censored data. Biometrika 104: 505-525.
- Sun T, Ding Y (2019) Copula-based semiparametric regression method for bivariate data under general interval censoring. Biostatistics.
- Glidden DV, Self SG (1999) Semiparametric likelihood estimation in the Clayton-Oakes failure time model. Scand. J Statist 26: 363-372.
- Wen CC
and Chen YH (2013) A frailty model approach for regression analysis of
bivariate interval-censored survival data. Statistica Sinica 23: 383-408.
- Wang N, Wang L, McMahan CS (2015) Regression analysis of bivariate current status data under the gamma-frailty proportional hazards model using the EM algorithm. Computational Statistics & Data Analysis 83: 140-150.
- Chen MH, Tong X, Sun J (2009) A frailty model approach for regression analysis of multivariate current status data. Statistics in Medicine 2: 3424-3436.
- Nielsen GG, Gill RD, Andersen PK, Sorensen TIA (1992) A counting process approach to maximum likelihood estimation in frailty models. Scand J Statist 19: 25-43.
- Dunson DB, Dinse GE (2002) Bayesian models for multivariate current status data with informative censoring. Biometrics 58: 79-88.