BAYESIAN ANALYSIS OF RIGHT CENSORED SURVIVAL TIME DATA

We analyzed cancer data using Fully Bayesian inference approach based on Markov Chain Monte Carlo (MCMC) simulation technique which allows the estimation of very complex and realistic models. The results show that sex and age are significant risk factors for dying from some selected cancers. The risk of dying from these cancers is observed to progressively increase as age of patients increases. It is also observed that in order to allow for nonlinearity due to metrical covariate age, the semiparametric P-splines model is better than the model that categorizes age into various age groups.


INTRODUCTION
Analysis of survival or failure times has gained a considerable attention, particularly in the field of medical applications wherefrom the conventional denotation 'survival analysis' arises [Hennerfeind(2006)].Censoring is one phenomenon that makes survival analysis differ from other analyses.This is a situation of incompleteness in the observed survival data.The most common censoring in survival time data is Right Censoring which occurs when the actual time a subject experiences the event of interest is not known.In this type of censoring, it is assumed for some individuals in the study that there is a time to event and the right censoring time C where the 's are assumed to be independently and identically distributed with density function f(t) and survival function S(t).The exact survival time T of any individual will be known if and only if is less than or equal to C. If is greater than C, then the individual is a survivor and the exact survival time is censored at C. Thus the observed time is T = min( ,C) and the data for such a design can be represented by pairs of random variables (T, δ ),where δ indicates whether the survival time T corresponds to an event (δ=1) or is right censored (δ=0).An aspect of analysis of survival time data that has gained popularity, especially in medical research is assessing the relationship between survival time and some biological, socio-economic and demographic characteristics that could possibly affect the survival status of patients.One popular regression model formulation that is often used in survival analysis is the Cox (1972) proportional hazards model.The model utilizes the hazard function λ(t), also known as the hazard rate or force of mortality which is defined as the probability of experiencing event of failure in the infinitesimally small interval (t, t+Δt), given that such an event has not been experienced prior to t.It is expressed as (1.1)

Likelihood for Right Censored Data
The likelihood for censored data is derived by considering the observed survival times . (1.2) If the subject is still alive at , all we know under non-informative censoring is that the lifetime exceeds and thus the contribution of such censored observation to the likelihood is δ be a failure indicator which takes value 1 if subject i fails at time and value 0 is subject i is censored.Then we write the full likelihood as i t . (1.4)

COX PROPORTIONAL HAZARDS MODEL FORMULATION
Suppose that the data collected on n subjects are denoted by ) , , ( , where t i is time to failure of the ith subject, δ i is the censoring indicator such that for the ith subject, δ i =1 if event of failure occurs to the subject at time and δ i = 0 if the time is right censored (i.e we observe some value c with the knowledge that t i > c) and Z i is a p-dimensional vector of covariates.Cox (1972) model assumes that the hazard function for the i-th subject with covariate value Z i has the form i t is an arbitrary baseline hazard function and γ is a p-vector of unknown regression coefficients.Model (2.1) is semi-parametric because the dependence function is modelled explicitly but no specific probability distribution is assumed for the survival times.Thus is only estimable through the partial likelihood estimation procedure.
Often, survival time data involve identified clusters of subjects according to some unobserved characteristics such that subjects belonging to the same cluster are similar with respect to such characteristics so that the survival times of such subjects are correlated whereas the survival times of subjects belonging to different clusters are independent.One appropriate way of analyzing such data is to use random effect (frailty) model.
where is the random effect (frailty) shared by the subjects belonging to cluster 3) assumes that effects of covariates are linear on the log hazards and are thus modelled parametrically as fixed effects.Often, in practical situations, effects of continuous covariates are not linear and thus cannot be adequately modelled as fixed effects.Thus extending Hennerfeind et al (2005), the parametric predictor 3) is replaced with a more flexible semiparametric structured additive predictor that incorporates this complexity within the same framework.Thus the Cox type hazard model, (2.1) can be written as where is the nonlinear effect of a continuous covariate , j f j x γ is the vector of usual linear fixed effects, c b is the cluster specific random effect (frailty) with Clearly, are usually assumed to be independent realizations from normal or log-gamma distribution with known mean and unknown variance.

BAYESIAN INFERENCE
Bayesian analysis requires assignment of priors.Thus for defining priors and developing posterior analysis, the predictor (2.4) needs to be rewritten in generic matrix notation.Thus we express , and b as the matrix product of an appropriately defined design matrix Z which leads to re-expressing (2.4) as o g j f . (3.1) We then assign priors as follows.For fixed effect parameter γ we have assumed diffuse priors i.e.
The general form of priors for j β can be cast into the form , where is a precision or penalty matrix of rank ( ) = , which shrinks parameters towards zero or penalizes too abrupt jumps between neighbouring parameters.
For the baseline and non-linear effect of continuous covariate, we assign Bayesian P-splines prior as in Lang and Brezger (2004) and the random effect are assumed to be i.i.d Gaussian.i.e ~).

APPLICATION: HOSPITAL ADMISSION OF CANCER PATIENTS
We consider data on cancer patients who were admitted at the University of Ilorin Teaching Hospital (UILTH) from 1999 to 2005.The record of each patient contains information on variables length of stay in the hospital recorded in days, sex, age and outcome which indicates whether the patient is dead or alive.We define survival time as length of stay till event of death occurs while those whose records read "alive" were right-censored because such patients had not died as at the time of the study.Nine types of cancer were selected and the Patients were grouped into nine cancer/tumor types/sites, which include: carcinoma, leukaemia, lymphoma, melanoma, sarcoma, rectum, lung, liver and stomach.Prostate and breast cancers are not included because they are gender related and may possibly introduce gender bias into the analysis.
Fitting variable cancer type as fixed effect requires that we construct eight dummy variables, and this result in eight parameter estimates to be compared to an arbitrarily chosen reference category.A more efficient alternative to this is to fit the cancer type as a random effect (frailty).
At the initial stage, we fitted sex and continuous age as fixed effects with diffuse prior.That is we fitted model Table 1 shows the posterior estimates, standard errors and the 95% credible intervals.Effects of sex and age when fitted as fixed effects are seen to be significant as the credible intervals do not include zero.To gain more insight into the analysis with respect to gender differences, we fitted models for combined and then male and female differently.Since the assumption of linear effect of metrical covariates such as age on the predictor is too restrictive as discussed in section (2), we consider two widely used alternative ways to allow for non-linearity in the effects of metrical covariates.In the first alternative, we categorize the covariate age by constructing a set of variables , with one being arbitrarily chosen as a reference category, thereby producing dummies with parameters to be estimated for the categorized covariate.In the second alternative, which is a more flexible and data driven way, we incorporate age additively in the predictor using smooth regression function and then model it nonparametrically using P-splines prior as in Lang and Brezger (2004).In this paper, Sex was coded 1 for male and 0 for female patients.The metrical age was coded into four categories: "less ) ( j j x f than 23 years" (reference group), "23-39 years", "40-55 years", and "greater than 55 years".Our research interest thus includes: investigating the effect of categorized age on the risk of dying from cancer for the cancer patients combined and for male and female separate, comparing the two ways described above by considering some hierarchical models, starting from very simple model and progressively increase model complexity.Model comparisons are based on Deviance information criterion (DIC) introduced by Spiegelhalter et al (2002), which is a Bayesian analogue of Akaike information criterion (AIC).The following models are fitted, noting that all models contain baseline effect.
Model I: (metrical age with random effect) Model 4: Model 5: (categorical age with random effect)

RESULTS
Results for the analyses are presented in table 2, showing fixed effects of age of patients for the combined, male and female and in table 3, showing the hierarchical models under the categorized age and age fitted by P-splines.The results in table 2 a,b and c are the posterior means, standard errors and the quantiles of fixed effects of the categorized age for combined, male and female patients.It is observed that the risk of dying from cancer increases with age for both combined and both sexes separately.For example, in the combined data, patients in age group 23-39 years have a risk of exp(0.290)which is 1.33 times that of patients in the reference category (less than 23 years).
The results are in the same direction for males and females, though the risks are relatively much higher for male than their female counterpart.For example, when the risk for male patients in age category 40-55 is 1.70 times those in the reference category, it is 1.52 for the females.It is observed in Table 3 that all the models fitted are best for the male patients alone and worst for the combined data as revealed by the values of the DIC which is least for the males and highest for the combined.It is also observed that the P-splines models for age are better than models with categorized age as the DIC values are seen to be smallest for the later than the former throughout for the combined, male and female, and we also observe that the data really contains random effect (frailty) and that models that take this into account are better than those that ignore it.

CONCLUSION
In the analysis of data on hospital admission for the cancer patients under study, results show significant differences among age groups with respect to the risk of dying from the selected cancer considered.Results of Deviance information criterion (DIC) also reveal that when we allow for non-linearity in the effects of metrical covariate age, the nonparametric model using P-splines prior as in Lang and Brezger (2004) is preferred over the model that categorize age.Software Package: All analyses in this paper have been done using BayesX, a public domain software package for performing complex full and empirical Bayesian inference is available at http://www.stat.uni-muenchen.de/~lang/BayesX.Limitation of the study: The major caveat to be considered when interpreting the result is about patient's age which is self reported.Most often, self reported age by patients may not be their true age.Despite this limitation, the study strength is significant.