Statistical models for longitudinal zero-inflated count data: application to seizure attacks

Background Chronic non-communicable diseases:- such as epilepsy, are increasingly recognized as public health problems in developing and African countries. This study aimed at finding determinants of the number of epileptic seizure attacks using different count data modeling techniques. Methods Four common fixed-effects Poisson family models were reviewed to analyze the count data with a high proportion of zeros in longitudinal outcome, i.e., the number of seizure attacks in epilepsy patients. This is because, in addition to the problem of extra zeros, the correlation between measurements upon the same patient at different occasions needs to be taken into consideration. Results The investigation remarkably identified some important factors associated with epileptic seizure attacks. As people grow old, the number of seizure attacks increased and male patients had more seizures than their female counterparts. In general, a patient's age, sex, monthly income, family history of epilepsy andservice satisfaction were some of the significant factors responsible for the frequency of seizure attacks (P value<0.05). Conclusion This study suggests that zero-inflated negative binomial is the best model for predicting and describing the number of seizure attacks as well as identifying the potential risk factors. Addressing these risk factors will definitely contain the progression of seizure attack.


Background
Chronic non-communicable diseases, such as epilepsy, are increasingly recognized as public health problems in developing countries. Even if, there is a cheap and effective treatment, the majority of people with epilepsy remain untreated. Epilepsy is one of the most common and serious brain disorders affecting at least 50 million people worldwide. People with epilepsy have a mortality rate 2-3 times higher than the general population 1 . Ed-ucational achievement, employment rate, and quality of life are all substantially lower for people with epilepsy 2 . In spite of recentglobal advancements in diagnosis and treatment, about eight million people with epilepsy in Africa are not treated with modern antiepileptic drugs as a result of whichhas to be seen from a public health perspective 3,4 . Five controlled incidence studies have been conducted in African countries. These studies showedthat the number of new cases of epilepsy detected among 100,000 people during one year were: 83 in Burkina Faso, 64 in Ethiopia, 73 in Tanzania, 119 in Togo, and 156 in Uganda. These incidence rates are higher than those reported from the developed world, which usually range from 40 to 70 per 100,000people. When interviewed neurologists in the African region, usually report that epilepsy is the second or third reason for consultation after headaches or pe-ripheral neuropathies, and the second or third reason for hospitalization after strokes and/or spinal cord pathologies 5 . Additionally, the prevalence and incidence rates of epilepsy in rural and suburban areas are usually higher than in cities. This is due to the fact that risk factors are more concentrated in the rural and the suburban areas where migrants from villages to cities reside in very poor conditions. 6 The prevalence of epilepsy is substantially increasing in Ethiopia, including the specific study area. As far as our knowledge is concerned, there is no published paper that documented the determinant factors affecting the progression and change of the number of epileptic seizure patients. Therefore:-this study addressed the objectives by applying longitudinal data analysis, especially count data regression modelling, to identify the potential factors associated with epileptic seizure attacks. Someprevious literatures provide a plausible analysis of various problems. For example, analyzed longitudinal data of epileptic seizure counts using hidden Markov regression approach, and 8 demonstrated the use of longitudinal count data modelling in childhood injuries. Over the past years, different authors have suggested several predictors of seizure episodes such as age, gender, family history of epilepsy and seizure characteristics 9 . However, most of the scholars considered chi-square and logistic regression analysis 10,11 . Epileptic seizures have a historical association with religion and sprit possession. some scholars stated that, the development of epilepsy, is related with socio-economic factors such that infection, injury, poor nutrition, low educational achievement and poor housing status 12,13 . About 36% of Africans live in cities. Most of these city dwellers live in the suburbs in poor conditions characterized by over-crowding, poor water supply and bad sanitation. Consequently, there is a high prevalence of communicable diseases such as malaria, meningitis, cysticercoids and tuberculosis:-all of which are the frequent causes of epilepsy.

Statistical analysis of count outcomes data
In medical research, count data are common. In addition, these event count data are often positively skewed with a high proportion of zeros. In these case, traditional methods may not always be appropriate for count data with extra zeros. The zero adjusting models have been used to analyze count data with extra-variability and excessive zeros. This model can be zero-inflated or hurdle models [14][15][16][17] . More recently, these modeling techniques have been extended to longitudinal and multilevel settings to accommodate correlated and incomplete count data 18,19 .
Class of models for analyzing count data:-An overview: Inthetheoretical background of probability models, count data usually have Poisson distribution. Several modifications (extensions) to the Poisson model have been proposed for different reasons: (1) repeated measures of the outcome variable, (2) the occurrence of over-dispersion (the variability of the data is larger than the mean), and (3) the occurrence of excess zeros. The first and the second reasonscan be managed respectively, by including the random subject specific-effects 20 and by the over-dispersion model, like negative binomial for count data 21 :where the natural parameter is assumed to follow a gamma distribution. Moreover, an excessive number zero, the third reason, can be accounted by including zero-inflated models 14 . Therefore for these reasons, four modified Poisson models: the negative binomial (NB) model, the zero-inflated Poisson model (ZIP), the zero-inflated negative binomial model, and the hurdle Poisson model 22 were introduced.

Existing over-dispersion
Poisson NB

Accessing zero
Existing zeros + overdispersion ZIP ZINB Accessing high zero + over-dispersion The aforementioned count models are fixed-effects models for independent (cross-sectional) count data. These models have been extended to longitudinal data. Longitudinal data are often complicated by at least three factors: the within-subject observations are usually not independent due to repeated measures on the same subjects (cluster); the between-subject variation may not be constant overtime; and the data are often incomplete, as subjects may drop out at any follow-up time.
Currently, there are two main approaches for modeling longitudinal and other correlated data, including count data. These are generalized estimating equations (GEE), 23 and generalized linear mixed-effects models(GLMM) 24 .
The key differences between the GEE and GLMM are the following:-firstly, GEE is a marginal orpopulation average model that centers on correct estimation of the regression parameters by treating the within-subject (cluster) correlation as a "nuisance" with its robust estimation feature. In contrast, GLMM is a subject/cluster-specific model in which the subject/cluster-specific effects and the variance-covariance matrix are essential parts of the model. Secondly, GEE has a more restrictive assumption about missing values (i.e., missing completely at random, CAR) than that of the GLMM model (i.e., missing at random, or MAR) 25 . Because GLMM or the generalized multilevel models are more popular among social scientists and applied researchers, and because its assumption about the missing value mechanism is more realistic, this paper focusedon the GLMM approach for modeling the number of seizure attacks data using the aforementioned five modeling strategies. The previous models for cross-sectional count data discussed earlier can be viewed as part of the generalized linear model (GLM)26 for modeling cross-sectional data. GLM extends ordinary least squares regression by allowing the outcome probability distribution to be any member of an exponential family of distributions for count data. Generalized linear mixed-effects models (GLMM) extend GLM to model correlated (longitudinal) data and incomplete data27. The GLMM approach accounts for the within-subject correlations due to clustering by eitherincluding random effects into the model or by directly modeling these correlations with the appropriate variance-covariance structure. The linear mixed-effects/multilevel/hierarchical model28 is a special case of GLMM where the response is a continuous variable. A longitudinal count outcome, depending on its distribution, is analyzed with either the Poisson model, the negative binomial (NB) model, or one of the zero-adjusting models in the GLMM framework. Under the longitudinal setting and assuming independence between subjects, let the random variable Yij denote the longitudinal response (such as the seizure attacks) for subject i=1, 2,..n, at the time j, where j=1, 2,… and it is assumed to follow the Poisson distribution with mean ij. The marginal mean is regressed on a set of covariates Xi=(xi1, xi2,…xip)' using a log-link.
where β is a vector of parameter associated with the vector covariates Xi. The Poisson regression model is assumed in its simplest form, the marginal mean and variance of the response are equal. This strong assumption, often not tenable for empirical data due to heterogeneity introduced in the data when important covariates are omitted from the study, is relaxed by applying an over-dispersed model. A commonly used over-dispersed model is the negative binomial model. The negative binomial model is expressed as: In this model, a-1 is called the over-dispersed parameter due to unobserved heterogeneity and λij is the mean number of seizure attacks. The negative binomial regression model can be obtained similar to equations (1) by using equation (2). The NB model may not be appropriate if the over-dispersion is due to excess zeros because it underestimates the probability of zeros and consequently underestimates the variability present in the outcome. In such situations, alternative models such as zero-inflated/hurdle models that account for over-dispersion due to excess zeros are useful.
The zero-inflated Poisson (ZIP)24 has been proposed to address this issue. The model assumes that counts are rather generated by two processes. The first process generates zero counts with probability пij while the non-zero counts follows the Poisson distribution with parameter λij and are realized with probability (1-пij). In addition, the model assumes that zero counts are generated from two sources based on the probabilities of the two processes. Thus, the probability distribution function of the longitudinal ZIP model can be written as: This model is reduced to the Poisson model when пij=0. Note that within this model, all zero counts are typically described though logistic regression, whereas positive counts are realized by log-linear Poisson model. For the vector of covariates Zij and Xij with associated parameters α and β respectively, the model specifications are expressed as follows: The ZIP model accounts for over-dispersion due to excess zeros, but not due to unobserved heterogeneity. Moreover, if the count process does not follow the Poisson model then one may use zero-inflated negative binomial (ZINB) model29, by considering the count process as negative binomial distribution. In contrast to ZIP model, the ZINB model is more appealing since an additional parameter captures variability due to over-dispersion. Thus, the probability distribution function of ZINB is given by: Here the parameters пij and λij have a similar meaning as in ZIP model. The zero-inflated negative binomial regression model can be obtained similar to equations (4) by using equation (5).

Data collection
Doctors and BSc psychiatry nurses who worked in psychiatric clinic at Felege Hiwot Referral Hospital were selected and trained to collect data. In the hospital, epileptic patients get follow-up from psychiatrynursesand the epilepsy clinic every month. The hospital is located in Bahir Dar city, Ethiopia, 565 km from Addis Ababa, the capital of Ethiopia. A structured questionnaire was used to collect data on social, demographic, behavioral and economic factors. To ensure the quality of data, the researchers were engaged in continuous supervision and monitoring.

Model variables and measurements
A total of n=53 patients' demographic details were included for this analysis. The outcome was measured using different measurements fromeach patient's card, and the covariates were selected from the various aspects of the collected data. The covariates were classified as time-independent and time-varying.

Time-varying covariates
Time was be measured as a continuous variable representing monthly follow-up visits to the hospital. The variable time starts with the value 0 for the first follow-up visit, 1 for the second visit, and so on.

The dependent variable
The number of seizure attacks per month (seizure attacks per month)

Model evaluation and selection
It is known that, the Poisson distribution is nested within negative binomial (NB) when the dispersion parameter is zero, and it is nested within zero-inflated Poisson (ZIP) when there is no zero-inflation. Moreover, NB is nested within ZINB when there is no zero-inflation and ZIP is nested within ZINB when the dispersion parameter is zero. Thus, for nested models, the differences in -2LogLikelihood (-2LL) between each set of nested pairs can be used to evaluate the best model fit. In addition, a more convenient way to compare the model fit across all models is to use information criteria: Akaike Information Criterion (AIC), and the Bayesian Information Criterion (BIC) 30,31 . The p-values less than 5% were considered as significant results. STATA 12.0 package was used for all statistical analyses.

Descriptive and exploratory analyses
The outcome variable: the number of seizure attacks per follow-up periods ranged from 0 to 30. The average number of attacks in each of the follow-up periods ranges from 0.3 to 6.03, and it was substantially smaller than the variance (0.46 to 66.59), which indicated that there exists over-dispersion in the data set. Half of the patients (50%) had at least 3 seizure attacks before starting the medication and three-fourth (75%) of them had at most 2 seizures at the third visit (Table 1). Taking the data set all together, the variance (17.19) was 11.6 times higher than the mean (1.49), which is a clear implication of the existence of over-dispersion. Moreover, the distribution of the data was highly positively skewed (skewness=5.65), with a high spike on the left and a long tail on the right. Values ranged from 0 to 30, but about 94% of the counts were between 0 and 4. Generally, there were about 61.5% zeros (Fig. 1), indicating excess zeros in the data.   Table 2. To select the best model fit for the data, we used formal evaluation methods, including: the likelihood value statistic (−2LL), the Akaike Information Criterion (AIC), and the Bayesian Information Criterion (BIC),which are used to measure how far off the fitted model is from the observed data sets. Thus, the smaller value indicated a better fit. The dispersion parameters for NB and ZINB were 0.86 and 1.8 respectively which were highly significant (p-value<0.01), implying, in their order, that the NB is superior to Poisson and ZINB is again superior to the ZIP. Comparing the four models together, the -2LL and the two information criteria (AIC and BIC) listed at the bottom indicated that the Poison model had the poorest fit, and zero-inflated negative binomial had the best fit (-LL=. 296, AIC=702.1and BIC=776.5).Since the zero-inflated negative binomial (ZINB) model had the best fit, all the interpretations were from this model. Contrary to the expectations, marital status, educational background, employment status, place of residence and other factors which are listed in the table were not associated with the frequency of the number of seizure attacks of the epileptic patients. However, age, sex (being male) and having a family history of epilepsy were positively and significantly associated with the number of seizure attacks. Specifically, conditional on random effects, when the age of the patient increased by one year, the frequency of seizure attacks on average will be exp(0.07)=1.07, holding all other factors (variables) constant. The significant negative time trend (exp(-0.79)=0.45) showed that, as the follow-up times increased regularly, the number of seizure attacks decreased almost by half. The significant random effects (1.8 with p-value<0.05) revealed considerable heterogeneity among individuals with respect to the frequency of seizure attacks (Table 2).

Discussion
In many research areas, most of the time theoutcome measures are counts' with a high proportion of zeros. Therefore, for this type of data sets, ordinary least squares (OLS) regression is usually not appropriate. Recently, models such as zero-adjusting, specially zero-inflated Poisson, models have become increasingly popular for analyzing the data with count outcomes and excess zeros. Nowadays, most of the researchers use zero-adjusting Poisson models when there are extra zeros in the outcome variable (count data). This may not work for our data since we have 61 % zero values, implying that one of the listed zero-adjusting count models might be appropriate. The likelihood value statistic (−2LL) and the information criteria we used have been widely accepted for model comparison in the overall generalized linear mixed models (GLMM),including the special type of GLMM zero-adjusting models, and, for this study, zero-inflated negative binomial model was the best. Accordingly, the main important aim of this paper was modeling the num-ber of seizure attacks (repeatedly counted) with many zeros, which is more complex than cross-sectional data due to cross-time correlations and attrition.
Age of a patient was positively and significantly associated with number of seizure attacks. This study is similar to reports from Ethiopia 32 , China 33 and other countries 34,35 .
In our study sex (being male) and family history of epilepsy were positively associated with the number of seizure attacks which is in line with the study 36 that reported being male is a risk factor for episode of febrile seizures.
The present study revealed that lower socio economic status of patients was associated with the number of seizure attacks which is similar to studies conducted by different scholars such as 37 , the study suggested that low socio-economic status, indexed by low education or lack of home ownership, was a risk factor for epilepsy in adults.
Other scholars also pointed out that people with epilepsy in developing regions carry a havy burden of stigma associated with poor social and economic status 38 .
Moreover, in those patients having previous history of seizures and and co-infected with other disease (s), there is an increment in progression of seizures as compared to non co-infected ones.

Conclusion and recommendation
To our knowledge, this study was the first study on statistical models for longitudinal zero-inflated count outcomes data with the application of number of seizure attacks. In this study different single level count regression models were used. Among these, the zero-inflated negative binomial regression model better fitted the data than any other models. Therefore, it is selected as the best parsimonious model to predict the number of seizure attacks. Based on our study, there is strong evidence that parameters such as age, sex, monthly income, family history of epilepsy, service satisfaction, and the presence of other diseases and time of follow-ups are considered to be the risk factors for the number of seizure attacks. As age increases, the seizure episodes also increase. As a result, those patients having seizure episodes should visit the health facilities when their age increases and those patients having seizure episodes with low income should be supported by families or the government in order to decrease the progression of the number of seizure attacks. Moreover, those patients having seizure and co-infected with other diseases should be diagnosed regularly and the underlingcauses of disease (s) should be treated in order to decrease the progression of seizure attacks.

Strength and limitation
Longitudinally collecting data on epilepticseizure attack is really a difficult task:-hence, we were restricted us in recording the targeted sample. The small sample size is one major limitation of our study. As a result, the conclusion may not be generalized for the whole Ethiopia. Certainly, this finding would be a great help for policy-makers as baseline guide for epileptic seizure attack prevention in Ethiopia.

Ethics approval and consent to participate
This study was conducted according to the principles mentioned expressed in the Declaration of Felege Hiwot Referral Hospital and Bahir Dar University, Ethiopia. It was approved by the research ethics committee at Bahir Dar University, College of Science, and all subjects who agreed to participate in this study signed a consent form.