Binary logistic regression methods for modeling broncho-pneumonia status in infants from tertiary health institutions in north central Nigeria

Acute respiratory tract infections, predominantly bronchopneumonia, are one of the leading causes of infant deaths in developing countries and around the world. This work models the effects of the significant risk factors on infants’ bronchopneumonia status and also fits some reduced models and determines the best model with minimum number of parameters. The data for this study consist of a random sample of 433 births to women seen in the obstetrics clinic of two sampled tertiary health institutions in north-central Nigeria. These include University Teaching Hospital (UTH) Abuja, and Federal Medical Center (FMC) Keffi, Nasarawa State. Binary logistic regression was used to identify and model the effects of the various risk factors while stepwise regression technique was used to fit some reduced logistic regression models. Then the best fitting model with minimum number of parameters was identified using likelihood ratio statistic. It was observed that baby’s weight at birth, baby’s weight four weeks since birth, and mother’s occupation have significant effects on infant’s bronchopneumonia status. Additionally, among the four fitted reduced models, model4 is the best predictor of infants’ bronchopneumonia status, followed by model3 and then model2. Therefore, community service like home visiting for health education, supplementation of vitamin A, etc., would be an advantage if provided for teenaged pregnant women as it would, in turn, reduce incidence of low birth weight and thereby reduce bronchopneumonia infection among these children. Keywords: Bronchopneumonia, Multiple Logistic Regression Model, Fitness, likelihood ratio test

Acute respiratory tract infection (ARI), predominantly pneumonia, is a major cause of morbidity and mortality among young children in developing countries. ARI is an infection of any part of respiratory tract or any related structures including para nasal sinuses, middle ear and pleural cavity (Bipin et al, 2011). The most common form of pneumonia in infants is bronchial pneumonia, which is also known as bronchopneumonia-an infection of the bronchial tubes of the lungs, with such symptoms as high fever, productive cough, loss of appetite, weakness, wheezing and difficulty in breathing, among others (Danan, 2002). Bronchial pneumonia is one of the leading causes of infant death. This disease kills 1.8 million children under five years of age every year, more than any other illness, in every region of the world. In spite of its huge toll, relatively few global resources are dedicated to tackling this child killer (Global Action Plan for Prevention and Control of Pneumonia, 2009). Pneumonia causes 15% of all deaths in children under age 5 worldwide, 2% of which are new-born (Janelle and Rachel, 2017). Bronchial pneumonia affects infants more than adults because their respiratory immune system is still immature. The most common cause of bronchopneumonia is a bacterial lung infection, such as Streptococcus pneumoniae and Haemophilus influenza type b (Hib). Viral and fungal lung infections can also cause pneumonia (Aaron, 2018). Thus there is every need to reduce infant morbidity and mortality from pneumonia by ensuring that every child is protected through a healthy environment and access to preventive and treatment measures. This can only be achieved by studying the major causes of infant mortality and morbidity from pneumonia (risk factors) and identifying the most important factors associated with these causes and applying the findings to child health policy with the goal of reducing child morbidity and mortality. In practice, situations involving categorical outcomes are quite common and some studies have been carried out in literature on prevalence of pneumonia and other infectious diseases in children and adults. Such studies include Danbaba et al (2013), Beki (2012), Vitmalkumar et al (2011), Cornfield (2010), and Monir et al (2015, among others. Most of these studies only either fit multivariate logistic regression or discriminant models to the collected data so as to determine the variables with statistically significant effects on the response. However, this work goes beyond that as it also considered fitting some reduced binary logistic models. Usually in many research projects, after data are collected and a full model is fitted, some parameters appear insignificant. In such situations, a reduced model retaining only the significant terms is then adopted for use. Therefore in this study, effects of some risk factors on bronchopneumonia status in infants were modeled using binary logistic regression methods. Then the fitted model was assessed for contribution of the individual factors and using stepwise regression technique, some reduced logistic regression models were fitted based on the variables with significant effects. These reduced models were then compared for their goodness of fit and the best fitting model with minimum number of parameters, that is, the one that best predicts bronchopneumonia status in infants, was identified using likelihood ratio statistic. Regression methods have become an integral component of any data analysis concerned with describing the relationship between a response variable and one or more explanatory variables. In most medical and epidemiologic studies, the outcome measure is categorical, such as occurrence or nonoccurrence of a disease, mortality (death or alive), etc., which may be coded as 1 or 0. Such studies call for evaluation of relative contribution of various factors to a single dichotomous or binary outcome variable and interest is always centered on modeling relationship between the probability of a success (which is between 0 and 1), and the explanatory variables (or risk factors). This relationship is nonlinear and modeling it by a linear function such as ordinary least squares (OLS) regression or linear discriminant function will violate the nonlinearity condition. This is due to strict statistical assumptions of these linear functions, such as linearity, normality, and continuity assumptions for OLS regression and multivariate normality with equal variances and covariances assumptions for discriminant analysis (Cabrera, 1994). Thus a nonlinear regression method, the most common of which is the Logistic Regression method, is the best approach in these kinds of studies (Anderson et al, 2003). This work therefore models the effects of the significant risk factors on infants' bronchopneumonia status using the logistic regression technique. The work also fits some reduced logistic regression models and determines the best model with minimum number of parameters.

MATERIALS AND METHODS
Data Collection: This research was carried out in two tertiary health institutions, which include University Teaching Hospital (UTH) Abuja, and Federal Medical Center (FMC) Keffi, Nasarawa State. The data set contains information on 433 births to women seen in the obstetrics clinic of these medical centers. All of these births were low birth weight. Of this total sample of 433 low birth weight cases, one hundred and eighty (180) cases were collected from the University of Abuja teaching hospital (UTH). Of this 180 cases, 80 (44.44%) were affected with pneumonia while the remaining 100 were not. The remaining two hundred and fifty three (253) cases were collected from the federal medical center (FMC), Keffi, in Nassarawa State, out of which 96 (37.9%) were affected with pneumonia. In all, 176 (40.6%) babies were affected with pneumonia.
The five variables identified in the code sheet in Table  1 were studied in this work. These variables have been recognized to be associated with low birth weight. Baby's weight was measured in grams, baby's sex was coded as 1 for male and 0 for female, mother's age was measured in years, and mother's occupation was coded as 0 for a housewife, 1 for a civil servant, and 2 for a business woman. Thus the baby's bronchopneumonia status was coded as 0 for a baby without the disease and 1 for a baby with the disease. Data on each of these variables were carefully and technically extracted directly from the individual client's medical folder.
The goal of this study was to determine whether these variables were risk factors in the clinic populations being served by each of the two medical centers. Recorded birth weight at birth Grams BW1 3 Recorded birth weight after 4 weeks Grams BW2 4 Baby's Sex 1 = male S 2 = female 5 Age of mother Years MA 6 Mother's Occupation 0 = housewife MOC 1 = civil servant 2 = business woman 7 Bronchopneumonia status 0 = absent Bpn 1 = present Model Fitting: If Y denotes an infant's bronchopneumonia status, with values "1" if the infant is infected (a success), and "0" otherwise (a failure), then, for every sampled infant, the probability that she is infected (i.e., a success) is ( ) = ( = 1/ ) and the corresponding probability that she is not infected (a failure) is 1 − ( ) = ( = 0/ ).
Let the vector ′ = ( , , … , ) denote the set of the p predictor variables in Table1, which may be categorical or continuous. The multiple logistic regression model, which relates the probability of an infant's bronchopneumonia status to the predictor variables x is given by: Where ( ) is the predicted probability for the ith infant at ; , , , , ( ) , and ( ) denote, respectively, baby's weight at birth, baby's weight 4 weeks after, baby's sex, mother's age, mother's occupation as a civil servant, and mother's occupation as a business woman.
denotes the estimated intercept and , ℎ = 1,2, … , denotes the estimated logistic regression coefficient for the ith predictor variable.
Since model (1) is nonlinear, the logit transformation on ( ) yields the multiple linear logistic regression model: Where all the terms are as defined above. This model was fitted to the collected data and the parameters , , … , were estimated via maximum likelihood estimation (MLE) method with the aid of the statistical package (SPSS version 22). Equation (2) is the natural log odds of an infant infected with bronchopneumonia.
Parameter estimation by MLE method is through the likelihood function. The likelihood for a single where the quantity ( ) denotes the value of ( ) computed at xi, as given in equation (1). Therefore, for the pair ( , ), the contribution to the likelihood function can be expressed as (Hosmer et al, 2013) Thus for n independent observations, ( , ), ( , ), … , ( , ), the likelihood function is The log-likelihood function is: Estimating the value of , the vector of parameters that maximizes ( ) , requires differentiating (5) with respect to and , ℎ = 1,2, … , . However, the resulting expressions are nonlinear in and thus require iterative methods for solution, which have been programmed in to logistic regression software.
The fitted model was then checked for goodness so as to know if it accurately explains the data or if it incorrectly classify cases as often as it correctly classifies them. The fitted model was assessed using a test, which is based on the deviance statistic (D), where D is given as -2log Likelihood statistic, with the log-likelihood function as given in equation (5). The deviance statistic is basically a measure of how much unexplained variation there is in our fitted logistic regression model -the higher the value the less accurate the model (Hosmer et al, 2013). This statistic compares the difference in probability between the predicted outcome and the actual outcome for each case and sums these differences together to provide a measure of the total error in the model.
Fitting Reduced Models: Usually in many research projects, after data are collected and a full model is fitted, some parameters appear insignificant. In such situations, a reduced model retaining only the significant terms is then adopted for use. Part of our goal in this work is also to obtain the best fitting model with the minimum number of terms or parameters. Therefore, the contribution of each variable to the fitted full model was assessed using Wald statistic, which is the ratio of the maximum likelihood estimate of each slope parameter, , to an estimate of its standard error. The significant variables were then used to fit the reduced multiple linear logistic regression models given below to the data.
( ) = + + 4. ( This can be expressed as = −2 (likelihood without the variable) (likelihood with the variable) Under the fitted model with k variables, the log-likelihood function (5) can be expressed as where ( ) = / are the fitted proportions.
Under the null (reduced) model the function can be written as Using (7) and (8), equation (6) becomes Under the hypothesis that the coefficient(s) for the p excluded variable(s) are equal to zero, the statistic G has the chi-square distribution given by  . We expect an improvement in fit (i.e. a significant decrease in deviance) as we add more variables to the equation depending on how significant the effect of the added variables are.

RESULTS AND DISCUSSION
The code sheet in Table1 shows two categorical predictor variables for this study, which include mother's occupation with three categories and baby's gender with two categories. Table2 reveals that the 'housewife' category of the mother's occupation and the 'male' category of the baby's sex were each used as the reference category in this work. Next the baseline (or constant-only) model was fitted to the data and the model is given in Table3 From this table we observed that this baseline model is a significant predictor of the outcome ( < 0.001).
We then consider the accuracy of classifying the observations of the infants' bronchopneumonia status by this model, as given in Table4. From this table we observed 100.0% correct classification of the unaffected (i.e., Bpn = 0) group and 0.0% correct classification of the affected (Bpn = 1) group with 59.4% overall percentage of correct classification. This indicates that the fitted baseline model's approach to prediction is only accurate 59.4% of the time. The multiple logistic regression model given in Equation (2) was then fitted to the data as given in Table 5.    From Table 5, we observed that the logistic regression coefficients for each of the variables BWABith and BWA4Weeks was significant but negative (-2.915 and -0.939, respectively) with p-values below 0.001. These coefficients indicate that for a one-unit increase in each of BWABirth and BWA4Weeks scores, we expect a decrease of 2.915 and 0.939 units, respectively, in the log odds of bronchopneumonia infection in infants. In terms of the odds ratio (Exp(B)), these coefficients (0.054 and 0.391) indicate that, holding other variables at a fixed value, there is a 94.6% and 60.9% decrease, respectively, in the odds of getting infected with bronchopneumonia disease for a one-unit increase in the BWABirth and BWA4Weeks scores. That is, for every unit increase in the BWABirth and BWA4Weeks scores, these infants are, respectively, 94.6% and 60.9% less likely to be infected with bronchopneumonia disease. This result shows the significant impact of baby's weights both at birth and after four weeks in preventing bronchopneumonia infection in infants. The coefficient for mother's age (Mother_age) is also negative (-0.002) and not significant as the p-value is greater than 0.05. The corresponding odds ratio of 0.998 indicates that, holding other variables at a fixed value, the infants are only 0.2% less likely to be infected with bronchopneumonia disease. This shows that mother's age do not have a significant impact in preventing bronchopneumonia infection in infants. The coefficient for baby's gender (Bgender (1)) was negative (-0.209) and not significant, as the p-value is greater than an accepted alpha value of 0.05. Since the female group is our reference category, this coefficient is the log of the ratio of odds for the male group to the odds for the female group. The corresponding odds ratio (0.811) indicates that male infants are 18.9% less likely to get bronchopneumonia infection than females even after controlling for other variables. The coefficient for the mother's occupation as a civil servant (Mother_occup(1)) was positive (0.064) but not significant ( > 0.05) while that of the business woman (Mother_occup(2)) was positive (0.812) and significant ( < 0.01). Since housewife group is the reference category, this coefficient is the log of the ratio of odds of the business mother group to the odds of the housewife group. The corresponding odds ratio (2.253) indicates that, after controlling for other variables, the odds of mothers with business as their occupation are 125% higher than the odds for mothers who are housewives. This shows that infants from business mothers are 125% more likely to get bronchopneumonia infection than infants from mothers who are housewives.

Assessing the Significance of the Fitted full Model:
The fitted model was then assessed using the likelihood ratio test statistic G given in equation (11) by testing the null hypothesis that the p "slope" coefficients included are equal to zero. That is, For the fitted baseline model with estimated coefficient in Table 3, the value of the deviance statistic D is −2 = 585.023, while for the fitted full model with coefficients in Fitting the Reduced Models: Table 7 presents the summary of the results of the four fitted reduced models. Each of the fitted models was assessed by testing the null hypothesis that the coefficients for each excluded variable is equal to zero using the likelihood ratio test ( 2.8) , which compares each model to its parent one.  as the chi-square ( ) value) with p < 0.05. This indicates that model4 is better at predicting infants' bronchopneumonia status than the other three models. Addition of this variable has increased the model's explanatory power to about 39% of the total variation in the bronchopneumonia status data, though there is a slight reduction in the classification accuracy of the model. Furthermore, this variable had three groups, housewife, civil servant (mother_occp(1)), and, business woman (mother_occp (2)). For this variable, housewife was considered as reference category and it was found that infants from mothers who are civil servants and those from mothers who are business women are, respectively, 6% and 126% more likely to be infected with bronchopneumonia than infants from mothers who are housewives. Baby's sex was entered into the model but its effect was insignificant and therefore it was excluded. These results revealed that of all the factors considered in this study, baby's weight after birth (BWABirth) is the best determinant of bronchopneumonia status in infants as it caused the highest reduction of 126.428 in the -2Log Likelihood statistic. That is, bronchopneumonia infection in a newly-born baby significantly depends on the weight at birth and a child with normal birth weight is less likely to be infected with bronchopneumonia. This was followed by the weight of the baby at 4 weeks since birth (BWA4Weeks) while occupation of the mother (mother_occp) has the least effect. Comparing the four fitted reduced models from this Table in terms of the -2Log Likelihood statistic, model4 turns out to be the best predictor of infants' bronchopneumonia status, followed by model3 and then model2. This is also true when we look at each of the models' explanatory power.
Conclusion: This study has established the potentials of logistic regression technique in modeling bronchopneumonia status in infants. It has identified baby's weight at birth and his/her weight four weeks after birth as the best determinants of his/her bronchopneumonia status. Each of these was observed to have a highly-significant impact in preventing bronchopneumonia infection in infants. The study has also demonstrated the power of likelihood ratio statistic in the fitting and identification of the reduced logistic regression models that best predicts bronchopneumonia status in infants.