Using Logistic Regression and Cox Regression Models to Studying the Most Prognostic Factors for Leukemia patients

The basic idea of this study focused on the using of two advanced statistical methods for studding the most important factors affecting the leukemia in Erbil city. The logistic regression was chosen and Cox regression as being applicable on this studies. The results indicated that, in spite of the different regression coefficients in somewhat to logistic regression and Cox regression, have not reached to the same variables that have an impact on the phenomenon. Moreover the results indicated that the surgery is the most important factor affecting the leukemia survival patients in both methods. validation was done by calculating two model selecting criterion; Akaike Information Criterion (AIC) and Bayesian information criterion (BIC) of each models are compared the smaller values of them. The data set of this study was obtained from Nanakali Hospital in the period from 1st January 2013 to 31st December 2018. The results obtained by utilizing the statistical packages ( Mat-lab and SPSS).

The basic idea of this study focused on the using of two advanced statistical methods for studding the most important factors affecting the leukemia in Erbil city. The logistic regression was chosen and Cox regression as being applicable on this studies. The results indicated that, in spite of the different regression coefficients in somewhat to logistic regression and Cox regression, have not reached to the same variables that have an impact on the phenomenon. Moreover the results indicated that the surgery is the most important factor affecting the leukemia survival patients in both methods. validation was done by calculating two model selecting criterion; Akaike Information Criterion (AIC) and Bayesian information criterion (BIC) of each models are compared the smaller values of them.

INTRODUCTION
Cancer is an abnormal cell proliferation and organized in the body to become abnormal and multiply without control outside the natural growth line, the modern theories say that the reason is a simple genetic immune system cannot notice this bug with time cause the cell's exit from the control and then the emergence of genetic medicine, to study such as those specifications and characteristics must be based on scientific analysis which depends on efficient statistical methods with quantitative and scientific measurements and criteria. This paper aims to determine the important variables that influence the leukemia disease applying logistic regression and cox regression models.

Background Information:
In this section, essential information that is related to this research will be describes. It consists of several sub-sections. The first sub-section conclude Logistic regression model, Cox Regression, Similarities and differences between logistic regression model and Cox regression and Measures of Model Selection, the second sub-section shows the result, and the Last sub-section shows conclusion.

Logistics Regression Model
Analysis of logistic regression is one of the important statistical methods that can be used in many areas of life, when there is a binary response variable or classified nature the relationship takes the formula of the logistic distribution function model and the predictor variables can be quantitative, qualitative, mixed, ordinal and binary. (mawlood 2000).
The reasons for the importance of logistical regression are: ease of use, requires a few assumptions as a model using, the logistic formulas result from a wide variety of basic assumptions about explanatory variables, and its ability to estimate regardless of the method of sampling, whether the Prospect or Retrospective.
Let the response variable Y be the binary variable, assuming that The goal is to model This model is called the logistic regression model.

Maximum Likelihood Estimation Method for Parameters
Estimation of parameters in logistic regression (the coefficients,  The mles are determined numerically, by maximizing the log likelihood.) by taking the first derivative of the log maximum likelihood equation (2 -31) and their equivalents by zero we get the equations are nonlinear in the parameters the solution can be estimated numerically and therefore resort to the use of Newton Rafson iterative method and after a few cycles of succession produced appropriate estimates of the parameters.

Evaluating the Performance of the models
in linear regression analysis ordinary least square used to fit a model, t. F tests, and residuals are used to the coefficient's and the model. The situation is different with in logistic regression the situation is different, the approximate chisquare and z tests and likelihood ratio test are used .
To test the hypothsis: chi-square test is used which is based on difference between the estimated log likelihoods corresponding to the two models, the test statistics is given by: To test the hypothesis: the test statistics is: to testing the model is adequate for data or is not : The model is adequate for data The model is not adequate for data the approximate chi-square test (Hosmer and lemshow for goodness of test) based on likelihood ratio is used.

Survival analysis
Survival analysis is usually deals with the analysis of data in times of events in the history of individual life. The survival analysis and modeling the time it takes events occur; this typical event is death, which is derived from the name ' survival ' analysis.
The most interesting survival modeling deals with the relationship between survival time and one or more variables, commonly called covariates. The Cox proportional hazard model (introduced by Cox, 1972) widely applicable and the method most frequently used in survival analysis. Let T be a random variable represents survival time of an event, with probability density function ( ) The hazard function is the function that symbolized as ( ) t h and gives the failure rate for the survival time, which is defined as the probability of a failure during a small period of time (conditional failure rate) assuming that the individual might have remained alive until the beginning of the period, as well as the individual fail in the so small time per unit time given that the individual have remained alive until time (t); also known as immediate risk, or conditional failure rate and death rate of a certain age, the hazard function gives the risk of failure for each unit of time during operation, which play an important role in The data remain, in practice when there is no censored observations the risk function is the percentage of patients

Cox regression model
The Cox proportional hazards regression model is the most convenient way to build regression models for survival data, time to-event outcome, based upon the values of given covariates. The corresponding survival functions are related as follows: One subject's hazard is a multiplicative replica of another's; comparing subject j to subject m, the model is stated as: A parametric regression model based on the exponential distribution (John, F. 2014): Or equivalent: Here = ( 1 , 2 , , …, ) , = ( 1, , 2, , … , , ) k is the total number of the covariates.
Is the constant Proportional effect of treatment.
As indicated by Hosmer and Lemeshow (1999); in Cox regression the measure that is analogous to 2 R in multiple regression is: L0 is the log likelihood of the model with no covariates.
Lp is the log likelihood of the model that includes the covariates.
n is the number of observations (censored or not).

The similarities between the two models
1. Nature of the model: The two models are considered multiple regression methods.
2. Independent variables: The nature and quality of the independent variables in the two models is not specified The variables are quantitative, descriptive, or mixed between the two. where they agreed that: -As the follow-up period increases, the logistic regression coefficients increase in amount other than coefficients Cox gradient which remains as it is.
-As the follow-up period increases, the standard error estimate in the Cox regression model is reduced comparing with a logistic regression model.

Measures of the Model Selection
In this study two measures for selecting the best model was used by comparing the accuracy and performance of methods for. Comparing models simply involves calculating the measures for each model; the model with the lowest value is chosen as the best model (Lee, T. & Wang, W. 2003).

2..4.1 Akaike's Information Criterion
The Akaike Information Criterion (AIC) compares the quality of a set of statistical models to each other. For example, Akaike's Information Criterion is calculated as follows: AIC= -2log-likelihood + 2K …(7) Where: K: is the number of model parameters (the number of variables in the model plus the intercept).
Log-likelihood is a measure of model fit, this is usually obtained from statistical output (Moore, 2016).

715
The Bayesian information criterion (BIC) is one of the most widely known and pervasively used equipment in statistical model selection. BIC is computed for each of the models corresponding to the minimum value of BIC is selected. BIC=-2lnL +2*lnN*k …(8) Where L is the value of the likelihood, N is the number of recorded measurements, and k is the number of estimated parameters.
Comparing models with the Bayesian information criterion simply involves calculating the BIC for each model; the model with the lowest BIC is chosen as the best model (Lee, T. & Wang, W. 2003).

Results and Discussions:
Is in this section two models in parametric methods binary logistic regression and Cox regression (were used for survival analysis data. Also; all the corresponding results have been given and a comparison between the two models has been done. Two statistical measures (AIC and BIC) were used to evaluate the best model in our data. The following programs were used to analyze digital medical image processing: 1. Mat-lab. 2. SPSS (V: 23).

Data Collection
The data set for this study about cancer disease was collected in Nanakali Hospital. The data consisted of 1475 cases have been collected during 5 years period; beginning from 1 st January 2013 through 31 st December 2017 on all leukemia patients admitted to the hospital with follow up period until 1 st April 2018. Out of those patients 160 died during the study and 1315 survived or under censored. The survival time was measured in months and this data was left censored; defined as the period between the diagnosis date of chest cancer and the occurrence of the event of interest (death) or until the end of the study.

Application of Logistic Regression Model
In applying binary logistic regression, we reached to these results and interpretations: The statistical hypothesis of goodnes of fit tes for logistic regression model define as: H0: the model adequate H1: the model not adequate

Table (2) Hosmer and Lemeshow Test for goodness of fit
Chi-square df Sig.
6.275 8 .617 If the lack of significance or P.value in Hosmer and Lemeshow test statistics for Goodness-of-Fit is 0.05 or less, we reject the null hypothesis that there is no difference between the observed and predicted values of the dependent; if it is Iraq

Vol. ( 4 ), Issue ( 2 ), Spring 2019 ISSN 2518-6566 (Online) -ISSN 2518-6558 (Print)
717 greater, as we want, we fail to reject the null hypothesis that there is no difference, implying that the model's estimates fit the data at an acceptable level. As here, table (2) shows that the lack of significance of the Chi-Squared test indicates that the model is a good fit, and we notes that the value of L-H statistics is equal to 6.275.

Table (3) Variables in the Equation
In Table (3) the Wald statistic and the corresponding significance level test the significance of each of the independents in the model. The ratio of the logistic coefficient B to its standard error S.E., squared, equals the Wald statistic. If the Wald statistic is significant (i.e., less than 0.05) then the parameter is significant in
The "Exp(B)" column is label for the odds ratio of the row independent with the dependent variable (status). It is the predicted change in odds for a unit increase in the corresponding independent variable. Odds ratios less than 1 correspond to decreases and odds ratios more than 1.0 correspond to increases in odds. Odds ratios close to 1.0 indicate that unit changes in that independent variable do not affect the dependent variable.
As with any regression, the positive coefficients indicate a positive relationship with the dependent variable. The prediction equation with significant factores can be written is: ̂ = 1.947 + 0.083 (surgery).

Application of Cox Regression Model
In applying Cox regression, we reached to these results and interpretations:  719 The status variable identifies whether the event has occurred for a given case. If the event has not occurred, the case is said to be censored. The case processing summary Table (4) shows that 160 cases are event data and 1315 cases are censored; these are patients who have not died.
The Omnibus Tests of Model Coefficients Table (  The chi-square change from previous step is the difference between the -2 log-likelihood of the model at the previous step and the current step. If the step was to add a variable, the inclusion makes sense if the significance of the change is less than 0.05. If the step was to remove a variable, the exclusion makes sense if the significance of the change is greater than 0.10. In the first step and second, surgery , chemo are added to the model.

721
❖ chemo factor is one of the affecting factors to the risk or death in leukemia diseases. Exp(0.082) = 1.086 means increases in the risk of the death for patient with chemo factor or who injected Chemotherapy. However, the p-value of chemo factor equal to 0.017 <= = 0.05 is statistically significant and the 95% confidence interval for the hazard ratio included. ❖ The remaining four prognostic factors which are (gender, age, radiology and hormone) are not statistically significant in the model or have note effects the survival of the leukemia patients in this study.