Diabetes Prediction Using Machine Learning Algorithms

Esin Seçil YILMAZ
11 min readFeb 3, 2022

Hello again… I wanted to share a new work of mine with you. I hope you enjoy reading. In this article; I will try to predict whether a person with certain characteristics has diabetes with machine learning methods. I hope you enjoy reading.

In this study, I examined the dataset of the US National Institute of Diabetes-Digestive-Kidney Diseases, used two machine learning methods on the dataset and tried to determine whether the person had diabetes.

Let’s examine the dataset together… 

The data used in the diabetes study conducted on Pima Indian women aged 21 and over living in Phoenix, the 5th largest city of the US state of Arizona, form the story of the dataset. The data set consists of 768 observations and 8 numerical independent variables. I specified the target variable as “result”; 1 indicates positive diabetes test result, 0 indicates negative. The number of healthy observations in the data set is 500 and 65% of the data set, while the number of observations with diabetes is 268, which constitutes 35% of the data set.

The aim of the study will be to research the literature, produce new variables and develop a ‘Diabetes Prediction Model’ with the data set. The target of success in the ‘Diabetes Prediction Model’ to be established is at least 75%. This success will be tried to be increased by studying data preprocessing, literature review, new variable derivation, hyperparameter optimization and unbalanced data problem solution.

  1. INTRODUCTION

Diabetes mellitus, popularly known as diabetes mellitus, is a meta-bolic disorder that usually occurs as a result of a combination ofhereditary and environmental factors and results in excessivelyhigh blood glucose levels (hyperglycemia)[1]. Diabetes itself andthe treatment methods used in diabetes can lead to many complica-tions. If the disease is not well controlled, many acute complicationssuch as hyperglycemia, ketoacidosis or nonketotic hyperosmolarcoma may develop[2]. The main chronic complications of the dis-ease that occur in a long time are; circulatory system (cardiovas-cular) diseases (such as hypertension, heart failure and atheroscle-rosis), chronic kidney failure (nephropathy), retinal damage thatcan cause blindness (retinopathy), various types of nerve damage(peripheral neuropathy), and microvascular disorders that delaywound healing and cause impotence countable. Delayed woundhealing, especially as a result of circulatory disorders in the feet,may result in amputation[3]. For all these reasons, the treatmentof diabetes is very important. Before the treatment, taking intoaccount certain parameters and making future predictions, it canbe easily determined whether the person has diabetes or not. Thus,a delayed treatment is prevented. In this study, it was estimatedwhether a person, given certain characteristics, has diabetes with logistic regression and CART models.

2. RELATED WORKS

According to WHO around 420 million people are experiencingdiabetes[4]. Thus, we see numerous research works to predict dia-betes using machine learning methods. This section is dedicated to some of the research works we have found in the literature. Kand-hasamy P. et al (2015) compared machine learning classifiers (J48Decision Tree, K-Nearest Neighbors, and Random Forest, SupportVector Machines) and got the highest accuracy 0.7382 with J48classifier[5]. Zou Q. et al (2018) used various classification methodsand reached 0.8084 accuracy rate[6]. Mujumdar A. et al (2019) im-posed a pipeline model to improve the accuracy of classification.Application of pipeline gave AdaBoost classifier as best model with0.988 accuracy[7]. Anwar F. et al (2020) used PIMA Indian DiabetesDataset to survey different machine learning approaches to diag-nose diabetes. The study found that Deep Neural Network achievedthe highest accuracy[8]. Soni M. et al (2020) used K-Nearest Neigh-bor, Logistic Regression, Decision Tree, Support Vector Machine,Gradient Boosting and Random Forest. Their study showed thatRandom Forest achieves the highest accuracy[9]. Ghosh P. et al(2021) conducted a comparative study on different machine learningtools like Gradient Boosting, Support Vector Machine, AdaBoostand Random Forest. The best result was achieved using RandomForest approach with 0.9935 accuracy[10]. García-Ordás M. et al(2021) applied various deep learning methods in their pipeline fordata preprocessing and classification. 0.9231 accuracy was withConvolutional Neural Network classifier[11]. Khanam J. et al (2021)used PIMA Indian Diabetes Dataset on seven machine learningalgorithms and one neural network. Neural network provided 0.886accuracy after trying various epochs[12]. Butt U. et al (2021) com-pared various machine learning methods to predict and classifydiabetes using PIMA Indian Diabetes Dataset. MLP got the highestaccuracy 0.8608 among classification algorithms and LSTM got thehighest accuracy 0.8726[13]. Nahzat S. et al (2021) used machinelearning classification algorithms to find the method with the high-est accuracy. Random Forest technique gave the highest accuracyas 0.883[14].

3. ANALYSIS

3.1. Exploratory Data Analysis

First, we did a quick EDA on the dataset. Diabetes is expressed as1, non-diabetes is expressed as 0. We examined the variables oneby one and examined their values in the presence and absence of diabetes.

3.2 Data Preprocessing

In the Data Preprocessing section, we examined the outliers andmissing values in the dataset.We thought that standardization is important, especially in linearand distance-based models. Common use in these models is 0–1transform or Standard Scaler, Robust Scaler transforms. The valuesafter conversion are as follows.

Columns with Missing Values
Scaled Values

3.3 Model with Logistic Regression

We will first apply the Logistic Regression method on the model.Because the dataset has a very variable structure, there are indepen-dent variables and dependent variables depending on these vari-ables. In addition, the dataset has a structure consisting of classes0–1. In modeling, the Outcome variable is our dependent variable.All variables except Outcome are arguments. We model the Out-come variable with Logistic Regression as follows. First, we used thepredict() function for the predicted y values.The Predict() function,when used for classification models, performs a conversion in itselfand converts values to 1–0. The predict_proba() functions, which wewill use later, give the probability values of these estimated y values.We evaluated the success of the results we found after finding thepredictive values and the 1–0 class probabilities. So we used Confu-sion Matrix. We defined a function called plot_confusion_matrix(y,y_pred) to plot the Confusion Matrix. This function has two argu-ments. The first argument is the actual y values, and the secondargument is the predicted y values. According to this function, weobtained the following graph with an accuracy score of 0.78.Recall, precision, and f1-score give us a result to evaluate success.But if you say which one to look at, it will be enough to look atthe harmonic mean of all of them. Therefore, we will look at thef1-score. As you can see, the f1-score of class 1 is 0.65. So our modelhas average success. Not very successful or very unsuccessful.

Output of Logistic Regression Predict Function
Confusion Matrix of Logistic Regression

3.4. Model with CART

CART is the most important of the Decision tree structures. Theaim is to transform the complex structures in the data set into sim-ple decision structures. Heterogeneous data sets are divided intohomogeneous subgroups according to a specified target variable.

Classification Report

Here, a decision tree structure is established and variables are eval-uated. The top variable in the decision tree structure is the mostimportant variable. In other words, the variable that reduces theerror the most becomes the most important variable and is at thetop. Our aim is to minimize the sum of the squares of the differenceof the estimated y values with the real y values at the branches,and this value is called the RSS (Residual Sum of Square).

Residual Sum of Square

Since our project is a classification problem, we will use treestructures for classification problems. Outliers or missing values donot matter in tree structures. Therefore, we will proceed by ignoringthese values. In this machine learning method, we will follow thefollowing steps: 1. Modeling with CART 2. Model Verification withthe Holdout Method 3. Hyperparameter Optimization 4. Refittingthe Final Model to All Data First we set up our CART model, thenwe calculated the y_pred and y_proba values. Finally, we reviewedthe classification report.

df = pd.read_csv(“datasets/diabetes.csv”)

y = df[“Outcome”]X = df.drop([“Outcome”], axis=1)

cart_model = DecisionTreeClassifier(random_state=17).fit(X, y)

y_pred = cart_model.predict(X)

y_prob = cart_model.predict_proba(X)[:, 1]

print(classification_report(y, y_pred))roc_auc_score(y, y_prob)

According to the results of the Classification report, we observed that our results were quite good. Our precision, recall, f1-score and accuracy values are 1. According to this result, we can say that we predicted all 0 and 1 classes correctly with this method.

In real life, it is not possible to guess all of them correctly, in other words, all of them will be 1. Therefore, after this stage, we will perform model validation with the Hold-out method. In this section, we will divide the dataset into two as train and test. We can build a model with some parts and test it with other parts.

Classification Report

Accordingly, we allocated 30 percent of the data for testing and70 percent for building a model. We wrote the following code toexamine the Train error and the Output is as shown.

# t r a i n e r r o r

y _ p r e d = c a r t _ m o d e l . p r e d i c t ( X _ t r a i n )

y _ p r o b = c a r t _ m o d e l . p r e d i c t _ p r o b a ( X _ t r a i n ) [ : , 1 ]print( c l a s s i f i c a t i o n _ r e p o r t ( y _ t r a i n , y _ p r e d ) )

r o c _ a u c _ s c o r e ( y _ t r a i n , y _ p r o b )

The result is again 1: When we examine the test error, we get the following output.

# t e s t e r r o r

y _ p r e d = c a r t _ m o d e l . p r e d i c t ( X _ t e s t )

y _ p r o b = c a r t _ m o d e l . p r e d i c t _ p r o b a ( X _ t e s t ) [ : , 1 ]

print( c l a s s i f i c a t i o n _ r e p o r t ( y _ t e s t , y _ p r e d ) )r o c _ a u c _ s c o r e ( y _ t e s t , y _ p r o b )

The results turned out to be quite different. As you can see from here, when you build a model with all the data, the result is 1, while when we divide the dataset, we get a logical result. We see that the f1-score is 0.58. As can be understood from here, the prediction performance of the model can be considered as 67 percent(AUCScore). When we want to examine the significance level of the variables in the model we have established, we obtain the following graph. Accordingly, the most important variable to consider when estimating diabetes is Glucose.

cart_modelcart_model = DecisionTreeClassifier(random_state=17)

# hyperparameter sets to search

cart_params = {‘max_depth’: range(1, 11),”min_samples_split”: [2, 3, 4]}

cart_cv = GridSearchCV(cart_model,cart_params, cv=5, n_jobs=-1, verbose=True)

cart_cv.fit(X_train, y_train)

cart_cv.best_params_cart_tuned = DecisionTreeClassifier(**cart_cv.best_params_).fit(X_train, y_train)

# train error

y_pred = cart_tuned.predict(X_train)

y_prob = cart_tuned.predict_proba(X_train)[:, 1]

As a result of these operations, the accuracy value AUC value was0.75. We can say that this is a better result than the other. We found the best parameters suitable for the data. We tested, validated, and in the final stage, we will adapt the final model of the data to all data.

Classification Report

4. CONCLUSION

We have used two methods for Diabetes Estimation. One of thesemethods is logistic regression, the other is tree-based CART (Classi-fication and Regression Tree). With these two methods, we tried topredict whether a person with certain characteristics has diabetesor not. As a result of our work, we have reached certain values.Accordingly, when we applied Logistic Regression on the dataset,we observed our estimation success as 0.65. When we applied theCART method, we observed our estimation success as 0.75. Ourgoal was to predict diabetes with a success rate of 75% or more.We have succeeded in estimating the estimation of diabetes by 75percent with the CART method.

REFERENCES

REFERENCES

[1]A. D. Association, “Diagnosis and classification of diabetes mellitus,”EuropeanJournal of Science and Technology, vol. 32 Suppl 1, 2009.[2]K. A. Gosmanov AR, Gosmanova EO, “Hyperglycemic crises: Diabetic ketoacido-sis and hyperglycemic hyperosmolar state,” 2021.[3]K. Alexiadou and J. Doupis, “Management of diabetic foot ulcers,”PubMed, vol. 3,2012.[4] who.int, “Who diabetes,” Mar. 2021.[5]S. B. J. Pradeep Kandhasamy, “Performance analysis of classifier models to predictdiabetes mellitus,”Procedia Computer Science, vol. 47, pp. 45–51, 2015.[6]Y. L. D. Y. Y. J. Quan Zou, Kaiyang Qu and H. Tang, “Predicting diabetes mellituswith machine learning techniques,”Frontiers in Genetics, vol. 9, 2018.[7]A. Mujumdar and V. Vaidehi, “Diabetes prediction using machine learning algo-rithms,”Procedia Computer Science, vol. 165, pp. 292–299, 2019.[8]F. Anwar, Qurat-Ul-Ain, M. Y. Ejaz, and A. Mosavi, “A comparative analysis ondiagnosis of diabetes mellitus using different approaches — a survey,”Informaticsin Medicine Unlocked, vol. 21, p. 100482, 2020.[9]D. S. V. Mitushi Soni, “Diabetes prediction using machine learning techniques,”IN-TERNATIONAL JOURNAL OF ENGINEERING RESEARCH TECHNOLOGY (IJERT),vol. 9, Sept. 2020.[10]P. Ghosh, S. Azam, A. Karim, M. Hassan, K. Roy, and M. Jonkman, “A comparativestudy of different machine learning tools in detecting diabetes,”Procedia ComputerScience, vol. 192, pp. 467–477, 2021.[11]M. T. García-Ordás, C. Benavides, J. A. Benítez-Andrades, H. Alaiz-Moretón, andI. García-Rodríguez, “Diabetes detection using deep learning techniques withoversampling and feature augmentation,”Computer Methods and Programs inBiomedicine, vol. 202, p. 105968, 2021.[12]J. J. Khanam and S. Y. Foo, “A comparison of machine learning algorithms fordiabetes prediction,”ICT Express, vol. 7, no. 4, pp. 432–439, 2021.[13]M. A. F. H. H. A. B. Umair Muneer Butt, Sukumar Letchmunan and H. H. R. Sherazi,“Machine learning based diabetes classification and prediction for healthcareapplications,”Journal of Healthcare Engineering, 2021.[14]M. Y. Shamriz Nahzat, “Diabetes prediction using machine learning classificationalgorithms,”European Journal of Science and Technology, vol. 24, pp. 35–59, 2021.4

--

--