Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Riza Atik
DOI Link: https://doi.org/10.22214/ijraset.2024.64988
Certificate: View Certificate
About 30 million people in India suffer from diabetes. These patients can be provided with proper treatment if the signs of diabetes are identified early on. This study aims to assess the risk of diabetes among individuals based on parameters such as age, body mass index (BMI), blood glucose level, blood pressure etc. The possibility of an individual suffering from diabetes is predicted using three different machine learning models. The accuracy and F1 score for the predictions of logistic regression, K-nearest neighbors and support vector machine model are calculated. These scores are then compared and support vector machine (SVM) model is found to be the most accurate among the three chosen models.
I. INTRODUCTION
Diabetes is a common disease in which either the pancreas doesn’t produce enough insulin or the body cannot effectively use the insulin it produces. Insulin is a hormone that regulates blood sugar. If not managed in a timely manner, it can lead to serious health issues such as damage to the heart, nerves, tissues, eyes, kidneys etc.
There are three different types of diabetes :
It is also found that people suffering from gestational diabetes have a higher chance of developing type 2 diabetes in the future.
A. Symptoms of Diabetes
The symptoms of diabetes may occur suddenly. Although in type 2 diabetes, the symptoms might start out mild and may take many years to be noticed. Following are some common symptoms of diabetes :
Diabetes can, over time, damage blood vessels in the heart, eyes, kidneys and nerves. People suffering from it have a higher risk of health problems including heart attack, stroke and kidney failure. It can also cause permanent vision loss by damaging blood vessels in the eyes. Many people with diabetes often develop problems with their feet from nerve damage and poor blood flow. This can cause foot ulcers and may lead to amputation.
B. Causes of diabetes
The primary cause of diabetes is genetics. It is brought on by at least two mutated genes in chromosome 6, which affects the body's reaction to diverse antigens. Additionally, viral infection may affect the likelihood of diabetes types 1 and 2 occurring in a particular person.
Research has indicated that infection with viruses like Coxsackievirus, rubella, hepatitis B virus, cytomegalovirus, and mumps raise the chance of getting diabetes.
II. RESEARCH PROBLEM
Diabetes is regarded as one of the deadliest chronic diseases that raises blood sugar. Many issues arise when diabetes is left untreated and undiagnosed. The laborious identification process ends with a patient visiting a diagnostic facility and seeing a doctor. However, the development of machine learning techniques resolves this significant issue. The goal of this research is to create a model that can accurately predict a patient's likelihood of having diabetes. We have chosen three supervised learning models, namely logistic regression, support vector machine (SVM) and k-nearest neighbors (KNN), to predict the occurrence of diabetes in a patient using factors such as age, glucose, blood pressure, skin thickness, BMI, insulin etc. The accuracy and F1 scores for all three models are calculated and then compared to find the model that is best at predicting the occurrence of diabetes.
III. PREDICTION USING SUPERVISED LEARNING MODELS
Supervised learning is a type of machine learning in which machines are trained using well labelled training data. The supervised learning algorithms then predict the output, on the basis of the data it has been trained on. Labelled data implies that the inputs values are already tagged with the correct output values.
As data is fed into the model, it adjusts its weights until the model has been fitted appropriately, which occurs as part of a cross validation process. Supervised learning is extremely useful in solving a variety of real world problems at scale. It can be used to build highly accurate machine learning algorithms.
Supervised learning uses a training set to teach models to yield the desired output. This training dataset includes inputs and correct outputs, which allow the model to learn over time. The algorithm measures its accuracy through the loss function, adjusting until the error has been sufficiently minimized.
Supervised le?rning ?r?blems can be further ?l?ssified into regression and ?l?ssifi??ti?n problems.
Nowadays, due to current environment and living habits, humans face various diseases. There is an urgent need for the identification and prediction of such diseases at their earlier stages, to prevent the extremity of it. It is difficult for doctors to manually identify the diseases accurately most of the time. Using cutting-edge machine learning techniques can prove immensely helpful in this process. The process can be simplified down to a few steps of putting in patient medical data and history and identifying their risk for dangerous diseases such as diabetes.
In an effort to implement the prediction of diabetes using supervised machine learning models, three appropriate models are chosen :
A. Logistic Regression
Logistic regression is a supervised machine learning algorithm that is mainly used for classification tasks. It is used in cases where the goal in to predict the probability that an instance belongs to a particular class or not. It is a statistical algorithm that analyzes the relationship between two data factors.
B. K- Nearest Neighbors
KNN algorithm works by determining the ‘K’ number of nearest neighbors to a given data point based on a distance metric, such as euclidean distance. The class or value of the data point is then determined by the average of the k neighbors or the majority vote. The approach of this algorithm allows it to adapt to different patterns and make predictions based on the local structure of the data.
C. Support Vector Machine
Support vector machine (SVM) is supervised machine learning algorithm that is used for linear and non-linear classification, regression and even outlier detection. It is best suited for classification problems. The main objective of SVM algorithm is to find the optimal hyperplane in an N-dimensional space that can separate the data points in different classes in the feature space. The hyperplane tries that the margin between the closest points of different classes should be maximum. The dimension of the hyperplane depends on the number of features.
IV. LITERATURE REVIEW
Results from related research that analyzed various healthcare datasets and made predictions using a variety of methods and strategies are presented. Researchers have created and used a variety of prediction models utilizing different data mining techniques, machine learning algorithms, or even a mix of these techniques.
The research in paper [1], proposes a robust framework for diabetes prediction where outlier rejection, filling the missing value, data standardization, feature selection, k-fold cross-validation were used to preprocess data. The data was then used for training algorithms such as k-nearest neighbor, decision tree, random forest, AdaBoost, naive bayes, XGBoost and multilayer perceptron. AUC is chosen as the performance metric which is maximized using grid search technique during the process of hyperparameter tuning.
In research paper [2], machine learning classification and ensemble techniques are used on a dataset to predict diabetes. The models used are k-nearest neighbors, logistic regression, decision tree, support vector machine, gradient boost and random forest. The accuracy of each model is compared and it is concluded that random forest technique achieves a higher accuracy than the other models employed.
This paper by Ayan Mir et al. [3] is concentrated on diabetes prediction. Diabetes databases for PIMA Indians are used. On the Weka interface, the classification methods Naive Bayes, SVM, Random Forest, and Simple CART are employed. As a result, SVM offers greater accuracy than the competition. Aakansha Rathore et al. [4] made use of the Diabetes dataset for PIMA Indians experimentally, and R Studio was used to assess the performance measurements. SVM and Decision Tree are two machine learning techniques that were employed. In [5], the authors drew a comparison between logistic regression, artificial neural networks and decision tree model for predicting diabetes or pre-diabetes. The participants of the study came from two communities in Guangzhou, China. 735 patients confirmed to have diabetes or pre-diabetes and 752 didn’t suffer from either. The decision tree model had best performance followed by logistic regression while ANN gave the lowest accuracy.
In [6], collection of disease symptoms was performed for preparing the dataset along with a person’s living habits, and related doctor consultations. This data was used to predict diabetes in patients. The performance of various algorithms such as naive bayes, decision tree and logistic regression was also compared.
V. METHODOLOGY
Supervised machine learning ?lg?rithms derive insights, ??tterns, ?nd rel?ti?nshi?s fr?m ? l?beled tr?ining d?t?set. It me?ns that the d?t?set ?lre?dy ??nt?ins ? kn?wn v?lue f?r the t?rget v?ri?ble f?r e??h re??rd. It is called su?ervised learning because the ?r??ess ?f ?n ?lg?rithm le?rning from the training d?t?set is like an instructor supervising the le?rning process. The ??rre?t ?nswers are known, the ?lg?rithm iter?tively m?kes ?redi?ti?ns ?n the tr?ining d?t? ?nd the instru?t?r ??rre?ts it. Learning ends when the ?lg?rithm ??hieves the desired level ?f ?erf?rm?n?e ?nd ???ur??y. In this study, three different supervised machine learning models are chosen namely logistic regression, support vector machine and k-nearest neighbor. The ‘PIMA Indians Diabetes Database’ dataset is first pre-processed and then split into ‘train’ and ‘test’. The ‘train’ part of the dataset is used to train the models. Once the training is completed, the performance of the models is evaluated using the ‘test’ part of the dataset. The accuracy and F1 score for each model is calculated and compared to find out which model is the best at predicting diabetes.
A. Implementation steps
The methodology involved collecting and pre-processing the data. We split the data into training and test set. After which the models are trained and evaluated. They are discussed in detail below. These steps are also summarized in Fig. 1.
Proposed Methodology
B. Brief description of the algorithms used
Supply vector machine algorithm implementation
C. Dataset Description
We have chosen the PIMA Indians Diabetes Database available on Kaggle [7]. The chosen dataset contains 9 columns representing various factors that could possibly contribute to diabetes as shown in Fig. 3. The data was collected for 768 individuals. There are no null values present in the dataset. The outcome column represents whether the person suffers from diabetes (1 represents disease) or doesn’t (0 represents disease).
PIMA Indians diabetes database
D. Model performance evaluation
This is the last stage in the prediction model. Here, we use measures like classification accuracy and F1-score to evaluate the prediction results.
Accuracy: The percentage of correct predictions to all the input samples is known as the classification accuracy.
F1 score: The F1 score is used to assess the correctness of a test. The Harmonic Mean of memory and precision is the F1 Score. F1 Score has a range of [0, 1]. It informs you of the robustness and precision of your classifier.
VI. RESULT
Three supervised machine learning models were trained on the ‘PIMA Indians Diabetes Dataset’. The models were then tested using the ‘test set’ and their F1 score and accuracy were measured for comparative analysis. Out of the three chosen models, logistic regression, k-nearest neighbors and support vector machine, support vector machine (SVM) gives the highest accuracy of 77.272727 and highest F1 Score of 0.615385.
Table I. represents the Accuracy and F1-Score performance metrics.
Performance Matrix
Model Name |
Accuracy of Model |
F1 Score |
Logistic Regression |
75.324675 |
0.577778 |
K-Nearest Neighbours |
72.727273 |
0.533333 |
Support Vector Machine |
77.272727 |
0.615385 |
Visualizing these scores helps comprehend the difference in model performance of each algorithm clearly. Fig. 4. depict the accuracy of each model and Fig. 5. represents the F1 score of each model.
Accuracy score
F1 Score
The early identification of diabetes is one of the major real-world medical issues. This study makes methodical attempts to design a system that forecasts the development of diabetes. In this paper, three machine learning algorithms namely logistic regression, K- nearest neighbors and supply vector machine were used to predict diabetes. These algorithms were applied on the ‘PIMA Indians Diabetes Dataset’. Experimental results determine the adequacy of the designed system with an achieved accuracy of 78 percent using Supply Vector Machine algorithm. In the future various other machine learning models as well as deep learning models can be used for diabetes prediction. Ensemble techniques can be used to improve the accuracy of predictions. There is also scope in using similar approaches to predict other diseases such as cancer, heart disease etc. This kind of work can be extended and improved such that it can be used to automate the process of diabetes analysis.
[1] Hasan, M. K., Alam, M. A., Das, D., Hossain, E., & Hasan, M. (2020). Diabetes prediction using ensembling of different machine learning classifiers. IEEE Access, 8, 76516-76531. [2] Soni, M., & Varma, S. (2020). Diabetes prediction using machine learning techniques. International Journal of Engineering Research & Technology (IJERT), 9(09), 2278-0181. [3] Mir, A., & Dhage, S. N. (2018, August). Diabetes disease prediction using machine learning on big data of healthcare. In 2018 fourth international conference on computing communication control and automation (ICCUBEA) (pp. 1-6). IEEE. [4] Rathore, A., Chauhan, S., & Gujral, S. (2017). Detecting and Predicting Diabetes Using Supervised Learning: An Approach towards Better Healthcare for Women. International Journal of Advanced Research in Computer Science, 8(5). [5] Meng, X. H., Huang, Y. X., Rao, D. P., Zhang, Q., & Liu, Q. (2013). Comparison of three data mining models for predicting diabetes or prediabetes by risk factors. The Kaohsiung journal of medical sciences, 29(2), 93-99. [6] Alanazi, R. (2022). Identification and prediction of chronic diseases using machine learning approach. Journal of Healthcare Engineering, 2022(1), 2826127. [7] Kaggle dataset description : https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
Copyright © 2024 Riza Atik. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET64988
Publish Date : 2024-11-04
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here