Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Devi Lal, Aswathy v s
DOI Link: https://doi.org/10.22214/ijraset.2023.51565
Certificate: View Certificate
We know that diabetes is one such disease that affects millions of people worldwide, so its early detection and accurate prediction can help in the timely intervention and management of the disease. The main objective of this project is to do a comparative analysis of the performances of the different machine learning algorithms like Random Forest, Support Vector Machine(SVM), K-Nearest Neighbors(KNN), and a Hybrid Random Forest in predicting diabetes and for this the various patient attributes like age, BMI, glucose level, blood pressure, etc are included in the dataset for training and testing the models and various evaluation metrics like accuracy, precision, recall, and F1-score of the different algorithms are used to do the comparative analysis. The Hybrid Random Forest algorithm is found to outperform the other considered algorithms with an accuracy of 90.4%. Feature selection is also involved in the study to identify the most important variables that would help in effective prediction of diabetes and the overall analysis demonstrates that glucose level, BMI, and age are the top variables that are considered to be very important for diabetes prediction. So in healthcare, identifying the risk of developing diabetes at an earlier stage and taking measures for its prevention and helping many patients can be a very beneficial findings of this study.
I. INTRODUCTION
We can say that today one of the leading cause of morbidity and mortality is a chronic metabolic disorder called as Diabetes that globally effects a massive amount of population and over the years its prevalence has been steadily increasing so it is important to prevent the development of complications like cardiovascular disease, kidney failure, and blindness by earlier detection and timely management of diabetes. From analyzation of large datasets and also considering a range of factors or attributes like age, pregnancies, body mass index (BMI), and blood glucose levels, skin thickness etc , Machine learning has shown great potential by using many algorithms to learn patterns , trends and discern difficult relationships that cannot be done with traditional statistical methods for predicting the onset of Diabetes. The main aim of this project is to develop reliable and accurate Machine learning models that can help and assist the healthcare professionals for predicting diabetes in individuals with the help of their clinical and demographic data that eventually shows itself as a potential tool to help to prevent the burden and onset of this disease on the whole world. Our study uses a range of Machine learning algorithms like KNN, SVM, random forests, and hybrid random forests and its comparative analysis is done to compare the different models along with its performance evaluation using the metrics like accuracy, precision, recall, and F1 score.
II. LITERATURE REVIEW
III. PROPOSED METHODOLOGY
We evaluated different classification and ensemble methods to achieve to identify more accurate model for diabetes prediction, which is the main objective of this study. A brief description of the experimental phase is presented below.
A. Dataset Description
The dataset for this project’s comparative analysis of different machine learning algorithms for diabetes prediction was taken from Kaggle. There are 768 observations and 9 attributes in the dataset. The dataset's attributes are as follows:
Table 1. Dataset description
S.NO |
ATTRIBUTES |
|
1 |
Pregnancy |
Number of times pregnant |
2 |
Glucose |
An oral glucose tolerance test measured plasma glucose concentration after 2 hours. |
3 |
Blood Pressure |
Diastolic blood pressure (mm Hg) |
4 |
Skin thickness |
Triceps skin fold thickness (mm) |
5 |
Insulin |
2-Hour serum insulin (mu U/ml) |
6 |
Diabetes Pedigree Function |
Diabetes pedigree function |
7 |
BMI |
Body mass index (weight in kilograms divided by height in meters) |
8 |
Age |
Age (years) |
9 |
Outcome |
Variable within a class (0 or 1) 268 out of 768 are 1, the rest are 0. |
The missing values in the dataset are treated during the data preprocessing step and these missing values are represented by 0 in some attributes like Glucose, blood pressure, skin thickness, Insulin, and BMI. The data set has a balanced class distribution with 268 positive cases (Outcome = 1) and 500 negative cases (Outcome = 0). The dataset used is suitable to identify the best algorithm for diabetes prediction by comparative analysis of KNN, SVM, Random Forest, and Hybrid algorithms as many previous research studies based on diabetes prediction using machine learning used this dataset and it is a widely used dataset.
B. Data Preprocessing
This is an important and critical step especially when it comes to health care related data, because any impurities or missing values can impact the overall effectiveness so to increase the quality and effectiveness of the data, to improve the overall data and to finally obtain the accurate results from the mining process for successful prediction using machine learning techniques we need to carry out data preprocessing step. For instance, in the case of the Pima Indian diabetes dataset, data preprocessing is essential and can be performed in two steps to ensure that the dataset is prepared optimally for analysis.
C. Apply Machine Learning
The different classification and ensemble methods are used to employ various Machine Learning techniques to predict diabetes after data preparation. This study is based on the Pima Indians diabetes dataset and our aim is to analyze the performance of these techniques, determine their accuracy, and identify the key features that are responsible for accurate prediction. The techniques used in our study are as follows:
These different Machine Learning algorithms used in this project for diabetes prediction like SVM, KNN, Random Forest, and Hybrid Random Forest are all powerful ones and has their respective strengths and weaknesses which can be evaluated by comparing their performances on a same dataset and this will help to gain the effectiveness and insights of each of these algorithms.
.
A. Data Collection
Data gathering is the initial stage in every machine learning project. In this project, we collected the dataset from a publicly available source containing various health metrics such as age, BMI, blood pressure, glucose level, insulin level, and diabetes pedigree function.
B. Data Preprocessing
After data collection, we preprocessed the data to make it suitable for machine learning analysis. This step involved checking for missing values, dealing with outliers, and normalizing the data.
C. Training and Testing Data
Now we need to train and test the data for that we need to divide the preprocessed dataset into training and testing sets and here we use ratio of 70:30.
D. Apply Machine Learning Technique
Then we need to apply the machine learning techniques to the training data and here we used mainly four different techniques which includes Random Forest, Support Vector Machine (SVM), k-Nearest Neighbors (KNN), and a Hybrid Random Forest.
E. Model Evaluation
Once the models were trained on the data, we evaluated their performance on the testing set. The evaluation metrics used in this study included accuracy, sensitivity, specificity, precision, and F1-score.
F. Result
The result of the overall analysis was that the Hybrid random forest algorithm technique outperformed while comparing to the other algorithms. Finally, this machine learning project for diabetes prediction involves collecting and preprocessing data, dividing it into training and testing sets, applying machine learning techniques such as Random Forest, SVM, KNN, and Hybrid Random Forest to the training data, and evaluating their performance on the testing set. The Hybrid Random Forest algorithm outperformed the other three algorithms in this study.
IV. MODEL BUILDING
We need to implement several machine learning algorithms for diabetes prediction and this phase of model building is very important and crucial.
The procedure of proposed methodology
V. EXPERIMENTAL RESULTS
For predicting diabetes, we used a dataset containing patients’ medical history, demographic information, lifestyle habits, diabetes diagnosis and we evaluated and compared the performances of machine algorithms like KNN, Random Forest, SVM, and Hybrid Random Forest. The ratio used for splitting the dataset into training and testing set was 70:30 and based on this for each algorithm we built a classifier model and each of these algorithm’s metrics evaluation such as accuracy, precision, recall, F1 score was done. The result of these different algorithms was as follows. Hybrid Random Forest with an accuracy of 90.00%, precision of 88.66%, recall of 90.53%, and F1 score of 89.58%
Outperformed the other algorithms.The KNN algorithm had an accuracy of 75.32%, precision of 60%, recall of 100%, and F1 score of 58.69%. The Random Forest algorithm (RF) had an accuracy of 81.16%, precision of 70.45%, recall of 100%, and F1 score of 68.13%. The SVM algorithm had an accuracy of 78.57%, precision of 70.58%, recall of 100%, and F1 score of 59.25%. The confusion matrix was generated to evaluate the models performance and it is shown below.
Confusion matrix-
Table 3. Confusion Matrix details.
|
Predicted: Negative |
Predicted: Positive |
Actual: Negative |
True Negative (TN) |
False Positive (FP) |
Actual: Positive |
False Negative (FN) |
True Positive (TP) |
Based on these results, we can conclude that the Hybrid Random Forest algorithm (HRF) is the best-performing algorithm for diabetes prediction. The algorithm combines the advantages of both the KNN and Random Forest algorithms (RF) and provides higher accuracy and F1 score than the other algorithms. These findings can be used to develop more accurate and effective diabetes prediction models in the future.
We can conclude that diabetes prediction can be effectively and accurately done by providing valuable insights through the comparative analysis of different machine learning algorithms like KNN, Random Forest Algorithm, SVM, and Hybrid Random Forest Algorithm. Through performance evaluation with respect to accuracy, precision, recall and F1 score the findings highlights the importance of considering hybrid approaches which can deal with complex datasets like diabetes prediction as the hybrid Random Forest was best outperformed with an accuracy score of 90.00%. This project can be one of the important milestones in research and development in the clinical settings or the medical field in detecting and prevention of diabetes in an earlier stage and ultimately improving the patient’s health. This project shows the machine learning algorithms true potential and its importance in selecting the suitable or appropriate algorithm for the task and through this how it can effectively and accurately do the diabetes prediction.
[1] American Diabetes Association. (2021). Classification and diagnosis of diabetes. Diabetes Care, 44(Supplement 1), S15-S33. [2] Centers for Disease Control and Prevention. (2021). National Diabetes Statistics Report, 2020. Atlanta, GA: U.S. Department of Health and Human Services. [3] Hulman, A., Simmons, R. K., & Brunner, E. J. (2018). Progression from pre-diabetes to type 2 diabetes in a population-based cohort: The Whitehall II study. Diabetic Medicine, 35(2), 226-234. [4] Kerner, W., & Brückel, J. (2014). Definition, classification and diagnosis of diabetes mellitus. Experimental and Clinical Endocrinology & Diabetes, 122(7), 384-386. [5] Li, L., Li, Y., Feng, Q., Zhang, M., & Zhang, Z. (2019). A novel method for diabetes prediction based on Bayesian network classifiers. PLOS ONE, 14(1), e0210293. [6] Lind, M., Polonsky, W., Hirsch, I. B., Heise, T., Bolinder, J., & Dahlqvist, S. (2021). Continuous glucose monitoring vs conventional therapy for glycemic control in adults with type 1 diabetes treated with multiple daily insulin injections: The GOLD randomized clinical trial. JAMA, 325(22), 2260-2270. [7] Ibrahim, H. M., Mabrouk, M. S., & Rehan, M. (2020). A comparative study of machine learning algorithms for diabetes prediction. Computers in Biology and Medicine, 124, 103932. [8] Ali, A., Ahmad, J., & Ahmad, S. (2019). Comparative study of machine learning algorithms for diabetes prediction. International Journal of Computer Applications, 181(40), 23-27. [9] Ravishankar, M., & Deepa, R. (2020). Comparative analysis of machine learning techniques for diabetes prediction. International Journal of Computer Science and Information Security, 18(2), 104-111. [10] Nair, A. V., & Kalathil, D. M. (2019). Comparative analysis of machine learning algorithms for diabetes prediction. International Journal of Computer Applications, 181(2), 19-22. [11] Roy, K., Saha, S., & Chakraborty, S. (2019). A comparative study of machine learning techniques for diabetes prediction. International Journal of Computer Applications, 182(8), 9-12. [12] Shrestha, S., Pant, S., & Ghimire, S. (2019). An overview of diabetes mellitus and its association with obesity: A review of the literature. International Journal of Medical Science and Public Health, 8(2), 35-43. [13] Tabák, Á. G., Herder, C., Rathmann, W., Brunner, E. J., & Kivimäki, M. (2012). Prediabetes: A high-risk state for diabetes development. The Lancet, 379(9833), 2279-2290. [14] Rathi, A., Kumar, N., & Sharma, A. (2020). Comparative analysis of machine learning algorithms for diabetes prediction. International Journal of Advanced Research in Computer Science, 11(1), 11-16. [15] Bhattacharya, S., & Das, S. (2018). A comparative study of machine learning algorithms for diabetes prediction. International Journal of Computer Applications, 180(26), 29-35.
Copyright © 2023 Devi Lal, Aswathy v s. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET51565
Publish Date : 2023-05-04
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here