Stream Processing for Association Rule to Generate Student Dataset using Apriori Algorithm

Authors: Keerti Ghodke, Rashmi Patil, Indira Umarji, Dr Vidyagouri Hemadri, Dr U P Kulkarni

DOI Link: https://doi.org/10.22214/ijraset.2022.45884

Abstract

Analytical techniques have been used for many years to analyse and predict academic achievement from various perspectives. One of the most challenging problems for higher education is predicting students\' paths through the education system. Many factors influence successful student outcome prediction in the early course stage. Apriori algorithm techniques use a variety of methods to find out and collect based on stored data patterns student information. Colab and Python applications are used in this project to predict each student based on characteristics in the given dataset. Each student\'s information is included in the dataset. Because it arrives as it is being created, received real-world data is referred to that as streaming data.

Introduction

I. INTRODUCTION

Data mining is an important part of educational today's organizations, as well as one of the most important research areas, with the goal of extraction of useful information from huge datasets of data. Educational data mining (EDM) is an important research field that can predict future useful information from educational datasets in order to improve academic outcomes, better understand, and assess students' active learning. Data mining is the process of extracting information from massive amounts of data. To put it another way, data mining is the process of trying to extract knowledge from data. Knowledge discovery refers to the technique of learning from data collection (KDD). We can make educated guesses based on the data provided. we have used Apriori algorithm and LinkedIn generated Apache Kafka, which is an accessible stream platform. It was later given to the Apache Foundation and accessible in 2011. Real-time processing of data streams: Take immediate action on knowledge and insight from actual data streaming platforms such as Kafka. Make your data scientists available: By attaching to the broker to discover data, train designs, deploy them to producers, monitor them, and They become more self-sufficient throughout the development lifecycle if they are quickly re-trained on new data. Apriori Algorithm :The Apriori is a popular method for mining frequently occurring item sets for Logic association rules. Apriori employs a “bottom up” approach in which frequent subsets are expanded one at a period. Kafka:LinkedIn created Apache Kafka, an open-source stream platform. It was later transferred to the Apache Foundation and accessible in 2011.

II. LITERATURE SURVEY

Dr. Vikesh Kumar & Samrat Singh [1] Data mining is a powerful tool for improving academic performance. Educational Data Mining is concerned with developing new methods for extracting information from educational data sets that can be used for decision making in the educational system.

M. Goyal and R. Vohra [2] Data analysis is vital for decision support in any industry, including manufacturing and education. While data mining techniques such as clustering, decision trees, and association are applied to higher education processes, this can help to improve student performance, life cycle management, course selection, retention rate, and grant fund management.

Seema Purohit and Neelam Naik [4]. Quality higher education is required for the country's growth and development. One of the pillars of higher education is professional education. Data mining techniques seek to uncover hidden knowledge in existing educational data, forecast the future, and apply it to the benefit of higher education institutions and students.

K. Rajeswari, Suchita Borkar [7]. Education Data mining is an interesting area that has a major impact on predicting students' academic performance. The performance of students is evaluated in this paper using the association rule mining algorithm. There has been research done on evaluating student performance based on various attributes. Important rules are generated in our study to measure the correlation between various attributes, which will help in enhancing the student's academic performance.

M. Tiwari, Randhir Singh, and Neeraj Vimal [8]. Educational institutions are important parts of our society, and they play an important role in the nation's growth and development. Predicting student performance in educational settings is also important. Personal, social, as well as psychological factors all effect a student's academic performance.

III. PROPOSED METHODOLOGY

The Apriori had a significant issue with various scan results through entire data set. It took a lot of spacetime. The change in our paper implies that we really do not scan the database structure to add up the support for each attribute. This is accomplished by keeping track of the minimum support count and comparing it to the support of each attribute. An attribute's support is only counted until it reaches its minimum support value. It is not necessary to know the support for just an attribute up to that point. This feature is achieved by using a value called flag in the technique. When the value of flag changes, the loop is divided and the benefit for support is recorded.

IV. PROBLEM DEFINITION

Create centralized dataset to improve the capability of level-wise frequent generation student dataset, an vital Apriori property is being used, which aids in reducing the search time. space. All subsets of a frequent student dataset should be common (Apriori property).

V. DATSET DESCRIPTION

The followings are the steps involves design and dataset.

We have chosen a dataset and attributes: created dataset contains the analysis of each students from 1 to 8 semester. The dataset contains 105 instances and 35 attributes. The data file has to be in either in ‘CSV’ format.

Here is the sample of our dataset which is in ‘CSV’ format

VI. EVALUATION METRICS

The Apriori algorithmic rule is that it assumes all elements of a frequently occurring item set to be frequent.

Similarly, for any sporadic item set, all its supersets should even be sporadic.

Support

Confidence

List

Conviction

Support- The amount of support for a law X => Y is calculated by dividing the number of transaction data that fulfil the law, N (X=>Y), by the total number of transactions, N.

(X=>Y) Support =N (X=>Y) / N

The frequency of activities that each of the rule's LHS and RHS hold true is thus the support. The bigger and more powerful the information that each type of event occurs along, the higher the support.

Support of item x is nothing however the quantitative relation of {the variety|the amount|the quantity} of transactions within which item x seems to the full number of transactions.

Support = Support = = 0.66667

Confidence is calculated by dividing the number of transaction data that fulfils the guideline N (X=>Y) by the transactions that consist the rule's body, X.

(X=>Y) Confidence = N (X=>Y) / N (X)

The belief is that the RHS will hold true if the LHS proves true. A high likelihood that the LHS event will end up there in RHS event assumes feat or apply statistical dependence.

Lift- The lift of the rule X => Y is the deviation of the full rule's support from the Support assumed below self rule given the support systems of the LHS (X) and also the Right hand side (Y).

Lift = self-assurance (X=>Y) / help (Y)= help (X=>Y) / help (X). support (Y)

Lift is a measure of the impact that information from the LHS has on chances of The RHS being true. Then raise is a value that provides data on the increase in likelihood of the "then" (subsequent RHS) half handed the "if" (antecedent LHS) half.

Lift is exactly one: There was no outcome (LHS and RHS independent). There is no connection between Events.

Greater than one lift: Positive outcome (if the LHS holds true, the RHS of operational risk management is more likely to hold true). Positive relationship between events

Lift is less than one: Negative outcome (whenever the LHS holds good, the RHS is less likely to hold true). Dependence between events that is negative.

Leverage is the amount of extra examples covered by both the element and also the outcome that is greater than what would be required if the cause and outcome were independent of each other, and finally. lev(X,Y) = supp(X,Y) sup(X) (Y)

Conviction is a live, related to Leverage, that mechanisms the departure from freedom. conv(X

Y) = supp(X)(1-supp(Y)) / supp(X) - supp(X) (X,Y)

VII. SYSTEM DESIGN

System design at the first stage we have problem statement once the problem statement defines the the what we are carring out for the project is defined we collect the student dataset which predicate and analysis the performance of each student once it is done by data is preprocessed Data preprocessing, that defines Any type of processing performed on original data to prepare it for further data processing is referred to as data preparation. Filters that convert the data in ways can be defined in the preprocess section. At the third stage we have data Data cleanup and data translation options Software is an information management technique involving ingesting an ongoing data stream and rapidly analysing, filtering, transforming, or improving the data in real time. Classification and relationship describe how components and object types will be further defined by linking to sources of information.

Conclusion

We use the apriori algorithm in this paper to predict and analyse student database which calculates the confidence and support with L1 and L2 to perform apriori algorithm. We also introduce the term called kafka which does the stream processing, In the future we are combining the apriori algorithm with Hash-based technique, Transaction Reduction, Portioning, Sampling, and Dynamic item counting. The authors are also willing to collaborate on data from tests and examinations for each course in the future in order to determine what types of students succeed in what types of courses. It may specify the types of courses that are tailored to each student\'s model who shares similar characteristics. It can also generate a variety of multi - dimensional reports and reshape pedagogical practises. learning paths.

References

[1] Samrat Singh, Dr. Vikesh Kumar , \"Performance Analysis of Engineering Students for achievement mistreatment Classification data processing Techniques \",IJCSET Feb 2013. [2] M. Goyal and R. Vohra, “Applications of information Mining in Higher Education”,IJCSI International Journal of engineering problems, Vol. 9, Issue2, No 1, March 2012. [3] Jason Brownlee ,\"How to avoid wasting Your Machine Learning Model and create Predictions in Weka\", August 3, 2016. [4] Neelam Naik & Seema Purohit, “Prediction of ultimate Result and Placement of scholars mistreatment Classification Algorithm”International Journal of pc Applications (0975 – 8887) Volume 56– No.12, Gregorian calendar month 2012. [5] Tirumalasetty, Sudhir, A. Aruna, A. Padmini, D. Vijaya Sagaru, and A. Tejeswini. \"An increased Apriori with interest of Patterns mistreatment cSupport and rSupport.\" International Journal of engineering and Mobile Computing ten, no. seven (July 2021): 20–27. http://dx.doi.org/10.47760/ijcsmc.2021.v10i07.003. [6] Cortez P. and timberland A. (2008). mistreatment data processing to Predict Secondary Student Performance. In EUROSIS, A. Brito and J. Teixeira (Eds.), pp.5 -12.

Copyright

Copyright © 2022 Keerti Ghodke, Rashmi Patil, Indira Umarji, Dr Vidyagouri Hemadri, Dr U P Kulkarni. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET45884

Publish Date : 2022-07-21

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here