Implementation of Flight Fare Prediction System Using Machine Learning

Authors: Neel Bhosale, Hrutuja Handore, Pranav Gole, Priti Lakade, Gajanan Arsalwad

DOI Link: https://doi.org/10.22214/ijraset.2022.43230

Abstract

The Flight ticket prices increase or decrease every now and then depending on various factors like timing of the flights, destination, duration of flights. In the proposed system a predictive model will be created by applying machine learning algorithms to the collected historical data of flights. Optimal timing for airline ticket purchasing from the consumer’s perspective is challenging principally because buyers have insufficient information for reasoning about future price movements. In this project we majorly targeted to uncover underlying trends of flight prices in India using historical data and also to suggest the best time to buy a flight ticket. The project implements the validations or contradictions towards myths regarding the airline industry, a comparison study among various models in predicting the optimal time to buy the flight ticket and the amount that can be saved if done so. Remarkably, the trends of the prices are highly sensitive to the route, month of departure, day of departure, time of departure, whether the day of departure is a holiday and airline carrier. Highly competitive routes like most business routes (tier 1 to tier 1 cities like Mumbai-Delhi) had a non-decreasing trend where prices increased as days to departure decreased, however other routes (tier 1 to tier 2 cities like Delhi - Guwahati) had a specific time frame where the prices are minimum. Moreover, the data also uncovered two basic categories of airline carriers operating in India – the economical group and the luxurious group, and in most cases, the minimum priced flight was a member of the economical group. The data also validated the fact that, there are certain time-periods of the day where the prices are expected to be maximum. The scope of the project can be extensively extended across the various routes to make significant savings on the purchase of flight prices across the Indian Domestic Airline market.

Introduction

I. INTRODUCTION

The flight ticket buying system is to purchase a ticket many days prior to flight take-off so as to stay away from the effect of the most extreme charge. Mostly, aviation routes don’t agree this procedure. Plane organizations may diminish the cost at the time, they need to build the market and at the time when the tickets are less accessible. They may maximize the costs. So, the cost may rely upon different factors. To foresee the costs this venture uses AI to exhibit the ways of flight tickets after some time. All organizations have the privilege and opportunity to change its ticket costs at any time. Explorer can set aside cash by booking a ticket at the least costs. People who had travelled by flight frequently are aware of price fluctuations. The airlines use complex policies of Revenue Management for execution of distinctive evaluating systems. The evaluating system as a result changes the charge depending on time, season, and festive days to change the header or footer on successive pages. The ultimate aim of the airways is to earn profit whereas the customer searches for the minimum rate. Customers usually try to buy the ticket well in advance of departure date so as to avoid hike in airfare as date comes closer. But actually, this is not the fact. The customer may wind up by giving more than they ought to for the same seat.

II. MOTIVATION

Motivation is to help people who tends to pay more for the flight fare ticket and for those who are naïve to this booking tickets process. This will also help us to get more exposure to the machine learning techniques that will help us to excel and improve in the existing skills.

III. AIM AND OBJECTIVE

The objective of the project is given below:

To get effective price for the customers.
Make UI user friendly.
Use of various ML methods to know more about dataset and get accurate results.

The aim of the project is:

a. The aim is to gain complete knowledge of “Data Science and Machine Learning”.

b. To study and gain knowledge about different algorithms in Machine Learning.

c. To get effective accurate price of flight fare.

d. To study flights prices ups & downs according to routes and on different days.

e. Creating effective user-friendly UI design.

f. Finding solutions for mitigation of defects.

IV. LITERATURE SURVEY

K. Tziridis T. Kalampokas G.Papakostas and K. Diamantaras "Airfare price prediction using machine learning techniques" in European Signal Processing Conference (EUSIPCO), DOI: 10.23919/EUSIPCO .2017.8081365L. Li Y. Chen and Z. Li” Yawning detection for monitoring driver fatigue based on two cameras” Proc. 12th Int. IEEE Conf. Intel. Transp. Syst. pp. 1-6 Oct. 2009.

Proposed study [1] Airfare price prediction using machine learning techniques, For the research work they have used dataset consisting of 1814 data flights of the Aegean Airlines collected and used to train machine learning model. Different number of features were used to train model various to showcase how selection of features can change accuracy of model. They have used various algorithms such as Multilayer Perceptron (MLP), Generalized Regression Neural Network, Extreme Learning Machine (ELM), Random Forest Regression Tree. o Regression Tree, Bagging Regression Tree, Regression SVM (Polynomial and Linear) and Linear Regression (LR) and gained different outputs for each machine learning algorithms. They have tried and trained various types of models with removing and adding different features from the dataset. Followed typical data science life cycle. The best results came from Bagging regression tree.

2. William Groves and Maria Gini "An agent for optimizing airline ticket purchasing" in proceedings of the 2013 international conference on autonomous agents and multi-agent systems.

In case study [2] by William groves an agent is introduced which is able to optimize purchase timing on behalf of customers. Partial least square regression technique is used to build a model. Initially they have used various techniques for feature selection such as Feature Extraction, Lagged Feature Computation, Regression Model Construction and Optimal Model Selection. Their experiments were designed to estimate real-world costs of using our prediction models. The lag scheme approach works well for many choices of machine learning algorithms, but PLS regression was found to work best for this domain. The improved performance can be attributed to a natural resistance to collinear and irrelevant variables.

3. J. Santos Dominguez-Menchero, Javier Rivera and Emilio Torres Manzanera "Optimal purchase timing in the airline market".

In this paper, the researchers have researched the general pattern in airline pricing behaviour and a methodology for analysing different routes and/or carriers. Their purpose is to provide customers with the relevant information they need to decide the best time to purchase a ticket, striking a balance between the desire to save money and any time restraints the buyer may have. Their study shows how non-parametric isotonic regression techniques, as opposed to standard parametric techniques, are particularly useful. Most importantly, we can determine the margin of time consumers may delay their purchase without significant price increase, specify the economic loss for each day the purchase is delayed and detect when it is better to wait until the last day to make a purchase.

4. Supriya Rajankar, Neha sakhrakar and Omprakash rajankar “Flight fare prediction using machine learning algorithms” International journal of Engineering Research and Technology (IJERT) June 2019.

Journal by Supriya Rajankar a survey on flight fare prediction using machine learning algorithm uses small dataset consisting of flights between Delhi and Bombay. Algorithms such as K-nearest neighbours (KNN), linear regression, support vector machine (SVM) are applied to gain different outcomes and do research on them. For predicting the flight ticket prices, many algorithms were implemented in machine learning. The algorithms are: Support Vector Machine (SVM), Linear regression, K-Nearest neighbours, Decision tree, Multilayer Perceptron, Gradient Boosting and Random Forest Algorithm. Using python library scikit learn these models have been implemented. The parameters like R-square, MAE and MSE are considered to verify the performance of these models. The best model results were of Decision Tree algorithm.

5. Tianyi wang, samira Pouyanfar, haiman Tian and Yudong Tao "A Framework for airline price prediction: A machine learning approach"

In this paper, Tianyi wang, samira Pouyanfar, haiman Tian and Yudong Tao [5] proposed framework where two databases are combined together with macroeconomic data and machine learning algorithms such as support vector machine, XGBoost are used to model the average ticket price based on source and destination pairs. The framework achieves a high prediction accuracy 0.869 with the adjusted R squared performance metrics. They had the result of lowest error rate of 0.92 with the XGBoost Algorithm.

6. T. Janssen "A linear quantile mixed regression model for prediction of airline ticket prices"

In this paper, they have predicted the best time to purchase the tickets. They have used various machine learning algorithms such as linear regression, Decision Tree, Random Forest, K-Nearest Neighbour, Multilayer Perceptron (MLP), gradient boosting, support vector machine (SVM). For predictors, they have used Naïve Bayes and Stacked Prediction Model. the research a desired model is implemented using the Linear Quantile Blended Regression methodology for San Francisco–New York course where each day airfares are given by online website. Two features such as number of days for departure and whether departure is on weekend or weekday are considered to develop the model.

7. Wohlfarth, T.clemencon, S.Roueff “A Data mining approach to travel price forecasting” 10th international conference on machine learning Honolulu 2011.

In the research paper [7] on Flight fare prediction system by Wohlfarth, T.clemencon, S.Roueff using the technique of yield management in the air travel industry. They have used various data mining techniques. It is the goal of this paper to consider the design of decision-making tools in the context of varying travel prices from the customer’s perspective. Terms used in the research are machine techniques/ algorithms mentioned as Clustering.

8. Vinod Kimbhaune, Harshil Donga, Ashutosh Trivedi, Sonam Mahajan and Viraj Mahajan research paper on flight fare prediction system.

In the research paper [7] on Flight fare prediction system by Vinod Kimbhaune, Harshil Donga, Ashutosh Trivedi, Sonam Mahajan and Viraj Mahajan using the various machine learning algorithm approaches i.e., Random Forest, Decision tree and Linear regression are applied on dataset. To determine ideal purchase time for flight ticket. There project aims to develop an application which will predict the flight prices for various flights using machine learning model. The techniques they have used are mentioned as Linear Regression, Decision Tree and random Forest. The performance metrics techniques used are MAE, MSE and RSME. The outcome for their project was not fully accurate but by adding more real time data set will give more accurate results.

9. W. Groves and M. Gini, ?An agent for optimizing airline ticket purchasing, ? 12th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2013), St. Paul, MN, May 06 - 10, 2013, pp. 1341-1342.

This is the extended version of the research paper [3] exploited Partial Least Square Regression (PLSR) for building up a model. The information was gathered from major travel adventure booking sites from 22 February 2011 to 23 June 2011. Extra information was additionally gathered and are utilized to check the correlations of the exhibitions of the last model. Janssen.

V. DIFFERENT APPROACHES

There are various approaches for implementing the project, below we got some approaches used by authors in the literature survey:

A. Linear Regression

Regression is a method of modelling a target value based on predictors that are independent. It is mostly based on the number of independent variables and the relationship between independent and dependent variables.

Linear regression is a type of analysis where the number of independent variables is one and the relationship between the dependent and independent variables vary linearly. The important concept to understand linear regressions are cost function and Gradient decent

y(pred) = b0+b1 ∗ x

B. Gradient Boosting

It is an additive regression model by fitting simple function to current “pseudo” residuals sequentially by least-squares at each iteration. It uses the Decision tree as a basic estimator in sci-kit implementation. Starting from 10 to 1000 with the interval of 10 boosting stages are used with maximum numbers. The loss function is an important parameter in the gradient boosting. It can be calculated with options: least squares regression, least absolute deviation, and quantile regression.

C. K- Nearest Neighbours

In regression techniques, the output obtained is an average value of its k nearest neighbours. It is a non-parametric method like SVM. Using some values, results are evaluated and the best performance value is obtained.

D. Multi-Layer Perceptron

It is the class of feedforward artificial neural networks. It includes the input layer, output layer and the number of the hidden layers. The hidden layer gives the depth of the neural network. The setup includes 1 hidden layer, the number of neurons starts from 100 to 2000 with different intervals depending upon the required condition. To fire each neuron, it requires activation energy. The logistic sigmoid function is used as an activation function.

E. Support Vector Machine

Support Vector Machine used as regression analysis that relays on kernel function considered as non-parametric technique. The following kernels are used: Linear, Polynomial, Radial Basis Function. As per the previous studies Random Forest and the gradient boosting gives the maximum accuracy.

VI. METHODOLOGY AND TERMS USED

The below mentioned are some parameters used in our data set:

Size of Test Set: 10683 rows & 11 columns
Airline: The name of the airline.
Date of Journey: The date of the journey.
Source: The source from which the service begins.
Route: Route of the flight, start to end.
Destinations: The destination where the service ends.
Departure Time: The time when the journey starts from the source.
Arrival Time: Time of arrival at the destination.
Duration: Total duration of the flight.
Total Stops: Total stops between the source and destination.
Additional Info: Additional information about the flight
Price: The price of the ticket

Machine Learning Algorithm used for implementing the project.

A. Random Forest

It is a supervised learning algorithm. The benefit of the random forest is, it very well may be utilized for both characterization and relapse issue which structure most of current machine learning framework. Random forest forms numerous decision trees, what’s more, adds them together to get an increasingly exact and stable expectation. Random Forest has nearly the equivalent parameters as a decision tree or a stowing classifier model. It is very simple to discover the significance of each element on the expectation when contrasted with others in this calculation. The regular component in these techniques is, for the kth tree, a random vector theta k is produced, autonomous of the past random vector’s theta 1, theta k-1 however with the equivalent distribution, while a tree is developed utilizing the preparation set and bringing about a classifier. x is an information vector. For a period, in stowing the random vector is created as the includes in N boxes where N is the number of models in the preparation set of information. In random split, choice includes various autonomous random whole numbers between 1 to K. The dimensionality and nature of theta rely upon its utilization in the development of a tree. After countless trees are created, they select the most famous class. These methodologies are called as random forests.

B. XGBoost

XGboost is the implementation of gradient boosted decision tree. In this algorithm, decision trees are created in sequential form. Weights play an important role in XGBoost. Weights are assigned to all independent variables which are then fed into decision tree which predicts results. The weight of tree is predicted wrong by tree is increased then these variables are then fed to second decision tree. This individual classifiers/predictor then ensemble to give a strong and more precise model. It ca work on regression,classification, prediction, ranking, user-defined prediction problems.

???????C. Performance Metrics

Performance metrics are statistical models which will be used to compare the accuracy of the machine learning models trained by different algorithms. The sklearn. metrics module will be used to implement the functions to measure the errors from each model using the regression metrics. Following metrics will be used to check the error measure of each model.

???????D. MAE (Mean Absolute Error)

Mean Absolute Error is basically the sum of average of the absolute difference between the predicted and actual values.

IX. IMPLEMENTATION

We have followed following steps in our project to get to our ultimate goal of predicting flight fare:

Importing Necessary Libraries

Importing the python libraries such as pandas, matplotlib, seaborn, NumPy for reading and visualizing the dataset.

2. Reading our Dataset

We will read out dataset using pandas. As the dataset is in the excel form, we will use “pd.read_excel()”.

3. Dropping NAN Values

We will check if there are any Null values in our dataset, if we have, we will drop it using: “dropna(inplace=TRUE)”.

4. Exploratory Data Analysis

We will pre-process our dataset. We will extract day and month from the column “Date of Journey” as the model will understand numerical value, for this we will use “pd.to_datetime” for day and month column. “dt.day” and “dt.month” will extract day and month respectively from the given column.

Same process will be doing for the “dep_time” column, “Duration” column and “arrival_time” column and extract hours and min from it. After extracting day, month, hours and min, we will drop “Date of Journey”, “Duration”, “dep_time” & “arrival_time” column from our dataset.

5. Handling Categorical Data

As we know the model understands numerical value, so we will convert all the categorical data into numerical data. For this we will perform “OneHotEncoding” method to convert it to numerical data. We will make dummies using pandas and perform “OneHotEncoding” on the “Airline”, “Source” and “Destination” columns.

We will drop “AdditionalInfo” and “Route” columns as “Route” column contains same data as “Total_Stops” columns and “AdditionalInfo” column doesn’t have any additional info. “Total_Stops” column is ordinal type data so we will perform “LabelEncoder” and label each stop as 0,1,2,3,4. As the stop increases, the value also increases.

6. Test Data: Performing EDA and Feature Engineering

For the test data, we will perform same steps followed in step (2), (3), (4) and (5).

7. Feature Selection

In this process, we will find out the best feature which will contribute to our target variable.

X = “Independent Feature”

Y = “Dependent Feature” i.e., “Price” column.

We will separate all the independent features except price in the X variable and price in Y variable. For this, we will use loc & iloc method.

Now, we have used “ExtraTreesRegressor” to find more important features from the data. Use the selection variable and do fitting the X & Y features. After this we will print “feature_importance” and will get to know the important features.

We get to know that “Total_stops” is playing as the most important feature.

8. Applying Machine Learning Algorithms

We have implemented this project using Random Forest and XGBoost Regressor algorithm. The test results are mentioned below.

9. Pickling the File

Pickling the best model (Random Forest) to reuse it.

X. RESULTS

Following are the test results for the train data and test data. As we can see Random Forest is performing better than XGBoost algorithm. So, we have chosen Random Forest model for out project.

ML Algorithm

Train Data

Test Data

XGBoost

0.7774

0.7752

Random Forest

0.9529

0.7970

Table 4. Model Accuracy Results

Below is the comparison of the MAE, MSE & RMSE

ML Algorithm

MAE

MSE

RMSE

Random Forest

1180.05

4358748.36

2087.76

HyperParameter Tuning Random Forest

1278.10

4277590.50

2068.23

XGBoost

1180.05

4358748.36

2087.76

Table 5. Comparing MAE, MSE &RMSE Scores

As we can, the scores of Random Forest before hyperparameter tuning and XGBoost are same. Scores of Random Forest are slightly affected after performing hyperparameter tuning.

As compared to the results of the reference paper [1], they have used various machine learning techniques in which they have got the best results with the Bagging Regression Tree method with the 87.42 accuracy rate. As compared to the Random Forest model of reference paper [1], below are the comparison:

Accuracy

87.42

79.7

Table 6. Comparison of Model Table 1

As compared to the results of the reference paper [15], they have used various machine learning techniques in which they have got the best results with the Trend Based Model method with the 81.8 accuracy rate. As compared to the Random Forest model of reference paper[15], below are the comparison:

Accuracy

77.8

79.7

Table 7. Comparison of Model Table 2

Conclusion

Machine Learning algorithms are applied on the dataset to predict the dynamic fare of flights. This gives the predicted values of flight fare to get a flight ticket at minimum cost. The values of R-squared obtained from the algorithm give the accuracy of the model. In the future, if more data could be accessed such as the current availability of seats, the predicted results will be more accurate. Finally, we conclude that this methodology is not preferred for performing this project. We can add more methods, more data for more accurate results.

References

[1] K. Tziridis T. Kalampokas G. Papa Kostas and K. Diamantaras \"Airfare price prediction using machine learning techniques\" in European Signal Processing Conference (EUSIPCO), DOI: 10.23919/EUSIPCO .2017.8081365L. Li Y. Chen and Z. Li” Yawning detection for monitoring driver fatigue based on two cameras” Proc. 12th Int. IEEE Conf. Intel. Transp. Syst. pp. 1-6 Oct. 2009. [2] William Groves and Maria Gini \"An agent for optimizing airline ticket purchasing\" in proceedings of the 2013 international conference on autonomous agents and multi-agent systems. [3] J. Santos Dominguez-Menchero, Javier Rivera and Emilio TorresManzanera \"Optimal purchase timing in the airline market\". [4] Supriya Rajankar, Neha sakhrakar and Omprakash rajankar “Flight fare prediction using machine learning algorithms” International journal of Engineering Research and Technology (IJERT) June 2019. [5] Tianyi wang, samira Pouyanfar, haiman Tian and Yudong Tao \"A Framework for airline price prediction: A machine learning approach\" [6] T. Janssen \"A linear quantile mixed regression model for prediction of airline ticket prices\" [7] Wohlfarth, T. clemencon, S. Roueff “A Dat mining approach to travel price forecasting” 10th international conference on machine learning Honolulu 2011. [8] Vinod Kimbhaune, Harshil Donga, Ashutosh Trivedi, Sonam Mahajan and Viraj Mahajan research paper on flight fare prediction system. [9] W. Groves and M. Gini, ?An agent for optimizing airline ticket purchasing, ? 12th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2013), St. Paul, MN, May 06 - 10, 2013, pp. 1341-1342. [10] Viet Hoang Vu, Quang Tran Minh and Phu H. Phung, An Airfare Prediction Model for Developing Markets?, IEEE paper 2018. [11] Dominguez-Menchero, J. Santo, Reviera, ?optimal purchase timing in airline markets? ,2014 [12] medium.com/analytics-vidhya/mae-mse-rmse-coefficient of determination-adjusted-r-squared-which-metric-is bettercd0326a5697e article on performance metrics [13] www.keboola.com/blog/random-forest-regression article on random forest [14] https://towardsdatascience.com/machine-learning-basics-decisiontreeregression-1d73ea003fda article on decision tree regression. [15] Achyut Joshi, Himanshu Sikaria, Tarun Devireddy, & Dr. Vivek Vijay. Predicting Flight Prices in India [16] O. Etzioni, R. Tuchinda, C. A. Knoblock, and A. Yates. To buy or not to buy: mining airfare data to minimize ticket purchase price. [17] Manolis Papadakis. Predicting Airfare Prices. [18] Groves and Gini, 2011. A Regression Model for Predicting Optimal Purchase TimingFor Airline Tickets. [19] Modeling of United States Airline Fares – Using the Official Airline Guide (OAG) and Airline Origin and Destination Survey (DB1B), Krishna Rama-Murthy, 2006. [20] B. S. Everitt: The Cambridge Dictionary of Statistics, Cambridge University Press, Cambridge (3rd edition, 2006). ISBN 0-521-69027-7. [21] Bishop: Pattern Recognition and Machine Learning, Springer, ISBN 0-387-31073-8. [22] E. Bachis and C. A. Piga. Low-cost airlines and online price dispersion. International Journal of Industrial Organization, In Press, Corrected Proof, 2011. [23] P. P. Belobaba. Airline yield management. an overview of seat inventory control. Transportation Science, 21(2):63, 1987. [24] Y. Levin, J. McGill, and M. Nediak. Dynamic pricing in the presence of strategic consumers and oligopolistic competition. Management Science, 55(1):32–46, 2009 [25] B. Smith, J. Leimkuhler, R. Darrow, and Samuels, ?Yield managementat american airlines,?Interfaces, vol.22, pp. 8–31, 1992. [26] T. Janssen, ?A linear quantile mixed regression model for prediction of airline ticket prices,? Bachelor Thesis, Radboud University, 2014. [27] S.B. Kotsiantis, ?Decision trees: a recent overview,? Artificial Intelligence Review, vol. 39, no. 4, pp. 261-283, 2013. [28] L. Breiman, ?Random forests, ? Machine Learning, vol. 45, pp. 5-32, 2001. [29] S. Haykin, Neural Networks – A Comprehensive Foundation. Prentice Hall, 2nd Edition, 1999. [30] H. Drucker, C.J.C. Burges, L. Kaufman, A. Smola and V. Vapnik, ?Support vector regression machines, ? Advances in neural information processing systems, vol. 9, pp. 155-161, 1997.

Copyright

Copyright © 2022 Neel Bhosale, Hrutuja Handore, Pranav Gole, Priti Lakade, Gajanan Arsalwad. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET43230

Publish Date : 2022-05-25

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here