Creating Search Engine Using Machine Learning Methods

Authors: Mr. D. Ranadeep Reddy, Dr. P. Sruthi, M. Ganesh, S. Priyanka, M. Shiva

DOI Link: https://doi.org/10.22214/ijraset.2024.59841

Abstract

The vast and ever-expanding amount of information available in the WWW has led to the widespread usage of search engines for data retrieval. It can be challenging to locate information that is actually relevant and helpful, even while ordinary search engines offer users an intuitive interface for entering queries and retrieving web page links as results. To rectify that problem, which paper presents a novel search engine that work ML techniques. The target is to come up users with most relevant web sites when they query the engine. The suggested search engine improves the relevancy and accuracy of search results by utilizing machine learning algorithms. This system seeks to grasp user intent, adjust to individual preferences, and deliver contextually relevant information by going beyond the bounds of traditional search engines. By using machine learning models, the search engine may learn dynamically and enhance the standard of its results over time by continuously refining its understanding. More sophisticated and adaptable search engines are being developed as an outcome of these combination of state-of-the-art machine learning libraries, tools for processing natural language, and effective indexing systems. The target of the research is too advance information retrieval systems by providing a more advanced and user-focused method of tackling the difficulties presented by the WWW\'s immense scope.

Introduction

I. INTRODUCTION

Today’s digital world, it is important to identify records accurately and quickly. Websites like Google have been gateways to the vast World Wide Web. But in this record sea of ??online data it is becoming increasingly difficulties to find valid and reliable results We have undertaken an ongoing mission using optimization tools techniques though meet the increasing demand for speed and accuracy of record retrieval. The WWW is essentially a network of separate servers and systems connected by various technologies and protocols. Each website is made up of numerous site pages that are created and sent over the server. Thus, a user must type a term so, they may get what they require. A group of terms taken from user-inputted searches is called a keyword. User-provided search input might be syntactically wrong. The real necessity for search engines now arises. A user-friendly interface is prepared with search engines for searching and displaying results. Web crawlers assist in gathering information about a website and the links that lead to it. The only tools we employ to collect data and information from the WWW and save it in our database are web crawlers. An indexer that organizes terms on a webpage and stores the list of terms that come next in a large database. Giving users relevant results for their keywords in response to their searches is Query Engine's main purpose. The query engine's Page Ranking algorithm rates the URL using a multiple of query engine techniques. Using a whole lot of strategies and algorithms to improve person search effects is a part of growing a system learning seek engine. To boom the relevancy of search outcomes based totally on person behaviour and preferences, one method is to appoint system mastering fashions. Collaborative filtering algorithms, as an instance, can look at trends in how users have interaction with search outcomes to indicate extra pertinent content material to users who're much like each different.

II. LITERATURE REVIEW

A. Review Paper on Various Search Engines (Google, Yahoo, Altavista, Ask and Bing)

Search engines are mostly utilized to retrieve matter on the network. Because this network massive reservoir of heterogeneous and unstructured material, search engines must distinguish between helpful and useless information. Crawling, page repository, indexing, querying, and ranking modules are all common elements of search engines. The intercommunication between these modules describes the search engine's functioning technique. The paper targets to evaluate 5 vital SERPS: Google, Yahoo, Altavista, Ask, and Bing. This assessment is provided in a tabular shape, basically focusing on a few competencies of those search engines like google. The features include a search operator, a web search, and a search for photos, all of which uses the algorithm. Page Rank is a numerical metric that calculates a web page's prominence connected with the number of backlinks.

B. Keyword Crawler with a Focus

The most of people using the network is increasing dramatically these days, which puts a great deal of strain on users to search for the websites are pertinent to their needs and of concern. Typically, users approach web sites by searching through a vast hierarchy of concepts that are available, or they use a query to explore web pages through search engines. that are available. In both cases, the outcomes are based on the search patterns, with just a small percentage of results being same to the query and the majority not being. Web crawlers are essential to search engines and are crucial when it enters into performance. This paper shows URL extraction using search criteria or keywords. It retrieves URLs from websites that have the searched term present in their content, deeming those pages as the only ones that should be downloaded; irrelevant webpages are not downloaded. When compared to standard web crawlers, it delivers more optimality and can improve search efficiency with greater precision.

C. Comparative Analysis of HITS, Weighted PageRank, and Simple PageRank Algorithms: Overview

The important areas of research are web mining. Essentially, it is the data mining tool used to uncover hidden matter on internet. We have access to an endless supply of knowledge on the internet, depending on our needs. Recently, search elements have become efficient for everybody who uses the network for web surfing or other online activities. However, as net usage increases, its substance is being stretched hurriedly. The unchecked growth of information material has brought with it the difficult problem of organizing it to reach the requirements of people. Therefore, we have methods to extract or filter information that is highly relevant to the user's query so that, it avoids the difficult situation. This research aims to prevent the difficulties in ranking relevant information for users by examining some of those algorithms and comparing them depend on different characteristics. Simple PageRank, which is stated by link structure and is primarily used by Google, has been conversed here with an appropriate example. Next, Weighted PageRank, which is also based on link structure but uses both forward and backward links to rank the pages, has been explained. Finally, HITS (Hypertext Induced Topic Search), which works on both content and link architecture of the web, has been examined.

D. Comparing Page Ranking Algorithms

Over the years, the most of websites has grown dramatically, and this has resulted in an increase in the volume of data available online. For website owners, it has become increasingly difficult to retrieve the necessary information from the internet to satisfy the expectations of online users. This study compares and examines the insights provided by different ranking systems.

III. EXISTING SYSTEMS

A. Information Retrieval

Information retrieval is the technique of obtaining and imparting applicable facts from a massive series of sources based totally on a user's. It includes a variety of methods and techniques aimed at identifying optimal information preferences to address specific information needs. It involves techniques such as indexing, querying, and relevance ranking to efficiently locate desired information. This enhances the general search, ensuring customers discover the records they need effectively.

B. Keyword based Search Engine

A keyword-based totally search engine is one sort of records retrieval device that permits customers to look for records the usage of particular terms or phrases is a keyword-primarily based seek engine. So that, it return relevant consequences, those search engines index webpages or files after which evaluate user-entered key phrases with the indexed content material. On the net, keyword-based totally search engines like google and yahoo are often used by humans to locate facts speedy and successfully. The search engine returns pages that include the keywords or phrases that the consumer enters once they publish a query. The repeated of key phrases inside the record, the content's relevancy to the question, and the web site or record's authority are few of the variables that the search engine takes under consideration when rating the results.

IV. LIMITATIONS OF EXISTING SYSTEMS

Keyword Ambiguity: Keyword-based totally search engines like google can also battle with ambiguous key phrases which have multiple meanings. Users won't always specific their data needs precisely, main to beside the point consequences.
Limited Context Understanding: Keyword-based SERPS lack the capability to recognize the context of the query absolutely. They might be not aware of the semantics or motive behind the user's question, this results in effects that are not completely applicable.
Over-reliance on Exact Matches: Keyword-based search engines like google typically depend upon actual keyword matches between the user query and indexed content. This can lead to neglected applicable statistics that won't contain the exact keywords however remains associated with the question topic.
Difficulty with Natural Language Queries: Users frequently formulate queries in herbal language, which won't constantly align with precise key phrases. Keyword-based totally search engines like google and yahoo may additionally warfare to interpret such queries appropriately.
Limited Support for Complex Queries: Keyword-based totally SERPS won't efficaciously deal with complicated queries with one standard or more than one situation. Users in search of nuanced records may locate it difficult to retrieve relevant outcomes.

V. SYSTEM REQUIREMENTS

A. Hardware Requirements

Minimum 4GB RAM
Hard Disk: 500GB
Processor: Intel Core i3(min)

B. Software Requirements

Operating system: Windows 10(min)
Coding Language: HTML, CSS, Python
Database: Mysql

C. Libraries

Pymysql
Sklearn
Numpy
Pandas
Nltk
Django

VI. METHODOLOGY

In this project architecture had three principal kinds of alternatives they are Admin, User, and Manager. Here the customers can seek a key-word based totally on that users get the required URLs. In this primarily depend on the user question URLs hyperlinks will come. It is completely net primarily based challenge that offers extraordinary types of gadget Learning and NLP techniques are used.

A. Web crawler

Web crawlers assist in accumulating facts approximately a website and the links related to it. We are solely utilising web crawlers for data collection purposes. We are only using web crawler for collecting data and information from WWW and store it to our database.

B. Indexer

Indexer which arranges each term on every web page and stores the following listing of phrases in a great repository.

C. Query Engine

It is mainly used to answer the person’s keyword and show the powerful final results for their key-word. In query engine.

Modules

a. Admin: Admin will provide authority to managers and customers. In order to facilitate set off the managers and activate the customers. The administrator has access to the information of all customers and bosses. The administrator can get accuracy outcomes for the SVM and XGBoost algorithms.

b. User: User information and undertaking descriptions for the entire experiment. After the person logs into the session, he will receive alternatives. He can seek the whatever unique URL or information. We will search the precise report and additionally, we can acquire rank of the record and the burden by using the TF-IDF approach.

c. Manager: Manager statistics and challenge descriptions for the entire test. Manager can upload the document into the database. We will upload the record with file kind and name.

Conclusion

In conclusion, machine learning tools such as XGBoost for search engineering offers great potential to increase the effectiveness of data acquisition in survey situations Through the power of algorithms that they are adept to learn from data and make predictions to implement on such systems that XGBoost can provide improved accuracy in identifying relevant documents and filtering noise , which is well known for its considerable efficiency and scalability, stands out as a viable technique to increase search engine performance, especially in dealing with massive data and complex queries. As ML generation continues to conform, future improvements in search engine development are in all likelihood to similarly enhance the abilities and accessibility of research information, ultimately advancing the frontiers of expertise across various domains. This paper generally tend to present companion empirical evaluation of XGBoost, a way based totally on gradient boosting that has established to be companion task thinker. Specifically, the overall performance of XGBoost in phrases of coaching pace and accuracy is compared with the overall performance of gradient boosting and random wooded area under a huge choice big choice responsibilities The results of this look at display that the foremost the most, in phrases of the variety of troubles with the handiest overall performance inside the troubles investigated, become gradient boosting. Withal, the differences with relation to XGBoost and to random woodland exploitation the default parameters do not seem to be don\'t appear to be terms of average ranks.

References

[1] Manika Dutta, K. L. Bansal, “A Review Paper on Various Search Engines (Google, Yahoo, Altavista, Ask and Bing)”, International Journal on Recent and Innovation Trends in Computing and Communication, 2016. [2] Gunjan H. Agre, Nikita V.Mahajan, “Keyword Focused Web Crawler”, International Conference on Electronic and Communication Systems, IEEE, 2015. [3] Michael Chau, Hsinchun Chen, “A machine learning approach to web page filtering using content and structure analysis”, Decision Support Systems 44 (2008) 482–494,scienceDirect,2008. [4] ijay Chauhan, Arunima Jaiswal, Junaid Khalid Khan, “Web Page Ranking Using Machine Learning Approach”, International Conference on Advanced Computing Communication Technologies, 2015. [5] Amanjot Kaur Sandhu, Tiewei s. Liu., “Wikipedia Search Engine: Interactive Information Retrieval Interface Design”, International Conference on Industrial and Information Systems, 2014. [6] Neha Sharma, Rashi Agarwal, Narendra Kohli,“Review of features and machine learning techniques for web searching”, International Conference on Advanced Computing Communication Technologies, 2016. [7] Sruthi, P., Premkumar, L.” Attribute-based storage supporting secure de-duplication of encrypted data in cloud”,International Journal of Recent Technology and Engineering, 2019, 7(6), pp. 418–421 [8] Shirisha, N., Bhaskar, T., Kiran, A., Alankruthi, K.” Restaurant Recommender System Based on Sentiment Analysis”,2023 International Conference on Computer Communication and Informatics, ICCCI 2023, 2023 [9] Bhanu, J.S., Bigul, S.D., Prakash, A.” Agricultural internet of things using machine learning”,AIP Conference ProceedingsThis link is disabled., 2021, 2358, 080010 [10] Y.Ambica, Dr N.Subhash Chandra MRI brain segmentation using correlation based on adaptively regularised kernel-based fuzzy C-means clustering Int. J. Advanced Intelligence Paradigms, Vol. 19, No. 2, 2021 ‘

Copyright

Copyright © 2024 Mr. D. Ranadeep Reddy, Dr. P. Sruthi, M. Ganesh, S. Priyanka, M. Shiva. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET59841

Publish Date : 2024-04-05

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here