Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Srikanth Kulkarni, Ayush Buradkar, Pratiksha Ghadge, Srusti Khainar
DOI Link: https://doi.org/10.22214/ijraset.2023.53467
Certificate: View Certificate
Web scraping is a potent method that makes it possible to scrape useful data from websites, making it an essential tool in a number of fields like data analysis, market research, and competitive intelligence. The objective of this project is to provide a web scraping solution that is reliable and effective, automates the process of gathering data from internet sources, and offers valuable insights for decision-making. For subsequent analysis, the gathered data is saved in a structured format, such as CSV, JSON, or a database. To improve the quality and utility of the retrieved data, the project also contains methods for cleaning and preparing the data. The data is then examined statistically, visually, and with the use of machine learning algorithms to find patterns, trends, and insights that might aid in making well-informed decisions.
I. INTRODUCTION
In the era of data science engineering, collecting informa- tion from websites for analysis is completely anticipated. By learning how to scrape website pages, you can save time and money. While we must trawl through various websites to obtain information in an organised configuration, certain organisations, such as Twitter, do offer APIs to access their data in a gradually assembled fashion. The fundamental idea behind web scratching is to retrieve information that is already there on a website and transform it into a format that can be used for analysis. One of the most often used programming languages for data science projects is Python. Scraping the web is easier when Beautifulsoup is used with Python. In this paper, we will get a detailed but fundamental explanation of how to use BeautifulSoup to scratch data in Python. This will make it simple and energy-efficient for information researchers to collect and store information from website pages.
II. MOTIVATION OF PROJECT
III. ALGORITHM
Let v and w represent the rightmost nodes of F1 and F2, respectively, and let F1 and F2 be ordered forests with a distance metric cost function on nodes. The recursion yields the tree edit distance: The following is the underlying premise. We consistently contrast the forests’ rightmost nodes, v and w. We branch for each of the three instances that need to be looked into when comparing the nodes: remove v, insert w, and relabel v to w. Since v is now accounted for, we remove it from its forest in the delete branch.
W is also taken out of its forest in the insert branch. Relabeling nodes causes us to branch twice, and the pair of relabeled nodes is then included in the mapping. This indicates that nodes descending from v can only map to nodes descending from w in order to comply with the mapping restrictions.
As a result, it is necessary to contrast the left forest of v with the left forest of w. The algorithm uses dynamic programming since the lemma states that the tree edit distance can be determined by combining answers to subproblems. Since the result is computed from the bottom up, each potential subproblem requires its own table entry. The nodes of the subproblems always have consecutive indices since the forests are given postorder indices.
IV. LITERATURE SURVEY
V. GOALS AND MOTIVATION
A. Goal
The goal of web data scraping is to extract and gather specific information from websites automatically. It involves using automated scripts or tools to crawl through web pages, locate relevant data, and extract it into a structured format for further analysis or use.
B. Objective
The objectives of web data scraping can vary depending on the specific needs and requirements of the project, but here are some common objectives:
VI. SYSTEM ARCHITECTURE
The user interface provides an interface for users to inter- act with the data scraper system. This can be a web-based dashboard, a desktop application, or a command-line interface. Users can input their scraping requirements, configure scraping rules, monitor scraping progress, and access the extracted data through the user interface. The job management component handles the scheduling and execution of scraping tasks. It manages the queue of scraping jobs, assigns resources, and ensures efficient utilization of computing resources. Each scraping engine/node is responsible for visiting the target websites,navigating through web pages, and extracting the desired data based on the configured scraping rules.
This component processes the raw data, cleanses and val- idates it, performs any necessary data transformations, and stores it in a structured format suitable for further analy- sis. Data processing may involve tasks such as removing duplicates, standardizing formats, and resolving inconsisten- cies. The processed data is typically stored in a database or data storage system for efficient retrieval and analysis. The system architecture includes the necessary infrastructure and resources to support the data scraper. This can include servers, cloud-based computing resources, storage systems, and net- work components. Scalability and reliability considerations are important to ensure that the system can handle large-scale scraping tasks and accommodating increased user demand.
VII. OBJECTIVE
The objectives of web data scraping can vary depending on the specific needs and requirements of the project, but here are some common objectives:
VIII. AREA OF PROJECT
The area of a web data scraping project can vary depending on the specific context and application. Here are some common areas where web data scraping is applied.
The proactive nature of the concept enables the user to an- ticipate changes in the product’s pricing and plan ahead to pur- chase it in the future. Hourly based data scraping occasionally provides customers with reliable results because product prices fluctuate greatly. The suggested solution successfully satisfies customer expectations while making product purchases by offering the greatest deal on offer on e-commerce websites. By automating the web scraping process, more reliable data is produced while simultaneously reducing the need for manual labour.creating a user interface, a web or mobile application, and web extensions to make use of it simple for users. Model development and feature research for improved performance outcomes. increasing the number of testing models used to increase coverage and get rid of exceptions, defects, and errors.
[1] Renita Crystal Pereira, Vanitha T. “Web Scraping of So- cial Networks”. International Journal of Innovative Research in Computer and Communication Engineering, vol. 3, pp.237- 239, Oct. 7, 2018 [2] Patrick Hagge Cording, “Algorithms for Web Scraping”, Kongens Lyngby 2011. [3] Roopesh N, Akarsh M S, C. Narendra Babu, Senior Member, IEEE M S Ramaiah University of Applied Sciences, India “An Optimal Data Entry Method, Using Web Scraping and Text Recognition”, 2021 International Conference on Information Technology (ICIT). [4] Sushitha S, Vijayalakshmi S Katti, Sowmya H N, Samanvita N. “Patents and Publications Web Scraping”, IJCSN International Journal of Computer Science and Network, Vol- ume 5, Issue 2, April 2016 [5] SARR, E. N., Ousmane, SALL., DIALLO, A. (2019, October). Fact Extract: “Automatic Collection and Aggrega- tion of Articles and Journalistic Factual Claims from Online Newspaper”. In 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS) (pp. 336-341). IEEE. [6] Saurkar, Anand V., Kedar G. Pathare and Shweta A. Gode, “An Overview On Web Scraping Techniques And Tools”, International Journal on Future Revolution in Com- puter Science and Communication Engineering, pages 363- 367, 2018. [7] Rahul Dhawani, Marudav Shukla, Priyanka Pu- var, Bhagirath Prajapati, A Novel “Approach to Web Scraping Technology”, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 5, Issue 5, 2019. [8] S. d. S. Sirisuriya, “A comparative study on web scraping”, 8th International Research Conference KDU, pp. 135-140, November 2015. [9] Holbert Ghazvinian and Viswanathan, “Simple WebScraping”, Jun. 2015, [online] Available: https://seanholbert.wordpress.com/2011/07/15/scrappy-simple-webscraping/. [10] Subrata Paul, Vidhi Singrodia, Anirban Mitra, “A Review on Web Scrapping and its Applications”, 2019 In- ternational Conference on Computer Communication and In- formatics (ICCCI -2019), Jan. 23 – 25, 2019, Coimbatore, INDIA. [11] Pontus Andersson, “Developing a Python based web scraper”, A study on the de43 velopment of a web scraper for TimeEdit, Summer 2021.
Copyright © 2023 Srikanth Kulkarni, Ayush Buradkar, Pratiksha Ghadge, Srusti Khainar. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET53467
Publish Date : 2023-05-31
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here