Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Mr. R. Ramakrishnan, R. Sujithra, A. Joice Niranjalin
DOI Link: https://doi.org/10.22214/ijraset.2024.62103
Certificate: View Certificate
Large data collections with a more intricate and diversified structure are referred to as \"big data.\" These traits are typically associated with more challenges when it comes to storing, analyzing, and applying additional procedures or getting results. The technique of analyzing vast quantities of intricate data to uncover patterns or connections that are concealed is known as \"big data analytics.\" Nonetheless, there is a clear conflict between big data\'s extensive use and its security and privacy. This paper examines privacy needs in big data, distinguishes between privacy and security, and focuses on privacy and security issues in large data. This study addresses the applications of privacy in business by utilizing established techniques like HybrEx, k-anonymity, T-closeness, and L-diversity. Many privacy-preserving techniques have been created for protection of privacy at various phases of the big data life cycle (such as data generation, storage, and processing). This work aims to explain the issues facing current privacy preservation strategies in big data and to give a comprehensive evaluation of them. This study also introduces new approaches to privacy preservation in big data, such as quick anonymization of massive data streams, identity-based anonymization, differential privacy, privacy preserving big data publishing, and hiding a needle in a haystack. This study discusses big data privacy and security issues in healthcare. A comparative analysis of some contemporary large data privacy approaches is also conducted. A comparative analysis of the latest big data privacy approaches is also conducted.
I. INTRODUCTION
The emergence of Internet applications, social networks, and the Internet of Things has led to a substantial increase in data generation, commonly referred to as big data. Big data is characterized by its 5V properties: Volume, Velocity, Variety, Value, and Veracity, stemming from its generation from diverse sources, in various formats, and at high speeds. However, these characteristics also introduce new challenges throughout the lifecycle of big data, particularly concerning privacy and five key security aspects: confidentiality, efficiency, authenticity, availability, and integrity. Confidentiality stands as the cornerstone of big data security and privacy, necessitating protection against data leaks. Any breach in data security results in the loss of its value. Big data's value diminishes significantly if hackers tamper with or access confidential information. Efficiency plays a pivotal role in big data security and privacy due to the demand for high network bandwidth. Authenticity ensures the reliability of data sources, processors, and authorized data recipients, safeguarding against inaccurate analysis and maximizing the potential value of big data. The availability of big data is paramount, as its value diminishes if inaccessible when needed. Furthermore, integrity is vital for obtaining accurate and valuable data. Inaccurate or incomplete data compromises the integrity of analyses, especially when the missing data is sensitive and crucial. Today, big data finds extensive utilization across various sectors such as healthcare, government agencies, businesses, research, and other organizations for analysis purposes. However, utilizing this data often involves its publication, investigation, and other applications, raising concerns about privacy. Since big data contains individuals' specific information, directly releasing it for analysis poses significant risks to user privacy. Therefore, techniques for privacy-preserving big data mining are indispensable, aiming to mitigate identity and sensitive information disclosure within datasets.
II. LITERATURE REVIEW
Aggarwal, C. C. (2014). In this paper, Aggarwal provides a comprehensive exploration of big data privacy issues from a technological perspective. He examines various challenges and solutions related to preserving privacy in the era of big data analytics. Kitchin, R. (2014). Kitchin's work focuses on the broader implications of big data, including its epistemological and paradigmatic shifts.
He offers insights into how big data affects various aspects of society and knowledge production. Boyd, D., & Crawford, K. (2012). Boyd and Crawford present critical questions and provocations regarding big data, emphasizing its cultural, technological, and scholarly impacts. They raise important considerations for researchers and practitioners navigating the big data landscape. Mayer-Schönberger, V., & Cukier, K. (2013). The authors discuss the transformative potential of big data in their book, highlighting its implications for various domains such as healthcare, business, and governance. They explore the opportunities and challenges associated with harnessing big data. Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). This article discusses the pitfalls of big data analysis using the example of Google Flu Trends. It underscores the importance of critical scrutiny and validation in interpreting big data insights. Floridi, L. (2012). Floridi contributes to the discourse on big data by addressing its epistemological challenges. He reflects on the nature of knowledge production in the context of vast amounts of data and the implications for understanding the world. Dhar, V. (2013). Dhar's work delves into the domain of data science and prediction, offering insights into the methodologies and techniques used in analyzing big data. He discusses the role of predictive analytics in extracting value from large datasets. Chen, M., Mao, S., & Liu, Y. (2014). This survey paper provides an overview of big data technologies and applications. It covers various aspects such as data processing, analytics, and challenges in managing and deriving insights from big data. Zikopoulos, P., Eaton, C., DeRoos, D., Deutsch, T., & Lapis, G. (2012). The authors present a comprehensive guide to understanding big data technologies, with a focus on Hadoop and streaming data analytics. They offer practical insights for organizations leveraging big data platforms. Raghupathi, W., & Raghupathi, V. (2014). This paper explores the promise and potential of big data analytics in healthcare. The authors discuss applications such as predictive modeling, personalized medicine, and healthcare management using large healthcare datasets.
III. NAVIGATING THE PRIVACY AND SECURITY LANDSCAPE OF BIG DATA
Privacy and security concerns in big data arise due to the sheer volume, variety, and velocity of data being generated, collected, processed, and stored. Some key concerns include:
Overall, addressing privacy and security concerns in big data requires a holistic approach, combining technical solutions, robust governance frameworks, and adherence to regulatory requirements to ensure the responsible and ethical use of data.
IV. ENSURING PRIVACY COMPLIANCE
Innovative approaches for ensuring privacy compliance extend to all phases of the ETL (Extract, Transform, and Load) process:
V. DE-IDENTIFICATION
De-identification has long been employed as a method for preserving privacy in data mining, involving the anonymization of data through generalization and suppression techniques. To bolster traditional approaches, concepts like k-anonymity, l-diversity, and t-closeness have been introduced to thwart re-identification risks. While de-identification remains pivotal in privacy protection, its efficacy in the realm of big data analytics faces challenges, particularly in addressing flexibility and effectiveness concerns, as well as heightened risks of re-identification due to the vast external data available. Developing efficient algorithms is crucial to enhance de-identification's suitability for privacy-preserving big data analytics. Common terms in the privacy domain include identifier attributes (e.g., full name), quasi-identifier attributes (e.g., age, gender), sensitive attributes (e.g., medical conditions), insensitive attributes (e.g., general information), and equivalence classes (sets of records sharing identical quasi-identifier values).
A. K-Anonymity
K-Anonymity offers a significant level of privacy protection by thwarting re-identification attempts, thereby ensuring highly accurate and secure data analysis. This privacy model serves to deter linking attacks by anonymizing datasets. A dataset achieves K-Anonymity when an individual's data in the published dataset is indistinguishable from at least k-1 other individual’s data. Consequently, an attacker cannot differentiate a specific record from the k-1 other individuals' records solely based on their quasi-identifier attributes.
Two primary techniques are proposed to reinforce K-anonymity in private datasets:
Both techniques uphold the integrity of the data while bolstering K-Anonymity.
a. Suppression: This approach involves replacing certain attribute values with a placeholder symbol, such as '@'. Either all or selected values within a column can be substituted with '@'. In the anonymized Table 1, all entries in the 'Name' attribute and each value in the 'Postal code' attribute are replaced with '@'.
b. Generalization: In this method, specific attribute values are substituted with broader categories. For example, the value '19' in the 'Age' attribute might be replaced with (20,30), while '23' could be replaced with (20,30), and so forth.
Table 1: Healthcare Information Dataset
KEY ATTRIBUTE |
QUASI – IDENTITIERS ATTRIBUTE |
SENSITIVE ATTRIBUTE |
|||||
NAME |
AGE |
SEX |
POSTAL CODE |
DISEASE |
|||
Seetha |
35 |
Female |
23120 |
Chickenpox |
|||
Ram |
38 |
Male |
32103 |
COVID-19 |
|||
Varsha |
33 |
Female |
42182 |
Measles |
|||
Manasi |
30 |
Female |
13216 |
Meningitis |
|||
Siva |
40 |
Male |
21425 |
Viral hepatitis |
|||
Sam |
32 |
Male |
61321 |
COVID-19 |
|||
Aruvi |
37 |
Female |
21835 |
Measles |
|||
Naveen |
45 |
Male |
32167 |
Common cold |
|||
Nikitha |
42 |
Female |
42896 |
Chickenpox |
|||
TABLE 2: Publicly available LICENSE NUMBER Dataset
DRIVER’S LICENSE NUMBER |
NAME |
AGE |
SEX |
POTAL CODE |
DL-4839201437832 |
Seetha |
35 |
Female |
23120 |
DL-5647392745362 |
Ram |
38 |
Male |
32103 |
DL-7854325647834 |
Varsha |
33 |
Female |
42182 |
DL-5463728987654 |
Manasi |
30 |
Female |
13216 |
DL-3647589230123 |
Siva |
40 |
Male |
21425 |
DL-8364758320946 |
Sam |
32 |
Male |
61321 |
DL-2647364578964 |
Aruvi |
37 |
Female |
21835 |
DL-5467382987564 |
Naveen |
45 |
Male |
32167 |
DL-6473823456782 |
Nikitha |
42 |
Female |
42896 |
Upon comparing Table 1 and Table 2, the attacker could discern that Seetha is afflicted with Chickenpox. Despite the removal of key identifiers, an individual can still be identified by cross-referencing with publicly available data. This practice of combining released table data with publicly accessible information constitutes a Linking Attack. Thus, this privacy model serves to thwart such linking attacks. Consequently, attempting to identify an individual from a release provides only limited information, such as age, gender, and zip code. Table 3 below exemplifies a 2-anonymous table, where k=2, signifying that at least two tuples share identical values in the quasi-identifier attributes.
TABLE 3: 2-Anonymized Dataset
AGE RANGE |
SEX |
POSTAL CODE |
DISEASE |
(30 ,40) |
Female |
231@ |
Chickenpox |
(30 ,40) |
Male |
321@ |
COVID-19 |
(30 ,40) |
Female |
421@ |
Measles |
(30 ,40) |
Female |
132@ |
Meningitis |
(30 ,50) |
Male |
214@ |
Viral hepatitis |
(30 ,40) |
Male |
613@ |
COVID-19 |
(30,40) |
Female |
218@ |
Measles |
(30 ,50) |
Male |
321@ |
Common cold |
(30 ,50) |
Female |
428@ |
Chickenpox |
K-anonymous data, despite its protective measures, remains vulnerable to various attacks such as unsorted matching, temporal, and complementary release attacks. However, there's a silver lining: the introduction of a greedy O(k log k)-approximation algorithm offers an optimal solution for achieving k-anonymity through entry suppression. This algorithm aims to minimize the information withheld while ensuring that individuals remain anonymous within groups of size k. Nonetheless, achieving this privacy level is a complex task, as it involves rendering private records k-anonymous while withholding minimal information, an optimization problem proven to be NP-hard. Moreover, when attributes are suppressed instead of individual entries, the problem becomes even more challenging, also falling under the NP-hard category. Consequently, there's a shift towards adopting the L-diversity strategy for data anonymization.
B. L-Diversity
The l-diversity model (Distinct, Entropy, Recursive) extends the k-anonymity model by decreasing data granularity through generalization and suppression methods, ensuring that each record maps onto at least k different records. Addressing weaknesses in the k-anonymity model, l-diversity ensures that protecting identities to the level of k-individuals also safeguards corresponding sensitive values, particularly when sensitive values within a group are homogenous. The model prioritizes intra-group diversity for sensitive values in anonymization. However, its effectiveness depends on the range of sensitive attributes, when the sensitive attribute lacks diversity, fictitious data may need to be introduced to achieve L-diversity. While this enhances security, it can complicate analysis. Moreover, the l-diversity method is vulnerable to skewness and similarity attacks, limiting its ability to prevent attribute disclosure.
C. T-Closeness
T-closeness, an enhancement of l-diversity group-based anonymization, is utilized to safeguard privacy in datasets by reducing data granularity. This reduction, however, involves a trade-off, sacrificing some data management or mining algorithm efficacy to bolster privacy. The t-closeness model (Equal/Hierarchical distance) builds upon l-diversity by treating attribute values individually and considering the distribution of data values for each attribute. An equivalence class achieves t-closeness if the divergence of a sensitive attribute within this class from the attribute's distribution across the entire table is less than a specified threshold t. A table is deemed to have t-closeness when all equivalence classes exhibit t-closeness. The primary advantage of t-closeness lies in its ability to thwart attribute disclosure. However, as data size and diversity increase, the risk of re-identification also rises. The brute-force approach, which explores every potential table partition to find the optimal solution, requires significant computational time. Improvements have been made to this approach, reducing the computational complexity to a single exponential in n, albeit remaining far from polynomial time complexity.
VI. ENHANCING PRIVACY MODELS WITH MAPREDUCE FRAMEWORK
A proposed approach leverages the MapReduce framework for efficient data processing, particularly for handling large and distributed datasets. Data is partitioned into chunks processed by separate mappers, with outputs consolidated by reducers to produce final results.
MapReduce-based k-Anonymity: Our algorithm, akin to Mondrian, ensures insensitivity to data distribution across mappers. By splitting each equivalence class into (at most) q classes in each iteration, we enhance generality and reduce iterations.
MapReduce-based l-Diversity: Extending privacy models from k-anonymity to l-diversity involves integrating sensitive values into mapper outputs. Unlike k-anonymity, mappers in l-diversity receive both quasi-identifiers and sensitive attributes as input.
VII. ANONYMIZING BIG DATA STREAMS
Anonymizing big data streams poses unique challenges due to the need for real-time processing and the sheer volume of data. Existing k-anonymity approaches, suitable for static data, are not directly applicable to data streams due to their NP-hard nature and the impracticality of repeated scanning. The FADS algorithm, designed for data stream anonymization, faces limitations in handling large volumes and sequential processing.
To address these challenges, a new approach called FAST is proposed. FAST leverages parallelism to enhance the efficiency of the FADS algorithm and introduces a proactive heuristic to prevent publishing expired tuples. Experimental results demonstrate that FAST outperforms FADS and other existing algorithms, reducing information loss and cost metrics during the anonymization process.
In conclusion, the rapid growth and utilization of big data bring significant challenges in terms of privacy and security. While big data analytics offer immense potential for uncovering hidden patterns and insights, they also raise serious concerns regarding the protection of personal and sensitive information. This paper highlights the importance of privacy in big data and distinguishes it from security, emphasizing the need for comprehensive strategies to address both aspects effectively. Various privacy-preserving techniques, such as HybrEx, k-anonymity, T-closeness, and L-diversity, have been developed to safeguard privacy at different stages of the data lifecycle. However, despite the existence of these techniques, challenges persist in preserving privacy in big data environments. The paper identifies limitations and shortcomings of current privacy preservation strategies and introduces innovative approaches to address them. Through a comparative analysis of contemporary big data privacy approaches, the paper underscores the need for ongoing research and development to enhance privacy protection in the era of big data. By advancing privacy-preserving techniques and implementing robust privacy policies, we can harness the benefits of big data while safeguarding individual privacy rights and ensuring ethical data practices.
[1] Narayanan, A., & Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets. In Security and Privacy, 2008. SP 2008. IEEE Symposium on (pp. 111-125). IEEE. [2] Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), 557-570. [3] Li, N., Li, T., & Venkatasubramanian, S. (2007). t-closeness: Privacy beyond k-anonymity and l-diversity. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on (pp. 106-115). IEEE. [4] Vaidya, J., & Clifton, C. (2002). Privacy-preserving k-means clustering over vertically partitioned data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 206-215). [5] Aggarwal, C. C. (2005). On k-anonymity and the curse of dimensionality. In Proceedings of the 31st international conference on Very large data bases (pp. 901-909). [6] Fung, B. C. M., Wang, K., Chen, R., & Yu, P. S. (2010). Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys (CSUR), 42(4), 1-53. [7] Gehrke, J., & McSherry, F. (2009). The case for privacy-aware data systems. Communications of the ACM, 52(4), 103-108. [8] El Emam, K., Dankar, F. K., Issa, R., Jonker, E., Amyot, D., Cogo, E., ... & Fineberg, A. (2009). A globally optimal k-anonymity method for the de-identification of health data. Journal of the American Medical Informatics Association, 16(5), 670-682. [9] Xiao, X., & Tao, Y. (2006). Anatomy: Simple and effective privacy preservation. In Proceedings of the 32nd international conference on Very large data bases (pp. 139-150). [10] Oliveira, R., Santos, C., & Rodrigues, J. J. (2015). Differential privacy in big data: A state-of-the-art survey. IEEE Access, 3, 1671-1679. [11] Xu, J., Li, D., & Yu, S. (2014). A survey of privacy-preserving data publishing in the big data era. IEEE Transactions on Big Data, 1(1), 1-16.
Copyright © 2024 Mr. R. Ramakrishnan, R. Sujithra, A. Joice Niranjalin. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET62103
Publish Date : 2024-05-14
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here