0% found this document useful (0 votes)
14 views10 pages

Big Data Technologies Overview Survey

The document provides a comprehensive overview of big data technologies, emphasizing the rapid evolution and challenges associated with managing large datasets. It discusses key concepts such as volume, velocity, variety, and the importance of tools like Hadoop and MapReduce for efficient data processing and storage. The paper also highlights various applications of big data across sectors, including healthcare, social media, and transportation, while addressing the need for innovative analytics methods and data security solutions.

Uploaded by

Fati Dah
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views10 pages

Big Data Technologies Overview Survey

The document provides a comprehensive overview of big data technologies, emphasizing the rapid evolution and challenges associated with managing large datasets. It discusses key concepts such as volume, velocity, variety, and the importance of tools like Hadoop and MapReduce for efficient data processing and storage. The paper also highlights various applications of big data across sectors, including healthcare, social media, and transportation, while addressing the need for innovative analytics methods and data security solutions.

Uploaded by

Fati Dah
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: [Link]

net/publication/343327956

A Comprehensive Overview of BIG DATA Technologies: A Survey

Conference Paper · May 2020


DOI: 10.1145/3404687.3404694

CITATIONS READS
8 1,375

2 authors, including:

Muhammad Umair Raza


Shenzhen University
4 PUBLICATIONS 41 CITATIONS

SEE PROFILE

All content following this page was uploaded by Muhammad Umair Raza on 13 June 2022.

The user has requested enhancement of the downloaded file.


A Comprehensive Overview of BIG DATA Technologies –
A Survey
Muhammad Umair Raza Zhao XuJian
Southwest University of Science Southwest University of Science
and Technology, Mianyang, and Technology, Mianyang,
P.R. 621010. China. P.R. 621010. China.
umair2007pak@[Link] jasonzhaoxj@[Link]

ABSTRACT Social media, stock exchange, etc.). Earlier to the revolution of


In as much as the approaches of the new revolution, machines big data, organizations couldn’t gather theirs archive for lengthy
including transmission media like social media sites, nowadays eras not proficiently accomplish huge data set. Traditional
quantity of data swell hastily. So, size is the core and only facet equipment had inadequate storage capacity. In the context, of
that leaps the mention of BIG DATA. In this article, an effort Big Data scalability, flexibility and performance must be needed.
to touch a comprehensive view of big data technologies, because Indeed, management of big data needs important resources,
of the swift evolution of data by an industry trying the academic innovative methods, and technologies. On the other hand, big
press to catch up. This paper also offers a unified explanation of data required to sterilized, processing, secure, as well as provide
big data as well as the analytics methods. A practical grainy access to vast evolve data sets[30].
discriminate characteristic of this paper is core analytics As the outcome of modified big data projects worldwide and
associated with unstructured data which is more than 90% of big dissimilar big data models, fresh technologies, the context has
data. To deal with complicated Big Data problems, great work been developed to impart further storage, and real-time analysis
has been done. This paper analyzes contemporary Big Data and parallel processing from varied references. Meanwhile, the
technologies. Therein article further strengthens the necessity to latest solutions for data security and privacy have evolved.
formulate new tools for analytics. It bestows not sole an Besides, due to the sustainable technological advancement, cost
intercontinental overview of big data techniques even though the value of hardware storage and processing solution is incessantly
valuation according to big data Hadoop Ecosystem. It classifies descending.
and debates the main technologies feature, challenges, and usage
as well. To study big data different software and hardware technologies
are build. The endeavor to verify the more authentic result of big
CCS Concepts data’s applications. However, it may be time taken and effortful
• Information systems➝Information systems applications to choose among techniques in such surroundings. There are a
lot of big data surveys but most of them tend to core on
Keyword algorithms and manner used to storage and processing of big
Big Data Technology; Apache Hadoop; HDFS, MapReduce data than technologies.

1. INTRODUCTION In this paper comprehensively we talk about big data


In this article, the basic concepts belong to big data technologies. technologies. We classify and profoundly differentiate them not
The unexpected data increment has left numerously improvised. only according to their storage, processing, challenges, and
There is a fast evaluation of the data’s quantity but on the other features as well. This conception helps to comprehend the links
hand, willing to accept the concept of both public and private among various big data technologies as well as functionalities.
sectors as well. The binding of big data discourse to a more
common outlet shows that there remains a clear knowledge of
2. PAST WORK ON BIG DATA/ REVIEW
perception and their terminology[4]. For instance, the primary OF LITERATURE
question is how data reached as BIG DATA? Thus, the cost of Term big data explain an immense surge data sets that involve
big data ideas and techniques needs to be documented in the heterogeneous composition such as shaped data (Relational data
academic press. Nowadays, the systematic generation of huge etc.), semi-shaped (XML data) and unshaped data (Pdf, Text,
volume data from varying roots such as (Black box storage data, Media log, etc.). The big data has a multiplex nature that really
required more mighty technologies as compared to customary
Permission to make digital or hard copies of all or part of this work for databases. So, in the case of big data solicitation, the standard
personal or classroom use is granted without fee provided that copies are
static business intelligence techniques cannot be more efficient.
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights
for components of this work owned by others than ACM must be 2.1 The Majority of Data Experts and
honored. Abstracting with credit is permitted. To copy otherwise, or Scientist They Describe Big Data by Some
republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee. Request permissions from Characteristics
Permissions@[Link]. Volume: The massive volume of digital data is interminable
generated from millions of computers and billions of
ICBDC 2020, May 28–30, 2020, Chengdu, China
© 2020 Association for Computing Machinery. applications such as (smartphones data, barcodes data, social
ACM ISBN 978-1-4503-7547-4/20/05…$15.00 media data, sensors, etc.). As stated by [46] it is approximated
that 2.5 exabytes were produced in 24 hours in 2012. On the
DOI: [Link]

23
other hand, this amount became double in 2013. In 2013 the Value: value is the most salient characteristic of big data
international data corporation appraise 4.4 Zettabytes (ZB) of all technology. The worth in the future of Big Data is enormous. It
digital data produced a duplicate and consumed. According to has worthy access to big data. It is very expensive to implement
the perception of IDC the size of data will ascend to nearly 40 IT infrastructure systems for storing huge data[35].
Zettabytes in 2020 and increase of 400 times by now.
2.2 Application of Big Data
Velocity: Rapidly data are generated from multiple ways and Big data techniques are widely and extensively indexed. It is
should process rapidly to obtain handy information. For Instance, used for such purposes as a search engine, transportation &
more than 2.5 petabytes data are generated in each hour due to logistics, Data storage, videos & pictures analysis,
customer's transactions in Walmart (an international discount Telecommunications, Web & Social Media, Medicines &
retail chain). It is good to say YouTube and Facebook also Healthcare, Science & Research as well as Social Life. Few of
creators of Big Data. them eminent applications are been discussed below[55].
Variety: Big data created in a distinct structure such as (audios, Transportation and logistics: Publically operating carriers use
videos, comments, documents) by several distributed reservoirs. RFID and GPS to track buses furthermore, to search the use of
Massive data sets consist of structured, unstructured and semi- fascinating data to enhance their facilities. For example, to
structured, which is maybe (public/personal, confined/distant, optimize the bus paths and the oftenness of journeys, the data
split/sensitive concluded/uncompleted, etc.)[52]. [31]the collection on the number of travelers on buses in some different
inclusion of previous defined V’s some other dimensions are routes. Data Mining also helps to improve the business traveling
also mentioned below: by forecasting the public and private networks' demand[58]. For
Veracity: IBM proclaimed a fourth, V, veracity that represents illustration, India has one of the busiest railway system, every
some information reasons for the unreliability inherent in it such single day nearly 250,000 seats are reserved and reservation can
as, customer thinking, social media uncertainty, since they be done by almost 60 days in advance. To prediction about such
involve mortal verdict, as yet they held worthy particulars. The data is a problem because of it up to some factors like the
necessity to conclude pacts along indefinite data therefore, festival, weekend, etc. By the using of machine learning
another element of big data which tend to use tools and techniques, it’s viable to mine and put on advance analysis on
investigate to the purpose of collecting dubious information. the previous as well as new big data technologies.

Variability: In the context of big data variability about a few Healthcare and Medicine: Big Data technologies are helpful
different things. That’s the amount of data incompatibility[15]. for storing of the medical record. Data can be captured from
They must be established in ways that meaningful analysis can multiple sensors and equipment’s that are devoted to patients
start with anomaly detection techniques. and it also generates from heterogeneous sources like
(Laboratory and clinical data, hospital operations and
Validity: In contrast to truthfulness, the validity indicates the pharmaceutical data)[50][77]. The medical data set has
correctness of data for the intended use. As stated by Forbes, numerous beneficial applications because the healthcare data is
there are more than 60% of data scientists spend time cleaning proficiently suited for big data processes and analytics. Recently,
data before to analyze, the advantage of big data analytics is in several areas in healthcare have been related that can be
good as its rudimentary data, so excellent data governance frankly beneficial from such treatment[49][47].
practice is needed to guarantee data quality and shared
definitions with metadata[26]. Social Media Analysis: IBM introduces a spiritual analysis, to
find out the invisible perceptivity from millions of web sources.
Vulnerability: New safety issues arise from data. After all, data It is used by a company to pick up superior understanding and
breaches with big data constitute a major violation. Sadly, there calculation their clients. It catches shoppers' information from
are so many violations of big data that have occurred. In May web-based life that predicts client behavior and warfare[66][18].
2016 a hacker called “peace posted data for sale on the dark web,
which was alleged to have approximately 167 million LinkedIn Science and Research: Science and analysis are currently
accounts and 360 million E-mails and password for Myspace compulsive by technologies. New prospectus added by Big data.
users”. [32]the sizeable and most strong practical accelerator, Large
Hadron Collider (LHC) has been launched by CERN, (European
Volatility: For consideration of volatility, we necessity to study Organization for Nuclear Research). Unrestrained information
the volume, variety, and velocity. Volatility refers to that data was produced by the Experiment. The data center of CERN has
should be stored for how long. Within this word, we required 65,000 processors analyzing 30 petabytes of information. Its
that we managed that which point and when data are not computing power is spread by thousands of pcs across 150 data
relevant as well as to the modern analysis. Because of the centers around the world.
volume, velocity, and variety, it’s very obligatory to understand
volatility. For some reasons, the identical data will always be Politics Analysis: Big data analytics helps in winning the US
there but sometimes for others, this is not will be the case[39]. Presidential election by Mr. Barack Obama in 2012[73]. His
strive consisted of 100 worthy analytics members to shake heaps
Visualization: In the tools of big data visualization, confront of terabytes of data. For analytical databases, a coalition of HP
some technological difficulties by some restrictions of in- Vertica is used massively parallel.
memory technology and imperfect scalability, functionality and
response time as well. When attempting to trace thousands of 3. BIG DATA-APACHE HADOOP AND
datasets, we cannot use traditional graphs and we need several MAPREDUCE (THE ARCHITECTURE OF
distinct methods to represent the data like data clustering or
three maps, sunburns, parallel coordination as well as circular BIG DATA TECHNOLOGY)
network diagrams.

24
Hadoop (Highly Archived Distributed Object-Oriented Here’s foremost remember the Name Node cannot directly
Programming) developed in 2005 by Mr. Doug Cutting and communicate to Data Node, but via pulses that the Data Node
Mike Cafarella. The name of Hadoop was selected by Doug regularly sends to Name Node.
Cutting as it was the name of his son’s toy elephant. It is an
open-source software system that makes it reliable as well as 3.1.3 Secondary name node
scalable and also provides distributed computing to This node used to help of the master node. when the name node
organizations [45]. This software knobs enormous amounts of performs some actions it creates a checkpoint and saves in
multiple types of data from distinct sources such as pictures, secondary name node. Meanwhile, if the master node is dead or
videos, audios, folders, software sensor recording and maybe create a problem, restart that node and pings its
communication data as well[43]. Hadoop's primary benefit is its secondary name node to gather checkpoint to get the prior state.
ability to quickly process to big data set. In reality, unlikely in There is a great degree of fault tolerance by secondary name
the traditional way, Hadoop doesn’t copy the entire separate data nodes[29].
into memory to performs computations. For instance, even the
terabytes of data just take the Nano-seconds to query in Hadoop.
3.1.4 Job tracker
Further superiority of Hadoop is the capability to work during The job tracker speaks to the Name node to adjudicate where the
the time that ensuring the fault tolerance, normally found in data is located. The Job Tracker schedules decrease the
distributed surroundings[52][67][76]. intermediate fusion or action of individual maps. It monitors
how these individual tasks have succeeded and failed. It operates
The capability of Hadoop is standing on two major components: to complete the whole task as well. If a job is not done, the Job
(i)Hadoop Distributed File System (HDFS) (ii)MapReduce Tracker restarts the task automatically, but probably at another
(MR)[33]. Moreover, users can also build modules on the top of node, to a predefined retries limit[15].
Hadoop, according to application requirements and their
objectives. These modules are said to be a Hadoop ecosystem. 3.1.5 Task tracker
The Job Tracker supervises the general execution of a
3.1 Hadoop Distributed File System(HDFS) MapReduce job scheduling. On each slave node, the Task
To store data in HDFS depends on its file system and a Trackers handle the execution of individual scheduling. Even
database(non-relational) called Hbase. HDFS entirely files though, the slave node contains a single Task Tracker. The Java
oriented system which creates high performance and efficient Virtual Machines (JVMs) can be created by each Task Tracker
access to data to run on commodity hardware. It has numerous to handle several maps or reduce the parallel allocation. In every
replicas to easily get data and swiftly return to the user[63]. One short time, Task Trackers also send messages to the Job Tracker,
of the main reasons for building these imitations is to offer the to reassure Job Tracker is still alive[64].
accessibility throughout and if some node fails to perform but
nothing should be stopped. Simply, in Hadoop each block data
must be replicas itself[5][74]. There are five major components
of HDFS called; (i) Name node (ii) Data node (iii) Seconder
Name node (iv) Job Tracker (v) Task Tracker.

3.1.1 Name node


Name Node considering as the core of HDFS file system as it
contains metadata information about the data of the user. while
the read operation it doesn’t stock physical data but it keeps all
pertinent facts and figures which are essential to amalgamate the
split data during the reading[33]. Hadoop cluster availability is
extremely dependent on the Name Node as all the information of
metadata is present only on the Name Node. On Name Node
server each file and folder is portrayed as iNode consisting of
processed data such as the moment of file access, amendment,
Figure 1. Data storage in HDFS.
authorization on file/directory and file block size, etc. The client
HDFS first contacts Name Node to collect appropriate iNode Client send some request to Name Node for data storing, Name
information while performing read operations, and then accesses Node give proper response with permission to client. Data Node
all the information nodes to acquire the actual user data. Name accept data from client with acknowledgement, Data Node store
node is also called single point of failure. data and have 2 others data replication Node and Data Node
send proper block report as well as Heartbeat in every short
3.1.2 Data node period of time to Name Node. Actually name node play a vital
Data nodes in Hadoop are primarily accountable for the creation, role it is also a single point of failure. Metadata stores all
replication, and delectation of the data file. Huge data files information about storage.
broken first into tiny blocks on the Name Node, and then store
into the selected Data Node. Name Node tracks all the 3.2 HBase
information of metadata partitioned blocks stored on data Hbase is a completely non-relational, open-source, distributed
nodes[72]. Formerly data save successfully in Data Node after Hadoop based database. It intended exclusively for execution
that it replicates on more than one backup nodes which already with low latency. Hbase is key/value pair column-oriented
available in HDFS client. If there is a collapse in HDFS client to database[57]. It can pillar aloft table update rates, also in
obtain file block from primary Data Node either because of Data distributed clusters horizontally. Furthermore, it offers a flexible
Node is much busy to serving other clients or it is down, then it layout, for large tables just like BigTable format[8]. Logically
will contact to corresponding backup data node to retrieve data. data store in table format. The benefit of such tables is that

25
millions of rows and columns can be processed. Hbase tables are its effective and cost-efficient mechanism. It enables to write, so
known as Hstore. the parallel processing is possible[52][59]. In reality, the
MapReduce programming model utilizes two following features:
Hbase, offer numerous characteristics such as real-time queries, The Map function and the Reduce function, to handle
natural language searching, linear, modular, automatic and processing[34].
configurable access to table sharing[28]. It is included on many
data-driven sites, just like Facebook messaging platform[2]. First of all, the map function splits input data into maverick data
partitions representing pairs of key/value.
Then, through several parallel map tasks, the MapReduce
framework sends all the key/value pairs independently to
mapper across this cluster. The mapper produces may be
multiple intermediate key/value pairs. At this level, the
substructure responsible for collecting and sorting all the
intermediate key/value pairs. Therefore, there are multiple keys
which have the list of related values.
Now, the reduction function exerts to process the whole output
of the intermediate data. The reduction function adds the key
values according to the pre-defined program for every single key.
(i.e., filter, summarize, sorting, hashing, take the average of
maximum). Then one or more key-value pairs will be
generated[20].

Figure 2. Architecture of Hbase. Finally, MapReduce stores all output (key/value) pairs within
the output folder smoothly.
Hbase architecture has 3 main components: HMaster, Region
Server and Zookeeper.
HMaster: HBase’s Master server implementation is HMaster. It
is a process in which regions are allocated to server region as
well as operations with DDL (Create, Delete table). It tracks all
Instance of region Server present in cluster. In a distributed
system, Master runs multiple background threads. HMaster has
several advantages such as load balance controlling and failover
etc.
Region Server: HBase Tables are divided into regions,
horizontally by row key Selection. Regions are the basic
building elements of the HBase cluster consisting of distribution
and consisting of Column groups. Area Server operates on the
HDFS Data Node located in the Hadoop cluster. Area Server
Regions are responsible for multiple things, such as handling,
controlling, executing as well as reading and writing HBase
operations on that group of regions. A Region has default size of Figure 3. Workflow/Architecture of MapReduce.
256 MB.
3.4 Yarn
Zookeeper: It is like being a Hbase leader. It provides services Than MapReduce, Yarn has been genetically modified. As
such as keeping information about the configuration, naming, compare to MapReduce it provides more scalability, parallelism
providing distributed synchronization, notification of server as well as improves the management of resources. It also
failure etc. Using zookeeper, clients communicate with region provides features of the operating system of big data analytics.
servers. The YARY resource manager has changed the Hadoop
architecture. In general, YARN operates on the top of HDFS.
3.3 MapReduce (MR) This position enables different applications to be carried out in
The MapReduce has become omnipresent for the processing of parallel[33]. It also allows the bath as well as interactive
large scale data. This application of Hadoop open source is processing to be handled in real-time.
widely accepted by organizations ranging from a two-person
start-up to fortuity 500 companies[1]. It reclines at the core of a In contrast to MapReduce, YARN improves effectiveness by
developing stack for data analytics, that supports heavyweight partitioning the Job Tracker’s two primary functionalities into
industries such as IBM, Microsoft, and Oracle, etc. one of the two different daemons[42]: (1) Resource-Manager (RM)
MapReduce advantages is the capacity to horizontally scale to apportion and regulates the cluster’s resources. (2) Application-
high volume of data on thousands of commodity servers, easy- Master (AM) is planning to, match and monitor their process
to-understand semantics for programming, and high rate of fault with TaskTracker[80].
tolerance[41]. It is the primary crucial step for the upcoming
generation to management and analysis tools for big data.
MapReduce has captivating advantages for big data applications.
As a matter of fact, it makes simple the gigantic size of data by

26
Chukwa: It is a mechanism of data collecting to monitoring
large distributed clusters. It constructs on the top of HDFS &
MapReduce to offer large-size logging and analytics. It has a
pliable and strong toolbox to showing, monitoring and analyzing
the outcomes on the collected data[24].
Zookeeper: The coordination between distributed applications
provided by Zookeeper. Several projects of Hadoop use the
Zookeeper for coordinating the cluster and provide distributed
facilities that are extremely accessible. It provides a centralized
service for maintenance, providing distributed synchronization
and community services[13].
Figure 4. Workflow/Architecture of Hadoop Yarn.
Ambari: It provides a step-by-step wizard with the Hadoop
1. Client send an application. cluster to install services, for example, Hive, Hbase, Pig and
2. The resource manager assigns the program manager to Zookeeper, etc. To simplify Hadoop management as well as the
start a container. Hadoop cluster, Ambari is a web-based tool that also handles
3. With the resource manager The program manager services. It provides key management for Hadoop services to
register itself. begin, stop and reset over the cluster. It controls the current
4. The program manager negotiates containers from the status of the Hadoop cluster[6].
resource manager.
5. The application notifies node manager that containers Avro: This scheme for serializing the information. It has
should be released. wealthy data structures. It offers compact and binary data format
6. Application code within the container is executed. for storing persistent data and remote procedure call (RPC).
7. Clients contacts resource manager to monitor the Code generation is not required for reading, writing data and nor
status of an application. to use RPC protocols[78].
8. Upon the completion of the processing the application Mahout: Mahout at the top of the MapReduce machine learning,
manager un-registers with the resource manager. data mining and math library. This project aims to offer scalable
and rapidly machine learning and data mining algorithm[10].
4. HADOOP ECO-SYSTEM
Apache Software Foundation is supporting various other Spark: It is a quick and general data processing engine. It
projects associated with Hadoop. A specific aspect of big data is provides an easier alternative to the use of MapReduce and runs
addressed in each project and Hadoop provides supplementary programs up to 100 times quicker than MapReduce. It is a
services. The projects associated with Hadoop is said to be sophisticated directed acyclic graph (DAG), which allows
Hadoop Eco-System[75]. The description is below; quickly in-memory computation and cyclic data flow. Spark is
running on Hadoop and can access HDFS, Hbase, and
Cassandra: It is a scalable database that offers elevated Cassandra[11].
availability as well as supports multi-master to avoid solitary
points of failure. MapReduce can recoup data from Cassandra. It Sqoop: A project intended to efficiently transfer bulk data
is a Big Data, Database, which can flee without HDFS. It between Hadoop and structured databases[69].
supported by both Google Big Table and Google File System as Oozie: Oozie is an Apache Hadoop workflow scheduler scheme.
well[22][7]. The Directed Acyclical Graph (DAGs) of actions operates in
Hive: It’s an infrastructure for the data storehouse that offers flow employment. Oozie is incorporated into the remainder of
data summarization, ad-hoc querying and HDFS-based analysis the Hadoop pile which supports several different kinds of
of huge datasets[54]. It also provides structural design for this Hadoop tasks as well as system specific jobs (e.g., Java program
information and also a HiveQL based on SQL. It also offers and shell scripts)[51].
flexibility for customizing mappers and reducers, if logic cannot
be expressed efficiently in HiveQL[9][36].
5. HADOOP DISTRIBUTION
Different IT providers and communities are enhancing Hadoop
Pig: Pig is a high-level programing language as well as a infrastructure, tools, and structure. It is useful for big data
parallel execution framework. A program that is written in Pig technologies to share revolution through open-source modules.
able to manage large datasets through significant parallelism. Anyway, it’s a pitfall users can wind up with a Hadoop platform
The basic infrastructure of Pig comprises of a compiler which is consisting of separate module from distinct sources[52]. There is
a factor of production MapReduce sequences with parallel a specific level of maturity for each module, a variant in the
implementations. Pig's language, Latin express sequences, and Hadoop platform is at danger of being incompatible. The
users can also build up their function to read, write and integration of different techniques on a single platform also
processing for data[79]. increases the same peril. Usually, every module is appraised.
Even though, the multi-source coalition can mostly have
Tez: It’s a broad-based information stream programming
concealed threats that are not fully researched nor tested.
framework, which is built on the top of Hadoop YARN. It offers
a strong and versatile engine to perform a complicated DAG Many IT vendors, such as IBM, Cloudera, MapR, and
(directed acyclic graph) tasks for batch or interactive processing. Hortonworks, initiate their modules and packaged them into
It increases the power of MapReduce by expressing distributions to deals with these matters.
computations in the data flow graph. Tez adopted by Hive, Pig,
and other eco-system members to substitute MapReduce job[12].

27
5. 1 InfoSphere BigInsights-IBM 6. CHALLENGES OF BIG DATA
It’s beginning to simplify the utilization of Hadoop. It can meet Big data provides numerous appealing possibilities. Moreover,
company requirements for storage, processing, advance practitioners and researchers face various difficulties to explore
evaluation, and visualization. The fundamental versions of IBM big data sets[55]. The problem occurs at various stages of data
InfoSphere is HDFS, Hbase, MapReduce, Hive, Mahout, Oozie, management such as data collection, storage, search, etc.,[14]. In
Pig, Zookeeper, etc., have been released now. distributed data-driven applications, there are some security and
privacy issues as well[60].
Enterprise Edition provides some additional principal services:
reliability features, performance capabilities, security Heterogeneousness and rawness: The big data analytics face
management and optimization of fault-tolerance. It encourages some difficulties from its huge size also with the presence of
sophisticated big data analytics with adaptive algorithms such as varied data on divergent shapes. There are several models with
(Text processing). IBM also offers layers of data access that can very distinct characteristics for complex heterogeneous mixture
also be attached to distinct sources of data (like DB2, streams, data, there are numerous patterns that have very different
data Stage, JDBC, etc.)[44]. There are some other benefits of properties. Data may be both structured and unstructured. More
IBM distribution: first, the possibility of storing data streaming than 80% of the data produced unstructured by organizations. It
to BigInsights clusters directly. Second, it promotes real-time is extremely dynamic and has no particular format. It may be the
analysis data streaming as well as facilitates visualization via multi-shaped (e.g., images, pdf documents, medical records,
dashboards and big sheets in the cluster. video, audio, etc.). Transforming this data into a structured form
is a vital challenge in the mining of big data. So the latest
5. 2 Cloudera technologies have to be adopted to deal with such kind of
Cloudera is a Hadoop distribution that is most commonly used. data[37].
It allows Hadoop to deploy and manage an Enterprise Hub[56].
It offers numerous advantages including centralized Scalability and complexity: Management of huge and speedily
management tools, unified batch processing, an interactive SQL expandable data is a series challenge. To manage increasing data
and role-based access control[16]. IMPALA is one of the volumes cannot be carried by traditional data management
principal Cloudera module[62]. It is an interesting Hadoop techniques. The scalability and complexity of big data to be
compatible query language module[30]. Impala structure data on analyzed are also major obstacles to data analysis[48].
a column-based shape. It enables synergistic and real-time Big data storing and quality: Storage and analysis huge
analysis of big data managed. Contrarily Hive, MapReduce amount of data is pivotal for a corporation to work need an
framework doesn’t use by Impala. Alternatively, it also utilizes extensive and multiplex hardware infrastructure. Data storage
an individual in-memory processing mechanism for quick devices are becoming more and more essential with consistent
queries over the massive amount of data. Hence, Impala is data development and many companies are looking forward to
quicker as compare to Hive while fetching the query. Indeed, high storage capacity to compete with this issue[17]. For the
Impala can candidly use data from current HDFS and Hbase decision-making, accuracy and on-time availability of data are
sources. essential. Big data is at most sympathetic when an information
Cloudera also has a versatile model that is quicker than Hive, management process is implemented to guarantee data accuracy
supporting both structured and unstructured data. For Example, and quality.
Cloudera is 10 times more quickly than Hive and MapReduce. Big data cleaning: In the case of traditional databases, the
Cloudera confirms that approximately 5 to 47 times its following steps (Cleansing, Aggregation, Encoding, Storage and
performance dividend for request with at least a single join as Access) are not emerging. There is a challenge to manage the
compare to HiveQL (Hive Query Language)[57]. processing and complex structure of Big Data in a distributed
Although, Cloudera has some disadvantages. Such as, it’s not environment with the combination (Velocity, Volume, and
perfect for querying streaming data (e.g., videos or uninterrupted Variety)[38]. The dependability of the source and nature of data
sensor data). All join activities shall be conducted in memory must be verified before using resources to reliable outcomes.
restricted by the cluster’s limited memory node[23]. The problem is purifying such amount of data sets and choose
which data set is accurate and helpful.
5. 3 MapR
MapR is an enterprise-designed business distribution for 7. SECURITY AND PRIVACY IN BIG
Hadoop. The precision, efficiency, and easiness of Big Data DATA
storing, processing as well as evaluation with machine learning The organizations need to securely process and regulations to
algorithms have been improved. It offers a broad range of assurance their framework. For Big Data security and privacy
components and projects to the Hadoop environment[52]. It issues, accustomed techniques are considered as ineffective[61].
doesn’t use HDFS. Although, it generates its MapR file systems However, new techniques are also hosted to unidentified back
(MapR-FS) which enable simple backups to enhance the doors and default credentials[60]. It is necessary to the
performance. The benefit of MapR-FS is NFS compatible. consideration of confidentiality, integrity, and availability of
MapR is based on Hadoop’s current programming model. data.

5. 4 Hortonworks Data Platform(Hdp) Security: Miscellany of data source, formats, streaming as well
The HDP is erect on Hadoop to the storing, querying as well as as infrastructures might cause unique vulnerabilities to safety.
processing. It is a quick, scalable and cost-effective solution. It The Cloud Security federation has broken down the challenges
offers multiple management, surveillance, and integration. of safety and privacy of big data into distinct classifications;
Furthermore, HDP offers open-source, managing instruments security of infrastructure, data protection, data management,
also promotes links with certain BI platforms[16]. integration and reactive security[65]. The Infrastructure of

28
security comprises of safe and secure distributed programming. realize and forecast the dynamics of the network[27].
The data security concerns to analytics, encrypted and grainy
access control data centers. Data management involves secure In IoT: Because of the rapid enlargement of IoT based
data storage, processing, logging, auditing and data applications in the cloud, the number of connected devices is
provenance[81]. Furthermore, validation, filtration, and real- increasing swiftly[55][40]. The expectation is that connected
time monitoring include integrity and reactive safety. Based on devices will be reached to 24 billion in 2020. These devices will
suggested issues, the authorization and authentication be connected via the cloud for different kinds of applications.
mechanisms of users are crucial also encoding and data masking IoT and cloud computing work on the integration that makes a
are essential to implement for both states of data (rest and new prototype, which has been designated as a cloud of things
stream). (CoT). In CoT, the objects of IoT are expanded through the
internet from sensors to all front-end objects. Furthermore, the
Privacy: The development of systems has led to independent distributed sites are attached as the entire body, just like as smart
collection control[25]. Recently, the National Security Agency houses, smart factories, smart cities, as well as the smart planet.
(NSA) under the cover of defending US citizens has been A logical design of the smart city is provided Based on CoT[70].
wiretapping personal data from miscellaneous sources like By combining the cloud platform and IoT, CoT needed to
databases of vast companies, cyberspace, and telecom enhance the interactive and interoperability capability of smart
companies. The eternally increasing the secrecy concerns about applications. In divergent industries and research areas, CoT will
big data including knowing the latest and secret actuality about take a progressively important role. There are some problems
people, amalgamating their private details, including value their such as resource distribution that stabilize energy and efficiency,
institutions with collected data from unknowing persons, the standard of service provisioning, storage of data architecture,
threatening uneducated people by prognostic analysis by social security, privacy and unnecessary communication of data will be
media, finally exchanging datasets between the associated in CoT[82][21][3].
organizations[19]. In response to such complex matters, rules
and regulations must have been clear limits for unauthorized 9. CONCLUSION
access, data sharing, illegal use, and also duplication of personal The intention of this article to delineate, evaluation, and review
information[60][68]. of big data technologies. Firstly, this article described, what is
big data means and to consolidate the divergent discourse on big
8. WHAT SHOULD BE HAPPEN IN data. In this article, we present varied definitions of big data,
FUTURE? which underlying the fact that size is only one facet of big data.
There are several important challenges for the future in On the other hand, some other dimensions, such as Velocity and
management of Big Data technologies that arise from the nature Variety are also foremost. The paper’s mainly focused on
of data such as complexity, diversity, and evolving. In the next analytics in order to gain viable and precious insights from big
years, researchers will have to face several difficulties in various data. Big data is applied in almost every area ranging from the
areas. financial sector to in healthcare sector. Big Data can be handled
by the implementation of several techniques. However, there is
In medical science: Today, the healthcare system is on an still scope for further research because of the problems of
unsustainable trajectory. The volume of costs in the current storage, processing, and management are surrounded by great
system is because of the patient’s having continuing diseases. issues in a broad classification. The magnitude of Data has been
Therefore, preventive care, as well as population health control, generated every minute which is may be structured, unstructured
should be a priority in the future[71]. Big Data makes easier for as well as semi-structured that need sufficient storage.
understanding. In the future of the healthcare sector, Furthermore, the issues which are related to the fast-growing
Personalized medicine is being promoted. Nowadays, the data but the result is still concerning and management issues
production of medicines is for the masses not for the related to Big Data are also still under consideration for future
independent. Looking forward, with the advent of Big Data studies.
applications, further, customize medicines that use patient
specifically data just like genomics and proteomics can be 10. ACKNOWLEDGEMENTS
generated which is based on the describing of similar patients This research was funded by Humanities and Social
and their responses to such approaches. Social media and Sciences Foundation of the Ministry of Education, grant
mobility are increasingly adopted, patients are adopted more and number 17YJCZH260 and CERENT innovation Project,
more aware of the alternatives accessible to them. In the future, grant number NGII20180403. The authors would like to
we expect the development of new data sources and analytical
technologies to change the way we practice medicine[53].
specially thanks to loving family who support in every
time as well as all friends and lab members.
In social media: The term “Social media” is a wide range of
online platforms for creating and exchanging content for the 11. REFERENCE
user. Social media classified into the following types such as [1] 6th Symposium on Operating Systems Design and
Social networks (e.g., Facebook, LinkedIn, Twitter, Tumbler, Implementation — Technical Paper:
Instagram, YouTube)[18] as well as some mobile apps. The [Link]
research about social media analytics extends to a number of s/dean/dean_html/. Accessed: 2019-08-01.
several directions including, psychology, sociology, computer
science, mathematics, physics, and economics. In social media [2] Aiyer, A. et al. 2012. Storage Infrastructure Behind
specifically, we need to enhance the predicting the future Facebook Messages. IEEE Data Engineering. (2012), 1–10.
linkages between the existing nodes that underlying network. [3] Al-fuqaha, A. et al. 2015. Internet of Things : A Survey on
Normally, social networks structures are not static and they Enabling. IEEE Communications Surveys & Tutorials. 17,
continuously expand. Wherefore, it is a natural objective to 4 (2015), 2347–2376.

29
DOI:[Link] [25] Conference, I.I. et al. 2015. Data Confidentiality
[4] Al-Sai, Z.A. et al. 2019. Big Data Impacts and Challenges: Challenges in Big Data Applications. 8, (2015), 2886–2888.
A Review. 2019 IEEE Jordan International Joint [26] Dave, M. and Kamal, J. 2017. Identifying Big Data
Conference on Electrical Engineering and Information Dimensions and Structure. (2017), 163–168.
Technology, JEEIT 2019 - Proceedings. (2019), 150–155. [27] Desai, P. V. 2018. A survey on big data applications and
DOI:[Link] challenges. Proceedings of the International Conference on
[5] Alam, A. and Ahmed, J. 2014. Hadoop Architecture and Its Inventive Communication and Computational Technologies,
Issues. (2014). DOI:[Link] ICICCT 2018. Icicct (2018), 737–740.
[6] Ambari -: [Link] Accessed: 2019-08-02. DOI:[Link]

[7] Apache Cassandra: [Link] Accessed: [28] Dimiduk, N. and Khurana, A. HBase in Action.
2019-08-01. [29] Dwivedi, K. 2014. Analytical Review on Hadoop
[8] Apache HBase – Apache HBaseTM Home: Distributed File System. (2014), 174–181.
[Link] Accessed: 2019-07-31. [30] Eldawy, A. and Mokbel, M.F. 2017. The era of Big Spatial
[9] Apache Hive TM: [Link] Accessed: 2019- Data. Proceedings of the VLDB Endowment. 10, 12 (2017),
08-02. 1992–1995.
DOI:[Link]
[10] Apache Mahout: [Link] Accessed:
2019-08-02. [31] Gandomi, A. and Haider, M. 2015. International Journal of
Information Management Beyond the hype : Big data
[11] Apache SparkTM - Unified Analytics Engine for Big Data: concepts , methods , and analytics. International Journal of
[Link] Accessed: 2019-08-02. Information Management. 35, 2 (2015), 137–144.
[12] Apache Tez – Welcome to Apache TEZ®: DOI:[Link]
[Link] Accessed: 2019-08-02. [32] Hep, T. et al. 2019. A Roadmap for HEP Software and
[13] Apache ZooKeeper: [Link] Computing R & D for the 2020s. Springer International
Accessed: 2019-08-02. Publishing.
[14] Ardagna, C.A. et al. 2016. Big Data Analytics as-a-Service : [33] Hurwitz, J. et al. 2013. Bir Data for Dummies.
Issues and challenges. (2016), 3638–3644. [34] Industry’s Next Generation Data Platform for AI and
[15] Arora, Y. Big Data Technologies : Brief Overview. 131, 9, Analytics | MapR: [Link] Accessed: 2019-08-
1–6. 01.
[16] Azarmi, B. Scalable Big Data Architecture. [35] Ishwarappa and J, A. 2015. A Brief Introduction on Big
Data 5Vs Characteristics and Hadoop Technology. 48, Iccc
[17] Balachandran, M. 2017. ScienceDirect ScienceDirect
(2015), 319–324.
ScienceDirect Challenges Deploying Challenges and and
DOI:[Link]
Benefits Benefits of of Deploying Big Data Data Analytics
Analytics in in the the Cloud Cloud for for Business [36] Ismail, A.S. et al. Querying DBpedia Using HIVE-QL.
Business Intelligence Intelligence Big. Procedia Computer 102–108.
Science. 112, (2017), 1112–1122. [37] Jaseena, K.U. and David, J.M. 2014.
DOI:[Link] ISSUES,CHALLENGES, AND SOLUTIONS: BIG DATA
[18] Barbier, G. Chapter 12 DATA MINING IN SOCIAL MINING. (2014), 131–140.
MEDIA. DOI:[Link] [38] Khan, N. et al. 1990. Big Data: Survey, Technologies,
[19] Bardi, M. et al. 1926. Big Data Security and Privacy: A Opportunities, and Challenges. Japanese Journal of
Review. Journal of the Chemical Society (Resumed). 129, 2 Applied Physics. 29, 8 (1990), L1497–L1499.
(1926), 663–670. DOI:[Link]
DOI:[Link] [39] Khan, N. et al. 2018. The 10 Vs, Issues and Challenges of
[20] Braganza, A. et al. 2017. Resource management in big data Big Data. March (2018), 52–56.
initiatives : Processes and dynamic capabilities ☆ , ☆☆. DOI:[Link]
Journal of Business Research. 70, (2017), 328–337. [40] Li, S. et al. 2018. US CR. (2018).
DOI:[Link] DOI:[Link]
[21] Cai, H. et al. 2017. IoT-Based Big Data Storage Systems in [41] Lin, J. 2013. MAPREDUCE IS GOOD ENOUGH ? March
Cloud Computing : Perspectives and Challenges. 4, 1 (2013), 28–37. DOI:[Link]
(2017), 75–87.
[42] Machova, R. et al. 2016. Processing of Big Educational
[22] Chang, F. et al. 2006. Bigtable: A Distributed Storage Data in the Cloud Using Apache Hadoop. (2016), 46–49.
System for Structured Data (Awarded Best Paper!). Osdi.
[43] Manwal, M. Big Data and Hadoop -A Technological
(2006), 205–218.
Survey.
DOI:[Link]
[44] Martino, B. Di et al. 2014. Big data (lost) in the cloud.
[23] Chauhan, A. 2013. Learning Cloudera Impala.
International Journal of Big Data Intelligence. 1, 1/2
[24] Chukwa - Welcome to Apache Chukwa: (2014), 3. DOI:[Link]
[Link] Accessed: 2019-08-02.
[45] Mass, C. et al. 2013. Volume 3, Issue 12, December 2013.

30
3, 12 (2013), 14947. (2018), 249–261.
[46] Mcafee, A. and Brynjolfsson, E. 2012. Spotlight on Big DOI:[Link]
Data Big Data: The Management Revolution, 2012. [65] Sinanc, D. et al. 2015. A survey on security and privacy
Acedido em 15-03-2017. Harvard Business Review. issues in big data. December (2015).
October (2012), 1–9. DOI:[Link]
[47] Mehta, N. and Pandit, A. 2018. Concurrence of big data [66] Singh, S. et al. 2015. Big Data : Technologies , Trends and
analytics and healthcare: A systematic review. Applications. 6, 5 (2015), 4633–4639.
International Journal of Medical Informatics. 114, January [67] Sogodekar, M. et al. 2016. Big data analytics: Hadoop and
(2018), 57–65. tools. IEEE Bombay Section Symposium 2016: Frontiers of
DOI:[Link] Technology: Fuelling Prosperity of Planet and People,
[48] Mishra, S. 2015. Challenges in Big Data Application : A IBSS 2016. (2016).
Review. 121, 19 (2015), 42–46. DOI:[Link]
[49] Mitra, A. et al. 2016. A Novel Big-Data Processing [68] Somasekaram, P. 2016. Privacy-Preserving Big Data in an
Framwork for Healthcare Applications. (2016), 3548–3555. In-Memory Analytics Solution. Luleå University of
[50] Nambiar, R. 2019. A look at challenges and opportunities Technology. (2016).
of Big Data analytics in healthcare - IEEE Conference [69] Sqoop -: [Link] Accessed: 2019-08-03.
Publication. (2019), 17–22. [70] Sur, S. et al. Can High-Performance Interconnects Benefit
[51] Oozie - Apache Oozie Workflow Scheduler for Hadoop: Hadoop Distributed File System ?
[Link] Accessed: 2019-08-03. [71] Taguchi, Y.H. et al. 2014. Heuristic principal component
[52] Oussous, A. et al. 2018. Big Data technologies : A survey. analysis-based unsupervised feature extraction and its
Journal of King Saud University - Computer and application to bioinformatics. Big Data Analytics in
Information Sciences. 30, 4 (2018), 431–448. Bioinformatics and Healthcare. i, (2014), 138–162.
DOI:[Link] DOI:[Link]
[53] Pashazadeh, A. and Navimipour, N.J. 2018. Big data [72] Tech, M.R.D. 2014. Handling Big Data with Hadoop
handling mechanisms in the healthcare applications: A Toolkit. 978 (2014).
comprehensive and systematic literature review. Journal of [73] The real story of how big data analytics helped Obama win
Biomedical Informatics. | InfoWorld:
[54] Patel, D. et al. 2017. Analyzing Network Traffic Data [Link]
Using Hive Queries. 3 (2017), 3–8. [Link].
[55] Philip Chen, C.L. and Zhang, C.Y. 2014. Data-intensive Accessed: 2019-07-30.
applications, challenges, techniques and technologies: A [74] To, Q.C. et al. 2018. A survey of state management in big
survey on Big Data. Information Sciences. 275, (2014), data processing systems. VLDB Journal. 27, 6 (2018), 847–
314–347. DOI:[Link] 872. DOI:[Link]
[56] Pol, U. 2016. International Journal of Advanced Research [75] Uzunkaya, C. et al. 2015. Hadoop Ecosystem and Its
in Big Data and Hadoop Technology Solutions with Analysis on Tweets. Procedia - Social and Behavioral
Cloudera Manager. September (2016). Sciences. 195, (2015), 1890–1897.
[57] Prasad, B.R. and Agarwal, S. 2016. Comparative Study of DOI:[Link]
Big Data Computing and Storage Tools: A Review. [76] Wang, H. et al. 2016. Towards felicitous decision making:
International Journal of Database Theory and Application. An overview on challenges and trends of Big Data.
9, 1 (2016), 45–66. Information Sciences. 367–368, (2016), 747–765.
DOI:[Link] DOI:[Link]
[58] Rajaraman, V. 2016. Big Data Analytics. August (2016), [77] Wang, Y. et al. 2018. Big data analytics: Understanding its
2015–2016. capabilities and potential benefits for healthcare
[59] Ravi, V.T. Comparing Map-Reduce and FREERIDE for organizations. Technological Forecasting and Social
Data-Intensive Applications. Change. 126, (2018).
DOI:[Link]
[60] Raza, M.U. 2017. Big Data – Security and Privacy policy. 5,
6 (2017), 51–54. [78] Welcome to Apache Avro! [Link]
Accessed: 2019-08-02.
[61] Rezaeijam, M. A Survey on Security of Hadoop.
[79] Welcome to Apache Pig! [Link] Accessed:
[62] Sakr, S. Big Data 2.0 Processing Systems A Survey. 2019-08-02.
[63] Shafer, J. et al. 2010. The Hadoop distributed filesystem: [80] White, T. Hadoop : The Definitive Guide.
Balancing portability and performance. ISPASS 2010 -
IEEE International Symposium on Performance Analysis of [81] Zheng, Z. et al. 2015. Real-Time Big Data Processing
Systems and Software. March 2010 (2010), 122–133. Framework : Challenges and Solutions. 3190, 6 (2015),
DOI:[Link] 3169–3190.

[64] Shao, Y. et al. 2018. Computers & Industrial Engineering E [82] Zhou, J. et al. 2013. CloudThings : a Common Architecture
ffi cient jobs scheduling approach for big data applications. for Integrating the Internet of Things with Cloud
Computers & Industrial Engineering. 117, March 2017 Computing. (2013), 651–657.

31

View publication stats

Common questions

Powered by AI

Big data analytics played a crucial role in Barack Obama's 2012 U.S. Presidential election campaign by helping strategists craft targeted and effective campaign efforts. The team utilized a sizable analytical group, approximately 100 members, to process terabytes of data. They used advanced analytical databases to gather insights and guide election strategies, contributing significantly to Obama's electoral success .

Hadoop contributes to big data processing by providing a framework for distributed storage and processing of large datasets through Hadoop Distributed File System (HDFS) and MapReduce. Hadoop ensures reliability and fault tolerance by creating multiple data replicas, which allows data to be readily accessible even if a node fails. Moreover, Hadoop handles enormous data volumes from diverse sources by not requiring the entire dataset to be loaded into memory, facilitating fast processing even with petabytes of data .

Big data technologies are applied in transportation and logistics significantly with the use of RFID and GPS to track vehicles, such as buses, in real-time. This allows public carriers to optimize routes and improve journey frequencies based on data analytics of traveler patterns on different routes. Additionally, data mining can forecast travel demands on both public and private networks, an application crucial in places with extensive networks like India's railway system .

Integrating multiple Hadoop modules from different sources presents challenges such as incompatibility and varying maturity levels, leading to potential system instability. These challenges are addressed by major IT vendors like IBM, Cloudera, and Hortonworks, who package and distribute comprehensive Hadoop solutions. These distributions ensure better compatibility, optimize functionality, and include enhancements such as security management and fault tolerance. They address potential hidden threats by offering unified, thoroughly tested, and supported ecosystems .

In the healthcare sector, big data is utilized for storing and analyzing vast medical records. Data captured from patient-focused sensors, laboratory tests, clinical operations, and pharmaceuticals are processed using big data technologies, enabling substantial benefits such as improved patient care through predictive analytics, personalized medicine, and efficient resource management. These technologies facilitate comprehensive data analysis, helping to derive actionable insights that enhance healthcare services .

Social media companies leverage big data by capturing vast amounts of user-generated content to perform consumer behavior analysis. Tools such as IBM's spiritual analysis employ data from millions of web sources to uncover hidden insights, allowing companies to understand customer preferences, predict behaviors, and identify trends. Companies use this data for improved customer engagement, personalized marketing, and strategic decision-making, which can significantly drive business growth .

Apache Spark enhances the data processing experience by providing a faster and more versatile processing engine compared to MapReduce. It runs programs up to 100 times quicker than MapReduce, offering ease in conducting in-memory computations and supporting cyclic data flow through its sophisticated directed acyclic graph (DAG) structure. Spark can run on Hadoop clusters and integrate smoothly with HDFS, HBase, and Cassandra, offering flexibility and performance improvements .

Big data visualization faces technological challenges due to constraints in in-memory technology, scalability, functionality, and response time. Traditional graphs are inadequate for tracking thousands of datasets, necessitating innovative approaches such as data clustering, tree maps, sunbursts, parallel coordination, and circular network diagrams. These approaches aim to visually represent massive amounts of data effectively while overcoming limitations in memory and computation scalability .

The Hadoop Distributed File System (HDFS) is composed of five major components: the Name Node, Data Node, Secondary Name Node, Job Tracker, and Task Tracker. The Name Node maintains metadata and does not store actual data, while Data Nodes store the bulk of the data. The Secondary Name Node helps in managing the Name Node's memory by checkpoints. Job Tracker manages resource allocation for processing, whereas Task Tracker handles the execution of tasks. These components collectively provide high-performance, efficient data access and fault tolerance on commodity hardware by creating multiple replicas of the data .

The value is the most salient characteristic of big data technology; it represents the potential worth derived from using big data. This value is immense as big data technologies can facilitate access to vast amounts of data and drive insights across various sectors, such as healthcare, transportation, and social media. However, implementing IT infrastructure systems necessary for storing large volumes of data is costly, presenting a significant challenge .

You might also like