Data Analytics
Big Data is a collection of large amounts of data sets that traditional computing approaches
cannot compute and manage. It is a broad term that refers to the massive volume of complex
data sets that businesses and governments generate in today's digital world. It is often
measured in petabytes or terabytes and originates from three key sources: transactional data,
machine data, and social data.
Big Data Analytics
The process of converting large amounts of unstructured raw data, retrieved from different
sources to a data product useful for organizations forms the core of Big Data Analytics.
Steps of Big Data Analytics
Big Data Analytics is a powerful tool which helps to find the potential of large and complex
datasets. To get a better understanding, let's break it down into key steps −
Data Collection
This is the initial step, in which data is collected from different sources like social media,
sensors, online channels, commercial transactions, website logs etc. Collected data might be
structured (predefined organisation, such as databases), semi-structured (like log files) or
unstructured (text documents, photos, and videos).
Data Cleaning (Data Pre-processing)
The next step is to process collected data by removing errors and making it suitable and
proper for analysis. Collected raw data generally contains errors, missing values,
inconsistencies, and noisy data. Data cleaning entails identifying and correcting errors to
ensure that the data is accurate and consistent. Pre-processing operations may also involve
data transformation, normalisation, and feature extraction to prepare the data for further
analysis.
Data Analysis
This is a key phase of big data analytics. Different techniques and algorithms are used to
analyse data and derive useful insights. This can include descriptive analytics (summarising
data to better understand its characteristics), diagnostic analytics (identifying patterns and
relationships), predictive analytics (predicting future trends or outcomes), and prescriptive
analytics (making recommendations or decisions based on the analysis).
Data Visualization
Its a step to present data in a visual form using charts, graphs and interactive dashboards.
Hence, data visualisation techniques are used to visually portray the data using charts, graphs,
dashboards, and other graphical formats to make data analysis insights more clear and
actionable.
Interpretation and Decision Making
Once data analytics and visualisation are done and insights gained, stakeholders analyse the
findings to make informed decisions. This decision-making includes optimising corporate
operations, increasing consumer experiences, creating new products or services, and directing
strategic planning.
Data Storage and Management
Once collected, the data must be stored in a way that enables easy retrieval and analysis.
Traditional databases may not be sufficient for handling large amounts of data, hence many
organisations use distributed storage systems such as Hadoop Distributed File System (HDFS)
or cloud-based storage solutions like Amazon S3.
Continuous Learning and Improvement
Big data analytics is a continuous process of collecting, cleaning, and analyzing data to
uncover hidden insights. It helps businesses make better decisions and gain a competitive
edge.
Types of Big Data Analytics
Some common types of Big Data analytics are as −
Descriptive Analytics
Descriptive analytics gives a result like What is happening in my business?" if the dataset
is business-related. Overall, this summarises prior facts and aids in the creation of reports
such as a company's income, profit, and sales figures. It also aids the tabulation of social
media metrics. It can do comprehensive, accurate, live data and effective visualisation.
Diagnostic Analytics
Diagnostic analytics determines root causes from data. It answers like Why is it
happening? Some common examples are drill-down, data mining, and data recovery.
Organisations use diagnostic analytics because they provide an in-depth insight into a
particular problem. Overall, it can drill down the root causes and ability to isolate all
confounding information.
Predictive Analytics
This kind of analytics looks at data from the past and the present to guess what will happen in
the future. Hence, it answers like What will be happening in future? Data mining, AI, and
machine learning are all used in predictive analytics to look at current data and guess what
will happen in the future. It can figure out things like market trends, customer trends, and so
on.
Prescriptive Analytics
Perspective analytics gives the ability to frame a strategic decision, the analytical results
answer What do I need to do? Perspective analytics works with both descriptive and
predictive analytics. Most of the time, it relies on AI and machine learning.
Tools and Technologies of Big Data Analytics
Some commonly used big data analytics tools are as −
Hadoop
A tool to store and analyze large amounts of data. Hadoop makes it possible to deal with big
data, It's a tool which made big data analytics possible.
MongoDB
A tool for managing unstructured data. It's a database which specially designed to store,
access and process large quantities of unstructured data.
Talend
. Talend integrates with big data management tools like Hadoop, Spark, and NoSQL
databases allowing organisations to process and analyse enormous amounts of data
efficiently.
Cassandra
. Cassandra is an open-source distributed NoSQL database management system that handles
massive amounts of data over several commodity servers, ensuring high availability and
scalability
Spark
. Apache Spark is a robust and versatile distributed computing framework that provides a
single platform for big data processing, analytics, and machine learning, making it popular in
industries such as e-commerce, finance, healthcare, and telecommunications.
Storm
It is an open-source real-time computational system. Apache Storm is a robust and versatile
stream processing framework that allows organisations to process and analyse real-time data
streams on a large scale, making it suited for a wide range of use cases in industries such as
banking, telecommunications, e-commerce, and IoT.
Kafka
It is a distributed streaming platform that is used for fault-tolerant storage. Apache Kafka is a
versatile and powerful event streaming platform that allows organisations to create scalable,
fault-tolerant, and real-time data pipelines and streaming applications to efficiently meet their
data processing requirements.