0% found this document useful (0 votes)
7 views37 pages

Types of Analytics Explained: Big Data Insights

Uploaded by

Domakonda Neha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views37 pages

Types of Analytics Explained: Big Data Insights

Uploaded by

Domakonda Neha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

1. Explain in detail about types of analytics.

Data which are very large in size is called Big Data.


1. Descriptive analytics
• Descriptive analytics answer the question, “What happened?”.
• This type of analytics is by far the most commonly used by customers, providing
reporting and analysis on past events.
• It helps companies understand things such as:
1. How much did we sell as a company?
2. What was our overall productivity?
• Descriptive analytics deals with past trends data, it basically finds out what has
happened in the past, and based on past data or historic data it predicts the future
outcome.
• Example –
Let’s take an example of DMart, we can look at the product’s history and find out
which products have been sold more or which products have large demand by looking
at the product sold trends and based on their analysis we can further make the decision
of putting a stock of that item in large quantity for the coming year.
[Link] Analytics :

Diagnostic analysis works hand in hand with Descriptive analytics. As descriptive analytics
find out what happened in the past, diagnostic analytics, on the other hand, finds out why did
that happen or what measures were taken at that time, or how frequent it has [Link]
basically gives a detailed explanation of a particular scenario by understanding behavior
patterns.
Example –
Let’s take the example of Dmart again. Now if we want to find out why a particular product
has a lot of demand, is it because of their brand or is it because of quality. All this information
can easily be identified using diagnostic analytics.

[Link] analytics
• Predictive analytics determines what is likely to happen based on historical data using
machine learning.
• Predictive analytics helps companies address use cases such as:
1. Predicting maintenance issues and part breakdown in machines.
2. Determining credit risk and identifying potential fraud.
3. Predict and avoid customer by identifying signs of customer dissatisfaction.

Whatever information we have received from descriptive and diagnostic analytics, we can use
that information to predict future data. it basically finds out what is likely to happen in the
future. Now when I say future data doesn’t mean we have become fortune-tellers, by looking at
the past trends and behavioral patterns we are forecasting that it might happen in the future.
Example –
The best example would be Amazon and Netflix recommender system. You might have
noticed that whenever you buy any product from Amazon, on the payment side it shows you a
recommendation saying the customer who purchased this has also purchased this product that
recommendation is based on the customer purchased behavior in the past. By looking at
customer past purchase behavior analyst creates an association between each product and that’s
the reason it shows recommendation when you buy any product.

[Link] analytics
• Prescriptive analytics pertains to true guided analytics prescribing or guiding you toward
a specific action to take.
This is an advanced method of Predictive analytics. Now when you predict something or
when you start thinking out of the box you will definitely have a lot of options, and then we
get confused as to which option will actually work. Prescriptive analytics helps to find
which is the best option to make it happen or work. As predictive analytics forecast future
data, Prescriptive analytics on the other hand helps to make it happen whatever we have
forecasted.
Example–
The best example would be Google self-driving Car, by looking at the past trends and
forecasted data it identifies when to turn or when to slow down, works much like a human
driver.
• Prescriptive: The best course of action for a given situation.
• Predictive: Future is predicted based on past patterns
• Diagnostic: What has happened and why
• Descriptive: What is happening

2. What is HBase? Explain its role in data processing and real-time analytics.
HBase
• Hbase is an open source and sorted map data built on Hadoop.
• It is column oriented and horizontally scalable.
• It is based on Google's Big Table.
• It has set of tables which keep data in key value format.
• Hbase is well suited for sparse data sets which are very common in big data
use cases.
• It is a part of the Hadoop ecosystem that provides random real-time
read/write access to data in the Hadoop File System.
3. a) What are HDFS commands? Explain.
[Link]
b) Write a short note on HDFS high availability.
4. a) Explain about text analytics.
Text Analytics is a process of analyzing and understanding written or spoken
language. It employs computer algorithms and techniques to extract valuable
information, patterns, and insights from extensive textual data. In simpler terms, text
analytics empowers computers to understand and interpret human language.

How Text Analytics Work?


Text Analytics process typically includes several key steps, such as language
identification, tokenization, sentence breaking, part-of-speech tagging, chunking, syntax
parsing, and sentence chaining.

Steps of Text Analytics Process

Language Identification
 Objective: Determine the language in which the text is written.
 How it works: Algorithms analyze patterns within the text to identify the
language. This is essential as different languages may have different rules
and structures.
Tokenization
 Objective: Divide the text into individual units, often words or sub-word
units (tokens).
 How it works: Tokenization breaks down the text into meaningful units,
making it easier to analyze and process.
Sentence Breaking
 Objective: Identify and separate individual sentences in the text.
 How it works: Algorithms analyze the text to determine where one sentence
ends and another begins. This is crucial for tasks that require understanding
the context of sentences.
Part of Speech Tagging
 Objective: Assign a grammatical category (part of speech) to each token in a
sentence.
 How it works: Machine learning models or rule-based systems analyze the
context and relationships between words to assign appropriate part-of-speech
tags (e.g., noun, verb, adjective) to each token.
Chunking
 Objective: Identify and group related words (tokens) together, often based on
the part-of-speech tags.
 How it works: Chunking helps in identifying phrases or meaningful chunks
within a sentence. This step is useful for extracting information about specific
entities or relationships between words.
Syntax Parsing
 Objective: Analyze the grammatical structure of sentences to understand
relationships between words.
 How it works: Syntax parsing involves creating a syntactic tree that
represents the grammatical structure of a sentence. This tree helps in
understanding the syntactic relationships and dependencies between words.
Sentence Chaining
 Objective: Connect and understand the relationships between multiple
sentences.
 How it works: Algorithms analyze the content and context of different
sentences to establish connections or dependencies between them. This step is
crucial for tasks that require a broader understanding of the text, such as
summarization or document-level sentiment analysis.

b) Define big data. Explain evolution of big data and 4 Vs of big


data.
5. What is HDFS? What are the components of HDFS
architecture? Explain.
Hadoop File System was developed using distributed file system
design.
HDFS Architecture
Given below is the architecture of a Hadoop File System.

HDFS follows the master-slave architecture and it has the following


elements.

Namenode

The namenode is the commodity hardware that contains the


GNU/Linux operating system and the namenode software. It is a
software that can be run on commodity hardware. The system
having the namenode acts as the master server and it does the
following tasks −

 Manages the file system namespace.


 Regulates client’s access to files.
 It also executes file system operations such as renaming,
closing, and opening files and directories.

Datanode
The datanode is a commodity hardware having the GNU/Linux
operating system and datanode software. For every node
(Commodity hardware/System) in a cluster, there will be a
datanode. These nodes manage the data storage of their system.

 Datanodes perform read-write operations on the file systems,


as per client request.
 They also perform operations such as block creation, deletion,
and replication according to the instructions of the namenode.

Block

Generally the user data is stored in the files of HDFS. The file in a
file system will be divided into one or more segments and/or stored
in individual data nodes. These file segments are called as blocks. In
other words, the minimum amount of data that HDFS can read or
write is called a Block. The default block size is 64MB, but it can be
increased as per the need to change in HDFS configuration.

6. Discuss in detail about the MapReduce framework


The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce
phase.
1. Map: As the name suggests its main use is to map the input data in key-
value pairs. The input to the map may be a key-value pair where the key
can be the id of some kind of address and value is the actual value that it
keeps. The Map() function will be executed and generates the intermediate
key-value pair which works as input for the Reducer or Reduce() function.

2. Reduce: The intermediate key-value pairs that work as input for Reducer
are shuffled and sort and send to the Reduce() function.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and
all the jobs across the cluster and also to schedule each map on the Task
Tracker running on the same data node since there can be hundreds of data
nodes available in the cluster.
2. Task Tracker: The Task Tracker can be considered as the actual slaves
that are working on the instruction given by the Job Tracker. This Task
Tracker is deployed on each of the nodes available in the cluster that
executes the Map and Reduce task as instructed by Job Tracker.
7. Describe the anatomy of file read and file write in HDFS.

Anatomy of File Read in HDFS

Let’s get an idea of how data flows between the client interacting with HDFS, the name
node, and the data nodes with the help of a diagram. Consider the figure:
Step 1: The client opens the file it wishes to read by calling open() on the File System
Object(which for HDFS is an instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node, using remote procedure
calls (RPCs), to determine the locations of the first few blocks in the file. For each
block, the name node returns the addresses of the data nodes that have a copy of that
block. The DFS returns an FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the data node and
name node I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored
the info node addresses for the primary few blocks within the file, then connects to the
primary (closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read()
repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the data node, then finds the best data node for the next block. This
happens transparently to the client, which from its point of view is simply reading an
endless stream. Blocks are read as, with the DFSInputStream opening new connections
to data nodes because the client reads through the stream. It will also call the name node
to retrieve the data node locations for the next batch of blocks as needed.
Step 6: When the client has finished reading the file, a function is called, close() on the
FSDataInputStream.

Anatomy of File Write in HDFS


Next, we’ll check out how files are written to HDFS. Consider figure 1.2 to get a better
understanding of the concept.

Note: HDFS follows the Write once Read many times model. In HDFS we cannot edit
the files which are already stored in HDFS, but we can append data by reopening the
files.

Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS).


Step 2: DFS makes an RPC call to the name node to create a new file in the file
system’s namespace, with no blocks associated with it. The name node performs
various checks to make sure the file doesn’t already exist and that the client has the
right permissions to create the file. If these checks pass, the name node prepares a
record of the new file; otherwise, the file can’t be created and therefore the client is
thrown an error i.e. IOException. The DFS returns an FSDataOutputStream for the
client to start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into packets,
which it writes to an indoor queue called the info queue. The data queue is consumed
by the DataStreamer, which is liable for asking the name node to allocate new blocks by
picking an inventory of suitable data nodes to store the replicas. The list of data nodes
forms a pipeline, and here we’ll assume the replication level is three, so there are three
nodes in the pipeline. The DataStreamer streams the packets to the primary data node
within the pipeline, which stores each packet and forwards it to the second data node
within the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the third
(and last) data node in the pipeline.

Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to
be acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and
waits for acknowledgments before connecting to the name node to signal whether the
file is complete or not.

8. Explain how reporting and analytics differ. Why are they


important?
An organisation often requires both reporting and analysis to explore
new business insights from big data.
A very common mistake organisations make is to relate reporting with
analysis.
For this, it is essential to know the difference between a report and an
analysis.
Reporting and analytics play a crucial role in the realm of big data analytics for several
reasons:

1. Decision Making: Reporting and analytics provide insights derived from large
volumes of data, enabling informed decision-making. By analyzing trends,
patterns, and anomalies, organizations can make data-driven decisions that are
aligned with their strategic objectives.
2. Performance Monitoring: Reporting tools allow organizations to monitor the
performance of various aspects of their operations in real-time. This helps in
identifying areas of improvement, optimizing processes, and maximizing
efficiency.
3. Identifying Trends and Patterns: Big data analytics help in uncovering hidden
trends and patterns within the data that might not be immediately apparent. By
analyzing these patterns, organizations can gain valuable insights into customer
behavior, market trends, and emerging opportunities.
4. Predictive Analytics: Reporting and analytics can be used for predictive
modeling, enabling organizations to forecast future trends and outcomes based
on historical data. This helps in proactive decision-making and strategic planning.
5. Customer Insights: Big data analytics can provide valuable insights into
customer behavior, preferences, and sentiments. By analyzing customer data,
organizations can personalize their marketing efforts, improve customer
satisfaction, and enhance customer retention.
6. Risk Management: Reporting and analytics help in identifying and mitigating
risks by analyzing historical data and identifying potential risk factors. This allows
organizations to take proactive measures to minimize risks and uncertainties.
7. Cost Optimization: By analyzing data related to resource utilization, operational
efficiency, and expenditure, organizations can identify opportunities for cost
optimization and resource allocation.

9. Explain all the phases in analysis process with necessary


diagram.
10. a) Explain about types of data.
There are three types of Big Data: Structured, Semi-structured and Unstructured
data.

1. Structured Data: Any data in a fixed format is known as structured data. It can
only be accessed, stored, or processed in a particular format. This type of data is
stored in the form of tables with rows and columns. Any Excel file or SQL file is an
example of structured data.
2. The data which is to the point, factual, and highly organized is referred to as
structured data.
3. It is easy to search and analyze structured data.
4. Structured data exists in a predefined format.
5. Relational database consisting of tables with rows and columns is one of the best
examples of structured data.
6. Structured data generally exist in tables like excel files and Google Docs
spreadsheets.
7. The programming language SQL (structured query language) is used for
managing the structured data.
[Link] Data: Unstructured data do not have a fixed format. These are stored in
an unknown format. Such type of data is known as unstructured data. An example of
unstructured data is a web page with text, images, videos, etc.

• Unstructured data is the data that lacks any predefined model or format.
• It requires a lot of storage space, and it is hard to maintain security in it.
• It cannot be presented in a data model or schema.
• That's why managing, analyzing, or searching for unstructured data is hard.
• It resides in various different formats like text, images, audio and video files, etc.
• It is qualitative in nature and sometimes stored in a non-relational database or
NO-SQL.

8. Semi-structured Data: Semi-structured data is the combination of structured as


well as unstructured forms of data. It does not contain any table to show
relations; it contains tags or other markers to show hierarchy. JSON files, XML
files, and CSV files (Comma-separated files) are semi-structured data examples.
The e-mails we send or receive are also an example of semi-structured data.

o Semi-structured data is a type of data that is not purely structured, but


also not completely unstructured.
o It contains some level of organization or structure, but does not conform
to a rigid schema or data model
o Semi-structured data is typically characterized by the use of metadata or
tags that provide additional information about the data elements. For
example, an XML document

b) Discuss about convergence of IT and analytics.

You might also like