Data Analytics
[Link] Kumar Pathak
Data Analysis
• Data analysis is defined as a process of cleaning,
transforming, and modeling data to discover useful
information for business decision making.
• Data analytics is the science of analyzing raw data in
order to make conclusions about that information.
• The purpose of Data Analysis is to extract useful
information from data and taking the decision based
upon the data analysis. For example, manufacturing
companies often record the runtime, downtime, and
work queue for various machines and then analyze the
data to better plan the workloads so the machines
operate closer to peak capacity.
Classification of Data:
Unstructured data :
a. Unstructured data is the unprocessed
form of data.
b. Data that has no inherent structure,
which may include text
documents, PDFs, images, and video. Structured data :
c. This data is often stored in a repository a. Structured data is tabular data (rows and
of files. columns) which are very
well defined.
b. Data containing a defined data type,
format, and structure, which
Semi-structured data : may include transaction data, traditional
a. Textual data files with a defined pattern RDBMS, CSV files, and
that enables parsing such simple spreadsheets.
as XML data files.
b. A consistent format is defined however
the structure is not very
strict.
c. Semi-structured data are often stored as
files.
Characteristics of Big Data
Big Data is characterized into four dimensions :
1. Volume :
a. Volume is concerned about scale of data i.e., the volume of the data
at which it is growing.
b. The volume of data is growing rapidly, due to several applications of
business, social, web and scientific explorations.
2. Velocity :
a. The speed at which data is increasing thus demanding analysis of
streaming data.
b. The velocity is due to growing speed of business intelligence
applications such as trading, transaction of telecom and banking
domain, growing number of internet connections with the increased
usage of internet etc.
3. Variety : It depicts different forms of data to use for analysis such as
structured, semi structured and unstructured.
4. Veracity :
a. Veracity means the truthfulness or reliability of data. It is about how much we can trust
the data.
b. Many times, data is incomplete, wrong, or misleading. So, picking out the correct and
useful data becomes difficult.
c. To solve this, a lot of data cleaning, filtering, and analysis is needed so that only the
right data is used for making decisions.
Big Data:
1. Big data platform is a type of IT solution that combines the features and
capabilities of several big data application and utilities within a single solution.
2. It is an enterprise class IT platform that enables organization in
developing, deploying, operating and managing a big data infrastructure/
environment.
3. Big data platform generally consists of big data storage, servers, database,
big data management, business intelligence and other big data management
utilities.
4. It also supports custom development, querying and integration with
other systems.
5. Big data platform are also delivered through cloud where the provider
provides an all inclusive big data solutions and services.
Features of Big Data analytics platform :
1. Big Data platform should be able to accommodate new
platforms and
tool based on the business requirement.
2. It should support linear scale-out.
3. It should have capability for rapid deployment.
4. It should support variety of data format.
5. Platform should provide data analysis and reporting tools.
6. It should provide real-time data analysis software.
7. It should have tools for searching the data through large data
sets.
Data Analysis
1. Determine the data :
a. The first step is to determine the data
requirements or how the
data is grouped.
b. Data may be separated by age,
demographic, income, or gender.
c. Data values may be numerical or be
divided by category.
2. Collection of data :
a. The second step in data analytics is the
process of collecting it.
b. This can be done through a variety of
sources such as computers,
online sources, cameras, environmental
sources, or through
personnel.
3. Organization of data :
a. Third step is to organize the data.
b. Once the data is collected, it must be
organized so it can be analyzed.
c. Organization may take place on a
spreadsheet or other form of
software that can take statistical data.
4. Cleaning of data :
a. In fourth step, the data is then cleaned
up before analysis.
b. This means it is scrubbed and checked
to ensure there is no
duplication or error, and that it is not
incomplete.
c. This step helps correct any errors before
it goes on to a data analyst
to be analyzed.
Modern data analytic tools.
1. Apache Hadoop :
a. Apache Hadoop, a big data analytics tool which is a Java based free software
framework.
b. It helps in effective storage of huge amount of data in a storage place known as a
cluster.
c. It runs in parallel on a cluster and also has ability to process huge data across all
nodes in it.
d. There is a storage system in Hadoop popularly known as the Hadoop Distributed File
System (HDFS), which helps to splits the large volume
2. Datawrapper :
a. It is an online data visualization tool for making interactive charts.
b. It uses data file in a csv, pdf or excel format.
c. Datawrapper generate visualization in the form of bar, line, map
etc. It can be embedded into any other website as well.
3. Tableau :
a. Tableau is another popular big data tool. It is simple and very intuitive to use.
b. It communicates the insights of the data through data visualization.
c. Through Tableau, an analyst can check a hypothesis and explore the data before
starting to work on it extensively.
5. RapidMiner :
a. RapidMiner tool operates using visual programming and also it is
much capable of manipulating, analyzing and modeling the data.
b. RapidMiner tools make data science teams easier and productive
by using an open-source platform for all their jobs like machine
learning, data preparation, and model deployment.
6. R-programming :
a. R is a free open source software programming language and a
software environment for statistical computing and graphics.
b. It is used by data miners for developing statistical software and data
analysis.
c. It has become a highly popular tool for big data in recent years.
Application of Data Analytics:
1. Security : Data analytics applications
or, more specifically, predictive
analysis has also helped in dropping crime
rates in certain areas.
2. Transportation :
a. Data analytics can be used to
revolutionize transportation.
b. It can be used especially in areas where
we need to transport a
large number of people to a specific area
and require seamless
transportation.
3. Risk detection :
a. Many organizations were struggling
under debt, and they wanted a
solution to problem of fraud.
b. They already had enough customer data
in their hands, and so,
they applied data analytics.
4. Delivery :
a. Several top logistic companies are using data analysis to examine
collected data and improve their overall efficiency.
b. Using data analytics applications, the companies were able to find
the best shipping routes, delivery time, as well as the most costefficient
transport means.
5. Fast internet allocation :
a. While it might seem that allocating fast internet in every area
makes a city ‘Smart’, in reality, it is more important to engage in
smart allocation. This smart allocation would mean understanding
how bandwidth is being used in specific areas and for the right
cause.
b. It is also important to shift the data allocation based on timing and
priority. It is assumed that financial and commercial areas require
the most bandwidth during weekdays, while residential areas
require it during the weekends. But the situation is much more
complex. Data analytics can solve it.
c. For example, using applications of data analysis, a community can
draw the attention of high-tech industries and in such cases; higher
bandwidth will be required in such areas.
6. Internet searching :
a. When we use Google, we are using one of their many data analytics applications
employed by the company.
b. Most search engines like Google, Bing, Yahoo, AOL etc., use data analytics. These
search engines use different algorithms to deliver the best result for a search query.
7. Digital advertisement :
a. Data analytics has revolutionized digital advertising.
b. Digital billboards in cities as well as banners on websites, that is, most of the
advertisement sources nowadays use data analytics using data algorithms.
Data Analysis – Types
• There are several types of Data
Analysis techniques that exist based
on business and technology.
• However, the major Data Analysis
methods are:
– Text Analysis
– Statistical Analysis
– Diagnostic Analysis
– Predictive Analysis
– Prescriptive Analysis
Descriptive Analytics
• Descriptive analytics helps answer questions
about what happened. These techniques
summarize large datasets to describe outcomes
to stakeholders.
• By developing key performance indicators (KPIs,)
these strategies can help track successes or
failures. Metrics such as return on investment
(ROI) are used in many industries.
• Specialized metrics are developed to track
performance in specific industries. This process
requires the collection of relevant data,
processing of the data, data analysis and data
visualization. This process provides essential
insight into past performance.
Diagnostic analytics
• Diagnostic analytics helps answer questions about
why things happened. These techniques supplement
more basic descriptive analytics.
• They take the findings from descriptive analytics
and dig deeper to find the cause. The
performance indicators are further investigated
to discover why they got better or worse. This
generally occurs in three steps:
– Identify anomalies in the data. These may be
unexpected changes in a metric or a particular
market.
– Data that is related to these anomalies is collected.
– Statistical techniques are used to find
Predictive analytics
•Predictive analytics helps answer questions
•
about what will happen in the future. These
techniques use historical data to identify
trends and determine if they are likely to
recur.
•
•Predictive analytical tools provide valuable
insight into what may happen in the future
and its techniques include a variety of
statistical and machine learning techniques,
such as: neural networks, decision trees,
and regression.
Prescriptive analytics
•Prescriptive analytics helps answer
•
questions about what should be done. By
using insights from predictive analytics,
data-driven decisions can be made.
••This allows businesses to make informed
decisions in the face of uncertainty.
Prescriptive analytics techniques rely on
machine
• learning strategies that can find
patterns in large datasets.
•By analyzing past decisions and events, the
likelihood of different outcomes can be
Methods
[Link]
Cluster analysis
• The action of grouping a set of data
elements in a way that said elements are
more similar (in a particular sense) to each
other than to those in other groups –
• hence the term ‘cluster.’
Since there is no target variable when
clustering, the method is often used to find
hidden patterns in the data. The approach is
also used to provide additional context to a
trend or dataset.
Cohort analysis
•This type of data analysis method uses historical data
• examine and compare a determined segment of
to
users' behavior, which can then be grouped with others
with similar characteristics.
•By using this data analysis methodology, it's possible to
• a wealth of insight into consumer needs or a firm
gain
understanding of a broader target group.
•Cohort analysis can be really useful to perform analysis
• marketing as it will allow you to understand the
in
impact of your campaigns on specific groups of
customers.
Regression analysis
•The regression analysis uses historical data
•
to understand how a dependent variable's
value is affected when one (linear
regression) or more independent variables
(multiple regression) change or stay the
•
same.
•By understanding each variable's
relationship and how they developed in the
past, you can anticipate possible outcomes
and make better business decisions in the
future.
Neural networks
•The neural network forms the basis
•
for the intelligent algorithms of
machine
•
learning.
•It is a form of data-driven analytics that
attempts, with minimal intervention, to
understand how the human brain would
•
process insights and predict values.
•Neural networks learn from each and every
data transaction, meaning that they evolve
and advance over time.
Factor analysis
•The factor analysis, also called “dimension
•
reduction,” is a type of data analysis used to
describe variability among observed,
correlated variables in terms of a potentially
lower number of unobserved variables
•
called factors.
•The aim here is to uncover independent
latent variables, an ideal analysis method for
streamlining specific data segments.
Data Mining
• A method of analysis that is the umbrella
term for engineering metrics and insights
for additional value, direction, and context.
• By using exploratory statistical evaluation,
data mining aims to identify dependencies,
relations, data patterns, and trends to
generate and advanced knowledge.
• When considering how to analyze data,
adopting a data mining mindset is essential
to success - as such, it’s an area that is worth
exploring in greater detail.
Text analysis
• Text analysis, also known in the industry
as text mining, is the process of taking
large sets of textual data and arranging
it in a way that makes it easier to
manage.
• By working through this cleansing
process in stringent detail, you will be
able to extract the data that is truly
relevant to your business and use it to
develop actionable insights that will
propel you forward.
Data Analysis Techniques
[Link]
Data Analytics Life Cycle
Phases of data analytics life cycle:
Phase 1 : Discovery :
1. In Phase 1, the team learns the business domain, including relevant history such as
whether the organization or business unit has attempted similar projects in the past from
which they can learn.
2. The team assesses the resources available to support the project in terms of people,
technology, time, and data.
3. Important activities in this phase include framing the business problem as an analytics
challenge and formulating initial hypotheses (IHs) to test and begin learning the data.
Phase 2 : Data preparation :
1. Phase 2 requires the presence of an analytic sandbox, in which the team can work
with data and perform analytics for the duration of the project.
2. The team needs to execute extract, load, and transform (ELT) or extract, transform
and load (ETL) to get data into the sandbox. Data should be transformed in the ETL
process so the team can work with it and analyze it.
3. In this phase, the team also needs to familiarize itself with the data thoroughly and
take steps to condition the data.
Phase 3 : Model planning :
1. Phase 3 is model planning, where the team determines the methods, techniques, and
workflow it intends to follow for the subsequent model building phase.
2. The team explores the data to learn about the relationships between variables and
subsequently selects key variables and the most suitable
Phase 4 : Model building :
1. In phase 4, the team develops data sets for testing, training, and
production purposes.
2. In addition, in this phase the team builds and executes models based on
the work done in the model planning phase.
3. The team also considers whether its existing tools will be adequate for
running the models, or if it will need a more robust environment for
executing models and work flows.
Phase 5 : Communicate results :
1. In phase 5, the team, in collaboration with major stakeholders,
determines if the results of the project are a success or a failure based
on the criteria developed in phase 1.
2. The team should identify key findings, quantify the business value, and
develop a narrative to summarize and convey findings to stakeholders.
Phase 6 : Operationalize :
1. In phase 6, the team delivers final reports, briefings, code, and technical
documents.
2. The team explores the data to learn about the relationships between variables and
subsequently selects key variables and the most suitable models.