Data Analytics Overview for B.Tech CSBS
Data Analytics Overview for B.Tech CSBS
College
DATA ANALYTICS
SemesterR23 Regulation
76
SRKR Engineering
College
DATA ANALYTICS
UNIT - I
Data Management: Design Data Architecture and manage the data for analysis, understand
various sources of Data like Sensors/Signals/GPS etc. Data Management, Data Quality(noise,
outliers, missing values, duplicate data) and Data Processing & Processing.
UNIT - II
Data Analytics: Introduction to Analytics, Introduction to Tools and Environment, Application
of Modeling in Business, Databases & Types of Data and variables, Data Modeling Techniques,
Missing Imputations etc. Need for Business Modeling.
UNIT - III
Regression – Concepts, Blue property assumptions, Least Square Estimation, Variable
Rationalization, and Model Building etc.
77
Data Analytics
DATA ANALYTICS
BASIC TERMINOLOGIES
BIG DATA
Big data is a field that treats ways to analyze, systematically extract information from, or
otherwise deal with data sets that are too large or complex to be dealt with by traditional data-
processing application software.
• Volume
• Variety
• Velocity
• Veracity
The volume of data refers to the size of the data sets that need to be analyzed and processed,
which are now frequently larger than terabytes and petabytes. The sheer volume of the data
requires distinct and different processing technologies than traditional storage and processing
capabilities. In other words, this means that the data sets in Big Data are too large to process with
a regular laptop or desktop processor. An example of a high-volume data set would be all
credit card transactions on a day within Europe.
Velocity refers to the speed with which data is generated. High velocity data is generated with
such a pace that it requires distinct (distributed) processing techniques. An example of a data that
is generated with high velocity would be Twitter messages or Facebook posts.
Variety makes Big Data really big. Big Data comes from a great variety of sources and generally
is one out of three types: structured, semi structured and unstructured data. The variety in data
types frequently requires distinct processing capabilities and specialist algorithms. An example
of high variety data sets would be the CCTV audio and video files that are generated at various
locations in a city.
Veracity refers to the quality of the data that is being analyzed. High veracity data has many
records that are valuable to analyze and that contribute in a meaningful way to the overall results.
Low veracity data, on the other hand, contains a high percentage of meaningless data.
Data Analytics
The non-valuable in these data sets is referred to as noise. An example of a high veracity data set
would be data from a medical experiment or trial.
Data that is high volume, high velocity and high variety must be processed with advanced tools
(analytics and algorithms) to reveal meaningful information. Because of these characteristics of
the data, the knowledge domain that deals with the storage, processing, and analysis of these data
sets has been labeled Big Data.
FORMS OF DATA
– STRUCTURED FORM
– UNSTRUCTURED FORM.
• Any form of data that does not have predefined structure is represented as
unstructured form of data. Eg: video, images, comments, posts, few websites
such as blogs and wikipedia
SOURCES OF DATA
DATA ANALYSIS
Data analysis is a process of inspecting, cleansing, transforming and modeling data with
the goal of discovering useful information, informing conclusions and supporting decision-
making.
DATA ANALYTICS
• Data analytics is the science of analyzing raw data in order to make conclusions about
that information...... This information can then be used to optimize processes to increase
the overall efficiency of a business or system.
Types:
In descriptive statistics the result is always going lead with probability among ‘n’
number of options where each option has an equal chance of probability.
– Predictive analytics Eg: healthcare, sports, weather, insurance, social media analysis.
This type of analytics deals with predicting past data to make decisions based on
certain algorithms. In case of a doctor the doctor questions the patient about the past
to correct his illness through already existing procedures.
Prescriptive analytics works with predictive analytics, which uses data to determine near-
term outcomes. Prescriptive analytics makes use of machine learning to help businesses
decide a course of action based on a computer program's predictions.
Fig 0.1Relation between Social Media, Data Analysis and Big Data
Data Analytics
Social media data are used in number of domains such as health and political trending and
forecasting, hobbies, ebusiness,cyber-crime, counter terrorism, time-evolving opinion mining,
social net-work analysis, and human machineinteractions.
Finally, summarizing all the above concepts processing for social media data can be
categorized into 3 parts as shown infigure 0.1. The first part consists of social media
websites, the second part consists of data analysis part and the thirdpart consists of big data
management layer and schedules the jobs across the cluster.
Prediction Analytics means we are trying to Analysis means we analyze always what
find conclusions about future. has happened in the past
MACHINE LEARNING
• Machine learning is an application of artificial intelligence (AI) that provides systems the
ability to automatically learn and improve from experience without being explicitly
programmed.
• Machine learning focuses on the development of computer programs that can access data
and use it learn for themselves.
Analytics
In general data is passed to a machine learning tool to perform descriptive data analytics
through set of algorithms built in it. Here both data analytics and data analysis is done by the tool
automatically. Hence we can say that Data analysis is a sub component of data analytics. And
data analytics is a sub component of machine learning tool. All these are described in figure 0.2.
The output of this machine learning tool generates a model. And from this model predictive
analytics and prescriptive analytics can be performed because the model gives output as data to
machine learning tool. This cycle continues till we get an efficient output.
Data Analytics
UNIT - I
1.1 DESIGN DATA ARCHITECTURE AND MANAGE THE DATA FOR ANALYSIS
Data architecture is composed of models, policies, rules or standards that govern which data
is collected, and how it is stored, arranged, integrated, and put to use in data systems and
in organizations. Data is usually one of several architecture domains that form the pillars of
an enterprise architecture or solution architecture.
Various constraints and influences will have an effect on data architecture design. These
include enterprise requirements, technology drivers, economics, business policies and data
processing needs.
• Enterprise requirements
These will generally include such elements as economical and effective system
expansion, acceptable performance levels (especially system access speed), transaction
reliability, and transparent data management. In addition, the conversion of raw data such as
transaction records and image files into more useful information forms through such features
as data warehouses is also a common organizational requirement, since this enables managerial
decision making and other organizational processes. One of the architecture techniques is the
split between managing transaction data and (master) reference data. Another one is splitting
data capture systems from data retrieval systems (as done in a dataware house).
• Technology drivers
These are usually suggested by the completed data architecture and database architecture
designs. In addition, some technology drivers will derive from existing organizational
integration frameworks and standards, organizational economics, and existing site resources
(e.g. previously purchased software licensing).
• Economics
These are also important factors that must be considered during the data architecture phase.
It is possible that some solutions, while optimal in principle, may not be potential candidates
due to their cost. External factors such as the business cycle, interest rates, market conditions,
and legal considerations could all have an effect on decisions relevant to data architecture.
• Business policies
Business policies that also drive data architecture design include internal organizational policies,
rules of regulatory bodies, professional standards, and applicable governmental laws that can
vary by applicable agency. These policies and rules will help describe the manner in which
enterprise wishes to process their data.
Data Analytics
The logical view/user's view, of a data analytics represents data in a format that is
meaningful to a user and to the programs that process those data. That is, the logical
view tells the user, in user terms, what is in the database. Logical level consists of data
requirements and process models which are processed using any data modelling techniques to
result in logical data model.
Physical level is created when we translate the top level design in physical tables in
the database. This model is created by the database architect, software architects, software
developers or database administrator. The input to this level from logical level and various
data modeling techniques are used here with input from software developers or database
administrator. These data modelling techniques are various formats of representation of data
Data Analytics
such as relational data model, network model, hierarchical model, object oriented model, Entity
relationship model.
Implementation level contains details about modification and presentation of data through the
use of various data mining tools such as (R-studio, WEKA, Orange etc). Here each tool has a
specific feature how it works and different representation of viewing the same data. These tools
are very helpful to the user since it is user friendly and it does not require much programming
knowledge from the user.
Observation Method:
we need to clearly differentiate our own observations from the observations provided to us by
other people. The range of data storage genre found in Archives and Collections, is suitable
for documenting observations e.g. audio, visual, textual and digital including sub-genres
of note taking, audio recording and video recording.
There exist various observation practices, and our role as an observer may vary according
to the research approach. We make observations from either the outsider or insider point of view
in relation to the researched phenomenon and the observation technique can be structured or
unstructured. The degree of the outsider or insider points of view can be seen as a movable
point in a continuum between the extremes of outsider and insider. If you decide to take the
insider point of view, you will be a participant observer in situ and actively participate in the
observed situation or community. The activity of a Participant observer in situ is called field
work. This observation technique has traditionally belonged to the data collection methods of
ethnology and anthropology. If you decide to take the outsider point of view, you try to try to
distance yourself from your own cultural ties and observe the researched community as an
outsider observer. These details are seen in figure 1.2.
Experimental Designs
There are number of experimental designs that are used in carrying out and experiment.
However, Market researchers have used 4 experimental designs most frequently. These are –
A completely randomized design (CRD) is one where the treatments are assigned
completely at random so that each experimental unit has the same chance of receiving any one
treatment. For the CRD, any difference among experimental units receiving the same treatment
is considered as experimental error. Hence, CRD is appropriate only for experiments with
homogeneous experimental units, such as laboratory experiments, where environmental effects
are relatively easy to control. For field experiments, where there is generally large variation
among experimental plots in such environmental factors as soil, the CRD is rarely used. CRD is
mainly used in agricultural field.
Step 1. Determine the total number of experimental plots (n) as the product of the number of
treatments (t) and the number of replications (r); that is, n = rt. For our example, n = 5 x 4 =
20. Here, one pot with a single plant in it may be called a plot. In case the number of replications
is not the same for all the treatments, the total number of experimental pots is to be obtained
as the sum of the replications for each treatment. i.e.,
n= i
Step 2. Assign a plot number to each experimental plot in any convenient manner; for example,
consecutively from 1 to n.
Step 3. Assign the treatments to the experimental plots randomly using a table of random
numbers.
Example 1: Assume that a farmer wishes to perform the experiment to determine which of his
3 fertilizers to use on 2800 tress. Assuming that farmer has a farm divided in to 3 terraces, where
those 2800 trees can be divided in the below format
Solution
Scenario 1
First we divide the 2800 trees in to random assignment of almost 3 equal parts
Random Assignment1: 933 trees
Random Assignment2: 933 trees
Random Assignment3: 934 trees
So for example random assignment1 we can assign fertilizer1, random assignment2 we can
assign fertilizer2, random assignment3 we can assign fertilizer3.
Scenario 2
Thus the farmer will be able analyze and compare various fertilizer performance on different
terrace.
Data Analytics
Example 2:
A company wishes to test 4 different types of tyre. The tyres lifetime as determined from
their threads are given. Where each tyre has been tried on 6 similar automobiles assigned at
random to their tyres. Determine whether there is a significant difference between tyres at .05
level.
Solution:
Null Hypothesis: There is no difference between the tyres in their life time.
We choose a random value closest to the average of all values in the table and subtract that
for each tyre in the automobile, for example by choosing 35
Now by using ANOVA (one way classification) Table, We calculate the F- Ratio.
F-Ratio:
The F ratio is the ratio of two mean square values. If the null hypothesis is true, you expect
F to have a value close to 1.0 most of the time. A large F ratio means that the variation among
group mean is more than you'd expect to see by chance
If the value of F-Ratio is closer to 1 it means that null hypothesis is true. If F-ratio is
greater than then we assume that the null hypothesis is false.
In this scenario the value of F-ratio is greater than 1. This indicates there will be variation
between samples. So assumed null hypothesis will be false
Level of significance = 0.05 (given in question)
Degrees of Freedom = (3, 20)
Critical value = 3.10 (calculated from 5 percentage table)
F-Ratio >critical value (i.e) 2.376> 3.10
Hence assumed null hypothesis is false. This indicates there is life time difference between
tyres.
Data Analytics
A randomized block design, the experimenter divides subjects into subgroups called
blocks, such that the variability within blocks is less than the variability between blocks.
Then, subjects within each block are randomly assigned to treatment conditions. Compared to
a completely randomized design, this design reduces variability within treatment conditions
and potential confounding, producing a better estimate of treatment effects.
The table below shows a randomized block design for a hypothetical medical experiment.
Gender Treatment
Placebo Vaccine
Male 250 250
Female 250 250
Subjects are assigned to blocks, based on gender. Then, within each block, subjects are randomly
assigned to treatments (either a placebo or a cold vaccine). For this design, 250 men get the
placebo, 250 men get the vaccine, 250 women get the placebo, and 250 women get the vaccine.
It is known that men and women are physiologically different and react differently to medication.
This design ensures that each treatment condition has an equal proportion of men and women. As
a result, differences between treatment conditions cannot be attributed to gender. This randomized
block design removes gender as a potential source of variability and as a potential confounding
variable.
LSD - Latin Square Design - A Latin square is one of the experimental designs which has a
balanced two-way classification scheme say for example - 4 X 4 arrangement. In this scheme
each letter from A to D occurs only once in each row and also only once in each column. The
balance arrangement, it may be noted that, will not get disturbed if any row gets changed with the
other.
A B C D
B C D A
C D A B
D A B C
The balance arrangement achieved in a Latin Square is its main strength. In this design, the
comparisons among treatments, will be free from both differences between rows and columns.
Thus the magnitude of error will be smaller than any other design.
FD - Factorial Designs - This design allows the experimenter to test two or more variables
simultaneously. It also measures interaction effects of the variables and analyzes the impacts
of each of the variables.
Data Analytics
In a true experiment, randomization is essential so that the experimenter can infer cause and effect
without any bias.
Internal sources
If available, internal secondary data may be obtained with less time, effort and money than
the external secondary data. In addition, they may also be more pertinent to the situation at hand
since they are from within the organization. The internal sources include
Accounting resources- This gives so much information which can be used by the marketing
researcher. They give information about internal factors.
Sales Force Report- It gives information about the sale of a product. The information provided
is of outside theorganization.
Internal Experts- These are people who are heading the various departments. They can give an
idea of how a particular thing isworking
Miscellaneous Reports- These are what information you are getting from operational [Link]
the data available within the organization are unsuitable or inadequate, the marketer should extend
the search to external secondary data sources.
Government Publications- Government sources provide an extremely rich pool of data for
the researchers. In addition, many of these data are available free of cost on internet websites.
There are number of government agencies generating data. These are:
Data Analytics
Registrar General of India- It is an office which generates demographic data. It includes details
of gender, age, occupation etc.
Central Statistical Organization- This organization publishes the national accounts statistics.
It contains estimates of national income for several years, growth rate, and rate of major economic
activities. Annual survey of Industries is also published by the CSO. It gives information about
the total number of workers employed, production units, material used and value added by
themanufacturer.
Director General of Commercial Intelligence- This office operates from Kolkata. It gives
information about foreign trade i.e. import and export. These figures are provided region- wise
and country-wise.
Ministry of Commerce and Industries- This ministry through the office of economic advisor
provides information on wholesale price index. These indices may be related to a number of
sectors like food, fuel, power, food grains etc. It also generates All India Consumer Price
Index numbers for industrial workers, urban, non-manual employees and cultural labourers.
Reserve Bank of India- This provides information on Banking Savings and investment. RBI also
prepares currency and finance reports.
Labour Bureau- It provides information on skilled, unskilled, white collared jobs etc.
National Sample Survey- This is done by the Ministry of Planning and it provides social,
economic, demographic, industrial and agricultural statistics.
State Statistical Abstract- This gives information on various types of activities related to the
state like - commercial activities, education, occupation etc.
The Bombay Stock Exchange (it publishes a directory containing financial accounts, key
profitability and other relevant matter)
Various Associations of Press Media. Export Promotion Council.
Data Analytics
Syndicate Services- These services are provided by certain organizations which collect and
tabulate the marketing information on a regular basis for a number of clients who are the
subscribers to these services. So the services are designed in such a way that the information suits
the subscriber. These services are useful in television viewing, movement of consumer goods etc.
These syndicate services provide information data from both household as well as institution.
In collecting data from household they use three approaches Survey- They conduct surveys
regarding - lifestyle, sociographic, general topics. Mail Diary Panel- It may be related to 2 fields
- Purchase and Media.
Various syndicate services are Operations Research Group (ORG) and The Indian Marketing
Research Bureau (IMRB).
Importance of Syndicate Services
Syndicate services are becoming popular since the constraints of decision making are changing
and we need more of specific decision-making in the light of changing environment. Also
Syndicate services are able to provide information to the industries at a low unit cost.
Disadvantages of Syndicate Services
The information provided is not exclusive. A number of research agencies provide customized
services which suits the requirement of each individual organization.
International Organization- These includes
The International Labour Organization (ILO)- It publishes data on the total and active population,
employment, unemployment, wages and consumer prices
The Organization for Economic Co-operation and development (OECD) - It publishes data on
foreign trade, industry, food, transport, and science andtechnology.
The International Monetary Fund (IMA) - It publishes reports on national and international
foreign exchange regulations.
Data Analytics
Based on various features (cost, data, process, source time etc.) various sources of data
can be compared as per table 1.
Sensor data is the output of a device that detects and responds to some type of input from
the physical environment. The output may be used to provide information or input to another
system or to guide a process. Examples are as follows
• A photosensor detects the presence of visible light, infrared transmission (IR) and/or
ultraviolet (UV) energy.
• Lidar, a laser-based method of detection, range finding and mapping, typically uses a low-
power, eye-safe pulsing laser working in conjunction with a camera.
• A charge-coupled device (CCD) stores and displays the data for an image in such a way that
each pixel is converted into an electrical charge, the intensity of which is related to a color in
the color spectrum.
• Smart grid sensors can provide real-time data about grid conditions, detecting outages, faults
and load and triggering alarms.
• Wireless sensor networks combine specialized transducers with a communications
infrastructure for monitoring and recording conditions at diverse locations. Commonly
monitored parameters include temperature, humidity, pressure, wind direction and speed,
illumination intensity, vibration intensity, sound intensity, powerline voltage, chemical
concentrations, pollutant levels and vital body functions.
Data Analytics
The simplest form of signal is a direct current (DC) that is switched on and off; this is the
principle by which the early telegraph worked. More complex signals consist of an alternating-
current (AC) or electromagnetic carrier that contains one or more data streams.
Data must be transformed into electromagnetic signals prior to transmission across a network.
Data and signals can be either analog or digital. A signal is periodic if it consists of a
continuously repeating pattern.
The Global Positioning System (GPS) is a space based navigation system that provides
location and time information in all weather conditions, anywhere on or near the Earth where
there is an unobstructed line of sight to four or more GPS satellites. The system provides critical
capabilities to military, civil, and commercial users around the world. The United States
government created the system, maintains it, and makes it freely accessible to anyone with a GPS
receiver.
Accuracy and Precision: This characteristic refers to the exactness of the data. It cannot
have any erroneous elements and must convey the correct message without being misleading.
This accuracy and precision have a component that relates to its intended use. Without
understanding how the data will be consumed, ensuring accuracy and precision could be off-
Data Analytics
target or more costly than necessary. For example, accuracy in healthcare might be more
important than in another industry (which is to say, inaccurate data in healthcare could have more
serious consequences) and, therefore, justifiably worth higher levels of investment.
Legitimacy and Validity: Requirements governing data set the boundaries of this characteristic.
For example, on surveys, items such as gender, ethnicity, and nationality are typically
limited to a set of options and open answers are not permitted. Any answers other than these
would not be considered valid or legitimate based on the survey’s requirement. This is the case
for most data and must be carefully considered when determining its quality. The people in
each department in an organization understand what data is valid or not to them, so the
requirements must be leveraged when evaluating data quality.
Reliability and Consistency: Many systems in today’s environments use and/or collect the same
source data. Regardless of what source collected the data or where it resides, it cannot contradict
a value residing in a different source or collected by a different system. There must be a stable
and steady mechanism that collects and stores the data without contradiction or unwarranted
variance.
Timeliness and Relevance: There must be a valid reason to collect the data to justify the effort
required, which also means it has to be collected at the right moment in time. Data collected too
soon or too late could misrepresent a situation and drive inaccurate decisions.
Availability and Accessibility: This characteristic can be tricky at times due to legal and
regulatory constraints. Regardless of the challenge, though, individuals need the right level of
access to the data in order to perform their jobs. This presumes that the data exists and is
available for access to be granted.
Granularity and Uniqueness: The level of detail at which data is collected is important, because
confusion and inaccurate decisions can otherwise occur. Aggregated, summarized and
manipulated collections of data could offer a different meaning than the data implied at a
lower level. An appropriate level of granularity must be defined to provide sufficient uniqueness
and distinctive properties to become visible. This is a requirement for operations to function
effectively.
Data Analytics
Noisy data is meaningless data. The term has often been used as a synonym for corrupt
data. However, its meaning has expanded to include any data that cannot be understood and
interpreted correctly by machines, such as unstructured text.
An outlier is an observation that lies an abnormal distance from other values in a random
sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus
process) to decide what will be considered abnormal.
In statistics, missing data, or missing values, occur when no data value is stored for the
variable in an observation. Missing data are a common occurrence and can have a significant
effect on the conclusions that can be drawn from the data. Missing values can be replaced by
following techniques:
Noisy data
• Examples: distortion of a person’s voice when talking on a poor phone and “snow” on
television screen
• We can talk about signal to noise ratio.
Left image of 2 sine waves has low or zero SNR; the right image are the two waves
combined with noise and has high SNR
Origins of noise
BUT...Missing (null) values may have significance in themselves (e.g. missing test in a
medical examination, deathdate missing means still alive!)
Duplicate Data
Data set may include data objects that are duplicates, or almost duplicates of one another
• Data Cleaning: Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
• Data Integration: Data with different representations are put together and conflicts
within the data are resolved.
• Data Transformation: Data is normalized, aggregated and generalized.
• Data Reduction: This step aims to present a reduced representation of the data in a
data warehouse.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately. One can replace all
data in a segment by its mean or boundary values can be used to complete the
task.
2. Regression:
Here data can be made smooth by fitting it to a regression [Link]
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
Data Analytics
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to
1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help
the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For Example-
The attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working
with huge volume of data, analysis became harder in such cases. In order to get rid of this, we
uses data reduction technique. It aims to increase the storage efficiency and reduce data
storage and analysis costs.
UNIT – II
INTRODUCTION TO ANALYTICS
2.1 Introduction to Analytics
As an enormous amount of data gets generated, the need to extract useful insights is a must
for a business enterprise. Data Analytics has a key role in improving your business. Here are
4 main factors which signify the need for Data Analytics:
• Gather Hidden Insights – Hidden insights from data are gathered and then analyzed with
respect to business requirements.
• Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in
business.
• Perform Market Analysis – Market Analysis can be performed to understand the
strengths and the weaknesses of competitors.
• Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.
Data Analytics refers to the techniques to analyze data to enhance productivity and business
gain. Data is extracted from various sources and is cleaned and categorized to analyze different
behavioral patterns. The techniques and the tools used vary according to the organization or
individual.
Data analysts translate numbers into plain English. A Data Analyst delivers value to their
companies by taking information about specific topics and then interpreting, analyzing, and
presenting findings in comprehensive reports. So, if you have the capability to collect data
from various sources, analyze the data, gather hidden insights and generate reports, then you can
become a Data Analyst. Refer to the image below:
In general data analytics also deals with bit of human knowledge as discussed below in
figure 2.2 in this under each type of analytics there is a part of human knowledge required in
prediction. Descriptive analytics requires the highest human input while predictive analytics
requires less human input. In case of prescriptive analytics no human input is required since all
the data is predicted.
In general data analytics deals with three main parts, subject knowledge, statistics and
person with computer knowledge to work on a tool to give insight in to the business. In the mainly
used tool is Rand Phyton as shown in figure 2.3
With the increasing demand for Data Analytics in the market, many tools have emerged with
various functionalities for this purpose. Either open-source or user-friendly, the top tools in the
data analytics market are as follows.
• R programming – This tool is the leading analytics tool used for statistics and data modeling.
R compiles and runs on various platforms such as UNIX, Windows, and Mac OS. It also
provides tools to automatically install all packages as per user-requirement.
• Python – Python is an open-source, object-oriented programming language which is easy to
read, write and maintain. It provides various machine learning and visualization libraries
such as Scikit-learn, TensorFlow, Matplotlib, Pandas, Keras etc. It also can be assembled on
any platform like SQL server, a MongoDB database or JSON
• Tableau Public – This is a free software that connects to any data source such as Excel,
corporate Data Warehouse etc. It then creates visualizations, maps, dashboards etc with real-
time updates on the web.
• QlikView – This tool offers in-memory data processing with the results delivered to the end-
users quickly. It also offers data association and data visualization with data being compressed
to almost 10% of its original size.
• SAS – A programming language and environment for data manipulation and analytics, this
tool is easily accessible and can analyze data from different sources.
• Microsoft Excel – This tool is one of the most widely used tools for data analytics. Mostly
used for clients’ internal data, this tool analyzes the tasks that summarize the data with a
preview of pivot tables.
• RapidMiner – A powerful, integrated platform that can integrate with any data source types
such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc. This tool is mostly used
for predictive analytics, such as data mining, text analytics, machine learning.
• KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics platform,
which allows you to analyze and model data. With the benefit of visual programming,
KNIME provides a platform for reporting and integration through its modular data pipeline
concept.
• OpenRefine – Also known as GoogleRefine, this data cleaning software will help you clean
up data for analysis. It is used for cleaning messy data, the transformation of data and
parsing data from websites.
• Apache Spark – One of the largest large-scale data processing engine, this tool executes
applications in Hadoop clusters 100 times faster in memory and 10 times faster on disk. This
tool is also popular for data pipelines and machine learning model development.
Data Analytics
Apart from the above-mentioned capabilities, a Data Analyst should also possess skills such
as Statistics, Data Cleaning, Exploratory Data Analysis, and Data Visualization. Also, if you have
knowledge of Machine Learning, then that would make you stand out from the crowd.
Data analytics is mainly involved in field of business in various concerns for the
following purpose and it varies according to business needs and it is discussed below in detail.
Nowadays majority of the business deals with prediction with large amount of data to work with.
Using big data as fundamental factor of making decision which need new capability, most
firms are far away from accessing all data resources. Companies in various sectors have acquired
crucial insight from the structured data collected from different enterprise systems and
anatomize by commercial database management systems. Eg:
1.) Facebook and Twitter to standard the instantaneous influence on campaign and to
examine consumer opinion about their products
2.) Some companies, like Amazon, eBay, and Google, considered as early commandants,
examining factors that control performance to define what raise sales revenue and user
interactivity.
Hadoop is an open source software platform that enables processing of large data sets in a
distributed computing environment", it discusses some concepts according to big data, the rules
for building, organizing and analyzing huge data-sets in the business environment, they offered
3 architecture layers and also they indicate some graphical tools to explore and represent
unstructured-data, the authors specified how the famous companies could improve their business.
Eg: Google, Twitter and Facebook show their attention in processing big data within cloud-
environment
Data Analytics
The Map() step: Each worker node applies the Map() function to the local data and writes the
output to atemporary storage space. The Map() code is run exactly once for each K1 key value,
generating output that isorganized by key values K2. A master node arranges it so that for
redundant copies of input data only one isprocessed.
The Shuffle ()step: The map output is sent to the reduce processors, which assign the K2 key
value that eachprocessor should work on, and provide that processor with all of the map-
generated data associated with that keyvalue, such that all data belonging to one key are located
on the same worker node.
The Reduce() step: Worker nodes process each group of output data(perkey) in parallel, executing
the userprovidedReduce() code; each function is run exactly once for each K2 key value pro-
duced by the map step.
Produce the final output: The MapReduce system collects all of the reduce outputs and sorts
them by K2 to producethe final out-come.
Fig.2.4 shows the classical “word count problem” using the MapReduce paradigm. As shown in
Fig.2.4, initially aprocess will split the data into a subset of chunks that will later be processed
by the mappers. Once the key/values aregenerated by mappers, a shuffling process is used to
mix (combine) these key values (combining the same keys in the sameworker node). Finally,
the reduce functions are used to count the words that generate a common output as a result of
thealgorithm. As a result of the execution or wrappers/reducers, the out- put will generate a
sorted list of word counts from theoriginal text input.
IBM and Microsoft are prominent representatives. IBM represented many big data options that
enable users to storing, managing, and analyzing data through various resources; it has a good
rendering on business-intelligence also healthcare areas. Compared with IBM, also Microsoft
showed powerful work in the area of cloud computing activities and techniques another example
is Face-book and Twitter, who are collecting various data from user's profiles and using it to
increase their revenue
Big data analytics and Business intelligence are united fields which became widely significant
in the business and academic area, companies are permanently trying to make insight from the
extending the three V's ( variety, volume and velocity) to support decision making
2.4 Databases
Database is an organized collection of structured information, or data, typically
stored electronically in a computer system. A database is usually controlled by
Data Analytics
The database can be divided into various categories such as text databases,
desktop database programs, relational database management systems (RDMS), and NoSQL and
object-oriented databases
A text database is a system that maintains a (usually large) text collection and provides
fast and accurate access to it. Eg: Text book, magazine, journals, manuals, etc..
A relational database (RDB) is a collective set of multiple data sets organized by tables,
records and columns. RDBs establish a well-defined relationship between
database tables. Tables communicate and share information, which facilitates data searchability,
organization and reporting. Eg: sql, oracle,Db2, DbaaS etc
NoSQL databases are non-tabular, and store data differently than relational tables.
NoSQL databases come in a variety of types based on their data model. The main types are
document, key-value, wide-column, and graph. Eg: JSON,Mango DB,CouchDB etc
Object-oriented databases (OODB) are databases that represent data in the form
of objects and classes. In object-oriented terminology, an object is a real-world entity, and a class
is a collection of objects. Object-oriented databases follow the fundamental principles of
object-oriented programming (OOP). Eg: c++, java, c#, small talk, LISP etc..
In any database we will be working with data to perform any kind of analysis and
predication. In relational data base management system we normally use rows to represent data
and columns to represent the attribute.
In terms of big data we represent the columns from RDMS as an attribute or a variable.
This variable can be divided in to two types’ categorical data or qualitative data and
continuous or discrete data called as quantitative data. As shown below in figure 2.5.
InNominal Data there is no natural ordering in values in the attribute of the dataset. Eg:
color, Gender, nouns (name, place, animal, thing). These categories cannot be predefined with
order for example there is no specific way to arrange gender of 50 students in a class. In this case
the first student can be male or female similarly for all 50 students. So ordering
Data Analytics
cannot be valid.
Data Analytics
In Ordinal Data there isnatural ordering in values in the attribute of the dataset. Eg: size
(S, M, L, XL, XXL), rating (excellent, good, better, worst). In the above example we can quantify
the amount of data after performing ordering which gives valuable insights into the data.
Discrete Attribute which takes only finite number of numerical values (integers). Eg:
number of buttons, no of days for product delivery etc.. These data can be represented at every
specific interval in case of time series data mining or even in ratio based entries.
Continuous Attribute which takes finite number of fractional values. Eg: price,
discount, height, weight, length, temperature, speed etc….. These data can be represented at
every specific interval in case of time series data mining or even in ratio based entries.
Data modelling is nothing but a process through which data is stored structurally in a
format in a database. Data modelling is important because it enables organizations to make data-
driven decisions and meet varied business goals.
The entire process of data modelling is not as easy as it seems, though. You are required
to have a deeper understanding of the structure of an organization and then propose
Data Analytics
a solution that aligns with its end-goals and suffices it in achieving the desired objectives.
Data Analytics
Data modeling can be achieved in various ways. However, the basic concept of each of
them remains the same. Let’s have a look at the commonly used data modeling methods:
Hierarchical model
As the name indicates, this data model makes use of hierarchy to structure the data in
a tree-like format as shown in figure 2.6. However, retrieving and accessing data is difficult in
a hierarchical database. This is why it is rarely used now.
Relational model
The network model is inspired by the hierarchical model. However, unlike the
hierarchical model, this model makes it easier to convey complex relationships as each record
can be linked with multiple parent records as shown in figure 2.8. In this model data can be shared
easily and the computation becomes easier.
This database model consists of a collection of objects, each with its own features and
methods. This type of database model is also called the post-relational database model as shown
in figure 2.8.
Entity-relationship model
Data Analytics
The entity relationship diagram explains relation between variables and with their primary
key and foreign key as shown in figure 2.10. along with this it also explains the multiple instances
of relation between tables.
Now that we have a basic understanding of data modeling, let’s see why it is important.
You will agree with us that the main goal behind data modeling is to equip your business and
contribute to its functioning. As a data modeler, you can achieve this objective only when you
know the needs of your enterprise correctly.
It is essential to make yourself familiar with the varied needs of your business so that you can
prioritize and discard the data depending on the situation.
Key takeaway: Have a clear understanding of your organization’s requirements and organize
your data properly.
Things will be sweet initially, but they can become complex in no time. This is why it is highly
recommended to keep your data models small and simple, to begin with.
Once you are sure of your initial models in terms of accuracy, you can gradually introduce more
datasets. This helps you in two ways. First, you are able to spot any inconsistencies in the initial
stages. Second, you can eliminate them on the go.
Key takeaway: Keep your data models simple. The best data modeling practice here is to use
a tool which can start small and scale up as needed.
Organize your data based on facts, dimensions, filters, and order
You can find answers to most business questions by organizing your data in terms of four
elements – facts, dimensions, filters, and order.
Let’s understand this better with the help of an example. Let’s assume that you run four e-
commerce stores in four different locations of the world. It is the year-end, and you want to
analyze which e-commerce store made the most sales.
In such a scenario, you can organize your data over the last year. Facts will be the overall
sales data of last 1 year, the dimensions will be store location, the filter will be last 12 months,
and the order will be the top stores in decreasing order.
This way, you can organize all your data properly and position yourself to answer an array
of business intelligence questions without breaking a sweat.
Key takeaway: It is highly recommended to organize your data properly using individual tables
for facts and dimensions to enable quick analysis.
While you might be tempted to keep all the data with you, do not ever fall for this trap! Although
storage is not a problem in this digital age, you might end up taking a toll over your
Data Analytics
machines’ performance.
Data Analytics
More often than not, just a small yet useful amount of data is enough to answer all the business-
related questions. Spending huge on hosting enormous data of data only leads to performance
issues, sooner or later.
Key takeaway: Have a clear opinion on how much datasets you want to keep. Maintaining more
than what is actually required wastes your data modeling, and leads to performance issues.
Data modeling is a big project, especially when you are dealing with huge amounts of data. Thus,
you need to be cautious enough. Keep checking your data model before continuing to the next
step.
For example, if you need to choose a primary key to identify each record in the dataset properly,
make sure that you are picking the right attribute. Product ID could be one such attribute. Thus,
even if two counts match, their product ID can help you in distinguishing each record. Keep
checking if you are on the right track. Are product IDs same too? In those aces, you will need to
look for another dataset to establish the relationship.
Key takeaway: It is the best practice to maintain one-to-one or one-to-many relationships.
The many-to-many relationship only introduces complexity in the system.
Key takeaway: Data models become outdated quicker than you expect. It is necessary that you
keep them updated from time to time.
The Wrap Up
Data modeling plays a crucial role in the growth of businesses, especially when you organizations
to base your decisions on facts and figures. To achieve the varied business intelligence insights
and goals, it is recommended to model your data correctly and use appropriate tools to ensure the
simplicity of the system.
In statistics, imputation is the process of replacing missing data with substituted values. ...
Because missing data can create problems for analyzing data, imputation is seen as a way
Data Analytics
to avoid pitfalls involved with list-wise deletion of cases that have missing values.
Data Analytics
Advantages:
• Works well with numerical dataset.
• Very fast and reliable.
Disadvantage:
• Does not work with categorical attributes
• Does not correlate relation between columns
• Not very accurate.
• Does not account for any uncertainty in data
The k nearest neighbours is an algorithm that is used for simple classification. The algorithm
uses ‘feature similarity’ to predict the values of any new data points. This means that the new
point is assigned a value based on how closely it resembles the points in the training set. This can
be very useful in making predictions about the missing values by finding the k’s closest neighbours
to the observation with missing data and then imputing them based on the non- missing values in
the neighbourhood.
Advantage:
• This method is very accurate than mean, median and mode
Disadvantage:
• Sensitive to outliers
UNIT-3
BLUE Property
assumptions
• The Gauss Markov theorem tells us that if a certain set of assumptions are met, the ordinary
least squares estimate for regression coefficients gives you the Best Linear Unbiased
Estimate (BLUE) possible.
• Linearity:
o The parameters we are estimating using the OLS method must be themselves
linear.
• Random:
o Our data must have been randomly sampled from the population.
• Non-Collinearity:
o The regressors being calculated aren’t perfectly correlated with each other.
• Exogeneity:
o The regressors aren’t correlated with the error term.
• Homoscedasticity:
o No matter what the values of our regressors might be, the error of the variance is
constant.
• Checking how well our data matches these assumptions is an important part of estimating
regression coefficients.
Data Analytics
• When you know where these conditions are violated, you may be able to plan ways to
change your experiment setup to help your situation fit the ideal Gauss Markov situation
more closely.
• In practice, the Gauss Markov assumptions are rarely all met perfectly, but they are still
useful as a benchmark, and because they show us what ‘ideal’ conditions would be.
• They also allow us to pinpoint problem areas that might cause our estimated regression
coefficients to be inaccurate or even unusable.
Data Analytics
• and generated by the ordinary least squares estimate is the best linear unbiased estimate
(BLUE) possible if
• The first of these assumptions can be read as “The expected value of the error term is zero.”.
The second assumption is collinearity, the third is exogeneity, and the fourth is
homoscedasticity.
Data Analytics
Regression Concepts
Regression
• Each xi corresponds to the set of attributes of the ith observation (known as explanatory
variables) and yi corresponds to the target (or response) variable.
• The explanatory attributes of a regression task can be either discrete or continuous.
Regression (Definition)
• Regression is the task of learning a target function f that maps each attribute set x into a
continuous-valued output y.
• To find a target function that can fit the input data with minimum error.
• The error function for a regression task can be expressed in terms of the sum of absolute
or squared error:
Data Analytics
• Suppose we wish to fit the following linear model to the observed data:
• where w0 and w1 are parameters of the model and are called the regression coefficients.
• A standard approach for doing this is to apply the method of least squares, which
attempts to find the parameters (w0,w1) that minimize the sum of the squared error
• These equations can be summarized by the following matrix equation' which is also
known as the normal equation:
• Since
• the normal equations can be solved to obtain the following estimates for the parameters.
Data Analytics
• Thus, the linear model that best fits the data in terms of minimizing the SSE is
• We can show that the general solution to the normal equations given in D.6 can be
expressed as follow:
Data Analytics
• Thus, linear model that results in the minimum squared error is given by
• In summary, the least squares method is a systematic approach to fit a linear model to the
response variable g by minimizing the squared error between the true and estimated value
of g.
• Although the model is relatively simple, it seems to provide a reasonably accurate
approximation because a linear model is the first-order Taylor series approximation for any
function with continuous derivatives.
Malla Reddy Institute of Technology and Science