UNIT - I
Introduction to Data Analytics
Data analytics involves the process of examining raw data to uncover trends, patterns, and
insights, and to make informed decisions. It uses various methods and techniques to
transform data into valuable information, aiding decision-making in various fields such as
business, healthcare, engineering, etc.
Sources of Data:
Data can originate from a wide variety of sources, and these sources are generally classified
into:
1. Primary Data: Data collected firsthand for a specific purpose. It is original and
typically gathered through methods like:
2. Surveys: Questionnaires or interviews with individuals.
3. Experiments: Controlled environments to test hypotheses.
4. Observations: Gathering data through human or machine observation in real-time.
5. Secondary Data: Data that has been collected by someone else for another purpose
but is reused. Examples include:
6. Government Publications: Census reports, public health statistics.
7. Online Databases: Financial records, research papers, or customer databases.
8. Web Data: Social media, website traffic logs.
Nature of Data:
The nature of data refers to its structure, content, and form, which can be categorised as:
1. Qualitative Data: Descriptive information that is not numerical, such as text, images,
and videos.
2. Quantitative Data: Numerical information that can be measured and expressed in
numbers (e.g., sales data, temperatures).
Classification of Data
Data can be classified into three broad categories based on its structure:
a. Structured Data:
Data that is organised in a fixed format or schema. It is typically found in relational
databases (e.g., SQL databases). Examples: Financial records in spreadsheets,
Transaction records in databases.
Characteristics:
1. Easy to search and analyse.
2. Organised in rows and columns.
3. Follows a specific format or schema (e.g., CSV, SQL tables).
b. Semi-Structured Data:
Data that does not have a strict formal structure but still follows some organisation, often
using tags or markers to separate elements.
Examples: XML, JSON files, Emails (metadata like sender/recipient, subject, etc.).
Characteristics:
1. Partially organised but flexible.
2. Can be stored in databases but also easily stored in file systems.
3. More complex to analyse than structured data but easier than unstructured data.
c. Unstructured Data:
Data that does not follow a specific structure or model. It’s often in its raw form and needs
significant processing to derive meaning.
Examples: Text documents (e.g., Word files), Multimedia files (e.g., images, videos, audio),
Social media posts, chat logs.
Characteristics:
1. Hard to organise and analyse.
2. Often large in volume.
3. Requires advanced techniques (e.g., Natural Language Processing, Computer
Vision) for meaningful analysis.
Characteristics of Data
Volume: Refers to the size of the dataset. With the rise of big data, the volume of data
available for analysis has grown exponentially.
1. Variety: Different types of data—structured, semi-structured, and unstructured. It includes
text, images, videos, and more.
2. Velocity: The speed at which data is generated and processed. This is especially crucial
for real-time data analytics (e.g., financial market data, sensor data).
3. Veracity: The accuracy, trustworthiness, and reliability of the data. High-quality data is
necessary for meaningful analysis.
4. Value: The usefulness of the data after processing. Data should be actionable and lead to
insights that add value.
Introduction to Big Data Platform
A Big Data platform refers to a comprehensive framework or architecture that handles the
storage, processing, and analysis of large volumes of data. These platforms are designed to
manage both structured and unstructured data from diverse sources and provide tools for
analysing this data at scale. The aim is to extract valuable insights and help organisations
make data-driven decisions.
Key Components of a Big Data Platform:
1. Data Storage: Utilises distributed storage systems like Hadoop HDFS or cloud storage
systems to store vast datasets.
2. Data Processing: Tools like Apache Spark, Hadoop MapReduce, or Flink enable
distributed computing to process large-scale datasets.
3. Data Management: Platforms include data lakes and warehouses (e.g., AWS Redshift,
Google BigQuery) to organise and query data.
4. Data Analytics: The platform incorporates machine learning, data mining, and real-time
analytics tools (e.g., TensorFlow, PyTorch, Apache Kafka).
5. Visualisation: Tools such as Tableau, Power BI, or Kibana are used to visualize patterns
and trends from large datasets.
Need for Data Analytics
Data analytics has become essential in modern organisations due to several reasons:
1. Volume of Data: With the explosion of data from sensors, social media, and business
processes, manual analysis is impossible. Data analytics helps to process and analyze huge
volumes of data efficiently.
2. Informed Decision-Making: Analytics allows businesses to make decisions based on
data rather than intuition, leading to better outcomes and more efficient processes.
3. Competitive Advantage: Organisations that leverage analytics can gain insights into
customer behaviour, market trends, and operational efficiency, giving them a competitive
edge.
4. Personalization and Targeting: Analytics helps businesses tailor products, services, and
marketing strategies to specific customer segments, improving customer experience.
5. Optimization: Analytics is used for process optimization, from supply chain management
to customer service, reducing costs and improving productivity.
Evolution of Analytic Scalability
The scalability of analytics has evolved significantly over time to meet the demands of
growing data volumes and complexity:
1. Traditional Analytics:
Early analytics involved manual data processing using spreadsheets and small databases
(e.g., Excel, SQL). The focus was on small datasets, and processing was often slow.
2. Data Warehousing:
With the rise of data warehouses (e.g., Oracle, Teradata), businesses could store and query
larger datasets, enabling faster access to historical data. However, these systems were still
limited in their ability to handle unstructured or real-time data.
3. Hadoop and Distributed Computing:
The advent of Hadoop and MapReduce introduced a new era of distributed computing. It
allowed organisations to process vast amounts of unstructured data across clusters of
computers in parallel, improving scalability.
4. Real-Time and Stream Processing:
Technologies like Apache Spark, Apache Kafka, and Flink enabled real-time and stream
processing, allowing organisations to analyse data as it is generated.
Cloud-Based Solutions:
Cloud platforms (e.g., AWS, Google Cloud, Microsoft Azure) have further improved
scalability by offering virtually unlimited storage and processing power on-demand. This has
democratised analytics by making it accessible to businesses of all sizes.
Analytic Process
The data analytic process involves several steps:
1. Data Collection:
Data is gathered from various sources (e.g., transactional databases, social media, IoT
devices).
2. Data Cleaning:
Data is preprocessed to remove inconsistencies, missing values, and noise. This ensures
accuracy and reliability in analysis.
3. Data Transformation:
The cleaned data is transformed into formats suitable for analysis. This can include
normalisation, aggregation, or converting unstructured data into structured form.
4. Data Modeling:
Analytical models (e.g., regression, classification, clustering) are applied to the data to
uncover trends, make predictions, or classify information.
5. Data Analysis:
The models are analysed using statistical and machine learning techniques to derive
insights.
6. Interpretation and Reporting:
The results are interpreted and visualised using dashboards or reports, enabling
stakeholders to understand the findings and make informed decisions.
Analytic Tools
Here are some of the most widely used tools in the field of analytics:
1. Big Data Processing:
Hadoop: A distributed file system and processing framework.
Apache Spark: A fast, in-memory processing engine for large-scale data analytics.
Kafka: A tool for real-time data streaming and analysis.
2. Data Storage:
HDFS (Hadoop Distributed File System): For distributed storage.
Amazon S3: Cloud-based storage for handling big data.
3. Data Analysis and Machine Learning:
Python (NumPy, Pandas, Scikit-learn): Python libraries for data manipulation, analysis,
and machine learning.
R: A statistical computing environment for data analysis and visualisation.
TensorFlow and PyTorch: Machine learning frameworks for building and training models.
4. Data Visualization:
Tableau: A powerful data visualisation tool used for creating dashboards and reports.
Power BI: Microsoft’s business analytics service to visualise data and share insights.
[Link]: A JavaScript library for creating custom data visualisations.
5. Data Warehousing:
Amazon Redshift: A fully managed data warehouse service in the cloud.
Google BigQuery: A serverless, highly scalable, and cost-effective cloud data warehouse.
Analysis vs Reporting
Analysis:
The goal of analysis is to uncover deeper insights, patterns, and relationships in the data. It
involves examining and interpreting data to support decision-making, solve problems, and
predict future trends.
Approach: Analysis is more complex and exploratory. It often uses advanced statistical,
machine learning, or AI techniques to uncover trends, anomalies, or causal relationships.
Examples: Identifying customer segments based on purchase behaviour, Predicting
equipment failure using historical sensor data.
Tools: Python (Pandas, Scikit-learn), R, Apache Spark, Jupyter Notebooks, and TensorFlow.
Reporting:
The goal of reporting is to present data in a clear, structured format, often summarising
metrics or KPIs. It typically involves the organisation and visualisation of data to provide a
snapshot of performance or status.
Approach: Reporting is more descriptive and less exploratory. It involves generating
predefined dashboards, tables, or visualisations to display historical data.
Examples:
Monthly sales performance reports.
Website traffic reports with data on visitors, bounce rate, and page views.
Tools: Tableau, Power BI, Google Data Studio, Excel.
Key Difference: Reporting focuses on what happened (descriptive), whereas analysis
focuses on why it happened or what might happen next (predictive and prescriptive).
Modern Data Analytics Tools
Several modern data analytics tools are designed for handling diverse data sources,
performing complex analyses, and visualising results. Here are some of the most prominent:
Apache Spark: A distributed computing system for large-scale data processing. It supports
real-time analytics, machine learning, and stream processing.
Tableau: A leading data visualisation tool that allows users to create interactive dashboards
and reports without coding.
Power BI: Microsoft’s data analytics service for business intelligence, enabling users to
connect, model, and visualise data from various sources.
Google BigQuery: A serverless and highly scalable data warehouse solution, ideal for
analysing massive datasets quickly.
Amazon Redshift: A fully managed data warehouse that handles big data workloads and
allows users to run complex queries on large datasets.
Python Libraries:
Pandas: A powerful library for data manipulation and analysis.
Scikit-learn: A machine learning library that provides algorithms for classification,
regression, clustering, etc.
TensorFlow/PyTorch: Libraries for deep learning and neural networks.
R: A programming language and software environment used for statistical analysis, data
visualisation, and machine learning.
Databricks: A unified data analytics platform that provides tools for data engineering, data
science, and machine learning on top of Apache Spark.
Applications of Data Analytics
Data analytics has applications across various industries, including:
Healthcare:
Predictive analytics for patient outcomes.
Fraud detection in health insurance claims.
Optimising hospital resource allocation.
Finance:
Risk management and fraud detection.
Algorithmic trading and investment strategies.
Customer credit scoring.
Retail:
Personalised marketing and recommendation systems.
Inventory management and demand forecasting.
Customer behaviour analysis.
Manufacturing:
Predictive maintenance for machinery.
Supply chain optimization.
Quality control and defect detection.
Telecommunications:
Customer churn prediction.
Network optimization.
Targeted marketing based on usage patterns.
E-commerce:
Price optimization and dynamic pricing.
Customer segmentation.
A/B testing for user experience improvements.
Data Analytics Lifecycle
The Data Analytics Lifecycle is a systematic process used to guide the execution of data
analytics projects. It ensures that projects are completed efficiently, with clear objectives and
methodologies. The life cycle typically includes the following phases:
1. Need for a Data Analytics Lifecycle:
Standardised Process: Provides a structured framework for data-driven problem-solving.
Consistency: Ensures consistency across different analytics projects.
Efficiency: Helps reduce errors, optimise resources, and avoid unnecessary steps.
Clear Outcomes: Keeps teams aligned on goals and deliverables.
2. Key Phases of the Data Analytics Lifecycle:
a) Discovery:
Objective: Understanding the problem and identifying business objectives. Stakeholders
and data scientists collaborate to define the project goals.
Key Tasks:
Identify the problem.
Define success criteria.
Assess the available resources and technology.
b) Data Preparation:
Objective: Collecting and cleaning the data to ensure it's usable for analysis. This phase
involves handling missing values, formatting inconsistencies, and transforming data.
Key Tasks:
Data extraction from multiple sources.
Data cleaning and preprocessing.
Exploratory data analysis.
c) Model Planning:
Objective: Develop a plan for the analytical approach to be used. This includes selecting the
algorithms or techniques that will best address the problem.
Key Tasks:
Define modelling techniques (e.g., regression, classification).
Split the dataset for training and testing.
Plan for model evaluation.
d) Model Building:
Objective: Build and train machine learning models using the prepared data. In this phase,
data scientists experiment with different algorithms and fine-tune models.
Key Tasks:
Model training and optimization.
Use cross-validation or hyperparameter tuning.
Document the model-building process.
e) Model Evaluation:
Objective: Evaluate the model's performance on test data to ensure it meets the business
requirements and is reliable.
Key Tasks:
Measure model accuracy, precision, recall, or other relevant metrics.
Validate the model using unseen data.
f) Deployment:
Objective: Implement the model into the production environment and integrate it into
business processes.
Key Tasks:
Deploy the model on live systems.
Set up monitoring to track model performance over time.
Ensure scalability and maintainability.
g) Feedback & Iteration:
Objective: Gather feedback on the model’s real-world performance and make necessary
adjustments. Often, the life cycle is iterative as models need constant updates and
improvements.
Key Tasks:
Collect feedback from stakeholders.
Adjust the model as needed to optimise performance.
Key Roles for Successful Analytic Projects
Data Scientist: Responsible for analysing the data, building machine learning models, and
deriving insights. They have expertise in statistics, programming, and machine learning.
Data Engineer: Builds and maintains the infrastructure for data generation, storage, and
processing. They work on optimising the data pipeline to ensure the availability of clean data
for analysis.
Business Analyst: Bridges the gap between the technical team and stakeholders. They
understand the business requirements and ensure that the insights derived from the data
align with organisational goals.
Project Manager: Ensures the project is completed on time and within scope. They
coordinate between different teams and manage the lifecycle phases to ensure smooth
execution.
Data Architect: Designs the data infrastructure and ensures that the organisation’s data is
well-organised, accessible, and scalable for future needs.
Stakeholders: Provide business context and define the objectives of the analytic project.
Their feedback is crucial for determining success metrics and ensuring that the analysis
aligns with business needs.
Phases of the Data Analytics Lifecycle
The Data Analytics Lifecycle is a systematic process that ensures analytics projects are
executed effectively from start to finish. It includes several phases, each with specific tasks
aimed at achieving a successful data analytics outcome. Here is an in-depth look at the key
phases:
1. Discovery
Objective: The discovery phase is all about understanding the problem that needs solving,
identifying business objectives, and gathering the resources and information necessary to
proceed.
Key Activities:
Understanding Business Requirements: Stakeholders collaborate to define the scope and
goals of the project.
Formulating Business Hypotheses: Teams identify potential hypotheses and key
questions that the analysis should address.
Assessing Available Resources:
Understand the data sources available.
Review existing tools, technology, and expertise.
Identifying Success Metrics: Define how success will be measured, e.g., increased sales,
customer retention, cost reductions, etc.
Output: A clear project charter that outlines the business objectives, success metrics,
timeline, and key deliverables.
2. Data Preparation
Objective: In this phase, data is collected, cleaned, and organised to ensure it is suitable for
analysis. This is one of the most time-consuming steps in the analytics lifecycle.
Key Activities:
Data Collection:
Gather data from various sources, such as databases, APIs, or third-party systems.
Data Cleaning:
Handle missing or incomplete data (imputing missing values, removing duplicates).
Address data inconsistencies (standardising formats, correcting errors).
Data Transformation:
Convert unstructured data (e.g., text, images) into structured formats.
Normalise and scale numerical data to prepare it for modelling.
Exploratory Data Analysis (EDA):
Use summary statistics and visualisations to understand data patterns and distributions.
Detects outliers and anomalies in the data.
Output: A clean, structured dataset ready for model development, along with any data
transformations needed for analysis.
3. Model Planning
Objective: This phase involves deciding on the analytical techniques and algorithms that will
be used for solving the problem. The focus is on designing the blueprint for the model.
Key Activities:
Selecting the Modeling Techniques:
Decide on the appropriate machine learning or statistical techniques based on the problem
type (e.g., regression, classification, clustering).
Splitting the Data:
Divide the data into training and testing sets to validate model performance.
Use cross-validation techniques to ensure robustness.
Creating a Data Pipeline:
Build workflows for how data will be fed into models and how results will be processed.
Feature Engineering:
Select and transform variables to improve model accuracy (e.g., creating new variables or
removing irrelevant ones).
Output: A plan that outlines the model structure, techniques, data split strategy, and the
criteria for evaluating the model's performance.
4. Model Building
Objective: The actual creation and training of models using the chosen algorithms and
techniques. In this phase, machine learning models are developed, tuned, and evaluated.
Key Activities:
Model Development:
Train machine learning or statistical models on the training dataset.
Experiment with different algorithms to find the best-performing one (e.g., decision trees,
neural networks).
Hyperparameter Tuning:
Adjust model parameters to improve performance (e.g., learning rate, number of trees in a
random forest).
Model Validation:
Evaluate the model on the test set to assess its performance.
Use accuracy metrics like precision, recall, F1 score, RMSE, or R-squared.
Iterative Improvement:
Based on performance, retrain the model or adjust features and parameters to optimize
results.
Output: A final model that meets the predefined success criteria, ready for testing and
deployment.
5. Communicating Results
Objective: This phase focuses on interpreting the model’s results and communicating
insights to stakeholders in a clear, actionable format.
Key Activities:
Interpreting the Results:
Translate model outputs into business insights (e.g., identifying factors influencing customer
churn).
Highlight any trends, patterns, or significant findings from the analysis.
Data Visualization:
Use visual tools (e.g., dashboards, charts, graphs) to present findings in an understandable
way.
Tools like Tableau, Power BI, or Matplotlib/Seaborn (in Python) are commonly used.
Creating Reports and Dashboards:
Develop dashboards or reports tailored to the needs of business stakeholders, highlighting
actionable insights and key takeaways.
Stakeholder Communication:
Present the findings to stakeholders, explaining the implications of the analysis in business
terms.
Provide recommendations based on the insights gathered from the data.
Output: A report or presentation summarising the analytical findings, along with
recommendations for decision-makers.
6. Operationalization
Objective: This phase involves deploying the model into production and integrating it into the
business’s operations. The model is used for decision-making, and its performance is
monitored over time.
Key Activities:
Model Deployment:
Deploy the model in a production environment where it can be used in real-time
decision-making (e.g., integrating the model into a web application or a business process).
Automation:
Automate the data pipeline and model predictions to ensure continuous operation (e.g.,
predicting customer churn on a weekly basis).
Monitoring and Maintenance:
Continuously monitor the model’s performance in production to ensure it remains accurate
and relevant.
Use alerts to detect model degradation or when re-training is necessary (e.g., when data
patterns shift).
Scaling the Solution:
Ensure that the system can handle larger datasets or increasing numbers of requests as the
business grows.
Model Retraining:
Periodically retrain the model with new data to maintain accuracy and adapt to changes in
the data patterns.
Output: A fully operational model that is integrated into business processes, providing
ongoing insights or automating decision-making.