CLOUD COMPUTING AND DEVOPS
(IT23APC601)
III Year [Link] | II Semester | SVCE R23 Regulations
Comprehensive Study Material
Covering all 5 Units | 45 Periods
UNIT I: Basics of Cloud Computing
1. Introduction to Cloud Computing
The U.S. National Institute of Standards and Technology (NIST) defines cloud computing as a model for
enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing
resources — such as networks, servers, storage, applications, and services — that can be rapidly
provisioned and released with minimal management effort or service provider interaction.
2. Characteristics of Cloud Computing
Cloud computing is defined by six essential characteristics that distinguish it from traditional IT
infrastructure:
On-demand Self Service: Cloud resources can be provisioned on-demand by users without
requiring interaction with the cloud service provider. The provisioning process is fully automated.
Broad Network Access: Cloud resources are accessible over the network using standard
mechanisms that support heterogeneous client platforms including workstations, laptops, tablets,
and smartphones.
Resource Pooling: Computing and storage resources provided by cloud service providers are
pooled to serve multiple users via multi-tenancy. Multiple users are served by the same physical
hardware.
Rapid Elasticity: Cloud resources can be provisioned and released rapidly and elastically.
Resources scale up or down based on demand through horizontal scaling (scaling-out) or vertical
scaling (scaling-up).
Measured Service: Cloud resources are provided on a pay-per-use model. Usage is measured
and users are charged based on specific metrics.
Multi-tenancy: The cloud's multi-tenant approach allows multiple users to use the same shared
resources. In virtual multi-tenancy, resources are shared; in organic multi-tenancy, every
component is shared among all tenants.
Additional benefits include: improved performance (resources scale with application workloads), reduced
costs (only required resources are provisioned dynamically), outsourced management (IT infrastructure
requirements delegated to cloud providers), and higher reliability (infrastructure is professionally
managed).
3. Cloud Service Models
NIST defines three primary cloud service models:
Software as a Service (SaaS): Provides complete software applications or user interfaces over a
network. The cloud provider manages underlying infrastructure, OS, and applications. Users access
via thin client interfaces (browsers). Examples: Salesforce CRM, Google Workspace, Microsoft
Office 365.
Platform as a Service (PaaS): Provides the capability to develop and deploy applications in the
cloud using development tools, APIs, software libraries, and services provided by the cloud
provider. The provider manages infrastructure; users manage applications. Examples: Google App
Engine, Windows Azure Web Sites.
Infrastructure as a Service (IaaS): Provides virtual computing, storage, and network resources
provisioned on demand. Users access resources as virtual machine instances and virtual storage.
Pay-per-use/pay-as-you-go billing. Examples: Amazon EC2, Google Compute Engine, Windows
Azure VMs.
4. Cloud Deployment Models
Public Cloud: Available for public use or a large industry group. Cloud resources are shared
among different users. Services provided by a third-party cloud provider.
Private Cloud: Operated for exclusive use of a single organization. Infrastructure can be on-
premise or off-premise, managed internally or by a third-party. Best suited for security-critical
applications.
Community Cloud: Available for shared use of several organizations supporting a specific
community. Best suited for organizations that want to share applications, data, and cloud costs.
Hybrid Cloud: Combines multiple clouds (public and private) that remain unique but are bound
together to offer application and data portability. Organizations benefit from both secured private
hosting and cost savings in public clouds.
5. Cloud Services Examples
IaaS Examples:
• Amazon EC2: Provides virtual machine instances from small (1 virtual core, 1.7 GB memory) to
extra-large (4 virtual cores, 15 GB memory). Supports auto-scaling and elastic load balancing.
• Google Compute Engine (GCE): Provides instances from small (1 virtual core, 1.7 GB memory)
to high-memory types (8 virtual cores, 52 GB memory).
• Windows Azure Virtual Machines: Provides instances from small (1 virtual core, 1.75 GB
memory) to memory-intensive (8 virtual cores, 56 GB memory).
PaaS Examples:
• Google App Engine (GAE): Cloud-based web service for hosting web applications and storing
data. Provides automatic scaling and load balancing. Supports Java, Python, PHP, and Go.
SaaS Examples:
• Salesforce: Cloud-based CRM platform allowing sales representatives to manage customer
profiles, track opportunities, and optimize campaigns.
6. Cloud-Based Services and Applications
Cloud computing has transformed multiple industries:
• Healthcare: Secure patient data sharing between hospitals. Patients access personal health
records (PHR) from all care providers.
• Energy Systems: Thousands of sensors gather real-time maintenance data for condition
monitoring and failure prediction in smart grids and power plants.
• Transportation: Intelligent Transportation Systems (ITS) driven by data from multiple sources to
provide services like advanced route guidance and dynamic vehicle routing.
• Manufacturing: Industrial Control Systems (ICS) such as SCADA and DCS generate monitoring
data that when analyzed in the cloud improves plant safety and prevents catastrophic failures.
• Government: Cloud-based e-governance improves service delivery to citizens, businesses, and
government agencies.
• Education: Cloud-based online learning systems, online exams, progress tracking, and
collaboration platforms for students.
• Mobile Communication: Cloud-based virtualization of heterogeneous network devices for the
Radio Access Network (RAN) and Core Network (CN).
Cloud Concepts and Technologies
1. Virtualization
Virtualization refers to partitioning the resources of a physical system (computing, storage, network, and
memory) into multiple virtual resources. It is the key enabling technology of cloud computing that allows
pooling of resources.
Hypervisor (VMM): The virtualization layer. A Type-1 (native) hypervisor runs directly on host
hardware. A Type-2 (hosted) hypervisor runs on top of a conventional operating system.
Full Virtualization: The virtualization layer completely decouples the guest OS from the underlying
hardware. Guest OS requires no modification. Enabled by direct execution and binary translation.
Para-Virtualization: The guest OS kernel is modified to replace non-virtualizable instructions with
hyper-calls that communicate directly with the hypervisor, improving performance and efficiency.
Hardware Virtualization: Enabled by hardware features such as Intel VT-x and AMD-V. Privileged
and sensitive calls automatically trap to the hypervisor, eliminating the need for binary translation.
2. Load Balancing
Load balancing distributes workloads across multiple servers to meet application demands. Goals
include: maximum resource utilization, minimized response times, and maximized throughput.
Load Balancing Algorithms:
• Round Robin: Servers selected one-by-one in a circular fashion with no priority.
• Weighted Round Robin: Servers assigned weights; requests proportionally routed based on
static or dynamic ratios.
• Low Latency: Requests routed to the server with lowest latency.
• Least Connections: Requests routed to the server with fewest active connections.
• Priority: Requests routed to highest-priority server; lower-priority servers act as fallback.
• Overflow: Similar to Priority; when highest-priority server is overloaded, requests overflow to
lower-priority servers.
Persistence Approaches: Sticky Sessions, Session Database, Browser Cookies, URL Re-writing.
3. Scalability and Elasticity
Vertical Scaling (Scaling Up): Upgrading hardware resources: adding additional computing,
memory, storage, or network resources.
Horizontal Scaling (Scaling Out): Addition of more resources of the same type — adding more
server instances.
Capacity planning involves determining the right sizing of each tier of application deployment in terms of
resource count and capacity for computing, storage, memory, or network resources.
4. Deployment
Cloud application deployment is an iterative process involving:
• Deployment Design: Specifying the number of servers in each tier, computing/memory/storage
capacities, server interconnection, load balancing, and replication strategies.
• Performance Evaluation: Verifying application meets performance requirements by monitoring
workload parameters such as response time and throughput.
• Deployment Refinement: Applying vertical or horizontal scaling, alternative server
interconnections, or different load balancing and replication strategies.
5. Replication
Replication creates and maintains multiple copies of data in the cloud for disaster recovery:
Array-based Replication: Uses compatible storage arrays to copy data from local to remote
storage array at the disk sub-system level. Uses NAS or SAN infrastructure. Higher setup costs.
Network-based Replication: Uses an appliance that intercepts network packets and replicates
them to a secondary location. Supports heterogeneous environments.
Host-based Replication: Runs on standard servers using software to transfer data. Can be block-
based (requires dedicated volumes of the same size) or file-based (allows selective file/folder
replication). More affordable with cloud infrastructure.
6. Monitoring
Cloud monitoring services collect and analyze data on various metrics including CPU usage, memory
usage, disk I/O, network I/O, request latency, error rates, and availability. Monitoring is critical for tracking
the health of applications and services deployed in the cloud.
7. Software Defined Networking (SDN)
SDN is a networking architecture that separates the control plane from the data plane and centralizes
the network controller. Conventional networks have coupled control and data planes with complex
proprietary devices.
Limitations of conventional networks: Complex network devices, management overhead (multiple
devices and interfaces), limited scalability.
SDN Architecture key elements:
• Centralized Network Controller: Decoupled control and data planes allow rapid network
configuration.
• Programmable Open APIs: Northbound interface for implementing network services such as
routing, QoS, access control.
• Standard Communication Interface (OpenFlow): Southbound interface that directly accesses
and manipulates the forwarding plane. OpenFlow switches use flow tables and a group table for
packet lookups and forwarding.
8. Network Function Virtualization (NFV)
NFV leverages virtualization to consolidate heterogeneous network devices onto industry-standard high-
volume servers, switches, and storage. NFV enables separation of network functions (implemented in
software) from underlying hardware.
NFV Architecture components:
VNF (Virtualized Network Function): Software implementation of a network function capable of
running over the NFV Infrastructure.
NFVI (NFV Infrastructure): Includes compute, network, and storage resources that are virtualized.
NFV Management and Orchestration: Handles all virtualization-specific management tasks,
orchestration, and lifecycle management of physical/software resources.
9. MapReduce
MapReduce is a parallel data processing model for processing and analyzing massive scale data:
• Map Phase: Data is read from distributed file system, partitioned among computing nodes, and
sent as key-value pairs. Map tasks process records independently and store intermediate
results on local disk.
• Reduce Phase: When all Map tasks complete, intermediate data with the same key is
aggregated to produce final results.
10. Identity and Access Management (IAM)
IAM for cloud describes authentication and authorization of users to provide secure access to cloud
resources. IAM enables:
• Centralized management of user identifiers and permissions
• Role-based access control (RBAC) to cloud resources and applications
• Creation of user groups where all users in a group have the same access permissions
• Management of security credentials and access keys
IAM technologies include: OpenAuth, RBAC, Digital Identities, Security Tokens, and Identity Providers.
11. Service Level Agreements (SLAs)
An SLA for cloud services formally defines the level of service as part of the service contract. SLAs
contain performance metrics and corresponding service level objectives covering availability, response
time, throughput, and support requirements.
12. Billing Models
Elastic Pricing (Pay-as-you-use): Customers charged based on actual usage. Best suited for
short-duration usage where consumption cannot be predicted beforehand.
Fixed Pricing: Customers charged a fixed amount per month. Suited for longer durations with more
control over expenses.
Spot Pricing: Variable pricing driven by market demand. Prices increase with high demand and
decrease with lower demand.
Cloud Services and Platforms
1. Compute Services
Compute services provide dynamically scalable compute capacity in the cloud. Virtual machines are
provisioned on-demand from standard or custom images. Features:
• Scalable: Auto-scaling policies triggered by CPU/memory thresholds
• Flexible: Wide range of instance types, operating systems, and zones
• Secure: Security groups, access control lists, network firewalls
• Cost Effective: On-demand, reserved, and spot instance billing options
2. Storage Services
Cloud storage services allow storage and retrieval of any amount of data from anywhere. Data is
organized in buckets or containers. Features: high capacity and scalability, multi-facility replication, ACL
policies, server-side encryption, and strong consistency for all upload and delete operations.
3. Database Services
Cloud database services allow setting up and operating relational or non-relational databases:
• Relational databases: MySQL, Oracle, SQL Server with automated backup and guaranteed
IOPS
• Non-relational (NoSQL) databases: Proprietary solutions offering eventual consistency
4. Application Services
Cloud application services include: Application Runtimes (Google App Engine supporting Java, Python,
PHP, Go), Queuing Services (Amazon SQS, Google Task Queue, Azure Queue), Email Services
(Amazon SES, Google Email Service), Notification Services (Amazon SNS, Google Cloud Messaging),
and Media Services (Amazon Elastic Transcoder).
5. Content Delivery Services
Content Delivery Networks (CDNs) are distributed systems of servers across multiple geographic
locations. CDNs serve static content (text, images, scripts) and streaming media with high availability
and low latency by routing requests to nearest edge locations.
6. Analytics Services
Cloud analytics services enable analysis of massive datasets using MapReduce programming models.
Examples: Amazon Elastic MapReduce (based on Hadoop), Google BigQuery (SQL-like querying of
massive datasets), Windows Azure HDInsight (Hadoop-as-a-service).
7. Open Source Private Cloud Software
CloudStack: Open source cloud infrastructure management supporting Zones, Pods, Clusters, Primary
Storage, and Secondary Storage. Eucalyptus: AWS-compatible private cloud with Node Controller,
Cluster Controller, Storage Controller, Cloud Controller, and Walrus components. OpenStack: Cloud OS
with services including nova-compute, nova-networking, Cinder (volumes), Swift (object storage),
Keystone (identity), Glance (image registry), and Horizon (dashboard).
UNIT II: Hadoop and Cloud Application Design
1. Apache Hadoop
Apache Hadoop is an open source framework for distributed batch processing of big data. It has been
proposed as a parallel programming model suitable for the cloud.
Hadoop Ecosystem Components
Hadoop Common: Utilities and scripts for starting Hadoop, components and interfaces to access
supported file systems.
HDFS (Hadoop Distributed File System): Reliably stores very large files across machines in a
large cluster of commodity hardware. Stores each file as a sequence of blocks (same size except
last).
Hadoop MapReduce: Parallel programming model distributing large-scale computations as
operations on key-value pair datasets.
YARN: Framework for job scheduling and cluster resource management (Hadoop 2.0+).
HBase: Scalable, non-relational, distributed, column-oriented database for structured data storage
of large tables.
Zookeeper: High performance distributed coordination service for configuration information,
naming, synchronization, and group services.
Pig: Dataflow language and execution environment for large dataset analysis. Compiles to
MapReduce jobs.
Hive: Data warehouse infrastructure providing SQL-like HiveQL for easy data summarization and
ad-hoc querying of HDFS data.
Mahout: Scalable machine learning library implementing algorithms for clustering, classification,
and collaborative filtering using MapReduce.
Chukwa: Data collection system for monitoring large distributed systems built on HDFS and
MapReduce.
Cassandra: Scalable multi-master database with no single points of failure.
Oozie: Workflow scheduler system for managing Hadoop jobs.
Flume: Distributed service for collecting, analyzing, and moving large amounts of data from
applications to HDFS.
Sqoop: Tool for efficiently transferring bulk data between Hadoop and relational databases.
2. Hadoop MapReduce Job Execution
Hadoop cluster components:
NameNode: Keeps directory tree of all files in the file system and tracks where file data is kept
across the cluster. Single Point of Failure for HDFS.
Secondary NameNode: Creates checkpoints of the namespace. Hosted on a separate machine.
Note: NOT a backup NameNode.
JobTracker: Service that distributes MapReduce tasks to specific nodes in the cluster, ideally
nodes that have the data.
TaskTracker: Node that accepts Map, Reduce, and Shuffle tasks from the JobTracker. Each has a
defined number of slots for tasks.
DataNode: Stores data in HDFS. Responds to requests from NameNode. Multiple DataNodes
ensure data is replicated.
MapReduce Job Execution Workflow
• Client application submits job to JobTracker, receives JobID
• JobTracker contacts NameNode to determine data location
• JobTracker locates TaskTracker nodes with available slots near the data
• TaskTrackers send heartbeat messages every few minutes to report availability
• JobTracker submits work to TaskTrackers using scheduling algorithms (default: FIFO)
• TaskTracker spawns separate JVM process for each task to isolate failures
• On completion, TaskTracker notifies JobTracker; JobTracker updates status
MapReduce 2.0 - YARN
YARN separates resource management from processing in Hadoop 2.0, functioning as an OS for Hadoop
supporting different processing engines (MapReduce, Apache Tez, Apache Storm).
Resource Manager (RM): Manages global assignment of compute resources. Contains Scheduler
(pluggable, enforces resource scheduling policy) and Applications Manager (AsM — manages
running Application Masters).
Application Master (AM): Per-application manager for the application life cycle. Negotiates
resources from RM and works with Node Managers.
Node Manager (NM): Per-machine manager for user processes.
Container: Bundle of resources (memory, CPU, network) allocated by RM for a component task.
3. Hadoop Schedulers
FIFO Scheduler: Default scheduler. Maintains a work queue; pulls jobs first-in-first-out (oldest job
first). No priority or size concept.
Fair Scheduler: Allocates resources evenly between multiple jobs. Assigns resources so each job
gets an equal share on average. Maintains pools with guaranteed capacity. Computes periodically
the difference between time received and ideal scheduling.
Capacity Scheduler: Defines named queues each with configurable map/reduce slots and
guaranteed capacity. Within each queue, FIFO with priority is used. User limits ensure cluster is
shared equally.
4. Hadoop Cluster Setup
Steps for setting up a Hadoop cluster:
• Install Java (Java 6 or later required)
• Download and unpack Hadoop setup tarball on all nodes
• Configure networking so all nodes can connect to each other
• Configure Hadoop using configuration files: [Link], [Link], [Link],
[Link], masters, slaves
• Format NameNode: bin/hadoop namenode -format
• Start HDFS daemons: bin/[Link]
• Start MapReduce daemons: bin/[Link]
Web UIs: NameNode at port 50070, JobTracker at port 50030.
Cloud Application Design
1. Design Considerations
Scalability: Ability to provision adequate resources to meet workload levels for millions of users.
Reliability and Availability: Reliability = probability system performs intended functions under
stated conditions for a specified time. Availability = probability system performs a specified function
at a prescribed time.
Security: Critical design consideration given the outsourced nature of cloud environments.
Maintenance and Upgradation: Design applications with low maintenance and upgradation costs
to achieve rapid time-to-market.
Performance: Applications should be designed with performance requirements in mind.
2. Reference Architecture for Cloud Applications
Three primary reference architectures:
• E-commerce/B2B/Banking Applications: Load Balancing Tier (minimum two load balancers in
separate availability zones) + Application Tier (auto-scaling, minimum two servers) + Database
Tier (master for writes, slaves for reads).
• Content Delivery Applications: Load balancers + relational and non-relational data stores + CDN
for media delivery and static content acceleration.
• Analytics Applications: Web tier + Application tier + Storage tier + Computing/Analytics tier
(Hadoop for big data) + Database tier.
3. Cloud Application Design Methodologies
Service Oriented Architecture (SOA)
SOA is an architectural approach for designing applications as loosely coupled services that can be
shared and reused. Services communicate via messages using SOAP protocol.
WSDL (Web Services Description Language) is an XML-based language for describing services,
containing: Service (function exposed), Endpoint (address), Binding (interface and transport protocol),
Interface (service operations), Operation (message decoding and actions), and Types (data description).
SOA Layers: Business Systems, Service Components, Composite Services, Orchestrated Business
Processes, Presentation Services, Enterprise Service Bus (ESB).
Cloud Component Model (CCM)
CCM provides a flexible way to create cloud applications in a rapid, convenient, and platform-independent
manner. Not tied to any specific programming language or cloud platform.
CCM Application Design Steps:
• Component Design: Identify building blocks based on function and cloud resource type. Each
component has inputs, actions, outputs, a functional interface, and a performance interface.
• Architecture Design: Define interactions between components using loose coupling,
asynchronous communication (via messaging queues), and stateless design.
• Deployment Design: Map components to specific cloud resources. Loosely coupled stateless
components can be deployed independently and migrated between clouds.
Model View Controller (MVC)
Model: Manages data and application behavior. Processes events from controller. Responds to
requests for state information.
View: Prepares user interface, handles user requests, sends them to controller, and presents
information provided by model.
Controller: Glues model to view. Processes user requests, updates model when user manipulates
view, and updates view when model changes.
RESTful Web Services
REST (Representational State Transfer) is a set of architectural principles for designing web APIs that
focus on system resources. REST constraints:
• Client-Server: Separation of concerns between client and server
• Stateless: Each request contains all information necessary; no stored context on server
• Cacheable: Responses labeled as cacheable or non-cacheable
• Layered System: Components cannot see beyond the immediate layer
• Uniform Interface: Uniform method of communication between client and server
• Code on Demand: Servers can provide executable code for clients (optional)
4. Data Storage Approaches
Relational (SQL) Databases
Based on the relational model by Edgar Codd (1970). Provide ACID guarantees:
• Atomicity: Each transaction is all-or-nothing; partial changes do not persist.
• Consistency: Each transaction brings database from one valid state to another, conforming to
defined schema and constraints.
• Isolation: Concurrent transactions result in the same state as if executed serially; incomplete
transaction results are not visible to others.
• Durability: Committed data remains persistent even after system outages.
Pros Cons
Well-defined consistent model Performance is the major constraint
Provide ACID guarantees Limited support for complex data structures
Relational integrity through entity and referential Complex knowledge of structure needed for ad
constraints hoc queries
Well-suited for OLTP applications Most relational database systems are expensive
Non-Relational (NoSQL) Databases
Better horizontal scaling capability and improved performance for big data. Offer eventual consistency
rather than ACID guarantees. Categories:
• Key-value store: For unstructured data without a fixed schema. Supports native programming
language data types.
• Document store: Stores semi-structured data in documents (JSON, XML, BSON, YAML).
• Graph store: For data with graph structure (nodes and edges). Suitable for social networks and
transportation systems.
• Object store: Stores data in the form of objects defined in object-oriented programming
languages.
UNIT III: Introduction to DevOps and Business
Foundations
1. DevOps Fundamentals
Overview
DevOps is a set of practices, principles, and a cultural approach that integrates software development
(Dev) and IT operations (Ops) to enable faster, reliable, and continuous delivery of software.
DevOps aims to shorten the software development life cycle (SDLC) while maintaining high quality by
encouraging collaboration, automation, and continuous feedback between development and operations
teams.
Key Characteristics of DevOps
• Continuous Integration and Continuous Delivery (CI/CD)
• Automation of build, test, and deployment
• Infrastructure as Code (IaC)
• Continuous monitoring and feedback
• Shared responsibility across teams
Objectives and Benefits
Objectives: Faster time-to-market, improved software quality, reduced deployment failures, increased
operational efficiency, and better customer satisfaction.
Benefits: Rapid and frequent software releases, early detection and resolution of defects, improved
collaboration and productivity, scalable and reliable systems, and business agility and innovation.
2. DevOps Origins
Before DevOps, software development followed traditional models like Waterfall, where development and
operations teams worked independently. The term DevOps emerged around 2008-2009, popularized
through online discussions and conferences.
Problems in Traditional IT Models:
• Development teams focused on speed and new features
• Operations teams focused on stability and reliability
• Manual deployments and testing
• Long release cycles and high risk during production releases
Influences on DevOps:
• Agile Development: Introduced iterative development, customer feedback, and short cycles.
Agile focused on development; DevOps extended these principles to deployment and
operations.
• Continuous Integration: Encouraged frequent code integration, early defect detection, and
automated builds.
• Lean Manufacturing: Contributed waste elimination, continuous improvement, and faster value
delivery principles.
• Cloud Computing and increased system complexity also drove DevOps adoption.
3. DevOps Roots
Agile Software Development: Iterative and incremental development, frequent customer
feedback, collaboration, and adaptability to change.
Lean Principles: Elimination of waste, continuous improvement (Kaizen), small batch sizes, and
faster delivery of value. Helps optimize the entire delivery pipeline.
Continuous Integration (CI): Frequent code integration, automated builds and tests, early defect
detection.
IT Service Management (ITSM): Incident management, change management, service reliability.
DevOps balances ITSM stability with Agile speed.
Systems Thinking: Viewing software delivery as an end-to-end system. Optimizing the whole
system, reducing bottlenecks.
Open Source Culture: Collaboration, transparency, shared ownership, and rapid innovation.
4. DevOps Practices
Continuous Integration (CI): Frequently integrating code into a shared repository. Automated
builds, automated testing, early error detection. Reduces integration problems and improves code
quality.
Continuous Delivery (CD): Software is always ready for release. Reduces manual intervention
and improves reliability.
Continuous Deployment: Software is automatically deployed to production after passing all tests.
Infrastructure as Code (IaC): Managing infrastructure using code. Provides consistency across
environments, version control, and faster provisioning.
Automated Testing: Unit testing, integration testing, and performance testing automated
throughout the pipeline.
Monitoring and Logging: Real-time monitoring, log analysis, and alerting for fast issue resolution
and improved reliability.
Continuous Feedback: Feedback collected from automated tests, monitoring tools, and end users
drives continuous improvement.
5. DevOps Culture
DevOps culture focuses on people and collaboration, not just tools:
• Collaboration and Shared Responsibility: Development and operations teams work together with
shared ownership of success and failures.
• Transparency and Trust: Open communication and clear visibility into processes and issues.
• Learning from Failure: Failures treated as learning opportunities. Blame-free culture with
continuous improvement.
• Automation Mindset: Preference for automated solutions to reduce human error and improve
efficiency.
• Customer-Centric Approach: Focus on delivering customer value with continuous user
feedback.
2. Adopting DevOps
Developing the DevOps Playbook
A DevOps Playbook is a documented guide that defines how DevOps practices are implemented within
an organization. It standardizes DevOps processes, reduces errors, enables faster onboarding, and
ensures scalability and repeatability.
Key Components of a DevOps Playbook:
• Development Workflow: Coding standards, version control practices
• CI/CD Pipeline: Build automation, automated testing, deployment strategies
• Toolchain Definition: CI/CD tools, configuration management tools, monitoring tools
• Infrastructure Management: IaC, cloud and container usage
• Security Integration (DevSecOps): Automated security testing, access control policies
• Monitoring and Incident Management: Logging standards, alerting, incident response
procedures
• Roles and Responsibilities: Clear ownership, cross-functional team responsibilities
Developing the DevOps Business Case
A DevOps Business Case justifies investment in DevOps by linking adoption to measurable business
benefits, helping secure management and stakeholder support.
Key Business Drivers:
• Faster time-to-market
• Improved software quality
• Reduced operational costs
• Increased customer satisfaction
• Higher employee productivity
Key Metrics for Business Case:
• Deployment frequency
• Lead time for changes
• Mean Time to Recovery (MTTR)
• Change failure rate
3. Business Model Canvas (BMC)
The Business Model Canvas is a strategic management tool that provides a visual representation of how
an organization creates, delivers, and captures value. It has nine building blocks.
1. Customer Segments
Customer Segments are the different groups of people or organizations that a business aims to serve.
Types:
• Mass Market: Large group with similar needs — e.g., consumer electronics
• Niche Market: Specialized group with specific needs — e.g., medical software for hospitals
• Segmented Market: Slightly different needs within a broader group — e.g., banking for students
vs professionals
• Diversified Market: Multiple unrelated customer segments — e.g., Amazon serving consumers
and businesses
• Multi-Sided Platform: Two or more interdependent groups — e.g., ride-sharing apps connecting
drivers and passengers
2. Value Propositions
Explains why customers choose one product/service over another. Types:
• Newness, Performance, Customization, Getting the Job Done, Design, Brand/Status
• Price, Cost Reduction, Risk Reduction, Accessibility, Convenience/Usability
3. Channels
How a company communicates with and delivers its value proposition to customer segments. Types:
Direct (company website, direct sales) vs Indirect (retailers, distributors). Channel Phases: Awareness,
Evaluation, Purchase, Delivery, After-Sales Support.
4. Customer Relationships
Types: Personal Assistance, Dedicated Personal Assistance, Self-Service (FAQs, tutorials), Automated
Services (chatbots, personalized recommendations), Communities (online forums), Co-Creation
(customer feedback on features).
5. Revenue Streams
Types: Asset Sale, Usage Fee, Subscription Fee, Lending/Renting/Leasing, Licensing,
Brokerage/Transaction Fee, Advertising.
6. Key Resources, Activities, Partnerships, and Cost Structure
Key Resources: Physical (data centers), Intellectual (patents, software), Human (engineers), and
Financial (funding). Essential for delivering value proposition.
Key Activities: Production (software development), Problem-Solving (consulting),
Platform/Network Management.
Key Partnerships: Strategic Alliances (non-competitors), Coopetition (competitors for mutual
benefit), Joint Ventures, Buyer-Supplier Relationships.
Cost Structure: Fixed Costs (rent, salaries), Variable Costs (cloud usage fees), Operational Costs
(utilities, maintenance), Economies of Scale.
UNIT IV: DevOps Strategies for Delivery and Innovation
1. Optimizing the Delivery Pipeline
DevOps as an Optimization Exercise
DevOps is best understood as a continuous optimization problem: how to move ideas to customers faster,
safer, and with less friction under real-world constraints. The object of optimization is the end-to-end
delivery system:
Idea → Code → Build → Test → Release → Operate → Learn
Key optimization goals for high-performing teams:
• Lead time: how fast changes reach production
• Deployment frequency: how smoothly work flows
• Change failure rate: quality of changes
• Mean time to recovery (MTTR): resilience when things break
• Feedback latency: how fast learning happens
Using the Theory of Constraints, overall delivery speed is limited by the slowest step. Common
bottlenecks include: manual approvals and release coordination, slow/flaky integration tests, environment
provisioning, cross-team dependencies, and knowledge silos.
Flow Over Utilization
Traditional management optimizes for people being busy. DevOps optimizes for work flowing. High
utilization leads to large batch sizes, longer queues, increased risk, and slower feedback. Flow
optimization emphasizes:
• Small, frequent changes and trunk-based development
• Decoupled services and continuous integration
• Limiting Work in Progress (WIP)
• Reducing wait times, not filling calendars
Automation as Risk Reduction
Automation is primarily about consistency and reliability, not just speed. High-impact automation areas:
builds and tests, infrastructure provisioning, deployments and rollbacks, security and compliance checks,
and incident detection.
Shifting Risk Left and Learning Right
• Shift Left: Early testing and validation, security and policy as code, fast developer feedback
• Shift Right: Observability and telemetry, feature flags and canaries, rapid rollback and recovery,
production-driven learning
2. Core Themes of DevOps
Ten core themes define the DevOps philosophy:
• Systems Thinking: Optimize the system, not the parts. Treat software delivery as a complex
system.
• Flow of Value: Delays, handoffs, queues, and rework are the real enemies. Flow beats
utilization.
• Constraint-Driven Optimization: The slowest step defines the system. Identify and elevate
constraints.
• Fast Feedback Loops: Speed of learning matters more than speed of execution.
• Small Batches, Frequent Change: Reduce batch size to reduce risk.
• Automation for Reliability: Predictability enables speed. Consistency enables trust.
• Built-In Quality and Safety: Design safety into the system, do not inspect it in later.
• Resilience Over Perfection: Recover fast, do not aim for zero failure.
• Continuous Improvement: Improvement is continuous, not episodic.
• Optimization Enables Innovation: When change is cheap, innovation thrives.
3. The DevOps Plays
DevOps plays are reusable solutions to common delivery problems:
Expose the System: Map the value stream end-to-end. Visualize lead time, wait states, failure
rates. Make work and constraints visible.
Attack the Bottleneck: Identify the current constraint. Optimize only that constraint. System
throughput is defined by its weakest link.
Reduce Batch Size: Trunk-based development, smaller PRs, feature flags, and incremental
releases. Smaller changes move faster and fail safer.
Automate the Repetitive and Risky: CI pipelines, Infrastructure as Code, automated deployment
and rollback. Makes outcomes predictable and scalable.
Shift Quality and Security Left: Test early and often, security as code, fast local developer
feedback. Fixing problems early is cheaper.
Design for Safe Change: Feature flags, canaries, blue/green deployments, immutable
infrastructure, progressive delivery. Risk reduced through design, not approvals.
Optimize for Flow, Not Utilization: Limit WIP, decouple teams and services, reduce dependencies
and handoffs.
Close the Feedback Loop from Production: Strong observability (logs, metrics, traces), user
telemetry, blameless post-incident learning.
Treat the Pipeline as a Product: Platform teams with clear ownership, roadmaps for CI/CD,
measure developer experience (DX).
Make Improvement Continuous: Small, frequent improvements, regular retrospectives on the
system, metrics-driven experimentation.
4. Specializing Core Plays
The same core plays apply everywhere, but execution depends on the dominant constraint:
• Speed-Constrained: Aggressive batch size reduction, trunk-based development, CI
performance optimization, self-service environments.
• Risk-Constrained (regulated): Policy-as-code, evidence generation baked into pipelines,
progressive delivery with automatic rollback, strong audit trails.
• Capacity-Constrained: Limit WIP aggressively, kill low-value work, automate toil first, platform
teams to reduce cognitive load.
• Monolith-Centric: Modularization before microservices, component-level testing, release slicing.
• Microservices: Contract testing, service ownership and SLOs, platform-standardized pipelines.
• Enterprise: Value stream alignment, platform teams as internal product orgs, guardrails over
gates.
• Startups: Bias toward speed and learning, minimal process, maximal observability.
5. Driving Innovation with DevOps
Optimize to Innovate
Innovation is blocked not by lack of ideas but by high cost of change. DevOps drives innovation by
reducing the cost, risk, and time of change across the delivery pipeline. Key principles:
• Optimize for Learning, Not Just Speed: Short lead times, rapid feedback, clear telemetry
signals, and fast failure recovery.
• Make Change Cheap and Safe: Small batch sizes, feature flags, automated testing and
rollback, observability-first design.
• Remove Friction from the Creative Loop: Automate toil, create self-service platforms, reduce
coordination overhead.
• Shift from Projects to Experiments: Hypothesis-driven development, short experiment cycles,
killing bad ideas early.
• Reliability Enables Risk-Taking: Fast detection, low MTTR, confidence in rollback, psychological
safety for teams.
The Uber Syndrome
The Uber Syndrome describes what happens when organizations chase innovation theater — copying
the surface behaviors of hyper-scaling tech companies without optimizing the underlying delivery
systems. Common failure patterns:
• Scale Before Stability: Distributed architectures adopted before reliable single-system
deployment.
• Autonomy Without Enablement: Teams declared 'fully autonomous' without reliable pipelines,
observability, or self-service infrastructure.
• Speed Mandates on Fragile Systems: Faster releases demanded without investment in
automation, testing, or resilience.
• DevOps as Culture, Not System: Framed as 'collaborate more' without changing incentives,
tooling, or workflows.
The Antidote: Fix delivery bottlenecks and stabilize CI/CD first. Build capability before complexity. Copy
principles (small safe changes, fast feedback, designed-in safety), not implementations. Speed is a
reward for system reliability.
Role of Technology in Innovation
Technology is a force multiplier, not a strategy. Key contributions:
• Lowers the Cost of Change: Rapid provisioning (cloud, IaC), safe deployment (CI/CD, feature
flags), fast rollback, observability.
• Enables Fast Feedback Loops: Automated testing, telemetry, user analytics, real-time
monitoring.
• Abstracts Complexity: Infrastructure platforms, managed services, APIs, and SDKs free teams
for problem-solving.
• Supports Safe Failure: Automatic rollback, redundancy, chaos testing turn failure into learning.
• Enables Scaling Innovation: Platforms encode best practices, making innovation repeatable
across teams.
• Technology cannot fix: misaligned incentives, poor organizational design, fear-driven
leadership, or lack of ownership.
6. Strategic Plays for Innovation
1. Building a DevOps Platform
Platform as a Product: Named product owner, roadmap aligned to developer pain points, platform
KPIs (DX, lead time, reliability).
Build Paved Roads, Not Guardrails: Opinionated CI/CD templates, golden paths for common
workloads, secure-by-default configurations, self-service onboarding.
Encode Policy and Safety into the Platform: Policy-as-code, automated evidence collection,
embedded security checks, drift detection.
Optimize the Inner Loop First: Fast local builds and tests, pre-commit checks, consistent dev
environments, clear failure feedback.
Design for Safe Experimentation: Feature flags, canary and progressive delivery, automatic
rollback, observability by default.
Measure What Matters: DORA metrics, developer experience metrics, experiment velocity, MTTR
and reliability indicators.
2. Delivering Microservices Architectures
Microservices fail when organizations adopt them without optimizing the delivery system first. Key
principles:
• Earn the Right to Microservices: Deploy a monolith safely before distributing into services.
Automate builds, tests, releases. Establish clear ownership.
• Organize Around Services, Not Projects: One team owns a service end-to-end. You build it, you
run it.
• Design for Independent Deployability: Loose coupling, backward-compatible APIs, schema
versioning, contract testing.
• Standardize the Path, Not the Destination: Platform provides CI/CD templates, observability
defaults, security, and service scaffolding.
• Testing Strategies That Scale: Strong unit tests, contract tests between services, minimal end-
to-end tests.
• Operability Is Not Optional: Logs, metrics, traces by default. SLOs, error budgets, automated
alerts, and clear runbooks.
• Safe Release Practices: Canary releases, blue/green deployments, feature flags, automatic
rollback.
3. DevOps and the API Economy
The API economy treats APIs as products, business enablers, and innovation multipliers. DevOps makes
APIs reliable, discoverable, and scalable:
• APIs as Products: Quality, reliability, clear contracts, versioning, backward compatibility,
monitoring.
• Continuous Delivery Enables API Agility: Rapid safe deployments, automated contract testing,
canary releases.
• Observability is the API Compass: Request/response metrics, error rates, latency, consumer
adoption patterns.
• API Governance Without Slowing Innovation: Policy-as-code, automated contract checks, self-
service scaffolding.
• Platform Thinking for API Scale: Internal developer portals, reusable API templates, shared
CI/CD pipelines.
4. Organizing for Innovation
Key organizational principles for sustainable innovation:
• Align Teams Around Outcomes, Not Projects: End-to-end ownership, measured on outcomes
not output.
• Long-Lived, Cross-Functional Teams: Developers, QA, Ops, and security embedded together.
• Limit Work in Progress: Focus effort, reduce context switching.
• Decentralized Decision-Making: Push decisions to lowest level with context and accountability.
Guardrails over central approvals.
• Platform Teams as Enablement Functions: Self-service pipelines, infrastructure, and common
services.
• Align Incentives to Learning and Experimentation: Reward validated learning, encourage
experimentation with clear metrics.
• Small Batches and Continuous Delivery: Feature flags, canaries, and progressive delivery for
incremental experimentation.
• Observability and Transparency: Visible work, constraints, and performance metrics enable
informed decisions.
UNIT V: Scaling and Leading DevOps in the Enterprise
Scaling DevOps
1. DevOps Center of Competency (CoC)
A DevOps Center of Competency (CoC) is a centralized team/structure that defines, standardizes, and
scales DevOps practices across multiple teams and projects. It acts as a governing body, knowledge
hub, and support system — an enabler, not a command-and-control unit.
Objectives of DevOps CoC:
• Standardize DevOps processes across teams
• Improve CI/CD pipeline efficiency
• Enhance collaboration between development, QA, and operations
• Reduce deployment failures and downtime
• Promote automation and Infrastructure as Code (IaC)
• Ensure security integration (DevSecOps)
Key Functions:
Governance and Standards: Define coding, testing, and deployment standards. Establish CI/CD
pipeline templates.
Toolchain Management: Select and manage DevOps tools (Jenkins, Git, Docker, Kubernetes).
Optimize licensing.
Automation and CI/CD: Design scalable pipelines. Implement automated testing and deployment.
Training and Enablement: Conduct workshops, provide documentation and playbooks, mentor
teams.
Monitoring and Feedback: Define KPIs (deployment frequency, lead time, MTTR). Enable
feedback loops.
Typical CoC Roles: DevOps Architect, Automation Engineer, Cloud Engineer, Security Specialist,
Release Manager.
2. Innovation Culture at Scale
Innovation culture at scale means embedding continuous experimentation, rapid delivery, and
collaborative problem-solving across all teams using DevOps principles. Core Principles:
• Continuous Experimentation: Frequent releases, feature flags, A/B testing. Fail fast, learn faster.
• Automation Everywhere: Automate build, test, deployment, and monitoring.
• Collaboration and Shared Ownership: Break silos, cross-functional squads, shared
accountability.
• Feedback-Driven Development: Real-time monitoring, logging, customer feedback.
• DevSecOps Integration: Security testing and compliance integrated early in the pipeline.
Key Enablers: CI/CD Pipelines, Cloud and IaC, Microservices Architecture, Observability and Monitoring.
Scaling Innovation in DevOps:
• Standardization with Flexibility: Reusable pipeline templates with governance.
• Platform Engineering: Internal developer platforms with self-service tools.
• DevOps Center of Competency (CoC): Define best practices, provide training.
• Knowledge Sharing: Documentation, communities of practice, internal hackathons.
3. Continuous Improvement Culture in DevOps
The practice of continuously analyzing workflows, identifying inefficiencies, implementing incremental
improvements, and learning from both failures and successes. Key principles:
• Iterative Development: Small, frequent changes with faster feedback.
• Feedback Loops: Continuous feedback from users, systems, and teams.
• Automation: Automate repetitive tasks to reduce human error.
• Collaboration: Shared responsibility across Dev, Ops, QA, and Security.
• Learning Culture: Treat failures as learning opportunities.
DORA Metrics for measuring improvement:
Metric What It Measures
Deployment Frequency How often releases happen
Lead Time for Changes Time from code commit to production
Mean Time to Recovery (MTTR) Time to recover from failures
Change Failure Rate Percentage of failed changes
4. Team Models in DevOps
Model Collaboration Scalability Speed Complexity
Traditional Low Low Slow Low
(Separate
Dev/Ops)
DevOps Team Medium Medium Medium Medium
(Bridge)
Cross-Functional High High Fast High
(Preferred)
DevOps as a Medium High Fast Medium
Service
Platform High Very High Fast High
Engineering
SRE Model High High Fast High
Guidance: Small teams → Cross-functional model. Large enterprises → Platform + DevOps as a Service.
High reliability systems → SRE model.
5. Tool and Process Standardization
Using a consistent set of tools, workflows, and practices across all teams to ensure uniformity and
efficiency.
• Tool Standardization: Git (version control), Jenkins/GitHub Actions (CI/CD), Docker/Kubernetes
(containers). Avoid tool duplication.
• Process Standardization: Standard workflows (Code → Build → Test → Deploy). Reusable
CI/CD pipeline templates. Consistent coding and testing standards.
• Documentation: Playbooks, guidelines, knowledge sharing across teams.
Best Practices: Standardize core tools, allow flexibility at edges. Use reusable templates. Regularly
review and update standards.
6. Security Considerations in DevOps (DevSecOps)
DevSecOps integrates security into every stage of the DevOps lifecycle. Key practices:
• Shift Left Security: Integrate security early in development. Perform code analysis during coding
phase.
• Automated Security Testing: SAST (Static Application Security Testing), DAST (Dynamic
Application Security Testing), dependency scanning.
• Secure CI/CD Pipelines: Secure credentials management, protect pipelines from unauthorized
access.
• Infrastructure Security: Use IaC securely, regular patching and updates.
• Monitoring and Incident Response: Continuous monitoring for threats, quick response to
vulnerabilities.
7. Outsourcing in DevOps
Types: Full Outsourcing (entire DevOps operations), Partial Outsourcing (specific tasks), Managed
Services (cloud providers for infrastructure).
Benefits: Expert skills access, cost reduction, faster implementation.
Risks: Security/data privacy concerns, vendor dependency, communication challenges, loss of control.
Best Practices: Define roles and responsibilities clearly, use SLAs, maintain internal knowledge base,
monitor vendor performance.
Leading Enterprise Adoption
1. DevOps as a Transformation Exercise
DevOps transformation is a complete shift in mindset and operations. Key aspects:
Cultural Transformation: Breaks barriers between teams. Encourages collaboration, trust, shared
responsibility, and fail-fast/learn-fast mindset.
Process Transformation: Replaces manual processes with automated workflows. Introduces CI,
CD.
Technology Transformation: Adoption of cloud computing, containerization (Docker, Kubernetes),
automation tools (Jenkins, GitHub Actions).
Organizational Transformation: Cross-functional teams instead of separate departments. Focus
shifts to end-to-end product delivery.
Transformation Lifecycle: Assessment → Planning → Implementation → Monitoring and Feedback →
Continuous Improvement.
2. Culture of Collaboration and Trust
Key characteristics:
• Open Communication: Transparent information sharing, regular stand-ups and reviews,
encouraged feedback.
• Shared Responsibility: Developers, testers, and operations work as one unit, accountable for
success and failure.
• Blameless Culture: Mistakes are treated as learning opportunities, not blame.
• Cross-Functional Teams: Reduces dependency on separate departments.
• Continuous Feedback: Feedback loops at every stage drive quick improvements.
How to Build: Use communication tools (Slack, Teams, Jira), promote transparency, encourage pair
programming, adopt blameless postmortems, and ensure leadership promotes trust.
3. Line-of-Business (LoB) Alignment
LoB Alignment ensures that IT/DevOps activities are closely aligned with business objectives so
technology directly supports business outcomes.
Importance:
• Improves business value — IT delivers solutions impacting revenue, cost, and customer
satisfaction
• Faster decision-making between business needs and IT execution
• Better resource utilization on high-priority business goals
• Enhanced customer experience with tailored solutions
Key Components: Shared Goals (common KPIs), Cross-functional Teams, Continuous Feedback from
business stakeholders, Business-driven Metrics.
4. Pilot Projects
Pilot projects are small, controlled initiatives to test and validate DevOps practices before organization-
wide implementation.
Characteristics of a Good Pilot: Small scope, low risk, cross-functional team, clear goals and metrics,
short duration.
Steps:
• Select manageable application (avoid highly complex or mission-critical systems)
• Define success metrics (deployment speed, bug reduction, system uptime)
• Implement DevOps practices (CI, CD, IaC, automated testing)
• Monitor and measure using dashboards and logging tools
• Evaluate results (before vs after DevOps adoption, document lessons learned)
• Scale and replicate successful practices to other projects
5. Metaphor: Rearing Unicorns on an Aircraft Carrier
This metaphor describes the challenge of fostering innovation (unicorns — requiring agility, creativity,
freedom to experiment) within large, complex, rigid organizations (aircraft carriers — highly structured,
process-driven, slow to change, risk-averse).
DevOps solutions: Small autonomous teams (startups within the organization), CI/CD pipelines for
faster/safer releases, microservices architecture for independent scalable components, automation and
cloud to reduce manual effort, and fail-fast culture that encourages learning from mistakes.
Kubernetes
1. Kubernetes Architecture
Kubernetes is an open-source container orchestration platform used to automate deployment, scaling,
and management of containerized applications. It follows a client-server architecture with Master (Control
Plane) and Worker nodes.
Control Plane Components
Kube-API Server: Initial gateway to the cluster that listens to updates via CLI (kubectl). Validates
and forwards requests. No request can bypass the API Server.
Kube-Scheduler: Receives Pod scheduling requests from API Server. Intelligently decides which
node to schedule the pod for better cluster efficiency.
Kube-Controller-Manager: Runs controllers handling various cluster aspects: Replication
Controller (ensures desired replicas are running) and Node Controller (marks nodes as ready/not
ready).
etcd: Key-value store for Cluster State Changes. Acts as the Cluster brain — tells Scheduler and
other processes about available resources and state changes.
Worker Node Components
Container Runtime: Software responsible for running containers on a node (Docker, containerd,
CRI-O). Manages container lifecycle.
kubelet: Agent running on each node. Communicates with container runtime and control plane to
ensure containers within pods run as specified. Monitors pod state and reports to API server.
kube-proxy: Runs on each node, manages network communication for pods. Implements
Kubernetes Services via network rules (iptables or IPVS). Enables service discovery and load
balancing.
2. Pods and Workloads
A Pod represents a single instance of a running process in the cluster and can contain one or more
containers. Key properties:
• Shared Networking: Each Pod gets a unique IP address. Containers within a Pod share this IP
and communicate via localhost.
• Shared Storage: Containers in a Pod share storage volumes for seamless data exchange.
Pod Lifecycle Phases:
Pending: Pod accepted by Kubernetes; one or more container images not yet created.
Running: Pod bound to node; all containers created; at least one container is running.
Succeeded: All containers terminated successfully (exit status 0). Terminal phase.
Failed: All containers terminated; at least one terminated in failure (non-zero status).
Unknown: Pod state could not be obtained due to network communication problems.
Workload Controllers:
• Jobs: For batch tasks that run once and complete (ephemeral).
• Deployments: For stateless and persistent applications such as web services.
• StatefulSets: For stateful and persistent applications like databases.
3. Services and Networking
Kubernetes Services provide stable network access to dynamic sets of pods, enabling service discovery,
load balancing, and network connectivity without application changes.
Types of Services:
ClusterIP (Default): Exposes service internally within the cluster using a virtual IP. Ideal for internal
communication between microservices. External access requires NodePort or LoadBalancer.
NodePort: Exposes a pod on a static port on every node. Allows external traffic via
<NodeIP>:<NodePort>. Useful for testing or simple external access.
LoadBalancer: Exposes application externally using a cloud provider's load balancer.
Automatically distributes incoming traffic across healthy pods. Ideal for production applications
needing public access.
Key Insight: Pods are ephemeral and can be created or destroyed dynamically. Services provide
a stable DNS name and stable endpoint even as the set of running pods changes.
Summary Reference: Key Concepts
DORA Metrics Summary
Metric What It Measures High Performer Target
Deployment Frequency How often releases happen Multiple times per day
Lead Time for Changes Code commit to production Less than 1 hour
Mean Time to Recovery (MTTR) Recovery from failures Less than 1 hour
Change Failure Rate Percentage of failed 0-15%
deployments
Cloud Service Models Comparison
Model User Manages Provider Manages Example
SaaS Nothing (use only) Everything Salesforce, Gmail
PaaS Applications and data Runtime, OS, Google App Engine
infrastructure
IaaS OS, runtime, apps, Hardware, network, Amazon EC2, Azure
data storage VMs
DevOps vs Traditional IT
Aspect Traditional IT DevOps
Team Structure Siloed Dev and Ops teams Integrated cross-functional
teams
Release Frequency Quarterly or monthly Multiple times per day
Deployment Process Manual, error-prone Automated CI/CD pipelines
Testing End-of-cycle, manual Continuous, automated
Failure Handling Blame culture, slow recovery Blameless, fast MTTR
Infrastructure Manual provisioning Infrastructure as Code (IaC)
Feedback Delayed, infrequent Continuous, real-time
Kubernetes Key Components Summary
Component Layer Function
kube-apiserver Control Plane Gateway for all cluster requests
etcd Control Plane Key-value store for cluster state
kube-scheduler Control Plane Assigns pods to nodes
kube-controller-manager Control Plane Manages replication and node
status
kubelet Worker Node Ensures containers run as
specified
kube-proxy Worker Node Manages network rules for
services
Container Runtime Worker Node Runs containers (Docker,
containerd)
Extended Topics: Cloud Computing Deep Dive
Cloud Economics and Cost Management
Total Cost of Ownership (TCO) in Cloud
Organizations migrating to cloud must evaluate Total Cost of Ownership (TCO) which encompasses all
direct and indirect costs of using cloud services versus maintaining on-premises infrastructure.
Key Cost Components:
Capital Expenditure (CapEx): Traditional on-premises IT requires upfront investment in servers,
networking equipment, data center facilities, power and cooling systems, and software licenses.
These are high initial costs with long amortization periods.
Operational Expenditure (OpEx): Cloud computing shifts spending to operational expenses: pay-
per-use pricing, subscription fees, support costs, network egress fees, and managed service costs.
These are predictable, variable costs aligned with actual consumption.
Hidden Costs: Organizations must account for data transfer costs (egress fees), storage
transaction costs, support plan costs, reserved instance commitments, and the cost of cloud
management tools.
Cloud Pricing Strategies
Cloud providers offer multiple pricing models to suit different workload patterns:
• On-Demand Instances: Pay for compute capacity by the hour or second. No long-term
commitments. Suitable for applications with short-term, spiky, or unpredictable workloads.
• Reserved Instances: Significant discount (up to 75%) compared to on-demand pricing. Require
1 or 3-year commitment. Suitable for steady-state or predictable workloads.
• Spot Instances: Use spare EC2 capacity at steep discounts (up to 90%). Instances can be
interrupted with 2-minute warning. Suitable for fault-tolerant, flexible applications.
• Savings Plans: Flexible pricing model offering lower prices in exchange for a commitment to a
consistent amount of usage (measured in $/hour) for 1 or 3 years.
Cloud Cost Optimization Strategies
• Right-sizing: Continuously analyzing resource utilization and adjusting instance types to match
actual workload requirements.
• Auto-scaling: Automatically increasing or decreasing the number of instances based on
demand, eliminating over-provisioning.
• Reserved capacity planning: Using reserved instances or savings plans for predictable, steady-
state workloads to achieve significant savings.
• Spot instance utilization: Leveraging spot instances for batch processing, data analytics, CI/CD
pipelines, and other fault-tolerant workloads.
• Storage tiering: Moving infrequently accessed data to cheaper storage tiers (e.g., Amazon S3
Glacier for archival).
• Idle resource management: Identifying and terminating idle or underutilized resources.
Advanced Virtualization and Containerization
Container Technology
Containers are a form of operating system virtualization. A container consists of an entire runtime
environment: an application, plus all its dependencies, libraries, and other binaries, and configuration
files needed to run it, bundled into one package.
Docker: The most widely used container platform. Docker uses a client-server architecture. The
Docker daemon manages containers on the host. Docker images are built from Dockerfiles and
stored in registries like Docker Hub.
Container Registry: A centralized repository for storing and distributing container images.
Examples: Docker Hub, Amazon ECR, Google Container Registry, Azure Container Registry.
Container Networking: Containers can communicate with each other through various network
modes: bridge (default, isolated network on host), host (container shares host network), and overlay
(multi-host networking for container clusters).
Containers vs Virtual Machines
Feature Virtual Machines Containers
Isolation Full OS isolation Process-level isolation
Startup Time Minutes Seconds or milliseconds
Resource Usage High (full OS overhead) Low (shared OS kernel)
Image Size Gigabytes Megabytes
Portability Less portable Highly portable
Security Strong isolation Less isolation
Use Case Long-running services, legacy Microservices, CI/CD
apps
Docker Architecture
Docker Engine: The core component that enables building and containerizing applications.
Includes Docker Daemon (dockerd), REST API, and CLI client.
Docker Image: A read-only template with instructions for creating a Docker container. Images are
built in layers — each instruction in a Dockerfile creates a new layer.
Docker Container: A runnable instance of an image. Containers are isolated from each other and
from the host. State can be saved as a new image.
Dockerfile: A text file containing all commands needed to assemble an image. Key instructions
include FROM (base image), RUN (execute commands), COPY (copy files), EXPOSE (declare
ports), and CMD (default execution command).
Docker Compose: A tool for defining and running multi-container Docker applications using a
YAML configuration file.
CI/CD Pipeline Deep Dive
Continuous Integration (CI) in Detail
CI is the practice of automatically integrating code changes from multiple contributors into a shared
repository several times a day. Each integration is verified by an automated build and automated tests.
Key CI Practices:
• Version Control for All: All production artifacts (code, configuration, scripts, database schemas)
must be version controlled.
• Automate the Build: The build process must be fully automated. A single command should
compile, link, run unit tests, and package the application.
• Make the Build Self-Testing: Automated tests should be run as part of the build process to verify
functionality.
• Keep the Build Fast: CI builds should complete within minutes (ideally 10 minutes) to provide
fast feedback.
• Test in a Clone of the Production Environment: Use identical environments for testing and
production.
• Fix Broken Builds Immediately: A broken build is the highest priority for the team. Everyone
stops new work until the build is fixed.
Continuous Delivery vs Continuous Deployment
Continuous Delivery: Every change is automatically tested and built, and the resulting artifact is
deployable to production. However, deployment to production requires manual approval. The
software is always in a deployable state.
Continuous Deployment: Every change that passes all automated tests is automatically deployed
to production without human intervention. Requires extremely high confidence in the automated test
suite.
CI/CD Pipeline Stages
A comprehensive CI/CD pipeline typically includes these stages:
• Source Control: Code commits trigger the pipeline. Branch strategies (Gitflow, trunk-based
development) define workflow.
• Build: Compile source code, resolve dependencies, generate build artifacts.
• Unit Tests: Fast, isolated tests verifying individual components or functions.
• Static Code Analysis: Automated code quality checks (SonarQube), security scanning (SAST),
and code style enforcement.
• Integration Tests: Tests verifying interactions between components and with external services.
• Artifact Storage: Built artifacts stored in artifact repositories (Nexus, Artifactory, ECR).
• Deployment to Staging: Automated deployment to a staging environment that mirrors
production.
• Acceptance Tests: End-to-end tests verifying business requirements. May include performance
testing.
• Security Scanning: DAST and vulnerability scanning in the staging environment.
• Deployment to Production: Automated (CD) or manual (delivery) deployment with appropriate
strategies.
• Post-Deployment Validation: Smoke tests and synthetic monitoring to verify deployment
success.
Deployment Strategies
Rolling Deployment: New version gradually replaces old version by updating instances one by
one or in small batches. Zero downtime if done correctly. Easy rollback by stopping the rollout.
Blue/Green Deployment: Maintain two identical production environments (Blue = current, Green =
new). Switch traffic from Blue to Green after testing. Instant rollback by switching back to Blue.
Requires double the infrastructure.
Canary Deployment: Release new version to a small subset of users (canary group) first. Monitor
for errors and performance issues. Gradually increase traffic to new version if stable. Automatic
rollback if error rates spike.
Feature Flags: Mechanism to enable or disable features in production without deploying new code.
Allows testing features in production with specific users. Decouples deployment from feature
release.
A/B Testing: Running two versions simultaneously to compare performance or user behavior.
Different from canary (which focuses on stability, A/B focuses on outcomes).
Infrastructure as Code (IaC)
IaC Principles and Tools
Infrastructure as Code is the practice of managing and provisioning computing infrastructure through
machine-readable definition files rather than physical hardware configuration or interactive configuration
tools.
Key Principles:
• Idempotency: Applying the same configuration multiple times yields the same result, regardless
of the starting state.
• Version Control: Infrastructure definitions stored in version control alongside application code.
• Immutability: Rather than modifying existing infrastructure, replace it entirely with a new version.
• Declarative Configuration: Specify the desired end state, not the steps to get there.
Popular IaC Tools:
Terraform (HashiCorp): Cloud-agnostic IaC tool using HCL (HashiCorp Configuration Language).
Supports AWS, Azure, GCP, and hundreds of providers. Manages resource lifecycle with plan,
apply, and destroy commands. State management tracks actual vs desired state.
AWS CloudFormation: AWS-native IaC service using JSON or YAML templates. Stacks group
related resources. Supports nested stacks for modularity. Change sets allow previewing changes
before applying.
Ansible: Agentless configuration management tool using YAML playbooks. SSH-based for Linux,
WinRM for Windows. Idempotent operations. Also used for application deployment and
orchestration.
Pulumi: Modern IaC using general-purpose programming languages (Python, TypeScript, Go, C#).
Strong type safety and IDE support.
Configuration Management
Configuration management ensures systems are configured consistently and correctly, preventing
configuration drift (the gradual divergence of system configurations from a known baseline).
• Desired State Configuration (DSC): Declarative approach where you define the desired state of
the system.
• Inventory Management: Maintaining accurate records of all systems and their configurations.
• Configuration Drift Detection: Continuously monitoring systems to detect unauthorized or
unintended changes.
• Automated Remediation: Automatically correcting configuration drift to return systems to desired
state.
Monitoring, Observability, and SRE
The Three Pillars of Observability
Metrics: Numeric measurements sampled over time. Examples: CPU utilization, request latency,
error rate, memory usage. Stored in time-series databases (Prometheus, InfluxDB). Visualized
using dashboards (Grafana).
Logs: Text records of discrete events in a system. Structured logs (JSON format) are easier to
parse and query. Centralized log management using ELK Stack (Elasticsearch, Logstash, Kibana)
or similar platforms.
Traces: Records of the path of a request through a distributed system. Distributed tracing shows
how requests propagate across microservices. Tools include Jaeger, Zipkin, and AWS X-Ray.
Essential for debugging microservices.
Service Level Indicators, Objectives, and Agreements
SLI (Service Level Indicator): A carefully defined quantitative measure of some aspect of the
service's behavior. Examples: request latency (percentage of requests faster than some threshold),
error rate (fraction of requests resulting in errors), system throughput.
SLO (Service Level Objective): A target value or range of values for an SLI. Example: 99.9% of
requests will complete in under 300ms. SLOs define the reliability target for a service.
SLA (Service Level Agreement): A contract between a service provider and a user that defines
the expected level of service, measured by SLIs, with consequences for failing to meet SLOs.
Error Budget: The acceptable amount of unreliability in a service over a rolling time period,
calculated as: 1 - SLO. If SLO is 99.9% availability, the error budget is 0.1% (about 8.7 hours/year).
Error budgets help balance reliability and innovation speed.
Alerting Best Practices
• Alert on symptoms, not causes: Alert when users are experiencing problems, not when a disk is
90% full.
• Alert on SLO violations: Set alerts that trigger when service is consuming error budget too
quickly.
• Actionable alerts: Every alert should require a human response. Non-actionable alerts lead to
alert fatigue.
• Alert severity levels: Critical (paging, immediate action), Warning (investigate during business
hours), Informational (logged only).
• Runbooks: Each alert should link to a runbook with investigation steps and common remediation
actions.
Site Reliability Engineering (SRE)
SRE is a discipline that incorporates aspects of software engineering and applies them to infrastructure
and operations problems. Originated at Google in 2003.
Core SRE Principles:
• Embracing Risk: 100% reliability is often not the right target. SREs use error budgets to balance
reliability with feature velocity.
• Service Level Objectives: SLOs define the reliability targets that make both users and
developers happy.
• Eliminating Toil: Toil is operational work that is manual, repetitive, automatable, tactical, devoid
of enduring value, and scales linearly with service growth. SREs aim to keep toil below 50% of
their time.
• Monitoring: SREs implement monitoring to understand system behavior and detect incidents
early.
• Automation: SREs automate operational tasks to reduce toil and improve consistency.
• Release Engineering: Ensuring software is built and deployed reliably and consistently.
• Simplicity: Preferring simple, reliable solutions over complex ones.
Security in DevOps (DevSecOps)
Shifting Security Left
Shifting security left means integrating security practices earlier in the software development lifecycle
(SDLC), rather than treating security as a final gate before release. This reduces the cost and time to fix
vulnerabilities.
Security Activities at Each Stage:
• Planning: Threat modeling, security requirements definition, architecture review
• Development: Secure coding guidelines, pre-commit hooks for secrets detection, IDE plugins for
vulnerability detection
• Build: SAST (Static Application Security Testing), dependency scanning, container image
scanning
• Test: DAST (Dynamic Application Security Testing), IAST (Interactive Application Security
Testing), API security testing
• Deploy: Infrastructure security scanning, compliance checks, secrets management
• Operate: Runtime security monitoring, intrusion detection, vulnerability management, incident
response
Security Testing Types
SAST (Static Application Security Testing): Analyzes source code, bytecode, or binary code for
security vulnerabilities without executing the program. Tools: SonarQube, Checkmarx, Veracode.
Finds issues like SQL injection, XSS, buffer overflows early in development.
DAST (Dynamic Application Security Testing): Tests running applications by simulating external
attacks. Black-box testing approach. Tools: OWASP ZAP, Burp Suite. Finds issues that only appear
at runtime.
IAST (Interactive Application Security Testing): Combines SAST and DAST approaches by
instrumenting the application during testing. More accurate with fewer false positives.
SCA (Software Composition Analysis): Identifies open source components in the application and
checks for known vulnerabilities. Tools: Snyk, WhiteSource, Dependabot.
Container Security Scanning: Analyzes container images for vulnerabilities in OS packages and
application libraries before deployment. Tools: Clair, Trivy, Anchore.
Secrets Management
Secrets management is the process of securely managing sensitive information like passwords, API
keys, certificates, and SSH keys used by applications and infrastructure.
• Never store secrets in code or version control: Use environment variables or dedicated secrets
management solutions.
• Secrets Management Tools: HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, Google
Secret Manager.
• Dynamic Secrets: Generate short-lived credentials on-demand rather than using static, long-
lived secrets.
• Secret Rotation: Automatically rotate secrets on a schedule or after any potential exposure.
Cloud Migration Strategies
The 6 Rs of Cloud Migration
Organizations migrating to cloud have six primary strategies, known as the '6 Rs':
Rehost (Lift and Shift): Move applications to cloud without modification. Uses IaaS. Fastest
migration approach. Good for meeting tight migration deadlines. Does not optimize for cloud.
Replatform (Lift, Tinker, and Shift): Make a few cloud optimizations without changing core
architecture. Examples: Migrating database to a managed RDS service, or deploying applications to
a managed Kubernetes service.
Repurchase (Drop and Shop): Moving to a different product, typically SaaS. Example: Replacing
an on-premises CRM with Salesforce.
Refactor/Re-architect: Rearchitect applications to take full advantage of cloud-native features.
Most expensive but offers greatest long-term benefits. Example: Decomposing monolith into
microservices.
Retire: Identify applications that are no longer useful and decommission them. Reduces security
risk and operational costs.
Retain (Revisit): Applications that are not yet ready to migrate or have recently been updated.
Keep on-premises and revisit later.
Cloud-Native Design Principles
• Design for Failure: Assume components will fail. Build redundancy, health checks, and auto-
recovery into every application.
• Design for Scale: Build applications that can scale horizontally. Avoid any single points of
contention.
• Design for Loose Coupling: Applications should communicate through well-defined APIs.
Services should be independently deployable.
• Design for Security: Implement security at every layer. Follow the principle of least privilege.
Encrypt data at rest and in transit.
• Automate Everything: Every operational task should be automated, from provisioning to
deployment to scaling to recovery.
• Design for Immutability: Infrastructure and application containers should be immutable. Never
modify running instances; replace them.
• Design for Observability: Every application should emit the metrics, logs, and traces needed to
understand its behavior.
Extended Topics: DevOps Tools and Ecosystem
Version Control Systems
Git Fundamentals
Git is a distributed version control system designed to handle everything from small to very large projects
with speed and efficiency. Every Git clone is a full repository with complete history and full version tracking
capabilities.
Core Git Concepts:
Repository: A data structure that stores metadata for a set of files and directories. Contains the
complete history of all changes.
Commit: A snapshot of the repository at a specific point in time. Each commit has a unique SHA-1
hash identifier, author information, timestamp, and a pointer to parent commit(s).
Branch: A lightweight movable pointer to a commit. The default branch is typically called 'main' or
'master'. Branches enable parallel development.
Merge: Combining changes from one branch into another. Types: Fast-forward merge (linear
history), Three-way merge (creates merge commit).
Rebase: Reapplying commits on top of another base commit. Creates a cleaner, linear project
history. Should not be used on shared branches.
Pull Request / Merge Request: A mechanism for proposing changes to a repository. Enables code
review, automated checks, and discussion before merging.
Git Branching Strategies
Gitflow: Uses feature branches, develop branch, release branches, and hotfix branches. Well-
structured for teams with scheduled release cycles. Can be complex.
Trunk-Based Development: All developers integrate their changes into the main branch (trunk) at
least once per day. Feature flags hide incomplete features. Enables continuous integration.
Preferred in DevOps environments.
Feature Branch Workflow: Each feature developed in a dedicated branch. Merged to main when
complete. Simpler than Gitflow. Works well with CI/CD when branches are short-lived.
Forking Workflow: Each contributor has a server-side fork of the repository. Used in open source
projects. Pull requests integrate changes from forks into the main repository.
Popular DevOps Tools
CI/CD Tools
Tool Type Key Features
Jenkins Open Source CI/CD Highly extensible with plugins,
self-hosted, widely adopted
GitHub Actions Cloud-native CI/CD Integrated with GitHub, YAML-
based workflows, marketplace
actions
GitLab CI/CD Integrated CI/CD Built into GitLab, Auto DevOps,
container registry included
CircleCI Cloud CI/CD Fast pipelines, orbs (reusable
configs), strong parallelism
ArgoCD GitOps CD Kubernetes-native, declarative
GitOps, continuous sync
Container Orchestration
Tool Use Case Key Features
Kubernetes Container orchestration Auto-scaling, self-healing,
service discovery, load
balancing
Docker Swarm Simple orchestration Easy setup, native Docker
integration, less complex than
K8s
Amazon ECS AWS container service Tight AWS integration, Fargate
serverless option
Amazon EKS Managed Kubernetes on AWS Managed control plane,
integrates with AWS services
Monitoring and Observability Tools
Tool Category Purpose
Prometheus Metrics Collection Time-series metrics, alerting
rules, pull-based model
Grafana Visualization Dashboards for metrics from
multiple data sources
ELK Stack Log Management Elasticsearch (storage),
Logstash (collection), Kibana
(visualization)
Jaeger Distributed Tracing End-to-end distributed tracing
for microservices
PagerDuty Incident Management On-call management, alert
routing, incident response
Datadog Full-stack Monitoring APM, infrastructure monitoring,
log management, tracing
Infrastructure and Configuration Tools
Tool Category Purpose
Terraform IaC Provision and manage cloud
infrastructure declaratively
Ansible Configuration Mgmt Agentless automation,
configuration management,
deployment
Helm K8s Package Manager Manage Kubernetes application
deployments as charts
Vault Secrets Management Securely store and manage
secrets, encryption as a service
Advanced Kubernetes Concepts
Kubernetes Networking
Kubernetes networking solves four distinct challenges: Container-to-Container communication (via
localhost within pods), Pod-to-Pod communication (every pod gets a unique IP), Pod-to-Service
communication (via stable Service ClusterIP), and External-to-Service communication (via Ingress or
LoadBalancer).
Ingress: An API object that manages external access to services in a cluster, typically HTTP.
Provides load balancing, SSL termination, and name-based virtual hosting. Requires an Ingress
Controller (e.g., nginx-ingress, AWS ALB Ingress).
Network Policies: Specifications of how groups of pods are allowed to communicate with each
other and with network endpoints. Used to implement micro-segmentation in Kubernetes.
DNS in Kubernetes: Kubernetes runs an internal DNS service (CoreDNS) that provides DNS
records for Services and Pods. Services are accessible via <service-
name>.<namespace>.[Link].
Kubernetes Storage
Persistent Volumes (PV): Storage resources provisioned by an administrator or dynamically via
Storage Classes. Independent lifecycle from any Pod that uses the PV.
Persistent Volume Claims (PVC): A request for storage by a user. Binds to an available PV
matching the requested storage size and access mode.
Storage Classes: Define different classes of storage (e.g., SSD, HDD, NFS). Enable dynamic
provisioning of PVs when a PVC is created.
ConfigMaps and Secrets: ConfigMaps store non-confidential configuration data as key-value
pairs. Secrets store sensitive data like passwords and API keys (base64 encoded).
Kubernetes Scaling
Horizontal Pod Autoscaler (HPA): Automatically scales the number of Pods in a Deployment or
ReplicaSet based on observed CPU utilization or custom metrics. Queries metrics from the Metrics
Server.
Vertical Pod Autoscaler (VPA): Automatically adjusts CPU and memory resource requests and
limits for containers based on usage data.
Cluster Autoscaler: Automatically adjusts the number of nodes in the cluster. Adds nodes when
pods cannot be scheduled due to insufficient resources. Removes underutilized nodes.
Multi-Cloud and Hybrid Cloud Strategies
Multi-Cloud Architecture
Multi-cloud involves using services from two or more cloud providers (e.g., AWS + Azure + GCP).
Organizations adopt multi-cloud for:
• Avoiding vendor lock-in and maintaining negotiation leverage
• Using best-of-breed services from different providers
• Geographic requirements or regulatory compliance
• Disaster recovery and resilience across providers
• Cost optimization by choosing cheapest provider for each workload
Multi-cloud Challenges:
• Increased operational complexity with multiple management interfaces
• Data transfer costs between cloud providers
• Different security models and compliance requirements per provider
• Skill gaps — teams need expertise in multiple cloud platforms
• Service inconsistency — equivalent services behave differently across providers
Hybrid Cloud Architecture
Hybrid cloud connects on-premises infrastructure with public cloud services. Common use cases:
• Cloud bursting: Applications run on-premises but burst to cloud during peak demand
• Data sovereignty: Sensitive data stored on-premises while non-sensitive workloads in cloud
• Application modernization: Gradually migrating applications from on-premises to cloud
• Disaster recovery: Cloud used as DR target for on-premises systems
Agile and Lean Foundations for DevOps
Agile Manifesto Principles Relevant to DevOps
The Agile Manifesto (2001) established values and principles that form the foundation of modern software
development and directly influenced DevOps:
• Working software over comprehensive documentation: Deliver working software frequently.
• Customer collaboration over contract negotiation: Continuous customer involvement.
• Responding to change over following a plan: Embrace changing requirements.
• Individuals and interactions over processes and tools: People matter more than tools.
Scrum Framework
Scrum is an agile framework for developing, delivering, and sustaining complex products. Key elements:
Sprint: A time-boxed period (1-4 weeks) during which a potentially releasable product increment is
created. Sprints are the heartbeat of Scrum.
Product Backlog: An ordered list of everything that might be needed in the product. Single source
of requirements for any changes. The Product Owner is responsible for the backlog.
Sprint Planning: Event at the start of each sprint where the team selects items from the Product
Backlog and creates a Sprint Goal.
Daily Scrum: 15-minute daily synchronization event for the Development Team. Each member
answers: What did I do yesterday? What will I do today? Are there any impediments?
Sprint Review: Held at the end of the Sprint to inspect the Increment and adapt the Product
Backlog.
Sprint Retrospective: Opportunity for the Scrum Team to inspect itself and create a plan for
improvements. DevOps retrospectives often include pipeline and operational improvements.
Kanban in DevOps
Kanban is a visual method for managing work as it moves through a process. DevOps teams use Kanban
boards to visualize work, limit work in progress (WIP), and identify bottlenecks.
Kanban Principles:
• Visualize the workflow: Make all work and its status visible on the board.
• Limit Work in Progress (WIP): Constraining WIP limits reduces multitasking, reduces context
switching, and reveals bottlenecks.
• Manage flow: Monitor, measure, and report the flow of work. Optimize for smooth, fast flow.
• Make process policies explicit: Clearly defined rules for how work moves through the system.
• Implement feedback loops: Regular reviews and retrospectives to improve the process.
• Improve collaboratively, evolve experimentally: Use scientific approaches to identify and
implement improvements.
Value Stream Mapping
Value Stream Mapping (VSM) is a lean-management technique for analyzing the current state and
designing a future state for the series of events that take a product from its beginning through to the
customer.
In DevOps, VSM is used to:
• Identify waste (waiting, rework, manual handoffs) in the software delivery process
• Quantify lead time and process time for each step in the delivery pipeline
• Find opportunities for automation and improvement
• Create a shared understanding across teams of the entire value stream
Key VSM Metrics:
Lead Time: Total time from when work is requested to when it is delivered to the customer.
Process Time: Actual time spent working on a task (excludes waiting time).
Efficiency: Process Time / Lead Time × 100%. High efficiency means little waiting.
Practice Questions and Key Points
Unit I: Cloud Computing - Key Points
• NIST defines 5 essential characteristics: on-demand self-service, broad network access,
resource pooling, rapid elasticity, and measured service.
• Three service models: SaaS (users access applications), PaaS (users develop/deploy
applications), IaaS (users provision virtual infrastructure).
• Four deployment models: Public, Private, Community, Hybrid.
• Virtualization is the key enabling technology of cloud computing. Hypervisors: Type-1 (bare
metal) and Type-2 (hosted).
• Load balancing algorithms: Round Robin, Weighted Round Robin, Low Latency, Least
Connections, Priority, Overflow.
• Replication types: Array-based, Network-based, Host-based.
• SDN separates control plane from data plane. Key protocol: OpenFlow.
• NFV virtualizes network functions. Components: VNF, NFVI, NFV Management and
Orchestration.
• IAM enables RBAC, security credentials management, and access key management.
• Billing models: Elastic (pay-as-you-use), Fixed, Spot pricing.
• OpenStack components: nova-compute, nova-networking, Cinder, Swift, Keystone, Glance,
Horizon.
Unit II: Hadoop - Key Points
• Hadoop ecosystem: HDFS (storage), MapReduce (processing), YARN (resource management),
HBase, Hive, Pig, Zookeeper.
• HDFS roles: NameNode (metadata), Secondary NameNode (checkpoints), DataNode (data
storage).
• MapReduce roles: JobTracker (assigns tasks), TaskTracker (executes tasks).
• YARN components: Resource Manager (global scheduling), Application Master (per-
application), Node Manager (per-machine), Containers.
• Hadoop schedulers: FIFO (default), Fair Scheduler (Facebook), Capacity Scheduler (Yahoo).
• SOA communicates via SOAP protocol. Services described using WSDL.
• CCM design steps: Component Design, Architecture Design, Deployment Design.
• REST constraints: Client-Server, Stateless, Cacheable, Layered, Uniform Interface, Code on
Demand.
• Relational databases: ACID guarantees (Atomicity, Consistency, Isolation, Durability).
• NoSQL types: Key-value, Document, Graph, Object stores.
Unit III: DevOps Fundamentals - Key Points
• DevOps emerged around 2008-2009 influenced by Agile, Lean, CI, and ITSM.
• Core practices: CI, CD, IaC, Automated Testing, Monitoring, Continuous Feedback.
• DevOps culture: collaboration, shared responsibility, blameless culture, automation mindset.
• DevOps Playbook: documents how DevOps practices are implemented; includes workflow,
CI/CD, toolchain, IaC, security, monitoring.
• Business Case metrics: deployment frequency, lead time, MTTR, change failure rate.
• Business Model Canvas has 9 blocks: Customer Segments, Value Propositions, Channels,
Customer Relationships, Revenue Streams, Key Resources, Key Activities, Key Partnerships,
Cost Structure.
• Revenue stream types: Asset Sale, Usage Fee, Subscription, Licensing, Brokerage,
Advertising.
Unit IV: DevOps Strategies - Key Points
• DevOps optimizes the end-to-end value stream: Idea → Code → Build → Test → Release →
Operate → Learn.
• Theory of Constraints: system throughput limited by slowest step. Optimize the bottleneck.
• Flow over utilization: optimize for work flowing, not people being busy. Limit WIP.
• 10 Core Themes: Systems Thinking, Flow, Constraints, Fast Feedback, Small Batches,
Automation, Quality, Resilience, Continuous Improvement, Innovation.
• 10 DevOps Plays: Expose, Attack Bottleneck, Reduce Batch, Automate, Shift Left, Design Safe,
Optimize Flow, Close Feedback, Pipeline as Product, Continuous Improvement.
• Uber Syndrome: copying surface behaviors without optimizing the underlying delivery system.
• API Economy: APIs as products. DevOps enables rapid, safe API evolution.
• Microservices success requires: independent deployability, service ownership (you build it, you
run it), platform-level standardization, strong observability.
Unit V: Scaling DevOps - Key Points
• DevOps CoC: centralizes standards, toolchain management, automation, training, monitoring
across teams.
• DORA metrics are the standard for measuring DevOps performance: deployment frequency,
lead time, MTTR, change failure rate.
• Team models: Traditional, DevOps Bridge, Cross-functional (preferred), DevOps as a Service,
Platform Engineering, SRE.
• DevSecOps: SAST (source code analysis), DAST (runtime testing), SCA (open source
vulnerabilities).
• Kubernetes control plane: API Server, Scheduler, Controller Manager, etcd.
• Kubernetes worker node: kubelet, kube-proxy, container runtime.
• Pod lifecycle: Pending → Running → Succeeded/Failed/Unknown.
• Kubernetes Service types: ClusterIP (internal), NodePort (external via static port), LoadBalancer
(cloud load balancer).
• Metaphor 'Rearing Unicorns on Aircraft Carrier': innovation in large enterprises requires: small
autonomous teams, CI/CD, microservices, automation, fail-fast culture.
Cloud Computing Case Studies and Emerging Trends
Netflix Cloud-Native Architecture
Netflix is one of the world's most cited examples of cloud-native architecture. Netflix migrated completely
from its own data centers to AWS between 2009 and 2016, becoming a reference architecture for large-
scale cloud applications.
Architecture Highlights
• Netflix uses hundreds of microservices — each responsible for a specific function such as
recommendations, search, encoding, and billing.
• Netflix uses a multi-region active-active architecture on AWS, running in multiple regions
simultaneously for global redundancy and low latency.
• The platform processes billions of events per day, including viewing patterns to drive
personalized recommendations.
• Netflix developed and open-sourced numerous tools: Eureka (service discovery), Hystrix (circuit
breaker), Zuul (API gateway), Chaos Monkey (resilience testing).
• Chaos Engineering: Deliberately introducing failures into production to test resilience. Chaos
Monkey randomly terminates instances in production.
Serverless Computing
Serverless computing is an execution model where the cloud provider dynamically manages resource
allocation. Pricing is based on actual consumption, not pre-purchased capacity.
Function as a Service (FaaS): Serverless execution for individual functions. Examples: AWS
Lambda, Azure Functions, Google Cloud Functions. Functions are triggered by events, execute for
milliseconds to minutes, and are billed per execution.
Benefits: No server management, automatic scaling (including to zero), pay per execution, faster
time to market.
Limitations: Cold starts (latency on first invocation), execution time limits, stateless by default,
vendor lock-in.
GitOps
GitOps is an operational framework applying DevOps best practices (version control, collaboration,
CI/CD) to infrastructure automation. The entire desired state of the system is described declaratively in
Git.
• Single source of truth: All desired system state stored in Git. Every change made through a Git
commit.
• Automated deployment: An agent (ArgoCD, Flux) continuously monitors the Git repository and
ensures cluster state matches desired state.
• Auditability: Git history provides a complete audit trail of every change.
• Rollback: Reverting to a previous system state is as simple as reverting a Git commit.
Platform Engineering
Platform Engineering designs and builds toolchains and workflows enabling self-service capabilities for
software engineering organizations. Platform teams build Internal Developer Platforms (IDPs).
• Self-service: Developers provision infrastructure and deploy applications without filing tickets.
• Golden paths: Standardized, opinionated paths for common tasks embedding security,
reliability, and compliance best practices.
• Developer portals: Centralized interfaces (e.g., Backstage by Spotify) providing service
catalogs, documentation, and self-service capabilities.
• Platform as a product: Platform teams treat developers as customers with roadmaps, user
research, and KPIs.
AIOps and MLOps
AIOps: Application of AI to IT operations. Uses machine learning to analyze logs, metrics, and
events to detect anomalies, predict failures, and automate remediation. Tools: Moogsoft, BigPanda,
Dynatrace.
MLOps: Extension of DevOps to machine learning. Addresses deployment and maintenance of ML
models in production. Includes data versioning, model training pipelines, model versioning, model
serving, and monitoring data drift.
FinOps (Cloud Financial Operations)
FinOps brings financial accountability to variable cloud spending. Development, operations, and finance
teams collaborate to manage cloud costs.
• Inform phase: Full visibility and allocation of cloud costs. Resource tagging by team, project,
and environment.
• Optimize phase: Right-sizing, reserved instances, spot instances, eliminating idle resources.
• Operate phase: Continuous improvement through iterative cost optimization aligned with
business value.
Cloud Security Best Practices
Shared Responsibility Model
The Shared Responsibility Model defines the security responsibilities between the cloud provider and the
customer. The cloud provider is responsible for security OF the cloud (hardware, software, networking,
facilities). The customer is responsible for security IN the cloud (data, identity, applications, operating
system configuration).
Zero Trust Security Model
Zero Trust is a security framework requiring all users, inside or outside the organization's network, to be
authenticated, authorized, and continuously validated before being granted access. Key principles:
• Never trust, always verify: No implicit trust granted to any user or device, regardless of network
location.
• Least privilege access: Users and systems granted only the minimum access required.
• Assume breach: Design security controls assuming the network is already compromised.
• Micro-segmentation: Divide networks into small zones to contain potential breaches.
Identity and Access Management Best Practices
• Multi-factor authentication (MFA): Require at least two forms of verification for all users,
especially privileged accounts.
• Principle of Least Privilege: Grant users only the permissions they need to perform their job
functions.
• Role-based access control (RBAC): Assign permissions to roles rather than individual users.
• Regular access reviews: Periodically review and revoke unnecessary permissions.
• Service accounts: Use dedicated service accounts for applications, with minimal required
permissions.
Disaster Recovery and Business Continuity
Key DR Metrics
Recovery Time Objective (RTO): The maximum acceptable time to restore normal operations
after a disaster. A shorter RTO requires more investment in DR infrastructure.
Recovery Point Objective (RPO): The maximum acceptable amount of data loss measured in
time. RPO of 1 hour means no more than 1 hour of data can be lost.
DR Strategies (Cost vs Recovery Speed)
Strategy Cost RTO Description
Backup and Restore Low Hours to days Backup data to cloud;
restore when needed
Pilot Light Low-Medium Minutes to hours Core infrastructure
always running; scale
up on disaster
Warm Standby Medium Minutes Scaled-down replica
always running; scale
up on disaster
Multi-site Active/Active High Near zero Full production
environment running in
multiple locations
--- End of Study Material ---