DevOps & CI/CD
Understanding DevOps culture, continuous integration, delivery and modern development practices
Last updated: 8/15/2025
Master the practices, tools and culture that bridge development and operations, enabling teams to deliver software faster and more reliably.
What is DevOps?
The Core Philosophy
Breaking down silos between Dev and Ops
DevOps isn't just tools - it's a cultural shift that brings development and operations teams together to deliver value faster and more reliably.
Real-world analogy: Traditional IT is like a relay race where developers hand off code to operations. DevOps is like a football team where everyone works together towards the same goal, with shared responsibility for success.
The Three Ways of DevOps
1. Flow (Left to Right)
Dev → Test → Deploy → Monitor
Fast flow of work from development to production
2. Feedback (Right to Left)
Monitor → Alert → Fix → Improve
Fast feedback from production to development
3. Continuous Learning
Experiment → Learn → Share → Improve
Culture of experimentation and learning
CI/CD Pipeline
Continuous Integration (CI)
Merge code frequently, test automatically
CI is about integrating code changes frequently and detecting problems early.
# Example CI Pipeline
name: Continuous Integration
on: [push, pull_request]
jobs:
test:
steps:
- Checkout code
- Install dependencies
- Run linting
- Run unit tests
- Run integration tests
- Generate coverage report
- Build application
- Upload artifacts
Continuous Delivery (CD)
Always ready to deploy
Code is always in a deployable state, but deployment is a manual decision.
Code → Build → Test → Stage → [Manual Approval] → Production
Continuous Deployment
Automatic deployment to production
Every change that passes tests goes to production automatically.
Code → Build → Test → Stage → Production (automatic)
Version Control
Git Workflows
Git Flow
Structured branching model
master (production)
↓
develop (integration)
├── feature/user-auth
├── feature/payment
└── feature/search
hotfix/critical-bug → master & develop
release/v2.0 → master & develop
GitHub Flow
Simplified workflow
main
├── feature-branch-1
├── feature-branch-2
└── feature-branch-3
1. Create branch from main
2. Make changes
3. Open pull request
4. Review and test
5. Merge to main
6. Deploy
GitLab Flow
Environment branches
main
↓
pre-production
↓
production
Features merge to main
Main deploys to pre-production
Pre-production promotes to production
Branching Strategies
Feature Branches:
git checkout -b feature/new-feature
# Work on feature
git push origin feature/new-feature
# Create PR/MR
Release Branches:
git checkout -b release/2.0.0
# Final testing and bug fixes
git tag -a v2.0.0 -m "Version 2.0.0"
Hotfix Branches:
git checkout -b hotfix/critical-bug production
# Fix bug
git merge --no-ff hotfix/critical-bug
Build Automation
Build Tools
Make
Classic build automation
# Makefile
.PHONY: build test deploy
build:
npm install
npm run build
test: build
npm test
deploy: test
./scripts/deploy.sh
clean:
rm -rf dist node_modules
Gradle
Modern build automation
// build.gradle
task build {
dependsOn 'compile', 'test'
}
task test {
doLast {
println 'Running tests...'
}
}
task deploy(dependsOn: build) {
doLast {
println 'Deploying application...'
}
}
Package Managers
Language-specific:
# JavaScript/Node.js
npm install
yarn add package
# Python
pip install package
poetry add package
# Ruby
gem install package
bundle install
# Java
mvn install
gradle build
# Go
go get package
Testing Strategies
Testing Pyramid
/\
/ \ E2E Tests (Few)
/ \ - Full user flows
/──────\
/ \ Integration Tests (Some)
/ \ - Component interactions
/────────────\
/ \ Unit Tests (Many)
/ \ - Individual functions
/──────────────────\
Types of Testing
Unit Testing
Test individual components
// Jest example
describe('Calculator', () => {
test('adds 1 + 2 to equal 3', () => {
expect(add(1, 2)).toBe(3);
});
test('multiplies 3 * 4 to equal 12', () => {
expect(multiply(3, 4)).toBe(12);
});
});
Integration Testing
Test component interactions
# pytest example
def test_api_endpoint():
response = client.post('/api/users', json={
'name': 'John Doe',
'email': 'john@example.com'
})
assert response.status_code == 201
assert response.json()['id'] is not None
End-to-End Testing
Test complete user journeys
// Cypress example
describe('User Registration', () => {
it('allows user to sign up', () => {
cy.visit('/signup');
cy.get('#email').type('user@example.com');
cy.get('#password').type('SecurePass123!');
cy.get('#submit').click();
cy.url().should('include', '/dashboard');
});
});
Test Automation
# GitHub Actions test automation
name: Test Suite
on: [push, pull_request]
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- run: npm test:unit
integration-tests:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:13
steps:
- uses: actions/checkout@v2
- run: npm test:integration
e2e-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- run: npm test:e2e
Infrastructure as Code (IaC)
Configuration Management
Ansible
Agentless automation
---
- name: Configure web servers
hosts: webservers
become: yes
tasks:
- name: Install nginx
apt:
name: nginx
state: present
- name: Start nginx service
service:
name: nginx
state: started
enabled: yes
- name: Deploy website
copy:
src: ./dist/
dest: /var/www/html/
Puppet
Declarative configuration
class webserver {
package { 'nginx':
ensure => installed,
}
service { 'nginx':
ensure => running,
enable => true,
require => Package['nginx'],
}
file { '/var/www/html/index.html':
content => 'Hello World',
require => Package['nginx'],
}
}
Chef
Ruby-based configuration
# Recipe
package 'nginx' do
action :install
end
service 'nginx' do
action [:enable, :start]
end
template '/etc/nginx/nginx.conf' do
source 'nginx.conf.erb'
notifies :restart, 'service[nginx]'
end
Infrastructure Provisioning
Terraform
Multi-cloud provisioning
# main.tf
provider "aws" {
region = "us-west-2"
}
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t2.micro"
tags = {
Name = "WebServer"
Environment = "Production"
}
}
resource "aws_s3_bucket" "assets" {
bucket = "my-app-assets"
acl = "public-read"
}
Pulumi
IaC with real programming languages
import * as aws from "@pulumi/aws";
const bucket = new aws.s3.Bucket("my-bucket", {
website: {
indexDocument: "index.html",
},
});
const instance = new aws.ec2.Instance("web-server", {
ami: "ami-0c55b159cbfafe1f0",
instanceType: "t2.micro",
});
export const bucketName = bucket.id;
export const instanceIp = instance.publicIp;
Containerisation & Orchestration
Docker in DevOps
# Multi-stage build for production
FROM node:16 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:16-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
EXPOSE 3000
CMD ["node", "dist/index.js"]
Kubernetes Deployments
# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-deployment
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: app
image: myapp:latest
ports:
- containerPort: 3000
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
Helm Charts
Kubernetes package manager
# values.yaml
replicaCount: 3
image:
repository: myapp
tag: "1.0.0"
pullPolicy: IfNotPresent
service:
type: LoadBalancer
port: 80
ingress:
enabled: true
hosts:
- host: myapp.example.com
paths: ["/"]
Monitoring & Observability
The Three Pillars
Metrics
Numerical measurements over time
# Prometheus metrics
http_requests_total{method="GET", endpoint="/api/users"} 1234
http_request_duration_seconds{quantile="0.99"} 0.123
memory_usage_bytes 536870912
Logs
Discrete events
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "ERROR",
"service": "payment-service",
"message": "Payment processing failed",
"user_id": "12345",
"error": "Insufficient funds",
"trace_id": "abc-123-def"
}
Traces
Request flow through services
User Request
└─> API Gateway (2ms)
└─> Auth Service (5ms)
└─> User Service (10ms)
└─> Database (15ms)
└─> Payment Service (50ms)
└─> External Payment API (45ms)
Total: 82ms
Monitoring Stack
Prometheus + Grafana
Metrics and visualisation
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'app'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
ELK Stack
Elasticsearch, Logstash, Kibana
# logstash.conf
input {
beats {
port => 5044
}
}
filter {
grok {
match => { "message" => "%{COMMONAPACHELOG}" }
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
}
APM (Application Performance Monitoring)
Tools:
- New Relic
- Datadog
- AppDynamics
- Dynatrace
// Datadog APM example
const tracer = require('dd-trace').init();
app.get('/api/users/:id', (req, res) => {
const span = tracer.startSpan('get_user');
span.setTag('user.id', req.params.id);
getUserFromDatabase(req.params.id)
.then(user => {
span.setTag('user.found', true);
res.json(user);
})
.catch(error => {
span.setTag('error', true);
span.setTag('error.message', error.message);
res.status(404).json({ error: 'User not found' });
})
.finally(() => {
span.finish();
});
});
Security in DevOps (DevSecOps)
Shift Left Security
Security early in the pipeline
# Security scanning in CI
name: Security Scan
on: [push]
jobs:
sast:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run SAST scan
run: |
npm audit
snyk test
container-scan:
runs-on: ubuntu-latest
steps:
- name: Scan Docker image
run: trivy image myapp:latest
secrets-scan:
runs-on: ubuntu-latest
steps:
- name: Scan for secrets
run: gitleaks detect --source .
Security Tools
SAST (Static Application Security Testing):
- SonarQube
- Checkmarx
- Fortify
DAST (Dynamic Application Security Testing):
- OWASP ZAP
- Burp Suite
- Acunetix
Dependency Scanning:
# Various tools
npm audit # Node.js
safety check # Python
bundle audit # Ruby
mvn dependency-check # Java
Secrets Management
HashiCorp Vault
# Store secret
vault kv put secret/myapp/db password=secretpass
# Retrieve secret
vault kv get secret/myapp/db
# Dynamic secrets
vault write database/creds/my-role
AWS Secrets Manager
import boto3
client = boto3.client('secretsmanager')
# Get secret
response = client.get_secret_value(SecretId='prod/myapp/db')
secret = response['SecretString']
Deployment Strategies
Blue-Green Deployment
Two identical environments
Current State:
Blue (v1.0) ← Live Traffic
Green (v2.0) ← Deploy & Test
Switch:
Blue (v1.0) ← Standby
Green (v2.0) ← Live Traffic
Rollback if needed:
Blue (v1.0) ← Live Traffic
Green (v2.0) ← Fix issues
Canary Deployment
Gradual rollout
Stage 1: 5% traffic to v2.0
Stage 2: 25% traffic to v2.0
Stage 3: 50% traffic to v2.0
Stage 4: 100% traffic to v2.0
Feature Flags
Deploy code without releasing features
if (featureFlag.isEnabled('new-checkout')) {
return <NewCheckoutFlow />;
} else {
return <OldCheckoutFlow />;
}
Automation Tools
Jenkins
The automation server
// Jenkinsfile
pipeline {
agent any
stages {
stage('Build') {
steps {
sh 'npm install'
sh 'npm run build'
}
}
stage('Test') {
parallel {
stage('Unit Tests') {
steps {
sh 'npm run test:unit'
}
}
stage('Integration Tests') {
steps {
sh 'npm run test:integration'
}
}
}
}
stage('Deploy') {
when {
branch 'main'
}
steps {
sh './deploy.sh'
}
}
}
post {
always {
cleanWs()
}
success {
slackSend(message: "Build successful!")
}
failure {
slackSend(message: "Build failed!")
}
}
}
GitLab CI/CD
Integrated CI/CD
# .gitlab-ci.yml
stages:
- build
- test
- deploy
variables:
DOCKER_IMAGE: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
build:
stage: build
script:
- docker build -t $DOCKER_IMAGE .
- docker push $DOCKER_IMAGE
test:
stage: test
script:
- npm test
coverage: '/Coverage: \d+\.\d+%/'
deploy:
stage: deploy
script:
- kubectl set image deployment/app app=$DOCKER_IMAGE
environment:
name: production
url: https://yourdomain.com
only:
- main
GitHub Actions
Native GitHub automation
name: CI/CD Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
build-and-test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:13
env:
POSTGRES_PASSWORD: postgres
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '18'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run tests
run: npm test
env:
DATABASE_URL: postgresql://postgres:postgres@localhost/test
- name: Build application
run: npm run build
- name: Upload artifacts
uses: actions/upload-artifact@v3
with:
name: dist
path: dist/
deploy:
needs: build-and-test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- name: Download artifacts
uses: actions/download-artifact@v3
with:
name: dist
- name: Deploy to production
run: |
echo "Deploying to production..."
# Deployment commands here
Site Reliability Engineering (SRE)
SLIs, SLOs and SLAs
Measuring reliability
SLI (Service Level Indicator):
Availability = (Successful Requests / Total Requests) × 100
Latency P99 = 99% of requests complete within X ms
SLO (Service Level Objective):
- 99.9% availability (43.2 minutes downtime/month)
- P99 latency < 200ms
- Error rate < 0.1%
SLA (Service Level Agreement):
If availability < 99.9%:
- 10% credit for 99.0-99.9%
- 25% credit for 95.0-99.0%
- 50% credit for < 95.0%
Error Budgets
Balancing reliability and velocity
Monthly Error Budget = (1 - SLO) × Total Time
99.9% SLO = 0.1% × 43,200 minutes = 43.2 minutes
If error budget consumed:
- Freeze feature releases
- Focus on reliability improvements
- Post-mortem for incidents
Chaos Engineering
Breaking things on purpose
# Chaos Monkey configuration
chaos:
enabled: true
schedule: "0 9-17 * * 1-5" # Weekdays 9-5
experiments:
- name: "Random Pod Failure"
type: "pod-failure"
probability: 0.1
- name: "Network Latency"
type: "network-delay"
delay: "100ms"
probability: 0.2
- name: "CPU Stress"
type: "resource-stress"
cpu: 80
duration: "5m"
Documentation as Code
API Documentation
# OpenAPI/Swagger
openapi: 3.0.0
info:
title: User API
version: 1.0.0
paths:
/users/{id}:
get:
summary: Get user by ID
parameters:
- name: id
in: path
required: true
schema:
type: integer
responses:
200:
description: User found
content:
application/json:
schema:
$ref: '#/components/schemas/User'
Architecture Decision Records (ADRs)
# ADR-001: Use PostgreSQL for primary database
## Status
Accepted
## Context
We need a reliable, scalable database for our application.
## Decision
We will use PostgreSQL as our primary database.
## Consequences
- Strong consistency guarantees
- Excellent JSON support
- Requires DBA knowledge for optimisation
- Need to plan for scaling (read replicas, partitioning)
DevOps Metrics
DORA Metrics
Key performance indicators
-
Deployment Frequency
- How often code is deployed
-
Lead Time for Changes
- Time from commit to production
-
Mean Time to Recovery (MTTR)
- Time to restore service
-
Change Failure Rate
- Percentage of deployments causing failure
Measuring Success
// Track deployment metrics
const deployment = {
timestamp: new Date(),
version: process.env.VERSION,
duration: deploymentTime,
status: 'success',
rollback: false
};
metrics.record('deployment.frequency', 1);
metrics.record('deployment.duration', deploymentTime);
metrics.record('deployment.success_rate', status === 'success' ? 1 : 0);
Getting Started with DevOps
Personal DevOps Project
Step 1: Set up version control
git init
git remote add origin https://github.com/username/project
Step 2: Create CI pipeline
# Simple CI starter
name: CI
on: [push]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- run: npm test
Step 3: Add monitoring
// Basic health check
app.get('/health', (req, res) => {
res.json({
status: 'healthy',
timestamp: new Date(),
uptime: process.uptime()
});
});
Step 4: Automate deployment
#!/bin/bash
# Simple deployment script
npm test || exit 1
docker build -t myapp .
docker push myapp
kubectl set image deployment/app app=myapp:latest
Summary
DevOps transforms how teams build, deploy and operate software. It's not just about tools - it's about culture, collaboration and continuous improvement.
Key takeaways:
- Automate everything that can be automated
- Measure everything that matters
- Fail fast, recover faster
- Security is everyone's responsibility
- Documentation is part of the code