GROUP SIX
Monitoring and Maintaining Systems
GROUP MEMBERS
WAMBA JEBBI 124-062082-34759
KIKULUBE LATIFU 124-062082-33391
AKUSON FADIA 124-062082-35916
KIPLANANT TIMOTHY 124-062082-36569
FEISALABDULLAHI 124-062082-36881
WASWA SWAIB 124-062082-33455
MASABA ERIC 124-062082-36282
MASSA JONAH 124-062082-33980
Monitoring and Maintaining Systems
Key areas:
System Performance monitoring tools
Event logs and alert systems
Troubleshooting system issues
Patch management and update
1) System performance monitoring tools
These continuously measure system health and behavior to detect anomalies, capacity issues,
performance regressions.
They provide historical trends for capacity planning and SLA/SLO tracking.
System performance monitoring involves observing and measuring how efficiently a computer
system or server is operating.
Key Performance Indicators (KPIs)
CPU Usage: Percentage of processor in use
Memory Usage (RAM): Amount of memory used vs available
Disk Performance: Disk space (free vs used), Disk read/write speed
Network Performance
Bandwidth usage
Packet loss and latency
Common Monitoring Tools
Task Manager (Windows)
Quick overview of CPU, RAM, disk, and processes
Performance Monitor (Perfmon)
Advanced tool with graphs and counters
Resource Monitor
Detailed real-time system resource usage
Third-party tools
Nagios (server monitoring)
Zabbix (network and system monitoring)
SolarWinds (enterprise monitoring)
Importance
Detects system bottlenecks
Helps prevent crashes and downtime
Improves system performance and planning
Best practices
Define SLOs (service level objectives) and translate them into measurable metrics/alerts.
Monitor meaningful metrics (avoid alert fatigue).
Use dashboards for quick triage (overview + detailed drilldowns).
Use rate and percentiles (p95/p99) for latency, not just averages.
Retain high-resolution recent data, downsample older data for long-term trends.
Protect monitoring system availability (redundant collectors, separate monitoring
cluster).
2) Event logs and Alert systems
Event Logs
Event logs are records of activities and events occurring in the system.
What to log: System logs, application logs, audit/security logs, infrastructure events.
Types of Event Logs in Windows
Application Logs: Events from applications/software
System Logs: Operating system events
Security Logs: login attempts, access control, security events
Event Viewer
This a tool used to view and analyze logs
It helps track errors, warnings, and information messages
Alert Systems
Alert systems notify administrators when issues occur.
Types of Alerts
Email notifications
SMS alerts
System pop-ups
Automated scripts/actions
Importance
Enables early detection of problems
Improves security monitoring
Supports auditing and accountability
• Alerting strategy
Two classes: Alerting for symptoms (high CPU, errors) and for underlying causes (disk full).
Define severity levels (critical, warning, info) and explicit escalation paths.
Use alert policies: avoid noisy alerts; include runbooks or playbook links in alerts.
Examples of alert rules:
▪ CPU > 90% for 5 minutes warning/critical.
▪ Error rate > 1% sustained for 2 minutes page.
▪ Disk usage > 85% warning; > 95% immediate page.
Include rate-of-change alerts (rapid growth) not just static thresholds.
• Integrations workflows
Integrate with incident management (PagerDuty, OpsGenie), chatops (Slack/MS Teams),
ticketing (Jira).
Automated remediation for common issues (auto-scaling, service restarts, scripts).
Maintain playbooks/runbooks referenced by alerts.
• Retention compliance
Set retention policies based on compliance and troubleshooting needs.
Secure logs (encryption, access controls); archive for audits.
3. Troubleshooting System Issues
Troubleshooting is the process of identifying and fixing problems in a system.
Common System Issues
Slow system performance
Application crashes
Network connectivity problems
Hardware failures
Troubleshooting Steps
Identify the problem
Gather information
Analyze possible causes
Test possible solutions
Implement the best solution
Monitor results
Tools for Troubleshooting
Command Prompt Tools
ping – checks network connectivity
ipconfig – displays IP configuration
tracert – traces route to a destination
System Tools
Task Manager
Device Manager
Safe Mode
Event Viewer
Best Practices
Follow a systematic approach
Check logs for errors
Document issues and solutions
Always test before applying fixes
• Incident lifecycle
Detect Triage Contain Diagnose Remediate
Recover Root Cause Analysis (RCA) Prevent.
First-response checklist (quick triage)
o Gather context: when did it start? Affected services/users? Recent
deploys/changes?
o Check dashboards alerts (what metrics changed first).
o Check logs (tail, filter by correlation ID/time window).
o Check host health: top, free -m, df -h, iostat, vmstat, sar.
o Check application/service status: systemctl status, netstat/lsof for port conflicts,
process list.
o Check recent changes: deployments, config changes, patches, scaling events
Diagnostic strategies
o Reproduce (if possible) in test environment.
o Isolate the layer: network, host, container, application, DB.
o Use correlation IDs/tracing (OpenTelemetry, Jaeger) to track request path.
o Check resource exhaustion (file descriptors, threads, DB connections).
o Validate configuration (env vars, secrets, connection strings).
o Consider recent deployments or config changes as likely cause.
Post-incident
o Conduct RCA: timeline, root cause, contributing factors, remediation, follow-ups.
o Capture learnings in runbooks; update monitoring and alerts to detect earlier.
o Perform blameless postmortems.
Patch Management and updates
What is Patch Management?
This a process of updating software to fix: Bugs ,Security vulnerabilities and Performance issues.
• Goals
Keep systems secure and stable by applying security and bugfix patches in a controlled way.
Minimize downtime and risk of regressions.
• Patch management lifecycle
o Inventory: track OS, packages, firmware, dependencies.
o Prioritize: critical security CVEs first, then functional fixes.
o Test: apply patches to staging/test environments and run smoke/regression
tests.
o Schedule: define maintenance windows; consider off-peak times.
o Deploy: automated rollouts (rolling updates, canary, blue-green) to minimize
impact.
o Verify: health checks and monitoring post-patch.
o Rollback: have rollback plans and backups (snapshot images, database backups).
Tools Used
o Linux package managers: apt, yum/dnf, zypper; use unattended-upgrades
carefully.
o Configuration and orchestration: Ansible, Chef, Puppet, SaltStack for consistent
patching.
o Windows: Windows Server Update Services (WSUS), SCCM (ConfigMgr),
Windows Update for Business.
o Cloud container strategies: AMI/VM image baking with patches, immutable
images, orchestration rollouts.
o Use CI/CD pipelines to build and test new images with updated packages.
Deployment strategies to reduce risk
o Canary: update small subset, monitor, then continue.
o Rolling: update a few nodes at a time with health checks.
o Blue-green: deploy new version into green environment, switch traffic when
healthy.
o Feature flags for application-level changes.
Security considerations
o Track CVEs and security advisories; subscribe to vendor feeds.
o Prioritize kernel and remote-exploit patches.
o Combine patch management with vulnerability scanning (Nessus, OpenVAS).
o Validate cryptographic libraries and dependencies.
Governance and compliance
o Maintain patch policies (timelines for applying critical/important patches).
o Keep audit trail of patching activity and approvals.
Importance
Protects against malware and cyber attacks
Improves system stability
Keeps software up to date
Best Practices
Schedule regular updates
Backup system before patching
Test patches before deployment
Keep records of updates
REFERENCES
1. "A Practical Guide to Ubuntu Linux" by Mark G. Sobell:
2. "Linux Administration: A Beginner's Guide" by Wale Soyinka:
3. Ubuntu Documentation: Offers guides on system monitoring and maintenance.
2. Red Hat Enterprise Linux Documentation: