0% found this document useful (0 votes)
4 views9 pages

Group 6-1

The document outlines key practices for monitoring and maintaining systems, including the use of performance monitoring tools, event logs, alert systems, troubleshooting methods, and patch management. It emphasizes the importance of detecting system issues early, maintaining system performance, and ensuring security through regular updates. Best practices and tools for each area are discussed to enhance system reliability and efficiency.

Uploaded by

opiojoshuara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views9 pages

Group 6-1

The document outlines key practices for monitoring and maintaining systems, including the use of performance monitoring tools, event logs, alert systems, troubleshooting methods, and patch management. It emphasizes the importance of detecting system issues early, maintaining system performance, and ensuring security through regular updates. Best practices and tools for each area are discussed to enhance system reliability and efficiency.

Uploaded by

opiojoshuara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

GROUP SIX

Monitoring and Maintaining Systems

GROUP MEMBERS

WAMBA JEBBI 124-062082-34759


KIKULUBE LATIFU 124-062082-33391
AKUSON FADIA 124-062082-35916
KIPLANANT TIMOTHY 124-062082-36569
FEISALABDULLAHI 124-062082-36881
WASWA SWAIB 124-062082-33455
MASABA ERIC 124-062082-36282
MASSA JONAH 124-062082-33980
Monitoring and Maintaining Systems
Key areas:
 System Performance monitoring tools
 Event logs and alert systems
 Troubleshooting system issues
 Patch management and update
1) System performance monitoring tools
These continuously measure system health and behavior to detect anomalies, capacity issues,
performance regressions.

They provide historical trends for capacity planning and SLA/SLO tracking.

System performance monitoring involves observing and measuring how efficiently a computer
system or server is operating.

Key Performance Indicators (KPIs)

 CPU Usage: Percentage of processor in use

 Memory Usage (RAM): Amount of memory used vs available

 Disk Performance: Disk space (free vs used), Disk read/write speed

 Network Performance

 Bandwidth usage

 Packet loss and latency

Common Monitoring Tools

 Task Manager (Windows)

 Quick overview of CPU, RAM, disk, and processes

 Performance Monitor (Perfmon)

 Advanced tool with graphs and counters


 Resource Monitor

 Detailed real-time system resource usage

 Third-party tools

 Nagios (server monitoring)

 Zabbix (network and system monitoring)

 SolarWinds (enterprise monitoring)

Importance

 Detects system bottlenecks

 Helps prevent crashes and downtime

 Improves system performance and planning

Best practices

 Define SLOs (service level objectives) and translate them into measurable metrics/alerts.

 Monitor meaningful metrics (avoid alert fatigue).

 Use dashboards for quick triage (overview + detailed drilldowns).

 Use rate and percentiles (p95/p99) for latency, not just averages.

 Retain high-resolution recent data, downsample older data for long-term trends.

 Protect monitoring system availability (redundant collectors, separate monitoring


cluster).

2) Event logs and Alert systems


Event Logs

Event logs are records of activities and events occurring in the system.

What to log: System logs, application logs, audit/security logs, infrastructure events.

Types of Event Logs in Windows

 Application Logs: Events from applications/software

 System Logs: Operating system events


 Security Logs: login attempts, access control, security events

Event Viewer

This a tool used to view and analyze logs

It helps track errors, warnings, and information messages

Alert Systems

Alert systems notify administrators when issues occur.

Types of Alerts

 Email notifications

 SMS alerts

 System pop-ups

 Automated scripts/actions

Importance

 Enables early detection of problems

 Improves security monitoring

 Supports auditing and accountability

• Alerting strategy

Two classes: Alerting for symptoms (high CPU, errors) and for underlying causes (disk full).

Define severity levels (critical, warning, info) and explicit escalation paths.

Use alert policies: avoid noisy alerts; include runbooks or playbook links in alerts.

Examples of alert rules:

▪ CPU > 90% for 5 minutes warning/critical.

▪ Error rate > 1% sustained for 2 minutes page.

▪ Disk usage > 85% warning; > 95% immediate page.

Include rate-of-change alerts (rapid growth) not just static thresholds.

• Integrations workflows
Integrate with incident management (PagerDuty, OpsGenie), chatops (Slack/MS Teams),
ticketing (Jira).

Automated remediation for common issues (auto-scaling, service restarts, scripts).

Maintain playbooks/runbooks referenced by alerts.

• Retention compliance

Set retention policies based on compliance and troubleshooting needs.

Secure logs (encryption, access controls); archive for audits.

3. Troubleshooting System Issues


Troubleshooting is the process of identifying and fixing problems in a system.

Common System Issues

 Slow system performance

 Application crashes

 Network connectivity problems

 Hardware failures

Troubleshooting Steps

 Identify the problem

 Gather information

 Analyze possible causes

 Test possible solutions

 Implement the best solution

 Monitor results
Tools for Troubleshooting
Command Prompt Tools

 ping – checks network connectivity

 ipconfig – displays IP configuration

 tracert – traces route to a destination

System Tools

 Task Manager

 Device Manager

 Safe Mode

 Event Viewer

Best Practices

 Follow a systematic approach

 Check logs for errors

 Document issues and solutions

 Always test before applying fixes

• Incident lifecycle

Detect Triage Contain Diagnose Remediate


Recover Root Cause Analysis (RCA) Prevent.

First-response checklist (quick triage)

o Gather context: when did it start? Affected services/users? Recent


deploys/changes?

o Check dashboards alerts (what metrics changed first).

o Check logs (tail, filter by correlation ID/time window).

o Check host health: top, free -m, df -h, iostat, vmstat, sar.

o Check application/service status: systemctl status, netstat/lsof for port conflicts,


process list.
o Check recent changes: deployments, config changes, patches, scaling events

Diagnostic strategies

o Reproduce (if possible) in test environment.

o Isolate the layer: network, host, container, application, DB.

o Use correlation IDs/tracing (OpenTelemetry, Jaeger) to track request path.

o Check resource exhaustion (file descriptors, threads, DB connections).

o Validate configuration (env vars, secrets, connection strings).

o Consider recent deployments or config changes as likely cause.

Post-incident

o Conduct RCA: timeline, root cause, contributing factors, remediation, follow-ups.

o Capture learnings in runbooks; update monitoring and alerts to detect earlier.

o Perform blameless postmortems.

Patch Management and updates

What is Patch Management?

This a process of updating software to fix: Bugs ,Security vulnerabilities and Performance issues.

• Goals

Keep systems secure and stable by applying security and bugfix patches in a controlled way.

Minimize downtime and risk of regressions.

• Patch management lifecycle

o Inventory: track OS, packages, firmware, dependencies.

o Prioritize: critical security CVEs first, then functional fixes.

o Test: apply patches to staging/test environments and run smoke/regression


tests.

o Schedule: define maintenance windows; consider off-peak times.


o Deploy: automated rollouts (rolling updates, canary, blue-green) to minimize
impact.

o Verify: health checks and monitoring post-patch.

o Rollback: have rollback plans and backups (snapshot images, database backups).

Tools Used

o Linux package managers: apt, yum/dnf, zypper; use unattended-upgrades


carefully.

o Configuration and orchestration: Ansible, Chef, Puppet, SaltStack for consistent


patching.

o Windows: Windows Server Update Services (WSUS), SCCM (ConfigMgr),


Windows Update for Business.

o Cloud container strategies: AMI/VM image baking with patches, immutable


images, orchestration rollouts.

o Use CI/CD pipelines to build and test new images with updated packages.

Deployment strategies to reduce risk

o Canary: update small subset, monitor, then continue.

o Rolling: update a few nodes at a time with health checks.

o Blue-green: deploy new version into green environment, switch traffic when
healthy.

o Feature flags for application-level changes.

Security considerations

o Track CVEs and security advisories; subscribe to vendor feeds.

o Prioritize kernel and remote-exploit patches.

o Combine patch management with vulnerability scanning (Nessus, OpenVAS).

o Validate cryptographic libraries and dependencies.

Governance and compliance

o Maintain patch policies (timelines for applying critical/important patches).

o Keep audit trail of patching activity and approvals.


Importance

 Protects against malware and cyber attacks

 Improves system stability

 Keeps software up to date

Best Practices

 Schedule regular updates

 Backup system before patching

 Test patches before deployment

 Keep records of updates

REFERENCES

1. "A Practical Guide to Ubuntu Linux" by Mark G. Sobell:

2. "Linux Administration: A Beginner's Guide" by Wale Soyinka:

3. Ubuntu Documentation: Offers guides on system monitoring and maintenance.

2. Red Hat Enterprise Linux Documentation:

You might also like