Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
Introduction to AIOps
- Defining AIOps and its significance.
- Contrasting traditional monitoring with AIOps-driven observability.
- Exploring AIOps architecture and key components.
Collecting and Normalizing Operational Data
- Types of observability data: metrics, logs, and traces.
- Ingesting data from diverse sources, including servers, containers, and cloud environments.
- Utilizing agents and exporters such as Prometheus, Beats, and Fluentd.
Data Correlation and Anomaly Detection
- Employing time series correlation and statistical methods.
- Applying ML models for anomaly detection.
- Detecting incidents across distributed systems.
Alerting and Noise Reduction
- Designing intelligent alert rules and thresholds.
- Implementing suppression, deduplication, and alert grouping.
- Integrating with platforms like Alertmanager, Slack, PagerDuty, or Opsgenie.
Root Cause Analysis and Visualization
- Using dashboards to visualize metrics and identify trends.
- Exploring events and timelines for Root Cause Analysis (RCA).
- Tracing issues across layers using distributed tracing tools.
Automation and Remediation
- Triggering automated scripts or workflows in response to incidents.
- Integrating with ITSM systems such as ServiceNow and Jira.
- Examining use cases: self-healing, scaling, and traffic rerouting.
Open Source and Commercial AIOps Platforms
- Overview of tools including Prometheus, Grafana, ELK, Moogsoft, and Dynatrace.
- Establishing evaluation criteria for selecting an AIOps platform.
- Demo and hands-on session with a selected stack.
Summary and Next Steps
Requirements
- A solid understanding of IT operations and system monitoring concepts.
- Prior experience with monitoring tools or dashboards.
- Familiarity with fundamental log and metric formats.
Audience
- Operations teams responsible for infrastructure and applications.
- Site Reliability Engineers (SREs).
- IT monitoring and observability teams.
14 Hours