Build an Anomaly Detection System for Real-Time Data Monitoring
Design a complete anomaly detection system with algorithm selection, threshold tuning, and false positive reduction.
๐ The Prompt
You are a machine learning engineer specializing in anomaly detection and data quality monitoring. I need you to design and implement an anomaly detection system for the following use case.
**System Context:**
- **Data Source:** [DATA_SOURCE, e.g., server metrics, financial transactions, IoT sensor readings, application logs]
- **Key Metrics to Monitor:** [METRICS_LIST, e.g., CPU utilization, transaction amount, request latency, error rate]
- **Data Volume:** Approximately [VOLUME, e.g., 10,000 events per minute]
- **Latency Requirement:** Anomalies must be detected within [LATENCY, e.g., 5 minutes, real-time, 1 hour]
- **Historical Data Available:** [HISTORY, e.g., 6 months of labeled data, 2 years unlabeled]
- **Labeled Anomaly Examples:** [LABELS, e.g., 'yes โ 500 labeled incidents', 'no โ fully unsupervised']
- **Current Pain Point:** [PAIN_POINT, e.g., 'too many false alerts', 'missed critical incidents', 'no monitoring exists']
Please design the complete system:
1. **Anomaly Taxonomy:** Classify the types of anomalies relevant to [DATA_SOURCE] โ point anomalies, contextual anomalies, and collective anomalies. Provide concrete examples of each for this domain.
2. **Algorithm Selection:** Recommend and compare at least 3 suitable algorithms based on the labeling situation:
- **Statistical:** Z-score, Grubbs' test, seasonal hybrid ESD (S-H-ESD)
- **Machine Learning:** Isolation Forest, One-Class SVM, Local Outlier Factor
- **Deep Learning:** Autoencoders, LSTM-based sequence anomaly detection
For each, explain computational complexity, interpretability, and suitability for [LATENCY] requirements.
3. **Feature Engineering:** Define the features to extract from raw [DATA_SOURCE] data, including rolling statistics (mean, std, percentiles over multiple windows), rate-of-change features, time-based features, and cross-metric correlation features.
4. **Threshold Tuning Strategy:** Describe how to set and dynamically adjust anomaly thresholds to balance precision vs. recall. Include a method for handling concept drift and evolving baselines.
5. **Alert Severity Classification:** Design a 3-tier severity system (critical, warning, info) with specific criteria for each tier and recommended response actions.
6. **False Positive Reduction:** Propose at least 3 techniques to minimize false positives โ correlation with other signals, minimum duration filters, suppression windows, and human feedback loops.
7. **Implementation Code:** Provide a Python implementation using [FRAMEWORK, e.g., scikit-learn, PyOD, PyCaret] that processes a sample dataset, trains the detector, and flags anomalies with confidence scores.
8. **Evaluation Framework:** Define how to measure detector performance using precision, recall, F1-score, and time-to-detect. Include a method for backtesting against historical incidents.
9. **Operational Runbook:** Create a brief runbook for the on-call team: what to check when an alert fires, escalation paths, and how to provide feedback to improve the model.
Structure the output with clear sections and include code with inline comments.
๐ก Tips for Better Results
Clearly state whether you have labeled anomaly examples โ this determines whether supervised, semi-supervised, or unsupervised methods are appropriate.
Describe your current false positive rate and tolerance level, as this is often the biggest practical challenge in anomaly detection systems.
Include information about expected seasonal patterns and known scheduled events (maintenance windows, batch jobs) to help design suppression rules.
๐ฏ Use Cases
MLOps engineers, SREs, data platform teams, and fraud analysts should use this when building or improving automated monitoring and alerting systems for critical business data.