Create an Outlier Detection and Handling Framework for Your Data

Build a complete outlier detection and handling framework using statistical, visual, and ML methods with Python code.

๐Ÿ“ The Prompt

You are a statistical analysis expert. I have a [DATASET_TYPE] dataset with [NUMBER_OF_ROWS] records containing the following numerical features: [LIST_OF_NUMERICAL_COLUMNS]. The data will be used for [ANALYSIS_PURPOSE] in the [INDUSTRY] domain. Design a thorough outlier detection and handling framework: **Phase 1 โ€” Detection** Apply and compare at least 4 outlier detection methods to my data: - Statistical methods (Z-score, Modified Z-score, IQR) - Visualization-based (box plots, scatter plots, distribution plots) - Machine learning-based (Isolation Forest, DBSCAN, Local Outlier Factor) For each method, provide: - Python implementation code - How to interpret the results - Strengths and limitations for [DATASET_TYPE] data **Phase 2 โ€” Classification** Help me distinguish between: - True anomalies (errors or fraud) - Natural extreme values (legitimate but rare) - Domain-specific acceptable outliers in [INDUSTRY] Provide a decision tree or flowchart logic for classifying detected outliers. **Phase 3 โ€” Handling** For each outlier category, recommend an appropriate action: - Removal, capping/winsorization, transformation (log, Box-Cox), imputation, or separate modeling - Justify each recommendation and explain its impact on [ANALYSIS_PURPOSE] **Phase 4 โ€” Documentation** Create a summary report template that logs: total outliers found per method, classification decisions, actions taken, and before/after distribution statistics. Provide all code in Python using pandas, scipy, and scikit-learn.

๐Ÿ’ก Tips for Better Results

Never blindly remove outliers โ€” always investigate whether they are errors or legitimate rare events, as this distinction depends heavily on your domain. Use multiple detection methods and look for consensus across them to reduce false positives. Document every outlier decision for reproducibility and audit purposes.

๐ŸŽฏ Use Cases

Data scientists and analysts use this when preparing data for modeling, fraud detection, or quality assurance where extreme values could skew results.

๐Ÿ”— Related Prompts

๐Ÿ“Š Data & Analytics intermediate

Write Complex SQL Queries

Generate optimized SQL queries for complex analysis with CTEs, JOINs, and performance tips.

๐Ÿ“Š Data & Analytics intermediate

Python Data Analysis Script

Generate a complete Python data analysis pipeline with cleaning, visualization, and insights.

๐Ÿ“Š Data & Analytics intermediate

Create a Comprehensive Data Quality Checklist for Your Dataset

Generate a tailored data quality checklist with SQL validation queries, severity levels, and a scoring framework for any dataset.

๐Ÿ“Š Data & Analytics intermediate

Build an RFM Customer Segmentation Model for Targeted Marketing

Create a complete RFM customer segmentation model with scoring logic, code implementation, and marketing strategies.

๐Ÿ“Š Data & Analytics advanced

Design a Robust ETL Pipeline Architecture for Your Data Platform

Design a complete ETL pipeline architecture with extraction, transformation, loading strategies, error handling, and governance.

๐Ÿ“Š Data & Analytics advanced

Analyze and Interpret A/B Test Results with Statistical Rigor

Get a complete A/B test analysis with statistical significance, power analysis, validity checks, and a clear ship decision.