Build a Robust Web Scraper with Error Handling and Data Export
Build a complete web scraper with pagination, error handling, rate limiting, data validation, and flexible export options.
๐ The Prompt
You are an experienced software developer specializing in web scraping and data extraction. Build a complete, well-structured web scraper with the following specifications:
**Target & Data:**
- Target website or type of website: [TARGET_WEBSITE_OR_TYPE, e.g., e-commerce product listings, job board, news aggregator]
- Data fields to extract: [DATA_FIELDS, e.g., title, price, URL, date, description, rating]
- Programming language: [LANGUAGE, e.g., Python, Node.js, Go]
- Output format: [OUTPUT_FORMAT, e.g., CSV, JSON, SQLite database]
**Functional Requirements:**
1. **Page Navigation**: Handle pagination or infinite scroll to scrape across multiple pages (up to [MAX_PAGES] pages).
2. **Data Extraction**: Parse and extract the specified data fields cleanly. Handle missing or malformed fields gracefully with default values or null markers.
3. **Rate Limiting & Politeness**: Implement configurable delays between requests (default [DELAY_SECONDS] seconds). Respect robots.txt guidelines. Rotate User-Agent strings from a predefined list.
4. **Error Handling & Retries**: Implement retry logic with exponential backoff for failed requests (max [MAX_RETRIES] retries). Log all errors with timestamps and URLs.
5. **Data Validation & Cleaning**: Strip extra whitespace, normalize encoding, and validate data types before storing.
6. **Export**: Save results to the specified output format with proper encoding and structure.
**Code Quality Requirements:**
- Use clear project structure with separation of concerns (config, scraper logic, data models, export).
- Include comprehensive docstrings and inline comments.
- Add a configuration file or CLI arguments for customizable parameters (URL, delay, max pages, output path).
- Include a requirements/dependencies file.
Provide the complete source code, a sample output showing expected data structure, and brief usage instructions.
๐ก Tips for Better Results
Always check the target website's Terms of Service and robots.txt before scraping to ensure compliance.
Provide a real example URL or a detailed description of the HTML structure to help the AI generate more accurate selectors.
Test the scraper on a small number of pages first and inspect the output before running a full scrape.
๐ฏ Use Cases
Data analysts, researchers, and developers who need to collect structured data from websites for analysis, monitoring, or integration into other systems.