- Remove platform-specific configurations from base_config.py
- Create separate config files for each platform in their respective directories
- Update import statements in core files to use new platform-specific config modules
- Clean up unused and deprecated configuration options
- BilibiliClient:
- Improve wbi_img_urls handling for better compatibility
- Add error handling for missing or invalid 'is_end' and 'next' in comment cursor
- BilibiliCrawler:
- Fix daily limit logic for keyword-based searches
- Improve logging and break conditions for max notes count limits
- Ensure proper tracking of total notes crawled for each keyword
This commit introduces several improvements to enhance the stability and functionality of the Bilibili crawler.
- **Add Retry Logic:** Implement a retry mechanism with exponential backoff when fetching video comments. This makes the crawler more resilient to transient network issues or API errors.
- **Improve Error Handling:** Add a `try...except` block to handle potential `JSONDecodeError` in the Bilibili client, preventing crashes when the API returns an invalid response.
- **Ensure Clean Shutdown:** Refactor `main.py` to use a `try...finally` block, guaranteeing that the crawler and database connections are properly closed on exit, error, or `KeyboardInterrupt`.
- **Update Default Config:** Adjust default configuration values to increase concurrency, enable word cloud generation by default, and refine the Bilibili search mode for more practical usage.
Refactors the Bilibili keyword search functionality to provide more flexible crawling strategies and corrects a flaw in how crawl limits were applied.
Previously, the `ALL_DAY` boolean flag offered a rigid choice for time-based searching and contained a logical issue where `CRAWLER_MAX_NOTES_COUNT` was incorrectly applied on a per-day basis instead of as an overall total.
This commit introduces the `BILI_SEARCH_MODE` configuration option with three distinct modes:
- `normal`: The default search behavior without time constraints.
- `all_in_time_range`: Maximizes data collection within a specified date range, replicating the original intent of `ALL_DAY=True`.
- `daily_limit_in_time_range`: A new mode that strictly enforces both the daily `MAX_NOTES_PER_DAY` and the total `CRAWLER_MAX_NOTES_COUNT` limits across the entire date range.
This change resolves the limit logic bug and gives users more precise control over the crawling process.
Changes include:
- Modified `config/base_config.py` to replace `ALL_DAY` with `BILI_SEARCH_MODE`.
- Refactored `media_platform/bilibili/core.py` to implement the new search mode logic.
- Add prominent language selection section at the top of each README
- Include flag emojis and clear language indicators (🇨🇳 中文, 🇺🇸 English, 🇪🇸 Español)
- Format as horizontal table for easy scanning and navigation
- Show current language with arrow indicator (← Current/当前/Actual)
- Use relative links that work on both GitHub and local repositories
- Improve discoverability of multilingual documentation
- Consistent navigation across all three language versions
- Add README_en.md: Complete English translation of project documentation
- Add README_es.md: Complete Spanish translation of project documentation
- Maintain exact same structure, formatting, and technical accuracy as original
- Preserve all markdown formatting, links, code examples, and legal disclaimers
- Keep original Chinese README.md unchanged
- Support for English and Spanish-speaking developers while maintaining educational focus