The PR adds API limit overrides and static proxy support, but the review found that the default proxy provider changed to an invalid static placeholder and the new API fields accepted unbounded values. This keeps the existing proxy default intact, makes static proxy explicit via config or CLI, validates API limit ranges, and adds focused regression coverage for both paths.
Constraint: PR branch must remain contributor-branch compatible and avoid adding dependencies
Rejected: Keep static as the default provider | breaks existing --enable_ip_proxy defaults with an invalid placeholder URL
Rejected: Accept arbitrary integer limits | lets API callers request negative or excessive crawl sizes
Confidence: high
Scope-risk: narrow
Directive: Do not change proxy provider defaults when adding new providers; new providers should be opt-in and covered by provider-specific tests
Tested: uv run pytest tests/test_api_limits.py tests/test_static_proxy_provider.py
Tested: uv run pytest tests
Tested: uv run pytest test/test_utils.py
Tested: uv run python -m compileall api cmd_arg config proxy tests
Tested: git diff --cached --check
Not-tested: Live crawler run against external platforms or real proxy vendor endpoints
- Add DISABLE_SSL_VERIFY = False to base_config.py (default: verification on)
- Add tools/httpx_util.py with make_async_client() factory that reads the config
- Replace all httpx.AsyncClient() call sites across all platforms (bilibili,
weibo, zhihu, xhs, douyin, kuaishou) and crawler_util with make_async_client()
- Extends SSL fix to previously missed platforms: xhs, douyin, kuaishou
Users running behind an intercepting proxy can set DISABLE_SSL_VERIFY = True
in config/base_config.py. All other users retain certificate verification.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update Chinese comments, variable descriptions, and metadata across
multiple configuration and core files to English. This improves
codebase accessibility for international developers. Additionally,
removed the sponsorship section from README files.
Features:
- Excel export with formatted multi-sheet workbooks (Contents, Comments, Creators)
- Professional styling: blue headers, auto-width columns, borders, text wrapping
- Smart export: empty sheets automatically removed
- Support for all platforms (xhs, dy, ks, bili, wb, tieba, zhihu)
Testing:
- Added pytest framework with asyncio support
- Unit tests for Excel store functionality
- Unit tests for store factory pattern
- Shared fixtures for test data
- Test coverage for edge cases
Documentation:
- Comprehensive Excel export guide (docs/excel_export_guide.md)
- Updated README.md and README_en.md with Excel examples
- Updated config comments to include excel option
Dependencies:
- Added openpyxl>=3.1.2 for Excel support
- Added pytest>=7.4.0 and pytest-asyncio>=0.21.0 for testing
This contribution adds immediate value for users who need data analysis
capabilities and establishes a testing foundation for future development.
- Remove platform-specific configurations from base_config.py
- Create separate config files for each platform in their respective directories
- Update import statements in core files to use new platform-specific config modules
- Clean up unused and deprecated configuration options
This commit introduces several improvements to enhance the stability and functionality of the Bilibili crawler.
- **Add Retry Logic:** Implement a retry mechanism with exponential backoff when fetching video comments. This makes the crawler more resilient to transient network issues or API errors.
- **Improve Error Handling:** Add a `try...except` block to handle potential `JSONDecodeError` in the Bilibili client, preventing crashes when the API returns an invalid response.
- **Ensure Clean Shutdown:** Refactor `main.py` to use a `try...finally` block, guaranteeing that the crawler and database connections are properly closed on exit, error, or `KeyboardInterrupt`.
- **Update Default Config:** Adjust default configuration values to increase concurrency, enable word cloud generation by default, and refine the Bilibili search mode for more practical usage.
Refactors the Bilibili keyword search functionality to provide more flexible crawling strategies and corrects a flaw in how crawl limits were applied.
Previously, the `ALL_DAY` boolean flag offered a rigid choice for time-based searching and contained a logical issue where `CRAWLER_MAX_NOTES_COUNT` was incorrectly applied on a per-day basis instead of as an overall total.
This commit introduces the `BILI_SEARCH_MODE` configuration option with three distinct modes:
- `normal`: The default search behavior without time constraints.
- `all_in_time_range`: Maximizes data collection within a specified date range, replicating the original intent of `ALL_DAY=True`.
- `daily_limit_in_time_range`: A new mode that strictly enforces both the daily `MAX_NOTES_PER_DAY` and the total `CRAWLER_MAX_NOTES_COUNT` limits across the entire date range.
This change resolves the limit logic bug and gives users more precise control over the crawling process.
Changes include:
- Modified `config/base_config.py` to replace `ALL_DAY` with `BILI_SEARCH_MODE`.
- Refactored `media_platform/bilibili/core.py` to implement the new search mode logic.