Commit Graph

33 Commits

Author SHA1 Message Date
程序员阿江(Relakkes)
8e93438fe5 Keep PR 900 overrides bounded and opt-in
The PR adds API limit overrides and static proxy support, but the review found that the default proxy provider changed to an invalid static placeholder and the new API fields accepted unbounded values. This keeps the existing proxy default intact, makes static proxy explicit via config or CLI, validates API limit ranges, and adds focused regression coverage for both paths.

Constraint: PR branch must remain contributor-branch compatible and avoid adding dependencies

Rejected: Keep static as the default provider | breaks existing --enable_ip_proxy defaults with an invalid placeholder URL

Rejected: Accept arbitrary integer limits | lets API callers request negative or excessive crawl sizes

Confidence: high

Scope-risk: narrow

Directive: Do not change proxy provider defaults when adding new providers; new providers should be opt-in and covered by provider-specific tests

Tested: uv run pytest tests/test_api_limits.py tests/test_static_proxy_provider.py

Tested: uv run pytest tests

Tested: uv run pytest test/test_utils.py

Tested: uv run python -m compileall api cmd_arg config proxy tests

Tested: git diff --cached --check

Not-tested: Live crawler run against external platforms or real proxy vendor endpoints
2026-05-29 21:27:52 +08:00
钟保罗
ec432eb63e feat: 启动任务接口添加帖子/视频数量与评论数量覆盖支持 2026-05-19 20:57:07 +08:00
程序员阿江(Relakkes)
f328ee35b5 fix: restore Tieba crawling after PC page rewrite
Tieba search, detail, comments, creator, and forum-list pages now rely on the current signed PC JSON APIs instead of brittle HTML selectors. The CLI also maps Tieba detail and creator arguments into the platform-specific config so command-line runs exercise the intended mode.

Constraint: Tieba PC pages no longer expose stable HTML structures for search, creator, and forum-list extraction
Constraint: Current PC APIs require browser cookies, tbs, and the web client signing convention
Rejected: Keep expanding HTML selectors | search and creator pages returned large documents with empty parsed results after the redesign
Confidence: high
Scope-risk: moderate
Directive: Do not replace these API paths with page HTML parsing without re-verifying the current Tieba network requests
Tested: uv run pytest tests/test_tieba_client_pagination.py tests/test_cmd_arg_tieba.py tests/test_tieba_extractor.py -q
Tested: uv run python -m py_compile cmd_arg/arg.py media_platform/tieba/help.py media_platform/tieba/client.py media_platform/tieba/core.py tests/test_cmd_arg_tieba.py tests/test_tieba_client_pagination.py tests/test_tieba_extractor.py
Tested: uv run main.py --platform tieba --type search --keywords 编程兼职 --get_comment false
Tested: uv run main.py --platform tieba --type detail --specified_id 9835114923 --get_comment true --max_comments_count_singlenotes 3
Tested: uv run main.py --platform tieba --type creator --creator_id https://tieba.baidu.com/home/main?id=tb.1.6ad0cd4a.7ZcjVYWa7UpHttCld2OppA --get_comment false
Not-tested: Second-level Tieba comment API migration; this path still uses the existing /p/comment HTML parser
Not-tested: Full pytest suite has one pre-existing unrelated XHS Excel store assertion failure
2026-04-30 18:20:46 +08:00
程序员阿江(Relakkes)
0282e626c9 feat: 新增 JSONL 存储格式支持,默认存储格式改为 jsonl
JSONL(JSON Lines)每行一个 JSON 对象,采用 append 模式写入,
无需读取已有数据,大数据量下性能远优于 JSON 格式。

- 新增 AsyncFileWriter.write_to_jsonl() 核心方法
- 7 个平台新增 JsonlStoreImplement 类并注册到工厂
- 配置默认值从 json 改为 jsonl,CLI/API 枚举同步更新
- db_session.py 守卫条件加入 jsonl,避免误触 ValueError
- 词云生成支持读取 JSONL 文件,优先 jsonl 回退 json
- 原有 json 选项完全保留,向后兼容
- 更新相关文档和测试
2026-03-03 23:31:07 +08:00
ouzhuowei
212276bc30 Revert "新增日志存储逻辑"
This reverts commit 30cf16af0c.

Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-10 15:03:40 +08:00
ouzhuowei
30cf16af0c 新增日志存储逻辑
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-06 12:33:35 +08:00
ouzhuowei
90280a261a 补充代理配置的arp
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-06 09:58:37 +08:00
ouzhuowei
7484156f02 新增数据保存路径,默认不指定则保存到data文件夹下
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-03 11:24:22 +08:00
wanzirong
df39d293de 修改--max_concurrency为--max_concurrency_num,保持命名一致 2026-01-30 11:15:06 +08:00
wanzirong
79048e265e feat: 添加并发爬虫数量控制参数
- 新增 --max_concurrency 命令行参数
- 用于控制并发爬虫数量
- 默认值为 1
2026-01-30 11:15:05 +08:00
wanzirong
90f72536ba refactor: 简化命令行参数命名
- 将 --max_comments_per_post 重命名为 --max_comments_count_singlenotes,与配置项名称保持一致
- 移除 --xhs_sort_type 参数(暂不需要)
- 保持代码简洁,减少不必要的功能
2026-01-21 16:30:07 +08:00
wanzirong
f7d27ab43a feat: 添加命令行参数支持
- 添加 --max_comments_per_post 参数用于控制每个帖子爬取的评论数量
- 添加 --xhs_sort_type 参数用于控制小红书排序方式
- 修复小红书 core.py 中 CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES 的导入方式
  从直接导入改为通过 config 模块访问,使命令行参数能正确生效
2026-01-21 16:23:47 +08:00
Doiiars
70a6ca55bb feat(database): add PostgreSQL support and fix Windows subprocess encoding 2026-01-09 00:41:59 +08:00
程序员阿江(Relakkes)
157ddfb21b i18n: translate all Chinese comments, docstrings, and logger messages to English
Comprehensive translation of Chinese text to English across the entire codebase:

- api/: FastAPI server documentation and logger messages
- cache/: Cache abstraction layer comments and docstrings
- database/: Database models and MongoDB store documentation
- media_platform/: All platform crawlers (Bilibili, Douyin, Kuaishou, Tieba, Weibo, Xiaohongshu, Zhihu)
- model/: Data model documentation
- proxy/: Proxy pool and provider documentation
- store/: Data storage layer comments
- tools/: Utility functions and browser automation
- test/: Test file documentation

Preserved: Chinese disclaimer header (lines 10-18) for legal compliance

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-26 23:27:19 +08:00
程序员阿江(Relakkes)
eb66e57f60 feat(cmd): add --headless, --specified_id, --creator_id CLI options
- Add --headless option to control headless mode for Playwright and CDP
- Add --specified_id option for detail mode video/post IDs (comma-separated)
- Add --creator_id option for creator mode IDs (comma-separated)
- Auto-configure platform-specific ID lists (XHS, Bilibili, Douyin, Weibo, Kuaishou)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-18 23:59:14 +08:00
程序员阿江(Relakkes)
6e858c1a00 feat: excel store with other platform 2025-11-28 15:12:36 +08:00
程序员阿江(Relakkes)
ff8c92daad chore: add copyright to every file 2025-11-18 12:24:02 +08:00
程序员阿江(Relakkes)
84f6f650f8 fix: typer args bugfix 2025-09-26 18:07:57 +08:00
程序员阿江-Relakkes
9d6cf065e9 fix(cli): support runtime without peps604 2025-09-26 17:38:50 +08:00
程序员阿江-Relakkes
95c740dee2 refine: harden typer cli defaults 2025-09-26 17:38:44 +08:00
程序员阿江-Relakkes
f97e0c18cd feat(cli): migrate argument parsing to typer 2025-09-26 17:21:47 +08:00
persist-1
d3bebd039e refactor(database): 调整数据库模块位置、调整初始化arg名称,并更新文档
- 将 db.py 数据库模块移入‘database/‘中,并修正所有调用代码
- 将初始化参数 `--init-db` 改为 `--init_db`
- 更新相关文档说明
2025-09-08 01:14:31 +08:00
persist-1
95a3dc8ce1 chore: 删除不必要的注释 2025-09-08 00:37:57 +08:00
persist-1
be306c6f54 refactor(database): 重构数据库存储实现,使用SQLAlchemy ORM替代原始SQL操作
- 删除旧的async_db.py和async_sqlite_db.py实现
- 新增SQLAlchemy ORM模型和数据库会话管理
- 统一各平台存储实现到_store_impl.py文件
- 添加数据库初始化功能支持
- 更新.gitignore和pyproject.toml依赖配置
- 优化文件存储路径和命名规范
2025-09-06 04:10:20 +08:00
persist-1
19df1734f1 chore: 增加--help参数中文显示支持及douyin_aweme表music_download_url字段\n\n- 为命令行参数增加中文显示支持,提升用户体验\n- 在douyin_aweme表中新增music_download_url字段用于存储视频音乐下载链接\n- 更新相关数据库表结构文件(tables.sql, sqlite_tables.sql)\n- 实现音乐下载URL提取逻辑并集成到数据存储流程 2025-07-24 22:39:53 +08:00
买定不离手
1673bd5c0c feat: 增强SQLite数据库配置和命令行参数支持
- 更新 cmd_arg/arg.py 文件,添加SQLite数据库选项的命令行参数解析支持
- 更新 config/base_config.py 文件,集成SQLite数据库的基础配置项和默认设置
- 更新 config/db_config.py 文件,扩展数据库配置以支持SQLite连接和参数管理
- 更新 pyproject.toml 文件,添加SQLite相关依赖包的版本管理和项目配置
2025-07-14 03:50:54 +08:00
Relakkes
9fe3e47b0f chore: 增加代码学习声明,严格禁止非法、禁止商业、不当用途 2024-10-20 00:43:25 +08:00
Relakkes
b7e57da0d2 feat: 知乎支持(关键词、评论) 2024-09-08 00:00:04 +08:00
Relakkes
a87094f2fd feat: 百度贴吧架子 & 登录done 2024-08-05 18:51:51 +08:00
Relakkes Yang
a0e5a29af8 fix: weibo bug 2024-06-17 00:25:48 +08:00
nelzomal
985ea93caf add few arg cmds 2024-06-12 10:53:03 +08:00
Relakkes
645ec729f6 fix: command args need default value 2024-06-11 23:55:42 +08:00
nelzomal
eace7d1750 improve base config reading command line arg logic 2024-06-09 18:51:36 +08:00