Commit Graph

770 Commits

Author SHA1 Message Date
程序员阿江-Relakkes
165776886f Merge pull request #900 from zanmeipaul/main
feat: 启动任务接口添加帖子/视频数量与评论数量覆盖支持
2026-05-29 21:34:07 +08:00
程序员阿江(Relakkes)
8e93438fe5 Keep PR 900 overrides bounded and opt-in
The PR adds API limit overrides and static proxy support, but the review found that the default proxy provider changed to an invalid static placeholder and the new API fields accepted unbounded values. This keeps the existing proxy default intact, makes static proxy explicit via config or CLI, validates API limit ranges, and adds focused regression coverage for both paths.

Constraint: PR branch must remain contributor-branch compatible and avoid adding dependencies

Rejected: Keep static as the default provider | breaks existing --enable_ip_proxy defaults with an invalid placeholder URL

Rejected: Accept arbitrary integer limits | lets API callers request negative or excessive crawl sizes

Confidence: high

Scope-risk: narrow

Directive: Do not change proxy provider defaults when adding new providers; new providers should be opt-in and covered by provider-specific tests

Tested: uv run pytest tests/test_api_limits.py tests/test_static_proxy_provider.py

Tested: uv run pytest tests

Tested: uv run pytest test/test_utils.py

Tested: uv run python -m compileall api cmd_arg config proxy tests

Tested: git diff --cached --check

Not-tested: Live crawler run against external platforms or real proxy vendor endpoints
2026-05-29 21:27:52 +08:00
程序员阿江-Relakkes
10091499f1 Merge pull request #901 from Jaryan-luck/main
修复:WebUI环境检查因Asyncio Windows兼容性而失败并且无任何错误提示
2026-05-29 19:26:08 +08:00
程序员阿江(Relakkes)
d280d22cb3 docs: 更新文档 英文
删除中英文 README 中已不再展示的代理广告内容,保留项目自身的代理池功能说明。

Confidence: high
Scope-risk: narrow
Tested: inspected README diff and verified targeted sponsor text removal
Not-tested: rendered README preview
2026-05-25 20:20:56 +08:00
🐟Jaryán🍋
5ad5a93e00 Update main.py
sys
2026-05-21 15:37:46 +08:00
🐟Jaryán🍋
5ddd969a8e Update main.py
修复Asyncio Windows兼容性
2026-05-20 22:54:27 +08:00
钟保罗
f997befce9 feat: 添加静态代理方式 2026-05-20 14:34:50 +08:00
钟保罗
5a362aebeb feat: 添加静态代理方式 2026-05-20 12:50:24 +08:00
程序员阿江(Relakkes)
9311d21f1f fix: dy creator #895 2026-05-19 21:46:38 +08:00
钟保罗
ec432eb63e feat: 启动任务接口添加帖子/视频数量与评论数量覆盖支持 2026-05-19 20:57:07 +08:00
程序员阿江(Relakkes)
f328ee35b5 fix: restore Tieba crawling after PC page rewrite
Tieba search, detail, comments, creator, and forum-list pages now rely on the current signed PC JSON APIs instead of brittle HTML selectors. The CLI also maps Tieba detail and creator arguments into the platform-specific config so command-line runs exercise the intended mode.

Constraint: Tieba PC pages no longer expose stable HTML structures for search, creator, and forum-list extraction
Constraint: Current PC APIs require browser cookies, tbs, and the web client signing convention
Rejected: Keep expanding HTML selectors | search and creator pages returned large documents with empty parsed results after the redesign
Confidence: high
Scope-risk: moderate
Directive: Do not replace these API paths with page HTML parsing without re-verifying the current Tieba network requests
Tested: uv run pytest tests/test_tieba_client_pagination.py tests/test_cmd_arg_tieba.py tests/test_tieba_extractor.py -q
Tested: uv run python -m py_compile cmd_arg/arg.py media_platform/tieba/help.py media_platform/tieba/client.py media_platform/tieba/core.py tests/test_cmd_arg_tieba.py tests/test_tieba_client_pagination.py tests/test_tieba_extractor.py
Tested: uv run main.py --platform tieba --type search --keywords 编程兼职 --get_comment false
Tested: uv run main.py --platform tieba --type detail --specified_id 9835114923 --get_comment true --max_comments_count_singlenotes 3
Tested: uv run main.py --platform tieba --type creator --creator_id https://tieba.baidu.com/home/main?id=tb.1.6ad0cd4a.7ZcjVYWa7UpHttCld2OppA --get_comment false
Not-tested: Second-level Tieba comment API migration; this path still uses the existing /p/comment HTML parser
Not-tested: Full pytest suite has one pre-existing unrelated XHS Excel store assertion failure
2026-04-30 18:20:46 +08:00
程序员阿江(Relakkes)
1572b64334 docs: 展示 LegionProxy 赞助商
在中英文 README 的现有赞助商区域补充 LegionProxy,使用压缩后的本地 banner,避免 README 依赖外部图片热链。

Constraint: 赞助商图片来自本地提供的 Canva banner,需要控制仓库体积和 README 展示尺寸
Rejected: 直接提交原始 PNG | 2.6MB 体积过大,不适合作为 README 资源
Confidence: high
Scope-risk: narrow
Tested: 检查 README 引用路径和压缩后图片尺寸
Not-tested: 未进行 README 页面渲染截图验证
2026-04-25 22:00:43 +08:00
程序员阿江(Relakkes)
0c5f281212 fix: 避免复用浏览器时跨域 Cookie 过长导致请求失败
连接已有 Chrome 会把整个浏览器上下文的 cookie 带入平台 client。
除 xhs 外,多数平台仍直接读取全量 cookies,导致请求头过长并放大跨域污染。
本次将各平台的 cookie 读取统一收口到平台域名,并补上基础回归测试。

Constraint: 必须继续复用用户真实浏览器里的平台登录态
Rejected: 仅修复 xhs | 其他平台在连接已有浏览器时仍会携带超长 Cookie
Confidence: high
Scope-risk: moderate
Reversibility: clean
Directive: 后续新增平台或调整 update_cookies 和 create client 流程时,只按平台域名读取 cookies
Tested: uv run pytest test/test_utils.py; python3 -m compileall tools/crawler_util.py media_platform/douyin/core.py media_platform/douyin/client.py media_platform/kuaishou/core.py media_platform/kuaishou/client.py media_platform/bilibili/core.py media_platform/bilibili/client.py media_platform/zhihu/core.py media_platform/zhihu/client.py media_platform/tieba/core.py media_platform/tieba/client.py media_platform/xhs/core.py media_platform/xhs/client.py media_platform/weibo/core.py media_platform/weibo/client.py test/test_utils.py
Not-tested: 各平台在真实 CDP 浏览器连接下的端到端抓取流程
2026-04-21 13:49:37 +08:00
程序员阿江(Relakkes)
15a20a7983 docs: README 新增打赏作者区块 2026-04-16 14:01:41 +08:00
程序员阿江(Relakkes)
5294b6d9b7 feat: 支持连接用户已有的 Chrome 浏览器进行爬取
新增 CDP_CONNECT_EXISTING 配置项,默认开启,通过 Chrome 远程调试功能
(chrome://inspect/#remote-debugging) 直接连接用户正在使用的浏览器,
复用真实的 Cookie、扩展和浏览历史,大幅降低平台风控检测风险。

主要变更:
- 新增 _connect_existing_browser 方法,通过 ws:// 直接连接已有浏览器
- 支持等待用户在浏览器端确认连接对话框(60秒超时)
- cleanup 时不关闭用户的浏览器进程
- 修复小红书在真实浏览器下 cookie 过多导致签名失败的问题
- 更新 README、CDP使用指南和常见问题文档
2026-04-15 10:54:29 +08:00
程序员阿江-Relakkes
e5ec29d4ff Merge pull request #867 from wanteatfruit/feature/xhs-international-support
feat: 添加海外版小红书(rednote.com)支持
2026-04-10 17:50:03 +08:00
Junwen
2a52c15fb3 feat: 添加海外版小红书(rednote.com)支持 2026-04-08 23:12:53 +08:00
程序员阿江(Relakkes)
16e8965035 fix: add xhshow dependency 2026-04-07 21:20:44 +08:00
程序员阿江(Relakkes)
699a90f830 fix: xhs creator error 2026-04-07 12:54:39 +08:00
程序员阿江-Relakkes
21b3f90c7d Add GitHub Sponsors FUNDING.yml 2026-04-03 16:07:19 +08:00
程序员阿江(Relakkes)
e8b18683a0 update docs 2026-03-24 09:52:30 +08:00
程序员阿江-Relakkes
2b049d05a3 Merge pull request #847 from w21180239/fix/ssl-verify-proxy
fix: disable SSL verification for proxy/VPN environments
2026-03-19 00:24:47 +08:00
Wei Liu
2970488f40 docs: 将新增注释和文档改为中文
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 12:44:37 +13:00
Wei Liu
dd327f068e fix: extend make_async_client to proxy provider and IP pool
Migrate remaining httpx.AsyncClient call sites in proxy/ package to
use make_async_client(), completing the DISABLE_SSL_VERIFY coverage
across all outbound HTTP requests in the project.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 12:39:35 +13:00
Wei Liu
125e02a4b9 fix: make SSL verification opt-in via config, extend fix to all platforms
- Add DISABLE_SSL_VERIFY = False to base_config.py (default: verification on)
- Add tools/httpx_util.py with make_async_client() factory that reads the config
- Replace all httpx.AsyncClient() call sites across all platforms (bilibili,
  weibo, zhihu, xhs, douyin, kuaishou) and crawler_util with make_async_client()
- Extends SSL fix to previously missed platforms: xhs, douyin, kuaishou

Users running behind an intercepting proxy can set DISABLE_SSL_VERIFY = True
in config/base_config.py. All other users retain certificate verification.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 12:31:49 +13:00
Wei Liu
eb45a6367f fix: disable SSL verification for proxy/VPN environments
Add verify=False to all httpx.AsyncClient calls across bilibili,
weibo, zhihu clients and crawler_util. Fixes SSL certificate
validation errors when running behind a corporate proxy or VPN.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 12:21:27 +13:00
程序员阿江(Relakkes)
6742cd598b docs: update README.md 2026-03-17 16:49:02 +08:00
程序员阿江(Relakkes)
71168a46f6 fix: 修正 OpenClaw 链接地址为 openclaw.ai 2026-03-10 02:45:38 +08:00
程序员阿江(Relakkes)
6f45b570a7 docs: Pro 功能列表新增 AI Agent Skill 支持 2026-03-10 02:42:31 +08:00
程序员阿江(Relakkes)
0282e626c9 feat: 新增 JSONL 存储格式支持,默认存储格式改为 jsonl
JSONL(JSON Lines)每行一个 JSON 对象,采用 append 模式写入,
无需读取已有数据,大数据量下性能远优于 JSON 格式。

- 新增 AsyncFileWriter.write_to_jsonl() 核心方法
- 7 个平台新增 JsonlStoreImplement 类并注册到工厂
- 配置默认值从 json 改为 jsonl,CLI/API 枚举同步更新
- db_session.py 守卫条件加入 jsonl,避免误触 ValueError
- 词云生成支持读取 JSONL 文件,优先 jsonl 回退 json
- 原有 json 选项完全保留,向后兼容
- 更新相关文档和测试
2026-03-03 23:31:07 +08:00
程序员阿江-Relakkes
4331b91fe1 Merge pull request #838 from jznrhnn/main
feat(msql_model):添加msql表字段注释
2026-03-01 01:59:46 +08:00
程序员阿江-Relakkes
19d974ccfc Merge pull request #839 from ravenling/fix-zhihu-comment
fix: 修复zhihu评论爬取分页问题
2026-03-01 01:53:46 +08:00
ravenling
95c3293b97 fix: 修复zhihu评论爬取分页问题 2026-02-28 15:57:55 +08:00
finley
c9bf3bce7d feat(msql_model):添加msql表字段注释 2026-02-28 10:42:37 +08:00
程序员阿江-Relakkes
13b6140f22 Merge pull request #831 from ouzhuowei/fix_redis_and_proxy
适配没有redisKeys和快代理没有账号密码的情况
2026-02-13 21:18:26 +08:00
ouzhuowei
279c293147 删除不必要的注释
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-13 09:54:10 +08:00
ouzhuowei
db47d0e6f4 适配没有redisKeys和快代理没有账号密码的情况
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-13 09:42:15 +08:00
程序员阿江(Relakkes)
d614ccf247 docs: translate comments and metadata to English
Update Chinese comments, variable descriptions, and metadata across
multiple configuration and core files to English. This improves
codebase accessibility for international developers. Additionally,
removed the sponsorship section from README files.
2026-02-12 05:30:11 +08:00
程序员阿江-Relakkes
257743b016 Merge pull request #828 from ouzhuowei/add_save_data_path
补充代理配置的arp
2026-02-12 04:47:25 +08:00
程序员阿江-Relakkes
dcaa11eeb9 Merge pull request #829 from ouzhuowei/update_sub_comment_error
处理子评论获取失败导致整个流程中断问题
2026-02-12 04:46:34 +08:00
ouzhuowei
e54463ac78 处理子评论获取失败导致整个流程中断问题
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-10 17:53:30 +08:00
ouzhuowei
212276bc30 Revert "新增日志存储逻辑"
This reverts commit 30cf16af0c.

Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-10 15:03:40 +08:00
ouzhuowei
30cf16af0c 新增日志存储逻辑
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-06 12:33:35 +08:00
ouzhuowei
80e9c866a0 Merge branch 'add_save_data_path' into add_log_config
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-06 12:24:57 +08:00
ouzhuowei
90280a261a 补充代理配置的arp
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-06 09:58:37 +08:00
程序员阿江-Relakkes
4ad065ce9a Merge pull request #825 from ouzhuowei/add_save_data_path
新增数据保存路径,默认不指定则保存到data文件夹下
2026-02-04 18:03:22 +08:00
ouzhuowei
2a0d1fd69f 补充各平台的媒体存储文件路径适配
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-04 09:48:39 +08:00
程序员阿江(Relakkes)
c309871485 refactor(xhs): improve login state check logic 2026-02-03 20:49:46 +08:00
程序员阿江(Relakkes)
6625663bde feat: #823 2026-02-03 20:40:15 +08:00
程序员阿江(Relakkes)
fb42ab5b60 fix: #826 2026-02-03 20:35:33 +08:00