760 Commits

Author SHA1 Message Date
程序员阿江(Relakkes)
f328ee35b5 fix: restore Tieba crawling after PC page rewrite
Tieba search, detail, comments, creator, and forum-list pages now rely on the current signed PC JSON APIs instead of brittle HTML selectors. The CLI also maps Tieba detail and creator arguments into the platform-specific config so command-line runs exercise the intended mode.

Constraint: Tieba PC pages no longer expose stable HTML structures for search, creator, and forum-list extraction
Constraint: Current PC APIs require browser cookies, tbs, and the web client signing convention
Rejected: Keep expanding HTML selectors | search and creator pages returned large documents with empty parsed results after the redesign
Confidence: high
Scope-risk: moderate
Directive: Do not replace these API paths with page HTML parsing without re-verifying the current Tieba network requests
Tested: uv run pytest tests/test_tieba_client_pagination.py tests/test_cmd_arg_tieba.py tests/test_tieba_extractor.py -q
Tested: uv run python -m py_compile cmd_arg/arg.py media_platform/tieba/help.py media_platform/tieba/client.py media_platform/tieba/core.py tests/test_cmd_arg_tieba.py tests/test_tieba_client_pagination.py tests/test_tieba_extractor.py
Tested: uv run main.py --platform tieba --type search --keywords 编程兼职 --get_comment false
Tested: uv run main.py --platform tieba --type detail --specified_id 9835114923 --get_comment true --max_comments_count_singlenotes 3
Tested: uv run main.py --platform tieba --type creator --creator_id https://tieba.baidu.com/home/main?id=tb.1.6ad0cd4a.7ZcjVYWa7UpHttCld2OppA --get_comment false
Not-tested: Second-level Tieba comment API migration; this path still uses the existing /p/comment HTML parser
Not-tested: Full pytest suite has one pre-existing unrelated XHS Excel store assertion failure
2026-04-30 18:20:46 +08:00
程序员阿江(Relakkes)
1572b64334 docs: 展示 LegionProxy 赞助商
在中英文 README 的现有赞助商区域补充 LegionProxy,使用压缩后的本地 banner,避免 README 依赖外部图片热链。

Constraint: 赞助商图片来自本地提供的 Canva banner,需要控制仓库体积和 README 展示尺寸
Rejected: 直接提交原始 PNG | 2.6MB 体积过大,不适合作为 README 资源
Confidence: high
Scope-risk: narrow
Tested: 检查 README 引用路径和压缩后图片尺寸
Not-tested: 未进行 README 页面渲染截图验证
2026-04-25 22:00:43 +08:00
程序员阿江(Relakkes)
0c5f281212 fix: 避免复用浏览器时跨域 Cookie 过长导致请求失败
连接已有 Chrome 会把整个浏览器上下文的 cookie 带入平台 client。
除 xhs 外,多数平台仍直接读取全量 cookies,导致请求头过长并放大跨域污染。
本次将各平台的 cookie 读取统一收口到平台域名,并补上基础回归测试。

Constraint: 必须继续复用用户真实浏览器里的平台登录态
Rejected: 仅修复 xhs | 其他平台在连接已有浏览器时仍会携带超长 Cookie
Confidence: high
Scope-risk: moderate
Reversibility: clean
Directive: 后续新增平台或调整 update_cookies 和 create client 流程时,只按平台域名读取 cookies
Tested: uv run pytest test/test_utils.py; python3 -m compileall tools/crawler_util.py media_platform/douyin/core.py media_platform/douyin/client.py media_platform/kuaishou/core.py media_platform/kuaishou/client.py media_platform/bilibili/core.py media_platform/bilibili/client.py media_platform/zhihu/core.py media_platform/zhihu/client.py media_platform/tieba/core.py media_platform/tieba/client.py media_platform/xhs/core.py media_platform/xhs/client.py media_platform/weibo/core.py media_platform/weibo/client.py test/test_utils.py
Not-tested: 各平台在真实 CDP 浏览器连接下的端到端抓取流程
2026-04-21 13:49:37 +08:00
程序员阿江(Relakkes)
15a20a7983 docs: README 新增打赏作者区块 2026-04-16 14:01:41 +08:00
程序员阿江(Relakkes)
5294b6d9b7 feat: 支持连接用户已有的 Chrome 浏览器进行爬取
新增 CDP_CONNECT_EXISTING 配置项,默认开启,通过 Chrome 远程调试功能
(chrome://inspect/#remote-debugging) 直接连接用户正在使用的浏览器,
复用真实的 Cookie、扩展和浏览历史,大幅降低平台风控检测风险。

主要变更:
- 新增 _connect_existing_browser 方法,通过 ws:// 直接连接已有浏览器
- 支持等待用户在浏览器端确认连接对话框(60秒超时)
- cleanup 时不关闭用户的浏览器进程
- 修复小红书在真实浏览器下 cookie 过多导致签名失败的问题
- 更新 README、CDP使用指南和常见问题文档
2026-04-15 10:54:29 +08:00
程序员阿江-Relakkes
e5ec29d4ff Merge pull request #867 from wanteatfruit/feature/xhs-international-support
feat: 添加海外版小红书(rednote.com)支持
2026-04-10 17:50:03 +08:00
Junwen
2a52c15fb3 feat: 添加海外版小红书(rednote.com)支持 2026-04-08 23:12:53 +08:00
程序员阿江(Relakkes)
16e8965035 fix: add xhshow dependency 2026-04-07 21:20:44 +08:00
程序员阿江(Relakkes)
699a90f830 fix: xhs creator error 2026-04-07 12:54:39 +08:00
程序员阿江-Relakkes
21b3f90c7d Add GitHub Sponsors FUNDING.yml 2026-04-03 16:07:19 +08:00
程序员阿江(Relakkes)
e8b18683a0 update docs 2026-03-24 09:52:30 +08:00
程序员阿江-Relakkes
2b049d05a3 Merge pull request #847 from w21180239/fix/ssl-verify-proxy
fix: disable SSL verification for proxy/VPN environments
2026-03-19 00:24:47 +08:00
Wei Liu
2970488f40 docs: 将新增注释和文档改为中文
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 12:44:37 +13:00
Wei Liu
dd327f068e fix: extend make_async_client to proxy provider and IP pool
Migrate remaining httpx.AsyncClient call sites in proxy/ package to
use make_async_client(), completing the DISABLE_SSL_VERIFY coverage
across all outbound HTTP requests in the project.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 12:39:35 +13:00
Wei Liu
125e02a4b9 fix: make SSL verification opt-in via config, extend fix to all platforms
- Add DISABLE_SSL_VERIFY = False to base_config.py (default: verification on)
- Add tools/httpx_util.py with make_async_client() factory that reads the config
- Replace all httpx.AsyncClient() call sites across all platforms (bilibili,
  weibo, zhihu, xhs, douyin, kuaishou) and crawler_util with make_async_client()
- Extends SSL fix to previously missed platforms: xhs, douyin, kuaishou

Users running behind an intercepting proxy can set DISABLE_SSL_VERIFY = True
in config/base_config.py. All other users retain certificate verification.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 12:31:49 +13:00
Wei Liu
eb45a6367f fix: disable SSL verification for proxy/VPN environments
Add verify=False to all httpx.AsyncClient calls across bilibili,
weibo, zhihu clients and crawler_util. Fixes SSL certificate
validation errors when running behind a corporate proxy or VPN.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 12:21:27 +13:00
程序员阿江(Relakkes)
6742cd598b docs: update README.md 2026-03-17 16:49:02 +08:00
程序员阿江(Relakkes)
71168a46f6 fix: 修正 OpenClaw 链接地址为 openclaw.ai 2026-03-10 02:45:38 +08:00
程序员阿江(Relakkes)
6f45b570a7 docs: Pro 功能列表新增 AI Agent Skill 支持 2026-03-10 02:42:31 +08:00
程序员阿江(Relakkes)
0282e626c9 feat: 新增 JSONL 存储格式支持,默认存储格式改为 jsonl
JSONL(JSON Lines)每行一个 JSON 对象,采用 append 模式写入,
无需读取已有数据,大数据量下性能远优于 JSON 格式。

- 新增 AsyncFileWriter.write_to_jsonl() 核心方法
- 7 个平台新增 JsonlStoreImplement 类并注册到工厂
- 配置默认值从 json 改为 jsonl,CLI/API 枚举同步更新
- db_session.py 守卫条件加入 jsonl,避免误触 ValueError
- 词云生成支持读取 JSONL 文件,优先 jsonl 回退 json
- 原有 json 选项完全保留,向后兼容
- 更新相关文档和测试
2026-03-03 23:31:07 +08:00
程序员阿江-Relakkes
4331b91fe1 Merge pull request #838 from jznrhnn/main
feat(msql_model):添加msql表字段注释
2026-03-01 01:59:46 +08:00
程序员阿江-Relakkes
19d974ccfc Merge pull request #839 from ravenling/fix-zhihu-comment
fix: 修复zhihu评论爬取分页问题
2026-03-01 01:53:46 +08:00
ravenling
95c3293b97 fix: 修复zhihu评论爬取分页问题 2026-02-28 15:57:55 +08:00
finley
c9bf3bce7d feat(msql_model):添加msql表字段注释 2026-02-28 10:42:37 +08:00
程序员阿江-Relakkes
13b6140f22 Merge pull request #831 from ouzhuowei/fix_redis_and_proxy
适配没有redisKeys和快代理没有账号密码的情况
2026-02-13 21:18:26 +08:00
ouzhuowei
279c293147 删除不必要的注释
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-13 09:54:10 +08:00
ouzhuowei
db47d0e6f4 适配没有redisKeys和快代理没有账号密码的情况
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-13 09:42:15 +08:00
程序员阿江(Relakkes)
d614ccf247 docs: translate comments and metadata to English
Update Chinese comments, variable descriptions, and metadata across
multiple configuration and core files to English. This improves
codebase accessibility for international developers. Additionally,
removed the sponsorship section from README files.
2026-02-12 05:30:11 +08:00
程序员阿江-Relakkes
257743b016 Merge pull request #828 from ouzhuowei/add_save_data_path
补充代理配置的arp
2026-02-12 04:47:25 +08:00
程序员阿江-Relakkes
dcaa11eeb9 Merge pull request #829 from ouzhuowei/update_sub_comment_error
处理子评论获取失败导致整个流程中断问题
2026-02-12 04:46:34 +08:00
ouzhuowei
e54463ac78 处理子评论获取失败导致整个流程中断问题
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-10 17:53:30 +08:00
ouzhuowei
212276bc30 Revert "新增日志存储逻辑"
This reverts commit 30cf16af0c.

Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-10 15:03:40 +08:00
ouzhuowei
30cf16af0c 新增日志存储逻辑
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-06 12:33:35 +08:00
ouzhuowei
80e9c866a0 Merge branch 'add_save_data_path' into add_log_config
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-06 12:24:57 +08:00
ouzhuowei
90280a261a 补充代理配置的arp
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-06 09:58:37 +08:00
程序员阿江-Relakkes
4ad065ce9a Merge pull request #825 from ouzhuowei/add_save_data_path
新增数据保存路径,默认不指定则保存到data文件夹下
2026-02-04 18:03:22 +08:00
ouzhuowei
2a0d1fd69f 补充各平台的媒体存储文件路径适配
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-04 09:48:39 +08:00
程序员阿江(Relakkes)
c309871485 refactor(xhs): improve login state check logic 2026-02-03 20:49:46 +08:00
程序员阿江(Relakkes)
6625663bde feat: #823 2026-02-03 20:40:15 +08:00
程序员阿江(Relakkes)
fb42ab5b60 fix: #826 2026-02-03 20:35:33 +08:00
ouzhuowei
7484156f02 新增数据保存路径,默认不指定则保存到data文件夹下
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-03 11:24:22 +08:00
程序员阿江(Relakkes)
413b5d9034 docs: fix README heading levels, sync Pro section across languages
- Fix h3→h2 for standalone sections (交流群组, 赞助商展示, 成为赞助者, 其他) in README.md
- Remove WebUI standalone heading (kept as collapsible only)
- Remove WandouHTTP sponsor from EN/ES versions
- Expand Pro section (remove <details> collapse) in EN/ES to match CN
- Add Content Deconstruction Agent to Pro feature list in EN/ES

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 00:40:27 +08:00
程序员阿江(Relakkes)
dbbc2c7439 docs: update README.md 2026-02-02 20:25:51 +08:00
程序员阿江-Relakkes
51a7d94de8 Merge pull request #821 from wanzirong/feature/max-concurrency-param
feat: 添加并发爬虫数量控制参数 --max_concurrency_num
2026-01-31 00:31:15 +08:00
wanzirong
df39d293de 修改--max_concurrency为--max_concurrency_num,保持命名一致 2026-01-30 11:15:06 +08:00
wanzirong
79048e265e feat: 添加并发爬虫数量控制参数
- 新增 --max_concurrency 命令行参数
- 用于控制并发爬虫数量
- 默认值为 1
2026-01-30 11:15:05 +08:00
程序员阿江-Relakkes
94553fd818 Merge pull request #817 from wanzirong/dev
feat: 添加命令行参数控制评论爬取数量
2026-01-21 16:49:13 +08:00
wanzirong
90f72536ba refactor: 简化命令行参数命名
- 将 --max_comments_per_post 重命名为 --max_comments_count_singlenotes,与配置项名称保持一致
- 移除 --xhs_sort_type 参数(暂不需要)
- 保持代码简洁,减少不必要的功能
2026-01-21 16:30:07 +08:00
wanzirong
f7d27ab43a feat: 添加命令行参数支持
- 添加 --max_comments_per_post 参数用于控制每个帖子爬取的评论数量
- 添加 --xhs_sort_type 参数用于控制小红书排序方式
- 修复小红书 core.py 中 CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES 的导入方式
  从直接导入改为通过 config 模块访问,使命令行参数能正确生效
2026-01-21 16:23:47 +08:00
程序员阿江(Relakkes)
be5b786a74 docs: update docs 2026-01-19 12:23:04 +08:00