46 Commits

Author SHA1 Message Date
程序员阿江-Relakkes
51a7d94de8 Merge pull request #821 from wanzirong/feature/max-concurrency-param
feat: 添加并发爬虫数量控制参数 --max_concurrency_num
2026-01-31 00:31:15 +08:00
wanzirong
df39d293de 修改--max_concurrency为--max_concurrency_num,保持命名一致 2026-01-30 11:15:06 +08:00
wanzirong
79048e265e feat: 添加并发爬虫数量控制参数
- 新增 --max_concurrency 命令行参数
- 用于控制并发爬虫数量
- 默认值为 1
2026-01-30 11:15:05 +08:00
程序员阿江-Relakkes
94553fd818 Merge pull request #817 from wanzirong/dev
feat: 添加命令行参数控制评论爬取数量
2026-01-21 16:49:13 +08:00
wanzirong
90f72536ba refactor: 简化命令行参数命名
- 将 --max_comments_per_post 重命名为 --max_comments_count_singlenotes,与配置项名称保持一致
- 移除 --xhs_sort_type 参数(暂不需要)
- 保持代码简洁,减少不必要的功能
2026-01-21 16:30:07 +08:00
wanzirong
f7d27ab43a feat: 添加命令行参数支持
- 添加 --max_comments_per_post 参数用于控制每个帖子爬取的评论数量
- 添加 --xhs_sort_type 参数用于控制小红书排序方式
- 修复小红书 core.py 中 CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES 的导入方式
  从直接导入改为通过 config 模块访问,使命令行参数能正确生效
2026-01-21 16:23:47 +08:00
程序员阿江(Relakkes)
be5b786a74 docs: update docs 2026-01-19 12:23:04 +08:00
程序员阿江-Relakkes
04fb716a44 Merge pull request #815 from 2470370075g-ux/fix-typo
修复拼写错误
2026-01-18 22:24:57 +08:00
WangXX
1f89713b90 修复拼写错误 2026-01-18 22:22:31 +08:00
程序员阿江-Relakkes
00a9e19139 Merge pull request #809 from orbisai0security/fix-cve-2023-50447-requirements.txt
[Security] Fix CRITICAL vulnerability: CVE-2023-50447
2026-01-13 14:40:23 +08:00
orbisai0security
8a2c349d67 fix: resolve critical vulnerability CVE-2023-50447
Automatically generated security fix
2026-01-12 15:10:10 +00:00
程序员阿江(Relakkes)
4de2a325a9 feat: ks comment api upgrade to v2 2026-01-09 21:09:39 +08:00
程序员阿江-Relakkes
2517e51ed4 Merge pull request #805 from MissMyDearBear/feature-bear
fix the login status error after scan the QR code
2026-01-09 14:18:16 +08:00
Alen Bear
e3d7fa7bed Merge branch 'NanmiCoder:main' into feature-bear 2026-01-09 14:14:37 +08:00
bear
a59b385615 fix the login status error after scan the QR code 2026-01-09 14:11:47 +08:00
程序员阿江-Relakkes
7c240747b6 Merge pull request #807 from DoiiarX/main
feat(database): add PostgreSQL support and fix Windows subprocess encoding
2026-01-09 10:53:57 +08:00
Doiiars
70a6ca55bb feat(database): add PostgreSQL support and fix Windows subprocess encoding 2026-01-09 00:41:59 +08:00
程序员阿江(Relakkes)
57b688fea4 feat: webui support light theme 2026-01-06 11:16:48 +08:00
程序员阿江(Relakkes)
ee4539c8fa chore: stop tracking .DS_Store 2026-01-06 11:11:49 +08:00
程序员阿江(Relakkes)
c895f53e22 fix: #803 2026-01-05 22:29:34 +08:00
程序员阿江(Relakkes)
99db95c499 fix: 'utf-8' codec can't decode error 2026-01-04 10:48:15 +08:00
程序员阿江-Relakkes
483c5ec8c6 Merge pull request #802 from Cae1anSou/fix/douyin-concurrent-comments
fix: fetch Douyin comments concurrently after each page instead of waiting for all pages
2026-01-03 22:38:26 +08:00
Caelan_Windows
c56b8c4c5d fix(douyin): fetch comments concurrently after each page instead of waiting for all pages
- Moved batch_get_note_comments call inside the pagination loop
- Comments are now fetched immediately after each page of videos is processed
- This allows real-time observation of comment crawling progress
- Improves data availability by not waiting for all video data to be collected first
2026-01-03 01:47:24 +08:00
程序员阿江(Relakkes)
a47c119303 docs: update 2025-12-30 17:10:13 +08:00
程序员阿江(Relakkes)
157ddfb21b i18n: translate all Chinese comments, docstrings, and logger messages to English
Comprehensive translation of Chinese text to English across the entire codebase:

- api/: FastAPI server documentation and logger messages
- cache/: Cache abstraction layer comments and docstrings
- database/: Database models and MongoDB store documentation
- media_platform/: All platform crawlers (Bilibili, Douyin, Kuaishou, Tieba, Weibo, Xiaohongshu, Zhihu)
- model/: Data model documentation
- proxy/: Proxy pool and provider documentation
- store/: Data storage layer comments
- tools/: Utility functions and browser automation
- test/: Test file documentation

Preserved: Chinese disclaimer header (lines 10-18) for legal compliance

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-26 23:27:19 +08:00
程序员阿江(Relakkes)
1544d13dd5 docs: update README.md 2025-12-26 22:41:32 +08:00
程序员阿江(Relakkes)
55d8c7783f feat: webo full context support 2025-12-26 19:22:24 +08:00
程序员阿江(Relakkes)
ff1b681311 fix: weibo get note image fixed 2025-12-26 00:47:20 +08:00
程序员阿江(Relakkes)
11500ef57a fix: #799 2025-12-24 11:45:07 +08:00
程序员阿江(Relakkes)
b9663c6a6d fix: #798 2025-12-22 17:44:35 +08:00
程序员阿江(Relakkes)
1a38ae12bd docs: update README.md 2025-12-19 00:23:55 +08:00
程序员阿江(Relakkes)
4ceb94f9c8 docs: webui 支持文档 2025-12-19 00:15:53 +08:00
程序员阿江(Relakkes)
508675a251 feat(api): add WebUI API server with built frontend
- Add FastAPI server with WebSocket support for real-time logs
- Add crawler management API endpoints (start/stop/status)
- Add data browsing API endpoints (list files, preview, download)
- Include pre-built WebUI assets for serving frontend

API endpoints:
- POST /api/crawler/start - Start crawler task
- POST /api/crawler/stop - Stop crawler task
- GET /api/crawler/status - Get crawler status
- WS /api/ws/logs - Real-time log streaming
- GET /api/data/files - List data files
- GET /api/data/stats - Get data statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-19 00:02:08 +08:00
程序员阿江(Relakkes)
eb66e57f60 feat(cmd): add --headless, --specified_id, --creator_id CLI options
- Add --headless option to control headless mode for Playwright and CDP
- Add --specified_id option for detail mode video/post IDs (comma-separated)
- Add --creator_id option for creator mode IDs (comma-separated)
- Auto-configure platform-specific ID lists (XHS, Bilibili, Douyin, Weibo, Kuaishou)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-18 23:59:14 +08:00
程序员阿江(Relakkes)
a8930555ac style: increase aside width and ad image size
- 增加右侧 aside 宽度从 256px 到 300px
- 增加广告图片宽度从 200px 到 280px

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-18 13:29:36 +08:00
程序员阿江(Relakkes)
fb66ef016d docs: add vitepress-plugin-mermaid for Mermaid diagram rendering
- 添加 vitepress-plugin-mermaid 和 mermaid 依赖
- 更新 VitePress 配置以支持 Mermaid 图表渲染
- 在 sidebar 中添加项目架构文档链接

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-18 13:25:21 +08:00
程序员阿江(Relakkes)
26c511e35f docs: add project architecture documentation with Mermaid diagrams
添加项目架构文档,包含:
- 系统架构总览图
- 数据流向图
- 爬虫基类体系和生命周期图
- 存储层架构图
- 代理、登录、缓存系统图
- 模块依赖关系图

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-18 13:16:32 +08:00
程序员阿江(Relakkes)
08fcf68b98 docs: update README.md 2025-12-17 12:12:29 +08:00
程序员阿江(Relakkes)
2426095123 docs: update README.md 2025-12-17 11:04:26 +08:00
程序员阿江(Relakkes)
3c75d4f1d0 docs: update docs style 2025-12-16 14:49:14 +08:00
程序员阿江(Relakkes)
332a07ce62 docs: update docs 2025-12-16 14:41:28 +08:00
程序员阿江(Relakkes)
8a0fd49b96 refactor: 抽离应用 runner 并优化退出清理
- 新增 tools/app_runner.py 统一信号/取消/清理超时逻辑
- main.py 精简为业务入口与资源清理实现
- CDPBrowserManager 不再覆盖已有 SIGINT/SIGTERM 处理器
2025-12-15 18:06:57 +08:00
程序员阿江(Relakkes)
9ade3b3eef chore: use playwright sign xhs and update dependency 2025-12-09 14:47:48 +08:00
程序员阿江-Relakkes
2600c48359 fix: xhs sub comment sign error
fix: params参数以及路径问题
2025-12-03 11:02:52 +08:00
MEI
ff9a1624f1 fix: params参数以及路径问题 2025-12-03 10:31:32 +08:00
程序员阿江-Relakkes
630d4c1614 Merge pull request #789 from NanmiCoder/feature/test_new_rule_01
docs: update data store
2025-11-28 22:35:45 +08:00
141 changed files with 6611 additions and 2110 deletions

BIN
.DS_Store vendored
View File

Binary file not shown.

46
.env.example Normal file
View File

@@ -0,0 +1,46 @@
# MySQL Configuration
MYSQL_DB_PWD=123456
MYSQL_DB_USER=root
MYSQL_DB_HOST=localhost
MYSQL_DB_PORT=3306
MYSQL_DB_NAME=media_crawler
# Redis Configuration
REDIS_DB_HOST=127.0.0.1
REDIS_DB_PWD=123456
REDIS_DB_PORT=6379
REDIS_DB_NUM=0
# MongoDB Configuration
MONGODB_HOST=localhost
MONGODB_PORT=27017
MONGODB_USER=
MONGODB_PWD=
MONGODB_DB_NAME=media_crawler
# PostgreSQL Configuration
POSTGRES_DB_PWD=123456
POSTGRES_DB_USER=postgres
POSTGRES_DB_HOST=localhost
POSTGRES_DB_PORT=5432
POSTGRES_DB_NAME=media_crawler
# Proxy Configuration (Wandou HTTP)
# your_wandou_http_app_key
WANDOU_APP_KEY=
# Proxy Configuration (Kuaidaili)
# your_kuaidaili_secret_id
KDL_SECERT_ID=
# your_kuaidaili_signature
KDL_SIGNATURE=
# your_kuaidaili_username
KDL_USER_NAME=
# your_kuaidaili_password
KDL_USER_PWD=
# Proxy Configuration (Jisu HTTP)
# Get JiSu HTTP IP extraction key value
jisu_key=
# Get JiSu HTTP IP extraction encryption signature
jisu_crypto=

2
.gitignore vendored
View File

@@ -178,4 +178,4 @@ docs/.vitepress/cache
agent_zone
debug_tools
database/*.db
database/*.db

View File

@@ -53,6 +53,7 @@
- **无需JS逆向**:利用保留登录态的浏览器上下文环境,通过 JS 表达式获取签名参数
- **优势特点**:无需逆向复杂的加密算法,大幅降低技术门槛
## ✨ 功能特性
| 平台 | 关键词搜索 | 指定帖子ID爬取 | 二级评论 | 指定创作者主页 | 登录态缓存 | IP代理池 | 生成评论词云图 |
| ------ | ---------- | -------------- | -------- | -------------- | ---------- | -------- | -------------- |
@@ -66,7 +67,8 @@
### 🚀 MediaCrawlerPro 重磅发布!
<details>
<summary>🚀 <strong>MediaCrawlerPro 重磅发布!开源不易,欢迎订阅支持</strong></summary>
> 专注于学习成熟项目的架构设计不仅仅是爬虫技术Pro 版本的代码设计思路同样值得深入学习!
@@ -90,10 +92,12 @@
点击查看:[MediaCrawlerPro 项目主页](https://github.com/MediaCrawlerPro) 更多介绍
</details>
## 🚀 快速开始
> 💡 **开源不易,如果这个项目对您有帮助,请给个 ⭐ Star 支持一下!**
> 💡 **如果这个项目对您有帮助,请给个 ⭐ Star 支持一下!**
## 📋 前置依赖
@@ -146,6 +150,37 @@ uv run main.py --platform xhs --lt qrcode --type detail
uv run main.py --help
```
## WebUI支持
<details>
<summary>🖥️ <strong>WebUI 可视化操作界面</strong></summary>
MediaCrawler 提供了基于 Web 的可视化操作界面,无需命令行也能轻松使用爬虫功能。
#### 启动 WebUI 服务
```shell
# 启动 API 服务器(默认端口 8080
uv run uvicorn api.main:app --port 8080 --reload
# 或者使用模块方式启动
uv run python -m api.main
```
启动成功后,访问 `http://localhost:8080` 即可打开 WebUI 界面。
#### WebUI 功能特性
- 可视化配置爬虫参数(平台、登录方式、爬取类型等)
- 实时查看爬虫运行状态和日志
- 数据预览和导出
#### 界面预览
<img src="docs/static/images/img_8.png" alt="WebUI 界面预览">
</details>
<details>
<summary>🔗 <strong>使用 Python 原生 venv 管理环境(不推荐)</strong></summary>
@@ -209,11 +244,12 @@ MediaCrawler 支持多种数据存储方式,包括 CSV、JSON、Excel、SQLite
📖 **详细使用说明请查看:[数据存储指南](docs/data_storage_guide.md)**
[🚀 MediaCrawlerPro 重磅发布 🚀!更多的功能,更好的架构设计!](https://github.com/MediaCrawlerPro)
[🚀 MediaCrawlerPro 重磅发布 🚀!更多的功能,更好的架构设计!开源不易,欢迎订阅支持!](https://github.com/MediaCrawlerPro)
### 💬 交流群组
- **微信交流群**[点击加入](https://nanmicoder.github.io/MediaCrawler/%E5%BE%AE%E4%BF%A1%E4%BA%A4%E6%B5%81%E7%BE%A4.html)
- **B站账号**[关注我](https://space.bilibili.com/434377496)分享AI与爬虫技术知识
### 💰 赞助商展示
@@ -226,23 +262,21 @@ MediaCrawler 支持多种数据存储方式,包括 CSV、JSON、Excel、SQLite
---
<p align="center">
<a href="https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad">
<img style="border-radius:20px" width="500" alt="TikHub IO_Banner zh" src="docs/static/images/tikhub_banner_zh.png">
</a>
</p>
<a href="https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad">
<img width="500" src="docs/static/images/tikhub_banner_zh.png">
<br>
TikHub.io 提供 900+ 高稳定性数据接口,覆盖 TK、DY、XHS、Y2B、Ins、X 等 14+ 海内外主流平台,支持用户、内容、商品、评论等多维度公开数据 API并配套 4000 万+ 已清洗结构化数据集,使用邀请码 <code>cfzyejV9</code> 注册并充值,即可额外获得 $2 赠送额度。
</a>
[TikHub](https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad) 提供超过 **700 个端点**,可用于从 **14+ 个社交媒体平台** 获取与分析数据 —— 包括视频、用户、评论、商店、商品与趋势等,一站式完成所有数据访问与分析。
---
通过每日签到,可以获取免费额度。可以使用我的注册链接:[https://user.tikhub.io/users/signup?referral_code=cfzyejV9](https://user.tikhub.io/users/signup?referral_code=cfzyejV9&utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad) 或使用邀请码:`cfzyejV9`,注册并充值即可获得 **$2 免费额度**。
[TikHub](https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad) 提供以下服务:
- 🚀 丰富的社交媒体数据接口TikTok、Douyin、XHS、YouTube、Instagram等
- 💎 每日签到免费领取额度
- ⚡ 高成功率与高并发支持
- 🌐 官网:[https://tikhub.io/](https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad)
- 💻 GitHub地址[https://github.com/TikHubIO/](https://github.com/TikHubIO/)
<a href="https://www.thordata.com/?ls=github&lk=mediacrawler">
<img width="500" src="docs/static/images/Thordata.png">
<br>
Thordata可靠且经济高效的代理服务提供商。为企业和开发者提供稳定、高效且合规的全球代理 IP 服务。立即注册赠送1GB住宅代理免费试用和2000次serp-api调用。
</a>
<br>
<a href="https://www.thordata.com/products/residential-proxies/?ls=github&lk=mediacrawler">【住宅代理】</a> | <a href="https://www.thordata.com/products/web-scraper/?ls=github&lk=mediacrawler">【serp-api】</a>
### 🤝 成为赞助者

View File

@@ -148,6 +148,37 @@ uv run main.py --platform xhs --lt qrcode --type detail
uv run main.py --help
```
## WebUI Support
<details>
<summary>🖥️ <strong>WebUI Visual Operation Interface</strong></summary>
MediaCrawler provides a web-based visual operation interface, allowing you to easily use crawler features without command line.
#### Start WebUI Service
```shell
# Start API server (default port 8080)
uv run uvicorn api.main:app --port 8080 --reload
# Or start using module method
uv run python -m api.main
```
After successful startup, visit `http://localhost:8080` to open the WebUI interface.
#### WebUI Features
- Visualize crawler parameter configuration (platform, login method, crawling type, etc.)
- Real-time view of crawler running status and logs
- Data preview and export
#### Interface Preview
<img src="docs/static/images/img_8.png" alt="WebUI Interface Preview">
</details>
<details>
<summary>🔗 <strong>Using Python native venv environment management (Not recommended)</strong></summary>
@@ -214,45 +245,37 @@ MediaCrawler supports multiple data storage methods, including CSV, JSON, Excel,
[🚀 MediaCrawlerPro Major Release 🚀! More features, better architectural design!](https://github.com/MediaCrawlerPro)
## 🤝 Community & Support
### 💬 Discussion Groups
- **WeChat Discussion Group**: [Click to join](https://nanmicoder.github.io/MediaCrawler/%E5%BE%AE%E4%BF%A1%E4%BA%A4%E6%B5%81%E7%BE%A4.html)
- **Bilibili Account**: [Follow me](https://space.bilibili.com/434377496), sharing AI and crawler technology knowledge
### 📚 Documentation & Tutorials
- **Online Documentation**: [MediaCrawler Complete Documentation](https://nanmicoder.github.io/MediaCrawler/)
- **Crawler Tutorial**: [CrawlerTutorial Free Tutorial](https://github.com/NanmiCoder/CrawlerTutorial)
# Other common questions can be viewed in the online documentation
>
> The online documentation includes usage methods, common questions, joining project discussion groups, etc.
> [MediaCrawler Online Documentation](https://nanmicoder.github.io/MediaCrawler/)
>
# Author's Knowledge Services
> If you want to quickly get started and learn the usage of this project, source code architectural design, learn programming technology, or want to understand the source code design of MediaCrawlerPro, you can check out my paid knowledge column.
[Author's Paid Knowledge Column Introduction](https://nanmicoder.github.io/MediaCrawler/%E7%9F%A5%E8%AF%86%E4%BB%98%E8%B4%B9%E4%BB%8B%E7%BB%8D.html)
---
## ⭐ Star Trend Chart
If this project helps you, please give a ⭐ Star to support and let more people see MediaCrawler!
[![Star History Chart](https://api.star-history.com/svg?repos=NanmiCoder/MediaCrawler&type=Date)](https://star-history.com/#NanmiCoder/MediaCrawler&Date)
### 💰 Sponsor Display
<a href="https://www.swiftproxy.net/?ref=nanmi">
<img src="docs/static/images/img_5.png">
<a href="https://h.wandouip.com">
<img src="docs/static/images/img_8.jpg">
<br>
**Swiftproxy** - 90M+ global high-quality pure residential IPs, register to get free 500MB test traffic, dynamic traffic never expires!
> Exclusive discount code: **GHB5** Get 10% off instantly!
WandouHTTP - Self-operated tens of millions IP resource pool, IP purity ≥99.8%, daily high-frequency IP updates, fast response, stable connection, supports multiple business scenarios, customizable on demand, register to get 10000 free IPs.
</a>
---
<a href="https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad">
<img width="500" src="docs/static/images/tikhub_banner_zh.png">
<br>
TikHub.io provides 900+ highly stable data interfaces, covering 14+ mainstream domestic and international platforms including TK, DY, XHS, Y2B, Ins, X, etc. Supports multi-dimensional public data APIs for users, content, products, comments, etc., with 40M+ cleaned structured datasets. Use invitation code <code>cfzyejV9</code> to register and recharge, and get an additional $2 bonus.
</a>
---
<a href="https://www.thordata.com/?ls=github&lk=mediacrawler">
<img width="500" src="docs/static/images/Thordata.png">
<br>
Thordata: Reliable and cost-effective proxy service provider. Provides stable, efficient and compliant global proxy IP services for enterprises and developers. Register now to get 1GB free residential proxy trial and 2000 serp-api calls.
</a>
<br>
<a href="https://www.thordata.com/products/residential-proxies/?ls=github&lk=mediacrawler">【Residential Proxies】</a> | <a href="https://www.thordata.com/products/web-scraper/?ls=github&lk=mediacrawler">【serp-api】</a>
### 🤝 Become a Sponsor
@@ -261,10 +284,24 @@ Become a sponsor and showcase your product here, getting massive exposure daily!
**Contact Information**:
- WeChat: `relakkes`
- Email: `relakkes@gmail.com`
---
### 📚 Other
- **FAQ**: [MediaCrawler Complete Documentation](https://nanmicoder.github.io/MediaCrawler/)
- **Crawler Beginner Tutorial**: [CrawlerTutorial Free Tutorial](https://github.com/NanmiCoder/CrawlerTutorial)
- **News Crawler Open Source Project**: [NewsCrawlerCollection](https://github.com/NanmiCoder/NewsCrawlerCollection)
## ⭐ Star Trend Chart
If this project helps you, please give a ⭐ Star to support and let more people see MediaCrawler!
[![Star History Chart](https://api.star-history.com/svg?repos=NanmiCoder/MediaCrawler&type=Date)](https://star-history.com/#NanmiCoder/MediaCrawler&Date)
## 📚 References
- **Xiaohongshu Signature Repository**: [Cloxl's xhs signature repository](https://github.com/Cloxl/xhshow)
- **Xiaohongshu Client**: [ReaJason's xhs repository](https://github.com/ReaJason/xhs)
- **SMS Forwarding**: [SmsForwarder reference repository](https://github.com/pppscn/SmsForwarder)
- **Intranet Penetration Tool**: [ngrok official documentation](https://ngrok.com/docs/)

View File

@@ -149,6 +149,37 @@ uv run main.py --platform xhs --lt qrcode --type detail
uv run main.py --help
```
## Soporte WebUI
<details>
<summary>🖥️ <strong>Interfaz de Operación Visual WebUI</strong></summary>
MediaCrawler proporciona una interfaz de operación visual basada en web, permitiéndole usar fácilmente las funciones del rastreador sin línea de comandos.
#### Iniciar Servicio WebUI
```shell
# Iniciar servidor API (puerto predeterminado 8080)
uv run uvicorn api.main:app --port 8080 --reload
# O iniciar usando método de módulo
uv run python -m api.main
```
Después de iniciar exitosamente, visite `http://localhost:8080` para abrir la interfaz WebUI.
#### Características de WebUI
- Configuración visual de parámetros del rastreador (plataforma, método de login, tipo de rastreo, etc.)
- Vista en tiempo real del estado de ejecución del rastreador y logs
- Vista previa y exportación de datos
#### Vista Previa de la Interfaz
<img src="docs/static/images/img_8.png" alt="Vista Previa de Interfaz WebUI">
</details>
<details>
<summary>🔗 <strong>Usando gestión de entorno venv nativo de Python (No recomendado)</strong></summary>
@@ -207,76 +238,46 @@ python main.py --help
## 💾 Almacenamiento de Datos
Soporta múltiples métodos de almacenamiento de datos:
- **Archivos CSV**: Soporta guardar en CSV (bajo el directorio `data/`)
- **Archivos JSON**: Soporta guardar en JSON (bajo el directorio `data/`)
- **Almacenamiento en Base de Datos**
- Use el parámetro `--init_db` para la inicialización de la base de datos (cuando use `--init_db`, no se necesitan otros argumentos opcionales)
- **Base de Datos SQLite**: Base de datos ligera, no requiere servidor, adecuada para uso personal (recomendado)
1. Inicialización: `--init_db sqlite`
2. Almacenamiento de Datos: `--save_data_option sqlite`
- **Base de Datos MySQL**: Soporta guardar en la base de datos relacional MySQL (la base de datos debe crearse con anticipación)
1. Inicialización: `--init_db mysql`
2. Almacenamiento de Datos: `--save_data_option db` (el parámetro db se mantiene por compatibilidad con actualizaciones históricas)
MediaCrawler soporta múltiples métodos de almacenamiento de datos, incluyendo CSV, JSON, Excel, SQLite y bases de datos MySQL.
📖 **Para instrucciones de uso detalladas, por favor vea: [Guía de Almacenamiento de Datos](docs/data_storage_guide.md)**
### Ejemplos de Uso:
```shell
# Inicializar la base de datos SQLite (cuando use '--init_db', no se necesitan otros argumentos opcionales)
uv run main.py --init_db sqlite
# Usar SQLite para almacenar datos (recomendado para usuarios personales)
uv run main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
```
```shell
# Inicializar la base de datos MySQL
uv run main.py --init_db mysql
# Usar MySQL para almacenar datos (el parámetro db se mantiene por compatibilidad con actualizaciones históricas)
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
```
---
[🚀 ¡Lanzamiento Mayor de MediaCrawlerPro 🚀! ¡Más características, mejor diseño arquitectónico!](https://github.com/MediaCrawlerPro)
## 🤝 Comunidad y Soporte
### 💬 Grupos de Discusión
- **Grupo de Discusión WeChat**: [Haga clic para unirse](https://nanmicoder.github.io/MediaCrawler/%E5%BE%AE%E4%BF%A1%E4%BA%A4%E6%B5%81%E7%BE%A4.html)
- **Cuenta de Bilibili**: [Sígueme](https://space.bilibili.com/434377496), compartiendo conocimientos de tecnología de IA y rastreo
### 📚 Documentación y Tutoriales
- **Documentación en Línea**: [Documentación Completa de MediaCrawler](https://nanmicoder.github.io/MediaCrawler/)
- **Tutorial de Rastreador**: [Tutorial Gratuito CrawlerTutorial](https://github.com/NanmiCoder/CrawlerTutorial)
# Otras preguntas comunes pueden verse en la documentación en línea
>
> La documentación en línea incluye métodos de uso, preguntas comunes, unirse a grupos de discusión del proyecto, etc.
> [Documentación en Línea de MediaCrawler](https://nanmicoder.github.io/MediaCrawler/)
>
# Servicios de Conocimiento del Autor
> Si quiere comenzar rápidamente y aprender el uso de este proyecto, diseño arquitectónico del código fuente, aprender tecnología de programación, o quiere entender el diseño del código fuente de MediaCrawlerPro, puede revisar mi columna de conocimiento pagado.
[Introducción de la Columna de Conocimiento Pagado del Autor](https://nanmicoder.github.io/MediaCrawler/%E7%9F%A5%E8%AF%86%E4%BB%98%E8%B4%B9%E4%BB%8B%E7%BB%8D.html)
---
## ⭐ Gráfico de Tendencia de Estrellas
¡Si este proyecto te ayuda, por favor da una ⭐ Estrella para apoyar y que más personas vean MediaCrawler!
[![Star History Chart](https://api.star-history.com/svg?repos=NanmiCoder/MediaCrawler&type=Date)](https://star-history.com/#NanmiCoder/MediaCrawler&Date)
### 💰 Exhibición de Patrocinadores
<a href="https://www.swiftproxy.net/?ref=nanmi">
<img src="docs/static/images/img_5.png">
<a href="https://h.wandouip.com">
<img src="docs/static/images/img_8.jpg">
<br>
**Swiftproxy** - ¡90M+ IPs residenciales puras de alta calidad globales, regístrese para obtener 500MB de tráfico de prueba gratuito, el tráfico dinámico nunca expira!
> Código de descuento exclusivo: **GHB5** ¡Obtenga 10% de descuento instantáneamente!
WandouHTTP - Pool de recursos IP auto-operado de decenas de millones, pureza de IP ≥99.8%, actualizaciones de IP de alta frecuencia diarias, respuesta rápida, conexión estable, soporta múltiples escenarios de negocio, personalizable según demanda, regístrese para obtener 10000 IPs gratis.
</a>
---
<a href="https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad">
<img width="500" src="docs/static/images/tikhub_banner_zh.png">
<br>
TikHub.io proporciona 900+ interfaces de datos altamente estables, cubriendo 14+ plataformas principales nacionales e internacionales incluyendo TK, DY, XHS, Y2B, Ins, X, etc. Soporta APIs de datos públicos multidimensionales para usuarios, contenido, productos, comentarios, etc., con 40M+ conjuntos de datos estructurados limpios. Use el código de invitación <code>cfzyejV9</code> para registrarse y recargar, y obtenga $2 adicionales de bonificación.
</a>
---
<a href="https://www.thordata.com/?ls=github&lk=mediacrawler">
<img width="500" src="docs/static/images/Thordata.png">
<br>
Thordata: Proveedor de servicios de proxy confiable y rentable. Proporciona servicios de IP proxy global estables, eficientes y conformes para empresas y desarrolladores. Regístrese ahora para obtener 1GB de prueba gratuita de proxy residencial y 2000 llamadas serp-api.
</a>
<br>
<a href="https://www.thordata.com/products/residential-proxies/?ls=github&lk=mediacrawler">【Proxies Residenciales】</a> | <a href="https://www.thordata.com/products/web-scraper/?ls=github&lk=mediacrawler">【serp-api】</a>
### 🤝 Conviértase en Patrocinador
¡Conviértase en patrocinador y muestre su producto aquí, obteniendo exposición masiva diariamente!
@@ -284,10 +285,24 @@ uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
**Información de Contacto**:
- WeChat: `relakkes`
- Email: `relakkes@gmail.com`
---
### 📚 Otros
- **Preguntas Frecuentes**: [Documentación Completa de MediaCrawler](https://nanmicoder.github.io/MediaCrawler/)
- **Tutorial de Rastreador para Principiantes**: [Tutorial Gratuito CrawlerTutorial](https://github.com/NanmiCoder/CrawlerTutorial)
- **Proyecto de Código Abierto de Rastreador de Noticias**: [NewsCrawlerCollection](https://github.com/NanmiCoder/NewsCrawlerCollection)
## ⭐ Gráfico de Tendencia de Estrellas
¡Si este proyecto te ayuda, por favor da una ⭐ Estrella para apoyar y que más personas vean MediaCrawler!
[![Star History Chart](https://api.star-history.com/svg?repos=NanmiCoder/MediaCrawler&type=Date)](https://star-history.com/#NanmiCoder/MediaCrawler&Date)
## 📚 Referencias
- **Repositorio de Firma Xiaohongshu**: [Repositorio de firma xhs de Cloxl](https://github.com/Cloxl/xhshow)
- **Cliente Xiaohongshu**: [Repositorio xhs de ReaJason](https://github.com/ReaJason/xhs)
- **Reenvío de SMS**: [Repositorio de referencia SmsForwarder](https://github.com/pppscn/SmsForwarder)
- **Herramienta de Penetración de Intranet**: [Documentación oficial de ngrok](https://ngrok.com/docs/)

19
api/__init__.py Normal file
View File

@@ -0,0 +1,19 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# WebUI API Module for MediaCrawler

186
api/main.py Normal file
View File

@@ -0,0 +1,186 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/main.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
"""
MediaCrawler WebUI API Server
Start command: uvicorn api.main:app --port 8080 --reload
Or: python -m api.main
"""
import asyncio
import os
import subprocess
import uvicorn
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from fastapi.staticfiles import StaticFiles
from fastapi.responses import FileResponse
from .routers import crawler_router, data_router, websocket_router
app = FastAPI(
title="MediaCrawler WebUI API",
description="API for controlling MediaCrawler from WebUI",
version="1.0.0"
)
# Get webui static files directory
WEBUI_DIR = os.path.join(os.path.dirname(__file__), "webui")
# CORS configuration - allow frontend dev server access
app.add_middleware(
CORSMiddleware,
allow_origins=[
"http://localhost:5173", # Vite dev server
"http://localhost:3000", # Backup port
"http://127.0.0.1:5173",
"http://127.0.0.1:3000",
],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Register routers
app.include_router(crawler_router, prefix="/api")
app.include_router(data_router, prefix="/api")
app.include_router(websocket_router, prefix="/api")
@app.get("/")
async def serve_frontend():
"""Return frontend page"""
index_path = os.path.join(WEBUI_DIR, "index.html")
if os.path.exists(index_path):
return FileResponse(index_path)
return {
"message": "MediaCrawler WebUI API",
"version": "1.0.0",
"docs": "/docs",
"note": "WebUI not found, please build it first: cd webui && npm run build"
}
@app.get("/api/health")
async def health_check():
return {"status": "ok"}
@app.get("/api/env/check")
async def check_environment():
"""Check if MediaCrawler environment is configured correctly"""
try:
# Run uv run main.py --help command to check environment
process = await asyncio.create_subprocess_exec(
"uv", "run", "main.py", "--help",
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
cwd="." # Project root directory
)
stdout, stderr = await asyncio.wait_for(
process.communicate(),
timeout=30.0 # 30 seconds timeout
)
if process.returncode == 0:
return {
"success": True,
"message": "MediaCrawler environment configured correctly",
"output": stdout.decode("utf-8", errors="ignore")[:500] # Truncate to first 500 characters
}
else:
error_msg = stderr.decode("utf-8", errors="ignore") or stdout.decode("utf-8", errors="ignore")
return {
"success": False,
"message": "Environment check failed",
"error": error_msg[:500]
}
except asyncio.TimeoutError:
return {
"success": False,
"message": "Environment check timeout",
"error": "Command execution exceeded 30 seconds"
}
except FileNotFoundError:
return {
"success": False,
"message": "uv command not found",
"error": "Please ensure uv is installed and configured in system PATH"
}
except Exception as e:
return {
"success": False,
"message": "Environment check error",
"error": str(e)
}
@app.get("/api/config/platforms")
async def get_platforms():
"""Get list of supported platforms"""
return {
"platforms": [
{"value": "xhs", "label": "Xiaohongshu", "icon": "book-open"},
{"value": "dy", "label": "Douyin", "icon": "music"},
{"value": "ks", "label": "Kuaishou", "icon": "video"},
{"value": "bili", "label": "Bilibili", "icon": "tv"},
{"value": "wb", "label": "Weibo", "icon": "message-circle"},
{"value": "tieba", "label": "Baidu Tieba", "icon": "messages-square"},
{"value": "zhihu", "label": "Zhihu", "icon": "help-circle"},
]
}
@app.get("/api/config/options")
async def get_config_options():
"""Get all configuration options"""
return {
"login_types": [
{"value": "qrcode", "label": "QR Code Login"},
{"value": "cookie", "label": "Cookie Login"},
],
"crawler_types": [
{"value": "search", "label": "Search Mode"},
{"value": "detail", "label": "Detail Mode"},
{"value": "creator", "label": "Creator Mode"},
],
"save_options": [
{"value": "json", "label": "JSON File"},
{"value": "csv", "label": "CSV File"},
{"value": "excel", "label": "Excel File"},
{"value": "sqlite", "label": "SQLite Database"},
{"value": "db", "label": "MySQL Database"},
{"value": "mongodb", "label": "MongoDB Database"},
],
}
# Mount static resources - must be placed after all routes
if os.path.exists(WEBUI_DIR):
assets_dir = os.path.join(WEBUI_DIR, "assets")
if os.path.exists(assets_dir):
app.mount("/assets", StaticFiles(directory=assets_dir), name="assets")
# Mount logos directory
logos_dir = os.path.join(WEBUI_DIR, "logos")
if os.path.exists(logos_dir):
app.mount("/logos", StaticFiles(directory=logos_dir), name="logos")
# Mount other static files (e.g., vite.svg)
app.mount("/static", StaticFiles(directory=WEBUI_DIR), name="webui-static")
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8080)

23
api/routers/__init__.py Normal file
View File

@@ -0,0 +1,23 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/routers/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
from .crawler import router as crawler_router
from .data import router as data_router
from .websocket import router as websocket_router
__all__ = ["crawler_router", "data_router", "websocket_router"]

63
api/routers/crawler.py Normal file
View File

@@ -0,0 +1,63 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/routers/crawler.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
from fastapi import APIRouter, HTTPException
from ..schemas import CrawlerStartRequest, CrawlerStatusResponse
from ..services import crawler_manager
router = APIRouter(prefix="/crawler", tags=["crawler"])
@router.post("/start")
async def start_crawler(request: CrawlerStartRequest):
"""Start crawler task"""
success = await crawler_manager.start(request)
if not success:
# Handle concurrent/duplicate requests: if process is already running, return 400 instead of 500
if crawler_manager.process and crawler_manager.process.poll() is None:
raise HTTPException(status_code=400, detail="Crawler is already running")
raise HTTPException(status_code=500, detail="Failed to start crawler")
return {"status": "ok", "message": "Crawler started successfully"}
@router.post("/stop")
async def stop_crawler():
"""Stop crawler task"""
success = await crawler_manager.stop()
if not success:
# Handle concurrent/duplicate requests: if process already exited/doesn't exist, return 400 instead of 500
if not crawler_manager.process or crawler_manager.process.poll() is not None:
raise HTTPException(status_code=400, detail="No crawler is running")
raise HTTPException(status_code=500, detail="Failed to stop crawler")
return {"status": "ok", "message": "Crawler stopped successfully"}
@router.get("/status", response_model=CrawlerStatusResponse)
async def get_crawler_status():
"""Get crawler status"""
return crawler_manager.get_status()
@router.get("/logs")
async def get_logs(limit: int = 100):
"""Get recent logs"""
logs = crawler_manager.logs[-limit:] if limit > 0 else crawler_manager.logs
return {"logs": [log.model_dump() for log in logs]}

230
api/routers/data.py Normal file
View File

@@ -0,0 +1,230 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/routers/data.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
import os
import json
from pathlib import Path
from typing import Optional
from fastapi import APIRouter, HTTPException
from fastapi.responses import FileResponse
router = APIRouter(prefix="/data", tags=["data"])
# Data directory
DATA_DIR = Path(__file__).parent.parent.parent / "data"
def get_file_info(file_path: Path) -> dict:
"""Get file information"""
stat = file_path.stat()
record_count = None
# Try to get record count
try:
if file_path.suffix == ".json":
with open(file_path, "r", encoding="utf-8") as f:
data = json.load(f)
if isinstance(data, list):
record_count = len(data)
elif file_path.suffix == ".csv":
with open(file_path, "r", encoding="utf-8") as f:
record_count = sum(1 for _ in f) - 1 # Subtract header row
except Exception:
pass
return {
"name": file_path.name,
"path": str(file_path.relative_to(DATA_DIR)),
"size": stat.st_size,
"modified_at": stat.st_mtime,
"record_count": record_count,
"type": file_path.suffix[1:] if file_path.suffix else "unknown"
}
@router.get("/files")
async def list_data_files(platform: Optional[str] = None, file_type: Optional[str] = None):
"""Get data file list"""
if not DATA_DIR.exists():
return {"files": []}
files = []
supported_extensions = {".json", ".csv", ".xlsx", ".xls"}
for root, dirs, filenames in os.walk(DATA_DIR):
root_path = Path(root)
for filename in filenames:
file_path = root_path / filename
if file_path.suffix.lower() not in supported_extensions:
continue
# Platform filter
if platform:
rel_path = str(file_path.relative_to(DATA_DIR))
if platform.lower() not in rel_path.lower():
continue
# Type filter
if file_type and file_path.suffix[1:].lower() != file_type.lower():
continue
try:
files.append(get_file_info(file_path))
except Exception:
continue
# Sort by modification time (newest first)
files.sort(key=lambda x: x["modified_at"], reverse=True)
return {"files": files}
@router.get("/files/{file_path:path}")
async def get_file_content(file_path: str, preview: bool = True, limit: int = 100):
"""Get file content or preview"""
full_path = DATA_DIR / file_path
if not full_path.exists():
raise HTTPException(status_code=404, detail="File not found")
if not full_path.is_file():
raise HTTPException(status_code=400, detail="Not a file")
# Security check: ensure within DATA_DIR
try:
full_path.resolve().relative_to(DATA_DIR.resolve())
except ValueError:
raise HTTPException(status_code=403, detail="Access denied")
if preview:
# Return preview data
try:
if full_path.suffix == ".json":
with open(full_path, "r", encoding="utf-8") as f:
data = json.load(f)
if isinstance(data, list):
return {"data": data[:limit], "total": len(data)}
return {"data": data, "total": 1}
elif full_path.suffix == ".csv":
import csv
with open(full_path, "r", encoding="utf-8") as f:
reader = csv.DictReader(f)
rows = []
for i, row in enumerate(reader):
if i >= limit:
break
rows.append(row)
# Re-read to get total count
f.seek(0)
total = sum(1 for _ in f) - 1
return {"data": rows, "total": total}
elif full_path.suffix.lower() in (".xlsx", ".xls"):
import pandas as pd
# Read first limit rows
df = pd.read_excel(full_path, nrows=limit)
# Get total row count (only read first column to save memory)
df_count = pd.read_excel(full_path, usecols=[0])
total = len(df_count)
# Convert to list of dictionaries, handle NaN values
rows = df.where(pd.notnull(df), None).to_dict(orient='records')
return {
"data": rows,
"total": total,
"columns": list(df.columns)
}
else:
raise HTTPException(status_code=400, detail="Unsupported file type for preview")
except json.JSONDecodeError:
raise HTTPException(status_code=400, detail="Invalid JSON file")
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
else:
# Return file download
return FileResponse(
path=full_path,
filename=full_path.name,
media_type="application/octet-stream"
)
@router.get("/download/{file_path:path}")
async def download_file(file_path: str):
"""Download file"""
full_path = DATA_DIR / file_path
if not full_path.exists():
raise HTTPException(status_code=404, detail="File not found")
if not full_path.is_file():
raise HTTPException(status_code=400, detail="Not a file")
# Security check
try:
full_path.resolve().relative_to(DATA_DIR.resolve())
except ValueError:
raise HTTPException(status_code=403, detail="Access denied")
return FileResponse(
path=full_path,
filename=full_path.name,
media_type="application/octet-stream"
)
@router.get("/stats")
async def get_data_stats():
"""Get data statistics"""
if not DATA_DIR.exists():
return {"total_files": 0, "total_size": 0, "by_platform": {}, "by_type": {}}
stats = {
"total_files": 0,
"total_size": 0,
"by_platform": {},
"by_type": {}
}
supported_extensions = {".json", ".csv", ".xlsx", ".xls"}
for root, dirs, filenames in os.walk(DATA_DIR):
root_path = Path(root)
for filename in filenames:
file_path = root_path / filename
if file_path.suffix.lower() not in supported_extensions:
continue
try:
stat = file_path.stat()
stats["total_files"] += 1
stats["total_size"] += stat.st_size
# Statistics by type
file_type = file_path.suffix[1:].lower()
stats["by_type"][file_type] = stats["by_type"].get(file_type, 0) + 1
# Statistics by platform (inferred from path)
rel_path = str(file_path.relative_to(DATA_DIR))
for platform in ["xhs", "dy", "ks", "bili", "wb", "tieba", "zhihu"]:
if platform in rel_path.lower():
stats["by_platform"][platform] = stats["by_platform"].get(platform, 0) + 1
break
except Exception:
continue
return stats

151
api/routers/websocket.py Normal file
View File

@@ -0,0 +1,151 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/routers/websocket.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
import asyncio
from typing import Set, Optional
from fastapi import APIRouter, WebSocket, WebSocketDisconnect
from ..services import crawler_manager
router = APIRouter(tags=["websocket"])
class ConnectionManager:
"""WebSocket connection manager"""
def __init__(self):
self.active_connections: Set[WebSocket] = set()
async def connect(self, websocket: WebSocket):
await websocket.accept()
self.active_connections.add(websocket)
def disconnect(self, websocket: WebSocket):
self.active_connections.discard(websocket)
async def broadcast(self, message: dict):
"""Broadcast message to all connections"""
if not self.active_connections:
return
disconnected = []
for connection in list(self.active_connections):
try:
await connection.send_json(message)
except Exception:
disconnected.append(connection)
# Clean up disconnected connections
for conn in disconnected:
self.disconnect(conn)
manager = ConnectionManager()
async def log_broadcaster():
"""Background task: read logs from queue and broadcast"""
queue = crawler_manager.get_log_queue()
while True:
try:
# Get log entry from queue
entry = await queue.get()
# Broadcast to all WebSocket connections
await manager.broadcast(entry.model_dump())
except asyncio.CancelledError:
break
except Exception as e:
print(f"Log broadcaster error: {e}")
await asyncio.sleep(0.1)
# Global broadcast task
_broadcaster_task: Optional[asyncio.Task] = None
def start_broadcaster():
"""Start broadcast task"""
global _broadcaster_task
if _broadcaster_task is None or _broadcaster_task.done():
_broadcaster_task = asyncio.create_task(log_broadcaster())
@router.websocket("/ws/logs")
async def websocket_logs(websocket: WebSocket):
"""WebSocket log stream"""
print("[WS] New connection attempt")
try:
# Ensure broadcast task is running
start_broadcaster()
await manager.connect(websocket)
print(f"[WS] Connected, active connections: {len(manager.active_connections)}")
# Send existing logs
for log in crawler_manager.logs:
try:
await websocket.send_json(log.model_dump())
except Exception as e:
print(f"[WS] Error sending existing log: {e}")
break
print(f"[WS] Sent {len(crawler_manager.logs)} existing logs, entering main loop")
while True:
# Keep connection alive, receive heartbeat or any message
try:
data = await asyncio.wait_for(
websocket.receive_text(),
timeout=30.0
)
if data == "ping":
await websocket.send_text("pong")
except asyncio.TimeoutError:
# Send ping to keep connection alive
try:
await websocket.send_text("ping")
except Exception as e:
print(f"[WS] Error sending ping: {e}")
break
except WebSocketDisconnect:
print("[WS] Client disconnected")
except Exception as e:
print(f"[WS] Error: {type(e).__name__}: {e}")
finally:
manager.disconnect(websocket)
print(f"[WS] Cleanup done, active connections: {len(manager.active_connections)}")
@router.websocket("/ws/status")
async def websocket_status(websocket: WebSocket):
"""WebSocket status stream"""
await websocket.accept()
try:
while True:
# Send status every second
status = crawler_manager.get_status()
await websocket.send_json(status)
await asyncio.sleep(1)
except WebSocketDisconnect:
pass
except Exception:
pass

37
api/schemas/__init__.py Normal file
View File

@@ -0,0 +1,37 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/schemas/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
from .crawler import (
PlatformEnum,
LoginTypeEnum,
CrawlerTypeEnum,
SaveDataOptionEnum,
CrawlerStartRequest,
CrawlerStatusResponse,
LogEntry,
)
__all__ = [
"PlatformEnum",
"LoginTypeEnum",
"CrawlerTypeEnum",
"SaveDataOptionEnum",
"CrawlerStartRequest",
"CrawlerStatusResponse",
"LogEntry",
]

98
api/schemas/crawler.py Normal file
View File

@@ -0,0 +1,98 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/schemas/crawler.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
from enum import Enum
from typing import Optional, Literal
from pydantic import BaseModel
class PlatformEnum(str, Enum):
"""Supported media platforms"""
XHS = "xhs"
DOUYIN = "dy"
KUAISHOU = "ks"
BILIBILI = "bili"
WEIBO = "wb"
TIEBA = "tieba"
ZHIHU = "zhihu"
class LoginTypeEnum(str, Enum):
"""Login method"""
QRCODE = "qrcode"
PHONE = "phone"
COOKIE = "cookie"
class CrawlerTypeEnum(str, Enum):
"""Crawler type"""
SEARCH = "search"
DETAIL = "detail"
CREATOR = "creator"
class SaveDataOptionEnum(str, Enum):
"""Data save option"""
CSV = "csv"
DB = "db"
JSON = "json"
SQLITE = "sqlite"
MONGODB = "mongodb"
EXCEL = "excel"
class CrawlerStartRequest(BaseModel):
"""Crawler start request"""
platform: PlatformEnum
login_type: LoginTypeEnum = LoginTypeEnum.QRCODE
crawler_type: CrawlerTypeEnum = CrawlerTypeEnum.SEARCH
keywords: str = "" # Keywords for search mode
specified_ids: str = "" # Post/video ID list for detail mode, comma-separated
creator_ids: str = "" # Creator ID list for creator mode, comma-separated
start_page: int = 1
enable_comments: bool = True
enable_sub_comments: bool = False
save_option: SaveDataOptionEnum = SaveDataOptionEnum.JSON
cookies: str = ""
headless: bool = False
class CrawlerStatusResponse(BaseModel):
"""Crawler status response"""
status: Literal["idle", "running", "stopping", "error"]
platform: Optional[str] = None
crawler_type: Optional[str] = None
started_at: Optional[str] = None
error_message: Optional[str] = None
class LogEntry(BaseModel):
"""Log entry"""
id: int
timestamp: str
level: Literal["info", "warning", "error", "success", "debug"]
message: str
class DataFileInfo(BaseModel):
"""Data file information"""
name: str
path: str
size: int
modified_at: str
record_count: Optional[int] = None

21
api/services/__init__.py Normal file
View File

@@ -0,0 +1,21 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/services/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
from .crawler_manager import CrawlerManager, crawler_manager
__all__ = ["CrawlerManager", "crawler_manager"]

View File

@@ -0,0 +1,282 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/services/crawler_manager.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
import asyncio
import subprocess
import signal
import os
from typing import Optional, List
from datetime import datetime
from pathlib import Path
from ..schemas import CrawlerStartRequest, LogEntry
class CrawlerManager:
"""Crawler process manager"""
def __init__(self):
self._lock = asyncio.Lock()
self.process: Optional[subprocess.Popen] = None
self.status = "idle"
self.started_at: Optional[datetime] = None
self.current_config: Optional[CrawlerStartRequest] = None
self._log_id = 0
self._logs: List[LogEntry] = []
self._read_task: Optional[asyncio.Task] = None
# Project root directory
self._project_root = Path(__file__).parent.parent.parent
# Log queue - for pushing to WebSocket
self._log_queue: Optional[asyncio.Queue] = None
@property
def logs(self) -> List[LogEntry]:
return self._logs
def get_log_queue(self) -> asyncio.Queue:
"""Get or create log queue"""
if self._log_queue is None:
self._log_queue = asyncio.Queue()
return self._log_queue
def _create_log_entry(self, message: str, level: str = "info") -> LogEntry:
"""Create log entry"""
self._log_id += 1
entry = LogEntry(
id=self._log_id,
timestamp=datetime.now().strftime("%H:%M:%S"),
level=level,
message=message
)
self._logs.append(entry)
# Keep last 500 logs
if len(self._logs) > 500:
self._logs = self._logs[-500:]
return entry
async def _push_log(self, entry: LogEntry):
"""Push log to queue"""
if self._log_queue is not None:
try:
self._log_queue.put_nowait(entry)
except asyncio.QueueFull:
pass
def _parse_log_level(self, line: str) -> str:
"""Parse log level"""
line_upper = line.upper()
if "ERROR" in line_upper or "FAILED" in line_upper:
return "error"
elif "WARNING" in line_upper or "WARN" in line_upper:
return "warning"
elif "SUCCESS" in line_upper or "完成" in line or "成功" in line:
return "success"
elif "DEBUG" in line_upper:
return "debug"
return "info"
async def start(self, config: CrawlerStartRequest) -> bool:
"""Start crawler process"""
async with self._lock:
if self.process and self.process.poll() is None:
return False
# Clear old logs
self._logs = []
self._log_id = 0
# Clear pending queue (don't replace object to avoid WebSocket broadcast coroutine holding old queue reference)
if self._log_queue is None:
self._log_queue = asyncio.Queue()
else:
try:
while True:
self._log_queue.get_nowait()
except asyncio.QueueEmpty:
pass
# Build command line arguments
cmd = self._build_command(config)
# Log start information
entry = self._create_log_entry(f"Starting crawler: {' '.join(cmd)}", "info")
await self._push_log(entry)
try:
# Start subprocess
self.process = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
encoding='utf-8',
bufsize=1,
cwd=str(self._project_root),
env={**os.environ, "PYTHONUNBUFFERED": "1"}
)
self.status = "running"
self.started_at = datetime.now()
self.current_config = config
entry = self._create_log_entry(
f"Crawler started on platform: {config.platform.value}, type: {config.crawler_type.value}",
"success"
)
await self._push_log(entry)
# Start log reading task
self._read_task = asyncio.create_task(self._read_output())
return True
except Exception as e:
self.status = "error"
entry = self._create_log_entry(f"Failed to start crawler: {str(e)}", "error")
await self._push_log(entry)
return False
async def stop(self) -> bool:
"""Stop crawler process"""
async with self._lock:
if not self.process or self.process.poll() is not None:
return False
self.status = "stopping"
entry = self._create_log_entry("Sending SIGTERM to crawler process...", "warning")
await self._push_log(entry)
try:
self.process.send_signal(signal.SIGTERM)
# Wait for graceful exit (up to 15 seconds)
for _ in range(30):
if self.process.poll() is not None:
break
await asyncio.sleep(0.5)
# If still not exited, force kill
if self.process.poll() is None:
entry = self._create_log_entry("Process not responding, sending SIGKILL...", "warning")
await self._push_log(entry)
self.process.kill()
entry = self._create_log_entry("Crawler process terminated", "info")
await self._push_log(entry)
except Exception as e:
entry = self._create_log_entry(f"Error stopping crawler: {str(e)}", "error")
await self._push_log(entry)
self.status = "idle"
self.current_config = None
# Cancel log reading task
if self._read_task:
self._read_task.cancel()
self._read_task = None
return True
def get_status(self) -> dict:
"""Get current status"""
return {
"status": self.status,
"platform": self.current_config.platform.value if self.current_config else None,
"crawler_type": self.current_config.crawler_type.value if self.current_config else None,
"started_at": self.started_at.isoformat() if self.started_at else None,
"error_message": None
}
def _build_command(self, config: CrawlerStartRequest) -> list:
"""Build main.py command line arguments"""
cmd = ["uv", "run", "python", "main.py"]
cmd.extend(["--platform", config.platform.value])
cmd.extend(["--lt", config.login_type.value])
cmd.extend(["--type", config.crawler_type.value])
cmd.extend(["--save_data_option", config.save_option.value])
# Pass different arguments based on crawler type
if config.crawler_type.value == "search" and config.keywords:
cmd.extend(["--keywords", config.keywords])
elif config.crawler_type.value == "detail" and config.specified_ids:
cmd.extend(["--specified_id", config.specified_ids])
elif config.crawler_type.value == "creator" and config.creator_ids:
cmd.extend(["--creator_id", config.creator_ids])
if config.start_page != 1:
cmd.extend(["--start", str(config.start_page)])
cmd.extend(["--get_comment", "true" if config.enable_comments else "false"])
cmd.extend(["--get_sub_comment", "true" if config.enable_sub_comments else "false"])
if config.cookies:
cmd.extend(["--cookies", config.cookies])
cmd.extend(["--headless", "true" if config.headless else "false"])
return cmd
async def _read_output(self):
"""Asynchronously read process output"""
loop = asyncio.get_event_loop()
try:
while self.process and self.process.poll() is None:
# Read a line in thread pool
line = await loop.run_in_executor(
None, self.process.stdout.readline
)
if line:
line = line.strip()
if line:
level = self._parse_log_level(line)
entry = self._create_log_entry(line, level)
await self._push_log(entry)
# Read remaining output
if self.process and self.process.stdout:
remaining = await loop.run_in_executor(
None, self.process.stdout.read
)
if remaining:
for line in remaining.strip().split('\n'):
if line.strip():
level = self._parse_log_level(line)
entry = self._create_log_entry(line.strip(), level)
await self._push_log(entry)
# Process ended
if self.status == "running":
exit_code = self.process.returncode if self.process else -1
if exit_code == 0:
entry = self._create_log_entry("Crawler completed successfully", "success")
else:
entry = self._create_log_entry(f"Crawler exited with code: {exit_code}", "warning")
await self._push_log(entry)
self.status = "idle"
except asyncio.CancelledError:
pass
except Exception as e:
entry = self._create_log_entry(f"Error reading output: {str(e)}", "error")
await self._push_log(entry)
# Global singleton
crawler_manager = CrawlerManager()

View File

File diff suppressed because one or more lines are too long

View File

File diff suppressed because one or more lines are too long

17
api/webui/index.html Normal file
View File

@@ -0,0 +1,17 @@
<!doctype html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8" />
<link rel="icon" type="image/svg+xml" href="/vite.svg" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>MediaCrawler - Command Center</title>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&family=JetBrains+Mono:wght@400;500&display=swap" rel="stylesheet">
<script type="module" crossorigin src="/assets/index-DvClRayq.js"></script>
<link rel="stylesheet" crossorigin href="/assets/index-OiBmsgXF.css">
</head>
<body>
<div id="root"></div>
</body>
</html>

View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

BIN
api/webui/logos/douyin.png Normal file
View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

BIN
api/webui/logos/github.png Normal file
View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.8 KiB

BIN
api/webui/logos/my_logo.png Normal file
View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 312 KiB

View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.2 KiB

1
api/webui/vite.svg Normal file
View File

@@ -0,0 +1 @@
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100"><circle cx="50" cy="50" r="40" fill="#de283b"/></svg>

After

Width:  |  Height:  |  Size: 116 B

View File

@@ -53,14 +53,14 @@ class AbstractCrawler(ABC):
async def launch_browser_with_cdp(self, playwright: Playwright, playwright_proxy: Optional[Dict], user_agent: Optional[str], headless: bool = True) -> BrowserContext:
"""
使用CDP模式启动浏览器可选实现
:param playwright: playwright实例
:param playwright_proxy: playwright代理配置
:param user_agent: 用户代理
:param headless: 无头模式
:return: 浏览器上下文
Launch browser using CDP mode (optional implementation)
:param playwright: playwright instance
:param playwright_proxy: playwright proxy configuration
:param user_agent: user agent
:param headless: headless mode
:return: browser context
"""
# 默认实现:回退到标准模式
# Default implementation: fallback to standard mode
return await self.launch_browser(playwright.chromium, playwright_proxy, user_agent, headless)

24
cache/abs_cache.py vendored
View File

@@ -20,9 +20,9 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Name : 程序员阿江-Relakkes
# @Name : Programmer AJiang-Relakkes
# @Time : 2024/6/2 11:06
# @Desc : 抽象类
# @Desc : Abstract class
from abc import ABC, abstractmethod
from typing import Any, List, Optional
@@ -33,9 +33,9 @@ class AbstractCache(ABC):
@abstractmethod
def get(self, key: str) -> Optional[Any]:
"""
从缓存中获取键的值。
这是一个抽象方法。子类必须实现这个方法。
:param key:
Get the value of a key from the cache.
This is an abstract method. Subclasses must implement this method.
:param key: The key
:return:
"""
raise NotImplementedError
@@ -43,11 +43,11 @@ class AbstractCache(ABC):
@abstractmethod
def set(self, key: str, value: Any, expire_time: int) -> None:
"""
将键的值设置到缓存中。
这是一个抽象方法。子类必须实现这个方法。
:param key:
:param value:
:param expire_time: 过期时间
Set the value of a key in the cache.
This is an abstract method. Subclasses must implement this method.
:param key: The key
:param value: The value
:param expire_time: Expiration time
:return:
"""
raise NotImplementedError
@@ -55,8 +55,8 @@ class AbstractCache(ABC):
@abstractmethod
def keys(self, pattern: str) -> List[str]:
"""
获取所有符合pattern的key
:param pattern: 匹配模式
Get all keys matching the pattern
:param pattern: Matching pattern
:return:
"""
raise NotImplementedError

View File

@@ -20,23 +20,23 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Name : 程序员阿江-Relakkes
# @Name : Programmer AJiang-Relakkes
# @Time : 2024/6/2 11:23
# @Desc :
class CacheFactory:
"""
缓存工厂类
Cache factory class
"""
@staticmethod
def create_cache(cache_type: str, *args, **kwargs):
"""
创建缓存对象
:param cache_type: 缓存类型
:param args: 参数
:param kwargs: 关键字参数
Create cache object
:param cache_type: Cache type
:param args: Arguments
:param kwargs: Keyword arguments
:return:
"""
if cache_type == 'memory':

32
cache/local_cache.py vendored
View File

@@ -20,9 +20,9 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Name : 程序员阿江-Relakkes
# @Name : Programmer AJiang-Relakkes
# @Time : 2024/6/2 11:05
# @Desc : 本地缓存
# @Desc : Local cache
import asyncio
import time
@@ -35,19 +35,19 @@ class ExpiringLocalCache(AbstractCache):
def __init__(self, cron_interval: int = 10):
"""
初始化本地缓存
:param cron_interval: 定时清楚cache的时间间隔
Initialize local cache
:param cron_interval: Time interval for scheduled cache cleanup
:return:
"""
self._cron_interval = cron_interval
self._cache_container: Dict[str, Tuple[Any, float]] = {}
self._cron_task: Optional[asyncio.Task] = None
# 开启定时清理任务
# Start scheduled cleanup task
self._schedule_clear()
def __del__(self):
"""
析构函数,清理定时任务
Destructor function, cleanup scheduled task
:return:
"""
if self._cron_task is not None:
@@ -55,7 +55,7 @@ class ExpiringLocalCache(AbstractCache):
def get(self, key: str) -> Optional[Any]:
"""
从缓存中获取键的值
Get the value of a key from the cache
:param key:
:return:
"""
@@ -63,7 +63,7 @@ class ExpiringLocalCache(AbstractCache):
if value is None:
return None
# 如果键已过期,则删除键并返回None
# If the key has expired, delete it and return None
if expire_time < time.time():
del self._cache_container[key]
return None
@@ -72,7 +72,7 @@ class ExpiringLocalCache(AbstractCache):
def set(self, key: str, value: Any, expire_time: int) -> None:
"""
将键的值设置到缓存中
Set the value of a key in the cache
:param key:
:param value:
:param expire_time:
@@ -82,14 +82,14 @@ class ExpiringLocalCache(AbstractCache):
def keys(self, pattern: str) -> List[str]:
"""
获取所有符合pattern的key
:param pattern: 匹配模式
Get all keys matching the pattern
:param pattern: Matching pattern
:return:
"""
if pattern == '*':
return list(self._cache_container.keys())
# 本地缓存通配符暂时将*替换为空
# For local cache wildcard, temporarily replace * with empty string
if '*' in pattern:
pattern = pattern.replace('*', '')
@@ -97,7 +97,7 @@ class ExpiringLocalCache(AbstractCache):
def _schedule_clear(self):
"""
开启定时清理任务,
Start scheduled cleanup task
:return:
"""
@@ -111,7 +111,7 @@ class ExpiringLocalCache(AbstractCache):
def _clear(self):
"""
根据过期时间清理缓存
Clean up cache based on expiration time
:return:
"""
for key, (value, expire_time) in self._cache_container.items():
@@ -120,7 +120,7 @@ class ExpiringLocalCache(AbstractCache):
async def _start_clear_cron(self):
"""
开启定时清理任务
Start scheduled cleanup task
:return:
"""
while True:
@@ -130,7 +130,7 @@ class ExpiringLocalCache(AbstractCache):
if __name__ == '__main__':
cache = ExpiringLocalCache(cron_interval=2)
cache.set('name', '程序员阿江-Relakkes', 3)
cache.set('name', 'Programmer AJiang-Relakkes', 3)
print(cache.get('key'))
print(cache.keys("*"))
time.sleep(4)

16
cache/redis_cache.py vendored
View File

@@ -20,9 +20,9 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Name : 程序员阿江-Relakkes
# @Name : Programmer AJiang-Relakkes
# @Time : 2024/5/29 22:57
# @Desc : RedisCache实现
# @Desc : RedisCache implementation
import pickle
import time
from typing import Any, List
@@ -36,13 +36,13 @@ from config import db_config
class RedisCache(AbstractCache):
def __init__(self) -> None:
# 连接redis, 返回redis客户端
# Connect to redis, return redis client
self._redis_client = self._connet_redis()
@staticmethod
def _connet_redis() -> Redis:
"""
连接redis, 返回redis客户端, 这里按需配置redis连接信息
Connect to redis, return redis client, configure redis connection information as needed
:return:
"""
return Redis(
@@ -54,7 +54,7 @@ class RedisCache(AbstractCache):
def get(self, key: str) -> Any:
"""
从缓存中获取键的值, 并且反序列化
Get the value of a key from the cache and deserialize it
:param key:
:return:
"""
@@ -65,7 +65,7 @@ class RedisCache(AbstractCache):
def set(self, key: str, value: Any, expire_time: int) -> None:
"""
将键的值设置到缓存中, 并且序列化
Set the value of a key in the cache and serialize it
:param key:
:param value:
:param expire_time:
@@ -75,7 +75,7 @@ class RedisCache(AbstractCache):
def keys(self, pattern: str) -> List[str]:
"""
获取所有符合pattern的key
Get all keys matching the pattern
"""
return [key.decode() for key in self._redis_client.keys(pattern)]
@@ -83,7 +83,7 @@ class RedisCache(AbstractCache):
if __name__ == '__main__':
redis_cache = RedisCache()
# basic usage
redis_cache.set("name", "程序员阿江-Relakkes", 1)
redis_cache.set("name", "Programmer AJiang-Relakkes", 1)
print(redis_cache.get("name")) # Relakkes
print(redis_cache.keys("*")) # ['name']
time.sleep(2)

View File

@@ -37,7 +37,7 @@ EnumT = TypeVar("EnumT", bound=Enum)
class PlatformEnum(str, Enum):
"""支持的媒体平台枚举"""
"""Supported media platform enumeration"""
XHS = "xhs"
DOUYIN = "dy"
@@ -49,7 +49,7 @@ class PlatformEnum(str, Enum):
class LoginTypeEnum(str, Enum):
"""登录方式枚举"""
"""Login type enumeration"""
QRCODE = "qrcode"
PHONE = "phone"
@@ -57,7 +57,7 @@ class LoginTypeEnum(str, Enum):
class CrawlerTypeEnum(str, Enum):
"""爬虫类型枚举"""
"""Crawler type enumeration"""
SEARCH = "search"
DETAIL = "detail"
@@ -65,7 +65,7 @@ class CrawlerTypeEnum(str, Enum):
class SaveDataOptionEnum(str, Enum):
"""数据保存方式枚举"""
"""Data save option enumeration"""
CSV = "csv"
DB = "db"
@@ -73,13 +73,15 @@ class SaveDataOptionEnum(str, Enum):
SQLITE = "sqlite"
MONGODB = "mongodb"
EXCEL = "excel"
POSTGRES = "postgres"
class InitDbOptionEnum(str, Enum):
"""数据库初始化选项"""
"""Database initialization option"""
SQLITE = "sqlite"
MYSQL = "mysql"
POSTGRES = "postgres"
def _to_bool(value: bool | str) -> bool:
@@ -102,7 +104,7 @@ def _coerce_enum(
return enum_cls(value)
except ValueError:
typer.secho(
f"⚠️ 配置值 '{value}' 不在 {enum_cls.__name__} 支持的范围内,已回退到默认值 '{default.value}'.",
f"⚠️ Config value '{value}' is not within the supported range of {enum_cls.__name__}, falling back to default value '{default.value}'.",
fg=typer.colors.YELLOW,
)
return default
@@ -133,7 +135,7 @@ def _inject_init_db_default(args: Sequence[str]) -> list[str]:
async def parse_cmd(argv: Optional[Sequence[str]] = None):
"""使用 Typer 解析命令行参数。"""
"""Parse command line arguments using Typer."""
app = typer.Typer(add_completion=False)
@@ -143,48 +145,48 @@ async def parse_cmd(argv: Optional[Sequence[str]] = None):
PlatformEnum,
typer.Option(
"--platform",
help="媒体平台选择 (xhs=小红书 | dy=抖音 | ks=快手 | bili=哔哩哔哩 | wb=微博 | tieba=百度贴吧 | zhihu=知乎)",
rich_help_panel="基础配置",
help="Media platform selection (xhs=XiaoHongShu | dy=Douyin | ks=Kuaishou | bili=Bilibili | wb=Weibo | tieba=Baidu Tieba | zhihu=Zhihu)",
rich_help_panel="Basic Configuration",
),
] = _coerce_enum(PlatformEnum, config.PLATFORM, PlatformEnum.XHS),
lt: Annotated[
LoginTypeEnum,
typer.Option(
"--lt",
help="登录方式 (qrcode=二维码 | phone=手机号 | cookie=Cookie)",
rich_help_panel="账号配置",
help="Login type (qrcode=QR Code | phone=Phone | cookie=Cookie)",
rich_help_panel="Account Configuration",
),
] = _coerce_enum(LoginTypeEnum, config.LOGIN_TYPE, LoginTypeEnum.QRCODE),
crawler_type: Annotated[
CrawlerTypeEnum,
typer.Option(
"--type",
help="爬取类型 (search=搜索 | detail=详情 | creator=创作者)",
rich_help_panel="基础配置",
help="Crawler type (search=Search | detail=Detail | creator=Creator)",
rich_help_panel="Basic Configuration",
),
] = _coerce_enum(CrawlerTypeEnum, config.CRAWLER_TYPE, CrawlerTypeEnum.SEARCH),
start: Annotated[
int,
typer.Option(
"--start",
help="起始页码",
rich_help_panel="基础配置",
help="Starting page number",
rich_help_panel="Basic Configuration",
),
] = config.START_PAGE,
keywords: Annotated[
str,
typer.Option(
"--keywords",
help="请输入关键词,多个关键词用逗号分隔",
rich_help_panel="基础配置",
help="Enter keywords, multiple keywords separated by commas",
rich_help_panel="Basic Configuration",
),
] = config.KEYWORDS,
get_comment: Annotated[
str,
typer.Option(
"--get_comment",
help="是否爬取一级评论,支持 yes/true/t/y/1 no/false/f/n/0",
rich_help_panel="评论配置",
help="Whether to crawl first-level comments, supports yes/true/t/y/1 or no/false/f/n/0",
rich_help_panel="Comment Configuration",
show_default=True,
),
] = str(config.ENABLE_GET_COMMENTS),
@@ -192,17 +194,26 @@ async def parse_cmd(argv: Optional[Sequence[str]] = None):
str,
typer.Option(
"--get_sub_comment",
help="是否爬取二级评论,支持 yes/true/t/y/1 no/false/f/n/0",
rich_help_panel="评论配置",
help="Whether to crawl second-level comments, supports yes/true/t/y/1 or no/false/f/n/0",
rich_help_panel="Comment Configuration",
show_default=True,
),
] = str(config.ENABLE_GET_SUB_COMMENTS),
headless: Annotated[
str,
typer.Option(
"--headless",
help="Whether to enable headless mode (applies to both Playwright and CDP), supports yes/true/t/y/1 or no/false/f/n/0",
rich_help_panel="Runtime Configuration",
show_default=True,
),
] = str(config.HEADLESS),
save_data_option: Annotated[
SaveDataOptionEnum,
typer.Option(
"--save_data_option",
help="数据保存方式 (csv=CSV文件 | db=MySQL数据库 | json=JSON文件 | sqlite=SQLite数据库 | mongodb=MongoDB数据库 | excel=Excel文件)",
rich_help_panel="存储配置",
help="Data save option (csv=CSV file | db=MySQL database | json=JSON file | sqlite=SQLite database | mongodb=MongoDB database | excel=Excel file | postgres=PostgreSQL database)",
rich_help_panel="Storage Configuration",
),
] = _coerce_enum(
SaveDataOptionEnum, config.SAVE_DATA_OPTION, SaveDataOptionEnum.JSON
@@ -211,25 +222,62 @@ async def parse_cmd(argv: Optional[Sequence[str]] = None):
Optional[InitDbOptionEnum],
typer.Option(
"--init_db",
help="初始化数据库表结构 (sqlite | mysql)",
rich_help_panel="存储配置",
help="Initialize database table structure (sqlite | mysql | postgres)",
rich_help_panel="Storage Configuration",
),
] = None,
cookies: Annotated[
str,
typer.Option(
"--cookies",
help="Cookie 登录方式使用的 Cookie 值",
rich_help_panel="账号配置",
help="Cookie value used for Cookie login method",
rich_help_panel="Account Configuration",
),
] = config.COOKIES,
specified_id: Annotated[
str,
typer.Option(
"--specified_id",
help="Post/video ID list in detail mode, multiple IDs separated by commas (supports full URL or ID)",
rich_help_panel="Basic Configuration",
),
] = "",
creator_id: Annotated[
str,
typer.Option(
"--creator_id",
help="Creator ID list in creator mode, multiple IDs separated by commas (supports full URL or ID)",
rich_help_panel="Basic Configuration",
),
] = "",
max_comments_count_singlenotes: Annotated[
int,
typer.Option(
"--max_comments_count_singlenotes",
help="Maximum number of first-level comments to crawl per post/video",
rich_help_panel="Comment Configuration",
),
] = config.CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES,
max_concurrency_num: Annotated[
int,
typer.Option(
"--max_concurrency_num",
help="Maximum number of concurrent crawlers",
rich_help_panel="Performance Configuration",
),
] = config.MAX_CONCURRENCY_NUM,
) -> SimpleNamespace:
"""MediaCrawler 命令行入口"""
enable_comment = _to_bool(get_comment)
enable_sub_comment = _to_bool(get_sub_comment)
enable_headless = _to_bool(headless)
init_db_value = init_db.value if init_db else None
# Parse specified_id and creator_id into lists
specified_id_list = [id.strip() for id in specified_id.split(",") if id.strip()] if specified_id else []
creator_id_list = [id.strip() for id in creator_id.split(",") if id.strip()] if creator_id else []
# override global config
config.PLATFORM = platform.value
config.LOGIN_TYPE = lt.value
@@ -238,8 +286,37 @@ async def parse_cmd(argv: Optional[Sequence[str]] = None):
config.KEYWORDS = keywords
config.ENABLE_GET_COMMENTS = enable_comment
config.ENABLE_GET_SUB_COMMENTS = enable_sub_comment
config.HEADLESS = enable_headless
config.CDP_HEADLESS = enable_headless
config.SAVE_DATA_OPTION = save_data_option.value
config.COOKIES = cookies
config.CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES = max_comments_count_singlenotes
config.MAX_CONCURRENCY_NUM = max_concurrency_num
# Set platform-specific ID lists for detail/creator mode
if specified_id_list:
if platform == PlatformEnum.XHS:
config.XHS_SPECIFIED_NOTE_URL_LIST = specified_id_list
elif platform == PlatformEnum.BILIBILI:
config.BILI_SPECIFIED_ID_LIST = specified_id_list
elif platform == PlatformEnum.DOUYIN:
config.DY_SPECIFIED_ID_LIST = specified_id_list
elif platform == PlatformEnum.WEIBO:
config.WEIBO_SPECIFIED_ID_LIST = specified_id_list
elif platform == PlatformEnum.KUAISHOU:
config.KS_SPECIFIED_ID_LIST = specified_id_list
if creator_id_list:
if platform == PlatformEnum.XHS:
config.XHS_CREATOR_ID_LIST = creator_id_list
elif platform == PlatformEnum.BILIBILI:
config.BILI_CREATOR_ID_LIST = creator_id_list
elif platform == PlatformEnum.DOUYIN:
config.DY_CREATOR_ID_LIST = creator_id_list
elif platform == PlatformEnum.WEIBO:
config.WEIBO_CREATOR_ID_LIST = creator_id_list
elif platform == PlatformEnum.KUAISHOU:
config.KS_CREATOR_ID_LIST = creator_id_list
return SimpleNamespace(
platform=config.PLATFORM,
@@ -249,9 +326,12 @@ async def parse_cmd(argv: Optional[Sequence[str]] = None):
keywords=config.KEYWORDS,
get_comment=config.ENABLE_GET_COMMENTS,
get_sub_comment=config.ENABLE_GET_SUB_COMMENTS,
headless=config.HEADLESS,
save_data_option=config.SAVE_DATA_OPTION,
init_db=init_db_value,
cookies=config.COOKIES,
specified_id=specified_id,
creator_id=creator_id,
)
command = typer.main.get_command(app)

View File

@@ -70,8 +70,8 @@ BROWSER_LAUNCH_TIMEOUT = 60
# 设置为False可以保持浏览器运行便于调试
AUTO_CLOSE_BROWSER = True
# 数据保存类型选项配置,支持种类型csv、db、json、sqlite、excel, 最好保存到DB有排重的功能。
SAVE_DATA_OPTION = "json" # csv or db or json or sqlite or excel
# 数据保存类型选项配置,支持种类型csv、db、json、sqlite、excel、postgres, 最好保存到DB有排重的功能。
SAVE_DATA_OPTION = "json" # csv or db or json or sqlite or excel or postgres
# 用户浏览器缓存的浏览器文件配置
USER_DATA_DIR = "%s_user_data_dir" # %s will be replaced by platform name

View File

@@ -37,7 +37,7 @@ mysql_db_config = {
# redis config
REDIS_DB_HOST = "127.0.0.1" # your redis host
REDIS_DB_HOST = os.getenv("REDIS_DB_HOST", "127.0.0.1") # your redis host
REDIS_DB_PWD = os.getenv("REDIS_DB_PWD", "123456") # your redis password
REDIS_DB_PORT = os.getenv("REDIS_DB_PORT", 6379) # your redis port
REDIS_DB_NUM = os.getenv("REDIS_DB_NUM", 0) # your redis db num
@@ -67,3 +67,18 @@ mongodb_config = {
"password": MONGODB_PWD,
"db_name": MONGODB_DB_NAME,
}
# postgres config
POSTGRES_DB_PWD = os.getenv("POSTGRES_DB_PWD", "123456")
POSTGRES_DB_USER = os.getenv("POSTGRES_DB_USER", "postgres")
POSTGRES_DB_HOST = os.getenv("POSTGRES_DB_HOST", "localhost")
POSTGRES_DB_PORT = os.getenv("POSTGRES_DB_PORT", 5432)
POSTGRES_DB_NAME = os.getenv("POSTGRES_DB_NAME", "media_crawler")
postgres_db_config = {
"user": POSTGRES_DB_USER,
"password": POSTGRES_DB_PWD,
"host": POSTGRES_DB_HOST,
"port": POSTGRES_DB_PORT,
"db_name": POSTGRES_DB_NAME,
}

View File

@@ -31,6 +31,10 @@ WEIBO_SPECIFIED_ID_LIST = [
# 指定微博用户ID列表
WEIBO_CREATOR_ID_LIST = [
"5533390220",
"5756404150",
# ........................
]
# 是否开启微博爬取全文的功能,默认开启
# 如果开启的话会增加被风控的概率,相当于一个关键词搜索请求会再遍历所有帖子的时候,再请求一次帖子详情
ENABLE_WEIBO_FULL_TEXT = True

View File

@@ -17,9 +17,9 @@
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# persist-1<persist1@126.com>
# 原因:将 db.py 改造为模块,移除直接执行入口,修复相对导入问题。
# 副作用:无
# 回滚策略:还原此文件。
# Reason: Refactored db.py into a module, removed direct execution entry point, fixed relative import issues.
# Side effects: None
# Rollback strategy: Restore this file.
import asyncio
import sys
from pathlib import Path

View File

@@ -22,7 +22,7 @@ from sqlalchemy.orm import sessionmaker
from contextlib import asynccontextmanager
from .models import Base
import config
from config.db_config import mysql_db_config, sqlite_db_config
from config.db_config import mysql_db_config, sqlite_db_config, postgres_db_config
# Keep a cache of engines
_engines = {}
@@ -36,6 +36,18 @@ async def create_database_if_not_exists(db_type: str):
async with engine.connect() as conn:
await conn.execute(text(f"CREATE DATABASE IF NOT EXISTS {mysql_db_config['db_name']}"))
await engine.dispose()
elif db_type == "postgres":
# Connect to the default 'postgres' database
server_url = f"postgresql+asyncpg://{postgres_db_config['user']}:{postgres_db_config['password']}@{postgres_db_config['host']}:{postgres_db_config['port']}/postgres"
print(f"[init_db] Connecting to Postgres: host={postgres_db_config['host']}, port={postgres_db_config['port']}, user={postgres_db_config['user']}, dbname=postgres")
# Isolation level AUTOCOMMIT is required for CREATE DATABASE
engine = create_async_engine(server_url, echo=False, isolation_level="AUTOCOMMIT")
async with engine.connect() as conn:
# Check if database exists
result = await conn.execute(text(f"SELECT 1 FROM pg_database WHERE datname = '{postgres_db_config['db_name']}'"))
if not result.scalar():
await conn.execute(text(f"CREATE DATABASE {postgres_db_config['db_name']}"))
await engine.dispose()
def get_async_engine(db_type: str = None):
@@ -52,6 +64,8 @@ def get_async_engine(db_type: str = None):
db_url = f"sqlite+aiosqlite:///{sqlite_db_config['db_path']}"
elif db_type == "mysql" or db_type == "db":
db_url = f"mysql+asyncmy://{mysql_db_config['user']}:{mysql_db_config['password']}@{mysql_db_config['host']}:{mysql_db_config['port']}/{mysql_db_config['db_name']}"
elif db_type == "postgres":
db_url = f"postgresql+asyncpg://{postgres_db_config['user']}:{postgres_db_config['password']}@{postgres_db_config['host']}:{postgres_db_config['port']}/{postgres_db_config['db_name']}"
else:
raise ValueError(f"Unsupported database type: {db_type}")

View File

@@ -406,9 +406,9 @@ class ZhihuContent(Base):
last_modify_ts = Column(BigInteger)
# persist-1<persist1@126.com>
# 原因:修复 ORM 模型定义错误,确保与数据库表结构一致。
# 副作用:无
# 回滚策略:还原此行
# Reason: Fixed ORM model definition error, ensuring consistency with database table structure.
# Side effects: None
# Rollback strategy: Restore this line
class ZhihuComment(Base):
__tablename__ = 'zhihu_comment'

View File

@@ -16,7 +16,7 @@
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
"""MongoDB存储基类:提供连接管理和通用存储方法"""
"""MongoDB storage base class: Provides connection management and common storage methods"""
import asyncio
from typing import Dict, List, Optional
from motor.motor_asyncio import AsyncIOMotorClient, AsyncIOMotorDatabase, AsyncIOMotorCollection
@@ -25,7 +25,7 @@ from tools import utils
class MongoDBConnection:
"""MongoDB连接管理(单例模式)"""
"""MongoDB connection management (singleton pattern)"""
_instance = None
_client: Optional[AsyncIOMotorClient] = None
_db: Optional[AsyncIOMotorDatabase] = None
@@ -37,7 +37,7 @@ class MongoDBConnection:
return cls._instance
async def get_client(self) -> AsyncIOMotorClient:
"""获取客户端"""
"""Get client"""
if self._client is None:
async with self._lock:
if self._client is None:
@@ -45,7 +45,7 @@ class MongoDBConnection:
return self._client
async def get_db(self) -> AsyncIOMotorDatabase:
"""获取数据库"""
"""Get database"""
if self._db is None:
async with self._lock:
if self._db is None:
@@ -53,7 +53,7 @@ class MongoDBConnection:
return self._db
async def _connect(self):
"""建立连接"""
"""Establish connection"""
try:
mongo_config = db_config.mongodb_config
host = mongo_config["host"]
@@ -62,14 +62,14 @@ class MongoDBConnection:
password = mongo_config["password"]
db_name = mongo_config["db_name"]
# 构建连接URL有认证/无认证)
# Build connection URL (with/without authentication)
if user and password:
connection_url = f"mongodb://{user}:{password}@{host}:{port}/"
else:
connection_url = f"mongodb://{host}:{port}/"
self._client = AsyncIOMotorClient(connection_url, serverSelectionTimeoutMS=5000)
await self._client.server_info() # 测试连接
await self._client.server_info() # Test connection
self._db = self._client[db_name]
utils.logger.info(f"[MongoDBConnection] Connected to {host}:{port}/{db_name}")
except Exception as e:
@@ -77,7 +77,7 @@ class MongoDBConnection:
raise
async def close(self):
"""关闭连接"""
"""Close connection"""
if self._client is not None:
self._client.close()
self._client = None
@@ -86,24 +86,24 @@ class MongoDBConnection:
class MongoDBStoreBase:
"""MongoDB存储基类提供通用的CRUD操作"""
"""MongoDB storage base class: Provides common CRUD operations"""
def __init__(self, collection_prefix: str):
"""初始化存储基类
"""Initialize storage base class
Args:
collection_prefix: 平台前缀(xhs/douyin/bilibili等)
collection_prefix: Platform prefix (xhs/douyin/bilibili, etc.)
"""
self.collection_prefix = collection_prefix
self._connection = MongoDBConnection()
async def get_collection(self, collection_suffix: str) -> AsyncIOMotorCollection:
"""获取集合:{prefix}_{suffix}"""
"""Get collection: {prefix}_{suffix}"""
db = await self._connection.get_db()
collection_name = f"{self.collection_prefix}_{collection_suffix}"
return db[collection_name]
async def save_or_update(self, collection_suffix: str, query: Dict, data: Dict) -> bool:
"""保存或更新数据(upsert"""
"""Save or update data (upsert)"""
try:
collection = await self.get_collection(collection_suffix)
await collection.update_one(query, {"$set": data}, upsert=True)
@@ -113,7 +113,7 @@ class MongoDBStoreBase:
return False
async def find_one(self, collection_suffix: str, query: Dict) -> Optional[Dict]:
"""查询单条数据"""
"""Query a single record"""
try:
collection = await self.get_collection(collection_suffix)
return await collection.find_one(query)
@@ -122,7 +122,7 @@ class MongoDBStoreBase:
return None
async def find_many(self, collection_suffix: str, query: Dict, limit: int = 0) -> List[Dict]:
"""查询多条数据limit=0表示不限制"""
"""Query multiple records (limit=0 means no limit)"""
try:
collection = await self.get_collection(collection_suffix)
cursor = collection.find(query)
@@ -134,7 +134,7 @@ class MongoDBStoreBase:
return []
async def create_index(self, collection_suffix: str, keys: List[tuple], unique: bool = False):
"""创建索引:keys=[("field", 1)]"""
"""Create index: keys=[("field", 1)]"""
try:
collection = await self.get_collection(collection_suffix)
await collection.create_index(keys, unique=unique)

View File

@@ -1,7 +1,8 @@
import {defineConfig} from 'vitepress'
import {withMermaid} from 'vitepress-plugin-mermaid'
// https://vitepress.dev/reference/site-config
export default defineConfig({
export default withMermaid(defineConfig({
title: "MediaCrawler自媒体爬虫",
description: "小红书爬虫,抖音爬虫, 快手爬虫, B站爬虫 微博爬虫,百度贴吧爬虫,知乎爬虫...。 ",
lastUpdated: true,
@@ -43,6 +44,7 @@ export default defineConfig({
text: 'MediaCrawler使用文档',
items: [
{text: '基本使用', link: '/'},
{text: '项目架构文档', link: '/项目架构文档'},
{text: '常见问题汇总', link: '/常见问题'},
{text: 'IP代理使用', link: '/代理使用'},
{text: '词云图使用', link: '/词云图使用配置'},
@@ -85,4 +87,4 @@ export default defineConfig({
{icon: 'github', link: 'https://github.com/NanmiCoder/MediaCrawler'}
]
}
})
}))

View File

@@ -11,9 +11,9 @@ const fetchAds = async () => {
return [
{
id: 1,
imageUrl: 'https://github.com/NanmiCoder/MediaCrawler/raw/main/docs/static/images/auto_test.png',
landingUrl: 'https://item.jd.com/10124939676219.html',
text: '给好朋友虫师新书站台推荐 - 基于Python的自动化测试框架设计'
imageUrl: 'https://github.com/NanmiCoder/MediaCrawler/raw/main/docs/static/images/MediaCrawlerPro.jpg',
landingUrl: 'https://github.com/MediaCrawlerPro',
text: '👏欢迎大家来订阅MediaCrawlerPro源代码'
}
]
}
@@ -63,7 +63,8 @@ onUnmounted(() => {
}
.ad-image {
max-width: 130px;
max-width: 100%;
width: 280px;
height: auto;
margin-bottom: 0.5rem;
}

View File

@@ -6,4 +6,5 @@
:root {
--vp-sidebar-width: 285px;
--vp-sidebar-bg-color: var(--vp-c-bg-alt);
}
--vp-aside-width: 300px;
}

View File

@@ -21,6 +21,9 @@ MediaCrawler 支持多种数据存储方式,您可以根据需求选择最适
- **MySQL 数据库**:支持关系型数据库 MySQL 中保存(需要提前创建数据库)
1. 初始化:`--init_db mysql`
2. 数据存储:`--save_data_option db`db 参数为兼容历史更新保留)
- **PostgreSQL 数据库**:支持高级关系型数据库 PostgreSQL 中保存(推荐生产环境使用)
1. 初始化:`--init_db postgres`
2. 数据存储:`--save_data_option postgres`
#### 使用示例
@@ -41,6 +44,13 @@ uv run main.py --init_db mysql
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
```
```shell
# 初始化 PostgreSQL 数据库
uv run main.py --init_db postgres
# 使用 PostgreSQL 存储数据
uv run main.py --platform xhs --lt qrcode --type search --save_data_option postgres
```
```shell
# 使用 CSV 存储数据
uv run main.py --platform xhs --lt qrcode --type search --save_data_option csv

View File

@@ -1,5 +1,9 @@
# MediaCrawler使用方法
## 项目文档
- [项目架构文档](项目架构文档.md) - 系统架构、模块设计、数据流向(含 Mermaid 图表)
## 推荐:使用 uv 管理依赖
### 1. 前置依赖

View File

Binary file not shown.

Before

Width:  |  Height:  |  Size: 171 KiB

View File

Binary file not shown.

Before

Width:  |  Height:  |  Size: 170 KiB

View File

Binary file not shown.

Before

Width:  |  Height:  |  Size: 168 KiB

View File

Binary file not shown.

Before

Width:  |  Height:  |  Size: 161 KiB

BIN
docs/static/images/MediaCrawlerPro.jpg vendored Normal file
View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 158 KiB

BIN
docs/static/images/QIWEI.png vendored Normal file
View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

BIN
docs/static/images/Thordata.png vendored Normal file
View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 486 KiB

BIN
docs/static/images/img_8.png vendored Normal file
View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 944 KiB

View File

Binary file not shown.

Before

Width:  |  Height:  |  Size: 241 KiB

View File

Binary file not shown.

Before

Width:  |  Height:  |  Size: 229 KiB

View File

@@ -9,4 +9,4 @@
>
> 如果图片展示不出来或过期可以直接添加我的微信号relakkes并备注github会有拉群小助手自动拉你进群
![relakkes_wechat](static/images/relakkes_weichat.jpg)
![relakkes_wechat](static/images/QIWEI.png)

883
docs/项目架构文档.md Normal file
View File

@@ -0,0 +1,883 @@
# MediaCrawler 项目架构文档
## 1. 项目概述
### 1.1 项目简介
MediaCrawler 是一个多平台自媒体爬虫框架,采用 Python 异步编程实现,支持爬取主流社交媒体平台的内容、评论和创作者信息。
### 1.2 支持的平台
| 平台 | 代号 | 主要功能 |
|------|------|---------|
| 小红书 | `xhs` | 笔记搜索、详情、创作者 |
| 抖音 | `dy` | 视频搜索、详情、创作者 |
| 快手 | `ks` | 视频搜索、详情、创作者 |
| B站 | `bili` | 视频搜索、详情、UP主 |
| 微博 | `wb` | 微博搜索、详情、博主 |
| 百度贴吧 | `tieba` | 帖子搜索、详情 |
| 知乎 | `zhihu` | 问答搜索、详情、答主 |
### 1.3 核心功能特性
- **多平台支持**:统一的爬虫接口,支持 7 大主流平台
- **多种登录方式**二维码、手机号、Cookie 三种登录方式
- **多种存储方式**CSV、JSON、SQLite、MySQL、MongoDB、Excel
- **反爬虫对策**CDP 模式、代理 IP 池、请求签名
- **异步高并发**:基于 asyncio 的异步架构,高效并发爬取
- **词云生成**:自动生成评论词云图
---
## 2. 系统架构总览
### 2.1 高层架构图
```mermaid
flowchart TB
subgraph Entry["入口层"]
main["main.py<br/>程序入口"]
cmdarg["cmd_arg<br/>命令行参数"]
config["config<br/>配置管理"]
end
subgraph Core["核心爬虫层"]
factory["CrawlerFactory<br/>爬虫工厂"]
base["AbstractCrawler<br/>爬虫基类"]
subgraph Platforms["平台实现"]
xhs["XiaoHongShuCrawler"]
dy["DouYinCrawler"]
ks["KuaishouCrawler"]
bili["BilibiliCrawler"]
wb["WeiboCrawler"]
tieba["TieBaCrawler"]
zhihu["ZhihuCrawler"]
end
end
subgraph Client["API客户端层"]
absClient["AbstractApiClient<br/>客户端基类"]
xhsClient["XiaoHongShuClient"]
dyClient["DouYinClient"]
ksClient["KuaiShouClient"]
biliClient["BilibiliClient"]
wbClient["WeiboClient"]
tiebaClient["BaiduTieBaClient"]
zhihuClient["ZhiHuClient"]
end
subgraph Storage["数据存储层"]
storeFactory["StoreFactory<br/>存储工厂"]
csv["CSV存储"]
json["JSON存储"]
sqlite["SQLite存储"]
mysql["MySQL存储"]
mongodb["MongoDB存储"]
excel["Excel存储"]
end
subgraph Infra["基础设施层"]
browser["浏览器管理<br/>Playwright/CDP"]
proxy["代理IP池"]
cache["缓存系统"]
login["登录管理"]
end
main --> factory
cmdarg --> main
config --> main
factory --> base
base --> Platforms
Platforms --> Client
Client --> Storage
Client --> Infra
Storage --> storeFactory
storeFactory --> csv & json & sqlite & mysql & mongodb & excel
```
### 2.2 数据流向图
```mermaid
flowchart LR
subgraph Input["输入"]
keywords["关键词/ID"]
config["配置参数"]
end
subgraph Process["处理流程"]
browser["启动浏览器"]
login["登录认证"]
search["搜索/爬取"]
parse["数据解析"]
comment["获取评论"]
end
subgraph Output["输出"]
content["内容数据"]
comments["评论数据"]
creator["创作者数据"]
media["媒体文件"]
end
subgraph Storage["存储"]
file["文件存储<br/>CSV/JSON/Excel"]
db["数据库<br/>SQLite/MySQL"]
nosql["NoSQL<br/>MongoDB"]
end
keywords --> browser
config --> browser
browser --> login
login --> search
search --> parse
parse --> comment
parse --> content
comment --> comments
parse --> creator
parse --> media
content & comments & creator --> file & db & nosql
media --> file
```
---
## 3. 目录结构
```
MediaCrawler/
├── main.py # 程序入口
├── var.py # 全局上下文变量
├── pyproject.toml # 项目配置
├── base/ # 基础抽象类
│ └── base_crawler.py # 爬虫、登录、存储、客户端基类
├── config/ # 配置管理
│ ├── base_config.py # 核心配置
│ ├── db_config.py # 数据库配置
│ └── {platform}_config.py # 平台特定配置
├── media_platform/ # 平台爬虫实现
│ ├── xhs/ # 小红书
│ ├── douyin/ # 抖音
│ ├── kuaishou/ # 快手
│ ├── bilibili/ # B站
│ ├── weibo/ # 微博
│ ├── tieba/ # 百度贴吧
│ └── zhihu/ # 知乎
├── store/ # 数据存储
│ ├── excel_store_base.py # Excel存储基类
│ └── {platform}/ # 各平台存储实现
├── database/ # 数据库层
│ ├── models.py # ORM模型定义
│ ├── db_session.py # 数据库会话管理
│ └── mongodb_store_base.py # MongoDB基类
├── proxy/ # 代理管理
│ ├── proxy_ip_pool.py # IP池管理
│ ├── proxy_mixin.py # 代理刷新混入
│ └── providers/ # 代理提供商
├── cache/ # 缓存系统
│ ├── abs_cache.py # 缓存抽象类
│ ├── local_cache.py # 本地缓存
│ └── redis_cache.py # Redis缓存
├── tools/ # 工具模块
│ ├── app_runner.py # 应用运行管理
│ ├── browser_launcher.py # 浏览器启动
│ ├── cdp_browser.py # CDP浏览器管理
│ ├── crawler_util.py # 爬虫工具
│ └── async_file_writer.py # 异步文件写入
├── model/ # 数据模型
│ └── m_{platform}.py # Pydantic模型
├── libs/ # JS脚本库
│ └── stealth.min.js # 反检测脚本
└── cmd_arg/ # 命令行参数
└── arg.py # 参数定义
```
---
## 4. 核心模块详解
### 4.1 爬虫基类体系
```mermaid
classDiagram
class AbstractCrawler {
<<abstract>>
+start()* 启动爬虫
+search()* 搜索功能
+launch_browser() 启动浏览器
+launch_browser_with_cdp() CDP模式启动
}
class AbstractLogin {
<<abstract>>
+begin()* 开始登录
+login_by_qrcode()* 二维码登录
+login_by_mobile()* 手机号登录
+login_by_cookies()* Cookie登录
}
class AbstractStore {
<<abstract>>
+store_content()* 存储内容
+store_comment()* 存储评论
+store_creator()* 存储创作者
+store_image()* 存储图片
+store_video()* 存储视频
}
class AbstractApiClient {
<<abstract>>
+request()* HTTP请求
+update_cookies()* 更新Cookies
}
class ProxyRefreshMixin {
+init_proxy_pool() 初始化代理池
+_refresh_proxy_if_expired() 刷新过期代理
}
class XiaoHongShuCrawler {
+xhs_client: XiaoHongShuClient
+start()
+search()
+get_specified_notes()
+get_creators_and_notes()
}
class XiaoHongShuClient {
+playwright_page: Page
+cookie_dict: Dict
+request()
+pong() 检查登录状态
+get_note_by_keyword()
+get_note_by_id()
}
AbstractCrawler <|-- XiaoHongShuCrawler
AbstractApiClient <|-- XiaoHongShuClient
ProxyRefreshMixin <|-- XiaoHongShuClient
```
### 4.2 爬虫生命周期
```mermaid
sequenceDiagram
participant Main as main.py
participant Factory as CrawlerFactory
participant Crawler as XiaoHongShuCrawler
participant Browser as Playwright/CDP
participant Login as XiaoHongShuLogin
participant Client as XiaoHongShuClient
participant Store as StoreFactory
Main->>Factory: create_crawler("xhs")
Factory-->>Main: crawler实例
Main->>Crawler: start()
alt 启用IP代理
Crawler->>Crawler: create_ip_pool()
end
alt CDP模式
Crawler->>Browser: launch_browser_with_cdp()
else 标准模式
Crawler->>Browser: launch_browser()
end
Browser-->>Crawler: browser_context
Crawler->>Crawler: create_xhs_client()
Crawler->>Client: pong() 检查登录状态
alt 未登录
Crawler->>Login: begin()
Login->>Login: login_by_qrcode/mobile/cookie
Login-->>Crawler: 登录成功
end
alt search模式
Crawler->>Client: get_note_by_keyword()
Client-->>Crawler: 搜索结果
loop 获取详情
Crawler->>Client: get_note_by_id()
Client-->>Crawler: 笔记详情
end
else detail模式
Crawler->>Client: get_note_by_id()
else creator模式
Crawler->>Client: get_creator_info()
end
Crawler->>Store: store_content/comment/creator
Store-->>Crawler: 存储完成
Main->>Crawler: cleanup()
Crawler->>Browser: close()
```
### 4.3 平台爬虫实现结构
每个平台目录包含以下核心文件:
```
media_platform/{platform}/
├── __init__.py # 模块导出
├── core.py # 爬虫主实现类
├── client.py # API客户端
├── login.py # 登录实现
├── field.py # 字段/枚举定义
├── exception.py # 异常定义
├── help.py # 辅助函数
└── {特殊实现}.py # 平台特定逻辑
```
### 4.4 三种爬虫模式
| 模式 | 配置值 | 功能描述 | 适用场景 |
|------|--------|---------|---------|
| 搜索模式 | `search` | 根据关键词搜索内容 | 批量获取特定主题内容 |
| 详情模式 | `detail` | 获取指定ID的详情 | 精确获取已知内容 |
| 创作者模式 | `creator` | 获取创作者所有内容 | 追踪特定博主/UP主 |
---
## 5. 数据存储层
### 5.1 存储架构图
```mermaid
classDiagram
class AbstractStore {
<<abstract>>
+store_content()*
+store_comment()*
+store_creator()*
}
class StoreFactory {
+STORES: Dict
+create_store() AbstractStore
}
class CsvStoreImplement {
+async_file_writer: AsyncFileWriter
+store_content()
+store_comment()
}
class JsonStoreImplement {
+async_file_writer: AsyncFileWriter
+store_content()
+store_comment()
}
class DbStoreImplement {
+session: AsyncSession
+store_content()
+store_comment()
}
class SqliteStoreImplement {
+session: AsyncSession
+store_content()
+store_comment()
}
class MongoStoreImplement {
+mongo_base: MongoDBStoreBase
+store_content()
+store_comment()
}
class ExcelStoreImplement {
+excel_base: ExcelStoreBase
+store_content()
+store_comment()
}
AbstractStore <|-- CsvStoreImplement
AbstractStore <|-- JsonStoreImplement
AbstractStore <|-- DbStoreImplement
AbstractStore <|-- SqliteStoreImplement
AbstractStore <|-- MongoStoreImplement
AbstractStore <|-- ExcelStoreImplement
StoreFactory --> AbstractStore
```
### 5.2 存储工厂模式
```python
# 以抖音为例
class DouyinStoreFactory:
STORES = {
"csv": DouyinCsvStoreImplement,
"db": DouyinDbStoreImplement,
"json": DouyinJsonStoreImplement,
"sqlite": DouyinSqliteStoreImplement,
"mongodb": DouyinMongoStoreImplement,
"excel": DouyinExcelStoreImplement,
}
@staticmethod
def create_store() -> AbstractStore:
store_class = DouyinStoreFactory.STORES.get(config.SAVE_DATA_OPTION)
return store_class()
```
### 5.3 存储方式对比
| 存储方式 | 配置值 | 优点 | 适用场景 |
|---------|--------|-----|---------|
| CSV | `csv` | 简单、通用 | 小规模数据、快速查看 |
| JSON | `json` | 结构完整、易解析 | API对接、数据交换 |
| SQLite | `sqlite` | 轻量、无需服务 | 本地开发、小型项目 |
| MySQL | `db` | 性能好、支持并发 | 生产环境、大规模数据 |
| MongoDB | `mongodb` | 灵活、易扩展 | 非结构化数据、快速迭代 |
| Excel | `excel` | 可视化、易分享 | 报告、数据分析 |
---
## 6. 基础设施层
### 6.1 代理系统架构
```mermaid
flowchart TB
subgraph Config["配置"]
enable["ENABLE_IP_PROXY"]
provider["IP_PROXY_PROVIDER"]
count["IP_PROXY_POOL_COUNT"]
end
subgraph Pool["代理池管理"]
pool["ProxyIpPool"]
load["load_proxies()"]
validate["_is_valid_proxy()"]
get["get_proxy()"]
refresh["get_or_refresh_proxy()"]
end
subgraph Providers["代理提供商"]
kuaidl["快代理<br/>KuaiDaiLiProxy"]
wandou["万代理<br/>WanDouHttpProxy"]
jishu["技术IP<br/>JiShuHttpProxy"]
end
subgraph Client["API客户端"]
mixin["ProxyRefreshMixin"]
request["request()"]
end
enable --> pool
provider --> Providers
count --> load
pool --> load
load --> validate
validate --> Providers
pool --> get
pool --> refresh
mixin --> refresh
mixin --> Client
request --> mixin
```
### 6.2 登录流程
```mermaid
flowchart TB
Start([开始登录]) --> CheckType{登录类型?}
CheckType -->|qrcode| QR[显示二维码]
QR --> WaitScan[等待扫描]
WaitScan --> CheckQR{扫描成功?}
CheckQR -->|是| SaveCookie[保存Cookie]
CheckQR -->|否| WaitScan
CheckType -->|phone| Phone[输入手机号]
Phone --> SendCode[发送验证码]
SendCode --> Slider{需要滑块?}
Slider -->|是| DoSlider[滑动验证]
DoSlider --> InputCode[输入验证码]
Slider -->|否| InputCode
InputCode --> Verify[验证登录]
Verify --> SaveCookie
CheckType -->|cookie| LoadCookie[加载已保存Cookie]
LoadCookie --> VerifyCookie{Cookie有效?}
VerifyCookie -->|是| SaveCookie
VerifyCookie -->|否| Fail[登录失败]
SaveCookie --> UpdateContext[更新浏览器上下文]
UpdateContext --> End([登录完成])
```
### 6.3 浏览器管理
```mermaid
flowchart LR
subgraph Mode["启动模式"]
standard["标准模式<br/>Playwright"]
cdp["CDP模式<br/>Chrome DevTools"]
end
subgraph Standard["标准模式流程"]
launch["chromium.launch()"]
context["new_context()"]
stealth["注入stealth.js"]
end
subgraph CDP["CDP模式流程"]
detect["检测浏览器路径"]
start["启动浏览器进程"]
connect["connect_over_cdp()"]
cdpContext["获取已有上下文"]
end
subgraph Features["特性"]
f1["用户数据持久化"]
f2["扩展和设置继承"]
f3["反检测能力增强"]
end
standard --> Standard
cdp --> CDP
CDP --> Features
```
### 6.4 缓存系统
```mermaid
classDiagram
class AbstractCache {
<<abstract>>
+get(key)* 获取缓存
+set(key, value, expire)* 设置缓存
+keys(pattern)* 获取所有键
}
class ExpiringLocalCache {
-_cache: Dict
-_expire_times: Dict
+get(key)
+set(key, value, expire_time)
+keys(pattern)
-_is_expired(key)
}
class RedisCache {
-_client: Redis
+get(key)
+set(key, value, expire_time)
+keys(pattern)
}
class CacheFactory {
+create_cache(type) AbstractCache
}
AbstractCache <|-- ExpiringLocalCache
AbstractCache <|-- RedisCache
CacheFactory --> AbstractCache
```
---
## 7. 数据模型
### 7.1 ORM模型关系
```mermaid
erDiagram
DouyinAweme {
int id PK
string aweme_id UK
string aweme_type
string title
string desc
int create_time
int liked_count
int collected_count
int comment_count
int share_count
string user_id FK
datetime add_ts
datetime last_modify_ts
}
DouyinAwemeComment {
int id PK
string comment_id UK
string aweme_id FK
string content
int create_time
int sub_comment_count
string user_id
datetime add_ts
datetime last_modify_ts
}
DyCreator {
int id PK
string user_id UK
string nickname
string avatar
string desc
int follower_count
int total_favorited
datetime add_ts
datetime last_modify_ts
}
DouyinAweme ||--o{ DouyinAwemeComment : "has"
DyCreator ||--o{ DouyinAweme : "creates"
```
### 7.2 各平台数据表
| 平台 | 内容表 | 评论表 | 创作者表 |
|------|--------|--------|---------|
| 抖音 | DouyinAweme | DouyinAwemeComment | DyCreator |
| 小红书 | XHSNote | XHSNoteComment | XHSCreator |
| 快手 | KuaishouVideo | KuaishouVideoComment | KsCreator |
| B站 | BilibiliVideo | BilibiliVideoComment | BilibiliUpInfo |
| 微博 | WeiboNote | WeiboNoteComment | WeiboCreator |
| 贴吧 | TiebaNote | TiebaNoteComment | - |
| 知乎 | ZhihuContent | ZhihuContentComment | ZhihuCreator |
---
## 8. 配置系统
### 8.1 核心配置项
```python
# config/base_config.py
# 平台选择
PLATFORM = "xhs" # xhs, dy, ks, bili, wb, tieba, zhihu
# 登录配置
LOGIN_TYPE = "qrcode" # qrcode, phone, cookie
SAVE_LOGIN_STATE = True
# 爬虫配置
CRAWLER_TYPE = "search" # search, detail, creator
KEYWORDS = "编程副业,编程兼职"
CRAWLER_MAX_NOTES_COUNT = 15
MAX_CONCURRENCY_NUM = 1
# 评论配置
ENABLE_GET_COMMENTS = True
ENABLE_GET_SUB_COMMENTS = False
CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES = 10
# 浏览器配置
HEADLESS = False
ENABLE_CDP_MODE = True
CDP_DEBUG_PORT = 9222
# 代理配置
ENABLE_IP_PROXY = False
IP_PROXY_PROVIDER = "kuaidaili"
IP_PROXY_POOL_COUNT = 2
# 存储配置
SAVE_DATA_OPTION = "json" # csv, db, json, sqlite, mongodb, excel
```
### 8.2 数据库配置
```python
# config/db_config.py
# MySQL
MYSQL_DB_HOST = "localhost"
MYSQL_DB_PORT = 3306
MYSQL_DB_NAME = "media_crawler"
# Redis
REDIS_DB_HOST = "127.0.0.1"
REDIS_DB_PORT = 6379
# MongoDB
MONGODB_HOST = "localhost"
MONGODB_PORT = 27017
# SQLite
SQLITE_DB_PATH = "database/sqlite_tables.db"
```
---
## 9. 工具模块
### 9.1 工具函数概览
| 模块 | 文件 | 主要功能 |
|------|------|---------|
| 应用运行器 | `app_runner.py` | 信号处理、优雅退出、清理管理 |
| 浏览器启动 | `browser_launcher.py` | 检测浏览器路径、启动浏览器进程 |
| CDP管理 | `cdp_browser.py` | CDP连接、浏览器上下文管理 |
| 爬虫工具 | `crawler_util.py` | 二维码识别、验证码处理、User-Agent |
| 文件写入 | `async_file_writer.py` | 异步CSV/JSON写入、词云生成 |
| 滑块验证 | `slider_util.py` | 滑动验证码破解 |
| 时间工具 | `time_util.py` | 时间戳转换、日期处理 |
### 9.2 应用运行管理
```mermaid
flowchart TB
Start([程序启动]) --> Run["run(app_main, app_cleanup)"]
Run --> Main["执行 app_main()"]
Main --> Running{运行中}
Running -->|正常完成| Cleanup1["执行 app_cleanup()"]
Running -->|SIGINT/SIGTERM| Signal["捕获信号"]
Signal --> First{第一次信号?}
First -->|是| Cleanup2["启动清理流程"]
First -->|否| Force["强制退出"]
Cleanup1 & Cleanup2 --> Cancel["取消其他任务"]
Cancel --> Wait["等待任务完成<br/>(超时15秒)"]
Wait --> End([程序退出])
Force --> End
```
---
## 10. 模块依赖关系
```mermaid
flowchart TB
subgraph Entry["入口层"]
main["main.py"]
config["config/"]
cmdarg["cmd_arg/"]
end
subgraph Core["核心层"]
base["base/base_crawler.py"]
platforms["media_platform/*/"]
end
subgraph Client["客户端层"]
client["*/client.py"]
login["*/login.py"]
end
subgraph Storage["存储层"]
store["store/"]
database["database/"]
end
subgraph Infra["基础设施"]
proxy["proxy/"]
cache["cache/"]
tools["tools/"]
end
subgraph External["外部依赖"]
playwright["Playwright"]
httpx["httpx"]
sqlalchemy["SQLAlchemy"]
motor["Motor/MongoDB"]
end
main --> config
main --> cmdarg
main --> Core
Core --> base
platforms --> base
platforms --> Client
client --> proxy
client --> httpx
login --> tools
platforms --> Storage
Storage --> sqlalchemy
Storage --> motor
client --> playwright
tools --> playwright
proxy --> cache
```
---
## 11. 扩展指南
### 11.1 添加新平台
1.`media_platform/` 下创建新目录
2. 实现以下核心文件:
- `core.py` - 继承 `AbstractCrawler`
- `client.py` - 继承 `AbstractApiClient``ProxyRefreshMixin`
- `login.py` - 继承 `AbstractLogin`
- `field.py` - 定义平台枚举
3.`store/` 下创建对应存储目录
4.`main.py``CrawlerFactory.CRAWLERS` 中注册
### 11.2 添加新存储方式
1.`store/` 下创建新的存储实现类
2. 继承 `AbstractStore` 基类
3. 实现 `store_content``store_comment``store_creator` 方法
4. 在各平台的 `StoreFactory.STORES` 中注册
### 11.3 添加新代理提供商
1.`proxy/providers/` 下创建新的代理类
2. 继承 `BaseProxy` 基类
3. 实现 `get_proxy()` 方法
4. 在配置中注册
---
## 12. 快速参考
### 12.1 常用命令
```bash
# 启动爬虫
python main.py
# 指定平台
python main.py --platform xhs
# 指定登录方式
python main.py --lt qrcode
# 指定爬虫类型
python main.py --type search
```
### 12.2 关键文件路径
| 用途 | 文件路径 |
|------|---------|
| 程序入口 | `main.py` |
| 核心配置 | `config/base_config.py` |
| 数据库配置 | `config/db_config.py` |
| 爬虫基类 | `base/base_crawler.py` |
| ORM模型 | `database/models.py` |
| 代理池 | `proxy/proxy_ip_pool.py` |
| CDP浏览器 | `tools/cdp_browser.py` |
---
*文档生成时间: 2025-12-18*

148
main.py
View File

@@ -17,11 +17,20 @@
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
import sys
import io
# Force UTF-8 encoding for stdout/stderr to prevent encoding errors
# when outputting Chinese characters in non-UTF-8 terminals
if sys.stdout and hasattr(sys.stdout, 'buffer'):
if sys.stdout.encoding and sys.stdout.encoding.lower() != 'utf-8':
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8', errors='replace')
if sys.stderr and hasattr(sys.stderr, 'buffer'):
if sys.stderr.encoding and sys.stderr.encoding.lower() != 'utf-8':
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8', errors='replace')
import asyncio
import sys
import signal
from typing import Optional
from typing import Optional, Type
import cmd_arg
import config
@@ -39,7 +48,7 @@ from var import crawler_type_var
class CrawlerFactory:
CRAWLERS = {
CRAWLERS: dict[str, Type[AbstractCrawler]] = {
"xhs": XiaoHongShuCrawler,
"dy": DouYinCrawler,
"ks": KuaishouCrawler,
@@ -53,115 +62,96 @@ class CrawlerFactory:
def create_crawler(platform: str) -> AbstractCrawler:
crawler_class = CrawlerFactory.CRAWLERS.get(platform)
if not crawler_class:
raise ValueError(
"Invalid Media Platform Currently only supported xhs or dy or ks or bili ..."
)
supported = ", ".join(sorted(CrawlerFactory.CRAWLERS))
raise ValueError(f"Invalid media platform: {platform!r}. Supported: {supported}")
return crawler_class()
crawler: Optional[AbstractCrawler] = None
# persist-1<persist1@126.com>
# 原因:增加 --init_db 功能,用于数据库初始化。
# 副作用:无
# 回滚策略:还原此文件。
async def main():
# Init crawler
def _flush_excel_if_needed() -> None:
if config.SAVE_DATA_OPTION != "excel":
return
try:
from store.excel_store_base import ExcelStoreBase
ExcelStoreBase.flush_all()
print("[Main] Excel files saved successfully")
except Exception as e:
print(f"[Main] Error flushing Excel data: {e}")
async def _generate_wordcloud_if_needed() -> None:
if config.SAVE_DATA_OPTION != "json" or not config.ENABLE_GET_WORDCLOUD:
return
try:
file_writer = AsyncFileWriter(
platform=config.PLATFORM,
crawler_type=crawler_type_var.get(),
)
await file_writer.generate_wordcloud_from_comments()
except Exception as e:
print(f"[Main] Error generating wordcloud: {e}")
async def main() -> None:
global crawler
# parse cmd
args = await cmd_arg.parse_cmd()
# init db
if args.init_db:
await db.init_db(args.init_db)
print(f"Database {args.init_db} initialized successfully.")
return # Exit the main function cleanly
return
crawler = CrawlerFactory.create_crawler(platform=config.PLATFORM)
await crawler.start()
# Flush Excel data if using Excel export
if config.SAVE_DATA_OPTION == "excel":
try:
from store.excel_store_base import ExcelStoreBase
ExcelStoreBase.flush_all()
print("[Main] Excel files saved successfully")
except Exception as e:
print(f"[Main] Error flushing Excel data: {e}")
_flush_excel_if_needed()
# Generate wordcloud after crawling is complete
# Only for JSON save mode
if config.SAVE_DATA_OPTION == "json" and config.ENABLE_GET_WORDCLOUD:
try:
file_writer = AsyncFileWriter(
platform=config.PLATFORM,
crawler_type=crawler_type_var.get()
)
await file_writer.generate_wordcloud_from_comments()
except Exception as e:
print(f"Error generating wordcloud: {e}")
await _generate_wordcloud_if_needed()
async def async_cleanup():
"""异步清理函数用于处理CDP浏览器等异步资源"""
async def async_cleanup() -> None:
global crawler
if crawler:
# 检查并清理CDP浏览器
if hasattr(crawler, 'cdp_manager') and crawler.cdp_manager:
if getattr(crawler, "cdp_manager", None):
try:
await crawler.cdp_manager.cleanup(force=True) # 强制清理浏览器进程
await crawler.cdp_manager.cleanup(force=True)
except Exception as e:
# 只在非预期错误时打印
error_msg = str(e).lower()
if "closed" not in error_msg and "disconnected" not in error_msg:
print(f"[Main] 清理CDP浏览器时出错: {e}")
print(f"[Main] Error cleaning up CDP browser: {e}")
# 检查并清理标准浏览器上下文仅在非CDP模式下
elif hasattr(crawler, 'browser_context') and crawler.browser_context:
elif getattr(crawler, "browser_context", None):
try:
# 检查上下文是否仍然打开
if hasattr(crawler.browser_context, 'pages'):
await crawler.browser_context.close()
await crawler.browser_context.close()
except Exception as e:
# 只在非预期错误时打印
error_msg = str(e).lower()
if "closed" not in error_msg and "disconnected" not in error_msg:
print(f"[Main] 关闭浏览器上下文时出错: {e}")
print(f"[Main] Error closing browser context: {e}")
# 关闭数据库连接
if config.SAVE_DATA_OPTION in ["db", "sqlite"]:
if config.SAVE_DATA_OPTION in ("db", "sqlite"):
await db.close()
def cleanup():
"""同步清理函数"""
try:
# 创建新的事件循环来执行异步清理
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(async_cleanup())
loop.close()
except Exception as e:
print(f"[Main] 清理时出错: {e}")
def signal_handler(signum, _frame):
"""信号处理器处理Ctrl+C等中断信号"""
print(f"\n[Main] 收到中断信号 {signum},正在清理资源...")
cleanup()
sys.exit(0)
if __name__ == "__main__":
# 注册信号处理器
signal.signal(signal.SIGINT, signal_handler) # Ctrl+C
signal.signal(signal.SIGTERM, signal_handler) # 终止信号
from tools.app_runner import run
try:
asyncio.get_event_loop().run_until_complete(main())
except KeyboardInterrupt:
print("\n[Main] 收到键盘中断,正在清理资源...")
finally:
cleanup()
def _force_stop() -> None:
c = crawler
if not c:
return
cdp_manager = getattr(c, "cdp_manager", None)
launcher = getattr(cdp_manager, "launcher", None)
if not launcher:
return
try:
launcher.cleanup()
except Exception:
pass
run(main, async_cleanup, cleanup_timeout_seconds=15.0, on_first_interrupt=_force_stop)

View File

@@ -20,7 +20,7 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/2 18:44
# @Desc : bilibili 请求客户端
# @Desc : bilibili request client
import asyncio
import json
import random
@@ -47,7 +47,7 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
def __init__(
self,
timeout=60, # 若开启爬取媒体选项b 站的长视频需要更久的超时时间
timeout=60, # For media crawling, Bilibili long videos need a longer timeout
proxy=None,
*,
headers: Dict[str, str],
@@ -61,11 +61,11 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
self._host = "https://api.bilibili.com"
self.playwright_page = playwright_page
self.cookie_dict = cookie_dict
# 初始化代理池(来自 ProxyRefreshMixin
# Initialize proxy pool (from ProxyRefreshMixin)
self.init_proxy_pool(proxy_ip_pool)
async def request(self, method, url, **kwargs) -> Any:
# 每次请求前检测代理是否过期
# Check if proxy has expired before each request
await self._refresh_proxy_if_expired()
async with httpx.AsyncClient(proxy=self.proxy) as client:
@@ -82,8 +82,8 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
async def pre_request_data(self, req_data: Dict) -> Dict:
"""
发送请求进行请求参数签名
需要从 localStorage 拿 wbi_img_urls 这参数,值如下:
Send request to sign request parameters
Need to get wbi_img_urls parameter from localStorage, value as follows:
https://i0.hdslb.com/bfs/wbi/7cd084941338484aae1ad9425b84077c.png-https://i0.hdslb.com/bfs/wbi/4932caff0ff746eab6f01bf08b70ac45.png
:param req_data:
:return:
@@ -95,7 +95,7 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
async def get_wbi_keys(self) -> Tuple[str, str]:
"""
获取最新的 img_key sub_key
Get the latest img_key and sub_key
:return:
"""
local_storage = await self.playwright_page.evaluate("() => window.localStorage")
@@ -160,12 +160,12 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
) -> Dict:
"""
KuaiShou web search api
:param keyword: 搜索关键词
:param page: 分页参数具体第几页
:param page_size: 每一页参数的数量
:param order: 搜索结果排序,默认位综合排序
:param pubtime_begin_s: 发布时间开始时间戳
:param pubtime_end_s: 发布时间结束时间戳
:param keyword: Search keyword
:param page: Page number for pagination
:param page_size: Number of items per page
:param order: Sort order for search results, default is comprehensive sorting
:param pubtime_begin_s: Publish time start timestamp
:param pubtime_end_s: Publish time end timestamp
:return:
"""
uri = "/x/web-interface/wbi/search/type"
@@ -182,13 +182,13 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
async def get_video_info(self, aid: Union[int, None] = None, bvid: Union[str, None] = None) -> Dict:
"""
Bilibli web video detail api, aid bvid任选一个参数
:param aid: 稿件avid
:param bvid: 稿件bvid
Bilibli web video detail api, choose one parameter between aid and bvid
:param aid: Video aid
:param bvid: Video bvid
:return:
"""
if not aid and not bvid:
raise ValueError("请提供 aid bvid 中的至少一个参数")
raise ValueError("Please provide at least one parameter: aid or bvid")
uri = "/x/web-interface/view/detail"
params = dict()
@@ -201,12 +201,12 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
async def get_video_play_url(self, aid: int, cid: int) -> Dict:
"""
Bilibli web video play url api
:param aid: 稿件avid
:param aid: Video aid
:param cid: cid
:return:
"""
if not aid or not cid or aid <= 0 or cid <= 0:
raise ValueError("aid cid 必须存在")
raise ValueError("aid and cid must exist")
uri = "/x/player/wbi/playurl"
qn_value = getattr(config, "BILI_QN", 80)
params = {
@@ -233,7 +233,7 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
)
return None
except httpx.HTTPError as exc: # some wrong when call httpx.request method, such as connection error, client error, server error or response status code is not 2xx
utils.logger.error(f"[BilibiliClient.get_video_media] {exc.__class__.__name__} for {exc.request.url} - {exc}") # 保留原始异常类型名称,以便开发者调试
utils.logger.error(f"[BilibiliClient.get_video_media] {exc.__class__.__name__} for {exc.request.url} - {exc}") # Keep original exception type name for developer debugging
return None
async def get_video_comments(
@@ -243,9 +243,9 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
next: int = 0,
) -> Dict:
"""get video comments
:param video_id: 视频 ID
:param order_mode: 排序方式
:param next: 评论页选择
:param video_id: Video ID
:param order_mode: Sort order
:param next: Comment page selection
:return:
"""
uri = "/x/v2/reply/wbi/main"
@@ -266,7 +266,7 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
:param crawl_interval:
:param is_fetch_sub_comments:
:param callback:
max_count: 一次笔记爬取的最大评论数量
max_count: Maximum number of comments to crawl per note
:return:
"""
@@ -299,7 +299,7 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
comment_list: List[Dict] = comments_res.get("replies", [])
# 检查 is_end next 是否存在
# Check if is_end and next exist
if "is_end" not in cursor_info or "next" not in cursor_info:
utils.logger.warning(f"[BilibiliClient.get_video_all_comments] 'is_end' or 'next' not in cursor for video_id: {video_id}. Assuming end of comments.")
is_end = True
@@ -317,7 +317,7 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
{await self.get_video_all_level_two_comments(video_id, comment_id, CommentOrderType.DEFAULT, 10, crawl_interval, callback)}
if len(result) + len(comment_list) > max_count:
comment_list = comment_list[:max_count - len(result)]
if callback: # 如果有回调函数,就执行回调函数
if callback: # If there is a callback function, execute it
await callback(video_id, comment_list)
await asyncio.sleep(crawl_interval)
if not is_fetch_sub_comments:
@@ -336,10 +336,10 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
) -> Dict:
"""
get video all level two comments for a level one comment
:param video_id: 视频 ID
:param level_one_comment_id: 一级评论 ID
:param video_id: Video ID
:param level_one_comment_id: Level one comment ID
:param order_mode:
:param ps: 一页评论数
:param ps: Number of comments per page
:param crawl_interval:
:param callback:
:return:
@@ -349,7 +349,7 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
while True:
result = await self.get_video_level_two_comments(video_id, level_one_comment_id, pn, ps, order_mode)
comment_list: List[Dict] = result.get("replies", [])
if callback: # 如果有回调函数,就执行回调函数
if callback: # If there is a callback function, execute it
await callback(video_id, comment_list)
await asyncio.sleep(crawl_interval)
if (int(result["page"]["count"]) <= pn * ps):
@@ -366,9 +366,9 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
order_mode: CommentOrderType,
) -> Dict:
"""get video level two comments
:param video_id: 视频 ID
:param level_one_comment_id: 一级评论 ID
:param order_mode: 排序方式
:param video_id: Video ID
:param level_one_comment_id: Level one comment ID
:param order_mode: Sort order
:return:
"""
@@ -386,10 +386,10 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
async def get_creator_videos(self, creator_id: str, pn: int, ps: int = 30, order_mode: SearchOrderType = SearchOrderType.LAST_PUBLISH) -> Dict:
"""get all videos for a creator
:param creator_id: 创作者 ID
:param pn: 页数
:param ps: 一页视频数
:param order_mode: 排序方式
:param creator_id: Creator ID
:param pn: Page number
:param ps: Number of videos per page
:param order_mode: Sort order
:return:
"""
@@ -405,7 +405,7 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
async def get_creator_info(self, creator_id: int) -> Dict:
"""
get creator info
:param creator_id: 作者 ID
:param creator_id: Creator ID
"""
uri = "/x/space/wbi/acc/info"
post_data = {
@@ -421,9 +421,9 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
) -> Dict:
"""
get creator fans
:param creator_id: 创作者 ID
:param pn: 开始页数
:param ps: 每页数量
:param creator_id: Creator ID
:param pn: Start page number
:param ps: Number of items per page
:return:
"""
uri = "/x/relation/fans"
@@ -443,9 +443,9 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
) -> Dict:
"""
get creator followings
:param creator_id: 创作者 ID
:param pn: 开始页数
:param ps: 每页数量
:param creator_id: Creator ID
:param pn: Start page number
:param ps: Number of items per page
:return:
"""
uri = "/x/relation/followings"
@@ -460,8 +460,8 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
async def get_creator_dynamics(self, creator_id: int, offset: str = ""):
"""
get creator comments
:param creator_id: 创作者 ID
:param offset: 发送请求所需参数
:param creator_id: Creator ID
:param offset: Parameter required for sending request
:return:
"""
uri = "/x/polymer/web-dynamic/v1/feed/space"
@@ -485,9 +485,9 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
:param creator_info:
:param crawl_interval:
:param callback:
:param max_count: 一个up主爬取的最大粉丝数量
:param max_count: Maximum number of fans to crawl for a creator
:return: up主粉丝数列表
:return: List of creator fans
"""
creator_id = creator_info["id"]
result = []
@@ -499,7 +499,7 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
pn += 1
if len(result) + len(fans_list) > max_count:
fans_list = fans_list[:max_count - len(result)]
if callback: # 如果有回调函数,就执行回调函数
if callback: # If there is a callback function, execute it
await callback(creator_info, fans_list)
await asyncio.sleep(crawl_interval)
if not fans_list:
@@ -519,9 +519,9 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
:param creator_info:
:param crawl_interval:
:param callback:
:param max_count: 一个up主爬取的最大关注者数量
:param max_count: Maximum number of followings to crawl for a creator
:return: up主关注者列表
:return: List of creator followings
"""
creator_id = creator_info["id"]
result = []
@@ -533,7 +533,7 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
pn += 1
if len(result) + len(followings_list) > max_count:
followings_list = followings_list[:max_count - len(result)]
if callback: # 如果有回调函数,就执行回调函数
if callback: # If there is a callback function, execute it
await callback(creator_info, followings_list)
await asyncio.sleep(crawl_interval)
if not followings_list:
@@ -553,9 +553,9 @@ class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
:param creator_info:
:param crawl_interval:
:param callback:
:param max_count: 一个up主爬取的最大动态数量
:param max_count: Maximum number of dynamics to crawl for a creator
:return: up主关注者列表
:return: List of creator dynamics
"""
creator_id = creator_info["id"]
result = []

View File

@@ -20,7 +20,7 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/2 18:44
# @Desc : B站爬虫
# @Desc : Bilibili Crawler
import asyncio
import os
@@ -64,7 +64,7 @@ class BilibiliCrawler(AbstractCrawler):
self.index_url = "https://www.bilibili.com"
self.user_agent = utils.get_user_agent()
self.cdp_manager = None
self.ip_proxy_pool = None # 代理IP池用于代理自动刷新
self.ip_proxy_pool = None # Proxy IP pool for automatic proxy refresh
async def start(self):
playwright_proxy_format, httpx_proxy_format = None, None
@@ -74,9 +74,9 @@ class BilibiliCrawler(AbstractCrawler):
playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(ip_proxy_info)
async with async_playwright() as playwright:
# 根据配置选择启动模式
# Choose launch mode based on configuration
if config.ENABLE_CDP_MODE:
utils.logger.info("[BilibiliCrawler] 使用CDP模式启动浏览器")
utils.logger.info("[BilibiliCrawler] Launching browser using CDP mode")
self.browser_context = await self.launch_browser_with_cdp(
playwright,
playwright_proxy_format,
@@ -84,7 +84,7 @@ class BilibiliCrawler(AbstractCrawler):
headless=config.CDP_HEADLESS,
)
else:
utils.logger.info("[BilibiliCrawler] 使用标准模式启动浏览器")
utils.logger.info("[BilibiliCrawler] Launching browser using standard mode")
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(chromium, None, self.user_agent, headless=config.HEADLESS)
@@ -149,31 +149,31 @@ class BilibiliCrawler(AbstractCrawler):
end: str = config.END_DAY,
) -> Tuple[str, str]:
"""
获取 bilibili 作品发布日期起始时间戳 pubtime_begin_s 与发布日期结束时间戳 pubtime_end_s
Get bilibili publish start timestamp pubtime_begin_s and publish end timestamp pubtime_end_s
---
:param start: 发布日期起始时间,YYYY-MM-DD
:param end: 发布日期结束时间,YYYY-MM-DD
:param start: Publish date start time, YYYY-MM-DD
:param end: Publish date end time, YYYY-MM-DD
Note
---
- 搜索的时间范围为 start end,包含 start end
- 若要搜索同一天的内容,为了包含 start 当天的搜索内容,则 pubtime_end_s 的值应该为 pubtime_begin_s 的值加上一天再减去一秒,即 start 当天的最后一秒
- 如仅搜索 2024-01-05 的内容,pubtime_begin_s = 1704384000pubtime_end_s = 1704470399
转换为可读的 datetime 对象:pubtime_begin_s = datetime.datetime(2024, 1, 5, 0, 0)pubtime_end_s = datetime.datetime(2024, 1, 5, 23, 59, 59)
- 若要搜索 start end 的内容,为了包含 end 当天的搜索内容,则 pubtime_end_s 的值应该为 pubtime_end_s 的值加上一天再减去一秒,即 end 当天的最后一秒
- 如搜索 2024-01-05 - 2024-01-06 的内容,pubtime_begin_s = 1704384000pubtime_end_s = 1704556799
转换为可读的 datetime 对象:pubtime_begin_s = datetime.datetime(2024, 1, 5, 0, 0)pubtime_end_s = datetime.datetime(2024, 1, 6, 23, 59, 59)
- Search time range is from start to end, including both start and end
- To search content from the same day, to include search content from that day, pubtime_end_s should be pubtime_begin_s plus one day minus one second, i.e., the last second of start day
- For example, searching only 2024-01-05 content, pubtime_begin_s = 1704384000, pubtime_end_s = 1704470399
Converted to readable datetime objects: pubtime_begin_s = datetime.datetime(2024, 1, 5, 0, 0), pubtime_end_s = datetime.datetime(2024, 1, 5, 23, 59, 59)
- To search content from start to end, to include search content from end day, pubtime_end_s should be pubtime_end_s plus one day minus one second, i.e., the last second of end day
- For example, searching 2024-01-05 - 2024-01-06 content, pubtime_begin_s = 1704384000, pubtime_end_s = 1704556799
Converted to readable datetime objects: pubtime_begin_s = datetime.datetime(2024, 1, 5, 0, 0), pubtime_end_s = datetime.datetime(2024, 1, 6, 23, 59, 59)
"""
# 转换 start end datetime 对象
# Convert start and end to datetime objects
start_day: datetime = datetime.strptime(start, "%Y-%m-%d")
end_day: datetime = datetime.strptime(end, "%Y-%m-%d")
if start_day > end_day:
raise ValueError("Wrong time range, please check your start and end argument, to ensure that the start cannot exceed end")
elif start_day == end_day: # 搜索同一天的内容
end_day = (start_day + timedelta(days=1) - timedelta(seconds=1)) # 则将 end_day 设置为 start_day + 1 day - 1 second
else: # 搜索 start end
end_day = (end_day + timedelta(days=1) - timedelta(seconds=1)) # 则将 end_day 设置为 end_day + 1 day - 1 second
# 将其重新转换为时间戳
elif start_day == end_day: # Searching content from the same day
end_day = (start_day + timedelta(days=1) - timedelta(seconds=1)) # Set end_day to start_day + 1 day - 1 second
else: # Searching from start to end
end_day = (end_day + timedelta(days=1) - timedelta(seconds=1)) # Set end_day to end_day + 1 day - 1 second
# Convert back to timestamps
return str(int(start_day.timestamp())), str(int(end_day.timestamp()))
async def search_by_keywords(self):
@@ -203,8 +203,8 @@ class BilibiliCrawler(AbstractCrawler):
page=page,
page_size=bili_limit_count,
order=SearchOrderType.DEFAULT,
pubtime_begin_s=0, # 作品发布日期起始时间戳
pubtime_end_s=0, # 作品发布日期结束日期时间戳
pubtime_begin_s=0, # Publish date start timestamp
pubtime_end_s=0, # Publish date end timestamp
)
video_list: List[Dict] = videos_res.get("result")
@@ -508,7 +508,7 @@ class BilibiliCrawler(AbstractCrawler):
"height": 1080
},
user_agent=user_agent,
channel="chrome", # 使用系统的Chrome稳定版
channel="chrome", # Use system's stable Chrome version
)
return browser_context
else:
@@ -525,7 +525,7 @@ class BilibiliCrawler(AbstractCrawler):
headless: bool = True,
) -> BrowserContext:
"""
使用CDP模式启动浏览器
Launch browser using CDP mode
"""
try:
self.cdp_manager = CDPBrowserManager()
@@ -536,22 +536,22 @@ class BilibiliCrawler(AbstractCrawler):
headless=headless,
)
# 显示浏览器信息
# Display browser information
browser_info = await self.cdp_manager.get_browser_info()
utils.logger.info(f"[BilibiliCrawler] CDP浏览器信息: {browser_info}")
utils.logger.info(f"[BilibiliCrawler] CDP browser info: {browser_info}")
return browser_context
except Exception as e:
utils.logger.error(f"[BilibiliCrawler] CDP模式启动失败,回退到标准模式: {e}")
# 回退到标准模式
utils.logger.error(f"[BilibiliCrawler] CDP mode launch failed, fallback to standard mode: {e}")
# Fallback to standard mode
chromium = playwright.chromium
return await self.launch_browser(chromium, playwright_proxy, user_agent, headless)
async def close(self):
"""Close browser context"""
try:
# 如果使用CDP模式需要特殊处理
# If using CDP mode, special handling is required
if self.cdp_manager:
await self.cdp_manager.cleanup()
self.cdp_manager = None

View File

@@ -27,28 +27,28 @@ from enum import Enum
class SearchOrderType(Enum):
# 综合排序
# Comprehensive sorting
DEFAULT = ""
# 最多点击
# Most clicks
MOST_CLICK = "click"
# 最新发布
# Latest published
LAST_PUBLISH = "pubdate"
# 最多弹幕
# Most danmu (comments)
MOST_DANMU = "dm"
# 最多收藏
# Most bookmarks
MOST_MARK = "stow"
class CommentOrderType(Enum):
# 仅按热度
# By popularity only
DEFAULT = 0
# 按热度+按时间
# By popularity + time
MIXED = 1
# 按时间
# By time
TIME = 2

View File

@@ -21,8 +21,8 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/2 23:26
# @Desc : bilibili 请求参数签名
# 逆向实现参考:https://socialsisteryi.github.io/bilibili-API-collect/docs/misc/sign/wbi.html#wbi%E7%AD%BE%E5%90%8D%E7%AE%97%E6%B3%95
# @Desc : bilibili request parameter signing
# Reverse engineering implementation reference: https://socialsisteryi.github.io/bilibili-API-collect/docs/misc/sign/wbi.html#wbi%E7%AD%BE%E5%90%8D%E7%AE%97%E6%B3%95
import re
import urllib.parse
from hashlib import md5
@@ -45,7 +45,7 @@ class BilibiliSign:
def get_salt(self) -> str:
"""
获取加盐的 key
Get the salted key
:return:
"""
salt = ""
@@ -56,8 +56,8 @@ class BilibiliSign:
def sign(self, req_data: Dict) -> Dict:
"""
请求参数中加上当前时间戳对请求参数中的key进行字典序排序
再将请求参数进行 url 编码集合 salt 进行 md5 就可以生成w_rid参数了
Add current timestamp to request parameters, sort keys in dictionary order,
then URL encode the parameters and combine with salt to generate md5 for w_rid parameter
:param req_data:
:return:
"""
@@ -65,35 +65,35 @@ class BilibiliSign:
req_data.update({"wts": current_ts})
req_data = dict(sorted(req_data.items()))
req_data = {
# 过滤 value 中的 "!'()*" 字符
# Filter "!'()*" characters from values
k: ''.join(filter(lambda ch: ch not in "!'()*", str(v)))
for k, v
in req_data.items()
}
query = urllib.parse.urlencode(req_data)
salt = self.get_salt()
wbi_sign = md5((query + salt).encode()).hexdigest() # 计算 w_rid
wbi_sign = md5((query + salt).encode()).hexdigest() # Calculate w_rid
req_data['w_rid'] = wbi_sign
return req_data
def parse_video_info_from_url(url: str) -> VideoUrlInfo:
"""
从B站视频URL中解析出视频ID
Parse video ID from Bilibili video URL
Args:
url: B站视频链接
url: Bilibili video link
- https://www.bilibili.com/video/BV1dwuKzmE26/?spm_id_from=333.1387.homepage.video_card.click
- https://www.bilibili.com/video/BV1d54y1g7db
- BV1d54y1g7db (直接传入BV号)
- BV1d54y1g7db (directly pass BV number)
Returns:
VideoUrlInfo: 包含视频ID的对象
VideoUrlInfo: Object containing video ID
"""
# 如果传入的已经是BV号,直接返回
# If the input is already a BV number, return directly
if url.startswith("BV"):
return VideoUrlInfo(video_id=url)
# 使用正则表达式提取BV号
# 匹配 /video/BV... /video/av... 格式
# Use regex to extract BV number
# Match /video/BV... or /video/av... format
bv_pattern = r'/video/(BV[a-zA-Z0-9]+)'
match = re.search(bv_pattern, url)
@@ -101,26 +101,26 @@ def parse_video_info_from_url(url: str) -> VideoUrlInfo:
video_id = match.group(1)
return VideoUrlInfo(video_id=video_id)
raise ValueError(f"无法从URL中解析出视频ID: {url}")
raise ValueError(f"Unable to parse video ID from URL: {url}")
def parse_creator_info_from_url(url: str) -> CreatorUrlInfo:
"""
从B站创作者空间URL中解析出创作者ID
Parse creator ID from Bilibili creator space URL
Args:
url: B站创作者空间链接
url: Bilibili creator space link
- https://space.bilibili.com/434377496?spm_id_from=333.1007.0.0
- https://space.bilibili.com/20813884
- 434377496 (直接传入UID)
- 434377496 (directly pass UID)
Returns:
CreatorUrlInfo: 包含创作者ID的对象
CreatorUrlInfo: Object containing creator ID
"""
# 如果传入的已经是纯数字ID,直接返回
# If the input is already a numeric ID, return directly
if url.isdigit():
return CreatorUrlInfo(creator_id=url)
# 使用正则表达式提取UID
# 匹配 /space.bilibili.com/数字 格式
# Use regex to extract UID
# Match /space.bilibili.com/number format
uid_pattern = r'space\.bilibili\.com/(\d+)'
match = re.search(uid_pattern, url)
@@ -128,20 +128,20 @@ def parse_creator_info_from_url(url: str) -> CreatorUrlInfo:
creator_id = match.group(1)
return CreatorUrlInfo(creator_id=creator_id)
raise ValueError(f"无法从URL中解析出创作者ID: {url}")
raise ValueError(f"Unable to parse creator ID from URL: {url}")
if __name__ == '__main__':
# 测试视频URL解析
# Test video URL parsing
video_url1 = "https://www.bilibili.com/video/BV1dwuKzmE26/?spm_id_from=333.1387.homepage.video_card.click"
video_url2 = "BV1d54y1g7db"
print("视频URL解析测试:")
print("Video URL parsing test:")
print(f"URL1: {video_url1} -> {parse_video_info_from_url(video_url1)}")
print(f"URL2: {video_url2} -> {parse_video_info_from_url(video_url2)}")
# 测试创作者URL解析
# Test creator URL parsing
creator_url1 = "https://space.bilibili.com/434377496?spm_id_from=333.1007.0.0"
creator_url2 = "20813884"
print("\n创作者URL解析测试:")
print("\nCreator URL parsing test:")
print(f"URL1: {creator_url1} -> {parse_creator_info_from_url(creator_url1)}")
print(f"URL2: {creator_url2} -> {parse_creator_info_from_url(creator_url2)}")

View File

@@ -21,7 +21,7 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/2 18:44
# @Desc : bilibli登录实现类
# @Desc : bilibili login implementation class
import asyncio
import functools

View File

@@ -151,19 +151,24 @@ class DouYinCrawler(AbstractCrawler):
utils.logger.error(f"[DouYinCrawler.search] search douyin keyword: {keyword} failed账号也许被风控了。")
break
dy_search_id = posts_res.get("extra", {}).get("logid", "")
page_aweme_list = []
for post_item in posts_res.get("data"):
try:
aweme_info: Dict = (post_item.get("aweme_info") or post_item.get("aweme_mix_info", {}).get("mix_items")[0])
except TypeError:
continue
aweme_list.append(aweme_info.get("aweme_id", ""))
page_aweme_list.append(aweme_info.get("aweme_id", ""))
await douyin_store.update_douyin_aweme(aweme_item=aweme_info)
await self.get_aweme_media(aweme_item=aweme_info)
# Batch get note comments for the current page
await self.batch_get_note_comments(page_aweme_list)
# Sleep after each page navigation
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[DouYinCrawler.search] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after page {page-1}")
utils.logger.info(f"[DouYinCrawler.search] keyword:{keyword}, aweme_list:{aweme_list}")
await self.batch_get_note_comments(aweme_list)
async def get_specified_awemes(self):
"""Get the information and comments of the specified post from URLs or IDs"""

View File

@@ -23,21 +23,21 @@ from enum import Enum
class SearchChannelType(Enum):
"""search channel type"""
GENERAL = "aweme_general" # 综合
VIDEO = "aweme_video_web" # 视频
USER = "aweme_user_web" # 用户
LIVE = "aweme_live" # 直播
GENERAL = "aweme_general" # General
VIDEO = "aweme_video_web" # Video
USER = "aweme_user_web" # User
LIVE = "aweme_live" # Live
class SearchSortType(Enum):
"""search sort type"""
GENERAL = 0 # 综合排序
MOST_LIKE = 1 # 最多点赞
LATEST = 2 # 最新发布
GENERAL = 0 # Comprehensive sorting
MOST_LIKE = 1 # Most likes
LATEST = 2 # Latest published
class PublishTimeType(Enum):
"""publish time type"""
UNLIMITED = 0 # 不限
ONE_DAY = 1 # 一天内
ONE_WEEK = 7 # 一周内
SIX_MONTH = 180 # 半年内
UNLIMITED = 0 # Unlimited
ONE_DAY = 1 # Within one day
ONE_WEEK = 7 # Within one week
SIX_MONTH = 180 # Within six months

View File

@@ -22,7 +22,7 @@
# @Author : relakkes@gmail.com
# @Name : 程序员阿江-Relakkes
# @Time : 2024/6/10 02:24
# @Desc : 获取 a_bogus 参数, 学习交流使用,请勿用作商业用途,侵权联系作者删除
# @Desc : Get a_bogus parameter, for learning and communication only, do not use for commercial purposes, contact author to delete if infringement
import random
import re
@@ -38,7 +38,7 @@ douyin_sign_obj = execjs.compile(open('libs/douyin.js', encoding='utf-8-sig').re
def get_web_id():
"""
生成随机的webid
Generate random webid
Returns:
"""
@@ -60,13 +60,13 @@ def get_web_id():
async def get_a_bogus(url: str, params: str, post_data: dict, user_agent: str, page: Page = None):
"""
获取 a_bogus 参数, 目前不支持post请求类型的签名
Get a_bogus parameter, currently does not support POST request type signature
"""
return get_a_bogus_from_js(url, params, user_agent)
def get_a_bogus_from_js(url: str, params: str, user_agent: str):
"""
通过js获取 a_bogus 参数
Get a_bogus parameter through js
Args:
url:
params:
@@ -82,10 +82,10 @@ def get_a_bogus_from_js(url: str, params: str, user_agent: str):
async def get_a_bogus_from_playright(params: str, post_data: dict, user_agent: str, page: Page):
async def get_a_bogus_from_playwright(params: str, post_data: dict, user_agent: str, page: Page):
"""
通过playright获取 a_bogus 参数
playwright版本已失效
Get a_bogus parameter through playwright
playwright version is deprecated
Returns:
"""
@@ -100,73 +100,73 @@ async def get_a_bogus_from_playright(params: str, post_data: dict, user_agent: s
def parse_video_info_from_url(url: str) -> VideoUrlInfo:
"""
从抖音视频URL中解析出视频ID
支持以下格式:
1. 普通视频链接: https://www.douyin.com/video/7525082444551310602
2. 带modal_id参数的链接:
Parse video ID from Douyin video URL
Supports the following formats:
1. Normal video link: https://www.douyin.com/video/7525082444551310602
2. Link with modal_id parameter:
- https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?modal_id=7525082444551310602
- https://www.douyin.com/root/search/python?modal_id=7471165520058862848
3. 短链接: https://v.douyin.com/iF12345ABC/ (需要client解析)
4. ID: 7525082444551310602
3. Short link: https://v.douyin.com/iF12345ABC/ (requires client parsing)
4. Pure ID: 7525082444551310602
Args:
url: 抖音视频链接或ID
url: Douyin video link or ID
Returns:
VideoUrlInfo: 包含视频ID的对象
VideoUrlInfo: Object containing video ID
"""
# 如果是纯数字ID,直接返回
# If it's a pure numeric ID, return directly
if url.isdigit():
return VideoUrlInfo(aweme_id=url, url_type="normal")
# 检查是否是短链接 (v.douyin.com)
# Check if it's a short link (v.douyin.com)
if "v.douyin.com" in url or url.startswith("http") and len(url) < 50 and "video" not in url:
return VideoUrlInfo(aweme_id="", url_type="short") # 需要通过client解析
return VideoUrlInfo(aweme_id="", url_type="short") # Requires client parsing
# 尝试从URL参数中提取modal_id
# Try to extract modal_id from URL parameters
params = extract_url_params_to_dict(url)
modal_id = params.get("modal_id")
if modal_id:
return VideoUrlInfo(aweme_id=modal_id, url_type="modal")
# 从标准视频URL中提取ID: /video/数字
# Extract ID from standard video URL: /video/number
video_pattern = r'/video/(\d+)'
match = re.search(video_pattern, url)
if match:
aweme_id = match.group(1)
return VideoUrlInfo(aweme_id=aweme_id, url_type="normal")
raise ValueError(f"无法从URL中解析出视频ID: {url}")
raise ValueError(f"Unable to parse video ID from URL: {url}")
def parse_creator_info_from_url(url: str) -> CreatorUrlInfo:
"""
从抖音创作者主页URL中解析出创作者ID (sec_user_id)
支持以下格式:
1. 创作者主页: https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main
2. ID: MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE
Parse creator ID (sec_user_id) from Douyin creator homepage URL
Supports the following formats:
1. Creator homepage: https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main
2. Pure ID: MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE
Args:
url: 抖音创作者主页链接或sec_user_id
url: Douyin creator homepage link or sec_user_id
Returns:
CreatorUrlInfo: 包含创作者ID的对象
CreatorUrlInfo: Object containing creator ID
"""
# 如果是纯ID格式(通常以MS4wLjABAAAA开头),直接返回
# If it's a pure ID format (usually starts with MS4wLjABAAAA), return directly
if url.startswith("MS4wLjABAAAA") or (not url.startswith("http") and "douyin.com" not in url):
return CreatorUrlInfo(sec_user_id=url)
# 从创作者主页URL中提取sec_user_id: /user/xxx
# Extract sec_user_id from creator homepage URL: /user/xxx
user_pattern = r'/user/([^/?]+)'
match = re.search(user_pattern, url)
if match:
sec_user_id = match.group(1)
return CreatorUrlInfo(sec_user_id=sec_user_id)
raise ValueError(f"无法从URL中解析出创作者ID: {url}")
raise ValueError(f"Unable to parse creator ID from URL: {url}")
if __name__ == '__main__':
# 测试视频URL解析
print("=== 视频URL解析测试 ===")
# Test video URL parsing
print("=== Video URL Parsing Test ===")
test_urls = [
"https://www.douyin.com/video/7525082444551310602",
"https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main&modal_id=7525082444551310602",
@@ -177,13 +177,13 @@ if __name__ == '__main__':
try:
result = parse_video_info_from_url(url)
print(f"✓ URL: {url[:80]}...")
print(f" 结果: {result}\n")
print(f" Result: {result}\n")
except Exception as e:
print(f"✗ URL: {url}")
print(f" 错误: {e}\n")
print(f" Error: {e}\n")
# 测试创作者URL解析
print("=== 创作者URL解析测试 ===")
# Test creator URL parsing
print("=== Creator URL Parsing Test ===")
test_creator_urls = [
"https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main",
"MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE",
@@ -192,7 +192,7 @@ if __name__ == '__main__':
try:
result = parse_creator_info_from_url(url)
print(f"✓ URL: {url[:80]}...")
print(f" 结果: {result}\n")
print(f" Result: {result}\n")
except Exception as e:
print(f"✗ URL: {url}")
print(f" 错误: {e}\n")
print(f" Error: {e}\n")

View File

@@ -53,7 +53,7 @@ class DouYinLogin(AbstractLogin):
async def begin(self):
"""
Start login douyin website
滑块中间页面的验证准确率不太OK... 如果没有特俗要求建议不开抖音登录或者使用cookies登录
The verification accuracy of the slider verification is not very good... If there are no special requirements, it is recommended not to use Douyin login, or use cookie login
"""
# popup login dialog
@@ -69,7 +69,7 @@ class DouYinLogin(AbstractLogin):
else:
raise ValueError("[DouYinLogin.begin] Invalid Login Type Currently only supported qrcode or phone or cookie ...")
# 如果页面重定向到滑动验证码页面,需要再次滑动滑块
# If the page redirects to the slider verification page, need to slide again
await asyncio.sleep(6)
current_page_title = await self.context_page.title()
if "验证码中间页" in current_page_title:
@@ -147,10 +147,10 @@ class DouYinLogin(AbstractLogin):
send_sms_code_btn = self.context_page.locator("xpath=//span[text() = '获取验证码']")
await send_sms_code_btn.click()
# 检查是否有滑动验证码
# Check if there is slider verification
await self.check_page_display_slider(move_step=10, slider_level="easy")
cache_client = CacheFactory.create_cache(config.CACHE_TYPE_MEMORY)
max_get_sms_code_time = 60 * 2 # 最长获取验证码的时间为2分钟
max_get_sms_code_time = 60 * 2 # Maximum time to get verification code is 2 minutes
while max_get_sms_code_time > 0:
utils.logger.info(f"[DouYinLogin.login_by_mobile] get douyin sms code from redis remaining time {max_get_sms_code_time}s ...")
await asyncio.sleep(1)
@@ -164,20 +164,20 @@ class DouYinLogin(AbstractLogin):
await sms_code_input_ele.fill(value=sms_code_value.decode())
await asyncio.sleep(0.5)
submit_btn_ele = self.context_page.locator("xpath=//button[@class='web-login-button']")
await submit_btn_ele.click() # 点击登录
# todo ... 应该还需要检查验证码的正确性有可能输入的验证码不正确
await submit_btn_ele.click() # Click login
# todo ... should also check the correctness of the verification code, it may be incorrect
break
async def check_page_display_slider(self, move_step: int = 10, slider_level: str = "easy"):
"""
检查页面是否出现滑动验证码
Check if slider verification appears on the page
:return:
"""
# 等待滑动验证码的出现
# Wait for slider verification to appear
back_selector = "#captcha-verify-image"
try:
await self.context_page.wait_for_selector(selector=back_selector, state="visible", timeout=30 * 1000)
except PlaywrightTimeoutError: # 没有滑动验证码,直接返回
except PlaywrightTimeoutError: # No slider verification, return directly
return
gap_selector = 'xpath=//*[@id="captcha_container"]/div/div[2]/img[2]'
@@ -191,16 +191,16 @@ class DouYinLogin(AbstractLogin):
await self.move_slider(back_selector, gap_selector, move_step, slider_level)
await asyncio.sleep(1)
# 如果滑块滑动慢了,或者验证失败了,会提示操作过慢,这里点一下刷新按钮
# If the slider is too slow or verification failed, it will prompt "操作过慢", click the refresh button here
page_content = await self.context_page.content()
if "操作过慢" in page_content or "提示重新操作" in page_content:
utils.logger.info("[DouYinLogin.check_page_display_slider] slider verify failed, retry ...")
await self.context_page.click(selector="//a[contains(@class, 'secsdk_captcha_refresh')]")
continue
# 滑动成功后,等待滑块消失
# After successful sliding, wait for the slider to disappear
await self.context_page.wait_for_selector(selector=back_selector, state="hidden", timeout=1000)
# 如果滑块消失了,说明验证成功了,跳出循环,如果没有消失,说明验证失败了,上面这一行代码会抛出异常被捕获后继续循环滑动验证码
# If the slider disappears, it means the verification is successful, break the loop. If not, it means the verification failed, the above line will throw an exception and be caught to continue the loop
utils.logger.info("[DouYinLogin.check_page_display_slider] slider verify success ...")
slider_verify_success = True
except Exception as e:
@@ -213,10 +213,10 @@ class DouYinLogin(AbstractLogin):
async def move_slider(self, back_selector: str, gap_selector: str, move_step: int = 10, slider_level="easy"):
"""
Move the slider to the right to complete the verification
:param back_selector: 滑动验证码背景图片的选择器
:param gap_selector: 滑动验证码的滑块选择器
:param move_step: 是控制单次移动速度的比例是1/10 默认是1 相当于 传入的这个距离不管多远0.1秒钟移动完 越大越慢
:param slider_level: 滑块难度 easy hard,分别对应手机验证码的滑块和验证码中间的滑块
:param back_selector: Selector for the slider verification background image
:param gap_selector: Selector for the slider verification slider
:param move_step: Controls the ratio of single movement speed, default is 1, meaning the distance moves in 0.1 seconds no matter how far, larger value means slower
:param slider_level: Slider difficulty easy hard, corresponding to the slider for mobile verification code and the slider in the middle of verification code
:return:
"""
@@ -234,31 +234,31 @@ class DouYinLogin(AbstractLogin):
)
gap_src = str(await gap_elements.get_property("src")) # type: ignore
# 识别滑块位置
# Identify slider position
slide_app = utils.Slide(gap=gap_src, bg=slide_back)
distance = slide_app.discern()
# 获取移动轨迹
# Get movement trajectory
tracks = utils.get_tracks(distance, slider_level)
new_1 = tracks[-1] - (sum(tracks) - distance)
tracks.pop()
tracks.append(new_1)
# 根据轨迹拖拽滑块到指定位置
# Drag slider to specified position according to trajectory
element = await self.context_page.query_selector(gap_selector)
bounding_box = await element.bounding_box() # type: ignore
await self.context_page.mouse.move(bounding_box["x"] + bounding_box["width"] / 2, # type: ignore
bounding_box["y"] + bounding_box["height"] / 2) # type: ignore
# 这里获取到x坐标中心点位置
# Get x coordinate center position
x = bounding_box["x"] + bounding_box["width"] / 2 # type: ignore
# 模拟滑动操作
# Simulate sliding operation
await element.hover() # type: ignore
await self.context_page.mouse.down()
for track in tracks:
# 循环鼠标按照轨迹移动
# steps 是控制单次移动速度的比例是1/10 默认是1 相当于 传入的这个距离不管多远0.1秒钟移动完 越大越慢
# Loop mouse movement according to trajectory
# steps controls the ratio of single movement speed, default is 1, meaning the distance moves in 0.1 seconds no matter how far, larger value means slower
await self.context_page.mouse.move(x + track, 0, steps=move_step)
x += track
await self.context_page.mouse.up()

View File

@@ -54,14 +54,15 @@ class KuaiShouClient(AbstractApiClient, ProxyRefreshMixin):
self.timeout = timeout
self.headers = headers
self._host = "https://www.kuaishou.com/graphql"
self._rest_host = "https://www.kuaishou.com"
self.playwright_page = playwright_page
self.cookie_dict = cookie_dict
self.graphql = KuaiShouGraphQL()
# 初始化代理池(来自 ProxyRefreshMixin
# Initialize proxy pool (from ProxyRefreshMixin)
self.init_proxy_pool(proxy_ip_pool)
async def request(self, method, url, **kwargs) -> Any:
# 每次请求前检测代理是否过期
# Check if proxy is expired before each request
await self._refresh_proxy_if_expired()
async with httpx.AsyncClient(proxy=self.proxy) as client:
@@ -86,6 +87,29 @@ class KuaiShouClient(AbstractApiClient, ProxyRefreshMixin):
method="POST", url=f"{self._host}{uri}", data=json_str, headers=self.headers
)
async def request_rest_v2(self, uri: str, data: dict) -> Dict:
"""
Make REST API V2 request (for comment endpoints)
:param uri: API endpoint path
:param data: request body
:return: response data
"""
await self._refresh_proxy_if_expired()
json_str = json.dumps(data, separators=(",", ":"), ensure_ascii=False)
async with httpx.AsyncClient(proxy=self.proxy) as client:
response = await client.request(
method="POST",
url=f"{self._rest_host}{uri}",
data=json_str,
timeout=self.timeout,
headers=self.headers,
)
result: Dict = response.json()
if result.get("result") != 1:
raise DataFetchError(f"REST API V2 error: {result}")
return result
async def pong(self) -> bool:
"""get a note to check if login state is ok"""
utils.logger.info("[KuaiShouClient.pong] Begin pong kuaishou...")
@@ -149,36 +173,32 @@ class KuaiShouClient(AbstractApiClient, ProxyRefreshMixin):
return await self.post("", post_data)
async def get_video_comments(self, photo_id: str, pcursor: str = "") -> Dict:
"""get video comments
:param photo_id: photo id you want to fetch
:param pcursor: last you get pcursor, defaults to ""
:return:
"""Get video first-level comments using REST API V2
:param photo_id: video id you want to fetch
:param pcursor: pagination cursor, defaults to ""
:return: dict with rootCommentsV2, pcursorV2, commentCountV2
"""
post_data = {
"operationName": "commentListQuery",
"variables": {"photoId": photo_id, "pcursor": pcursor},
"query": self.graphql.get("comment_list"),
"photoId": photo_id,
"pcursor": pcursor,
}
return await self.post("", post_data)
return await self.request_rest_v2("/rest/v/photo/comment/list", post_data)
async def get_video_sub_comments(
self, photo_id: str, rootCommentId: str, pcursor: str = ""
self, photo_id: str, root_comment_id: int, pcursor: str = ""
) -> Dict:
"""get video sub comments
:param photo_id: photo id you want to fetch
:param pcursor: last you get pcursor, defaults to ""
:return:
"""Get video second-level comments using REST API V2
:param photo_id: video id you want to fetch
:param root_comment_id: parent comment id (must be int type)
:param pcursor: pagination cursor, defaults to ""
:return: dict with subCommentsV2, pcursorV2
"""
post_data = {
"operationName": "visionSubCommentList",
"variables": {
"photoId": photo_id,
"pcursor": pcursor,
"rootCommentId": rootCommentId,
},
"query": self.graphql.get("vision_sub_comment_list"),
"photoId": photo_id,
"pcursor": pcursor,
"rootCommentId": root_comment_id, # Must be int type for V2 API
}
return await self.post("", post_data)
return await self.request_rest_v2("/rest/v/photo/comment/sublist", post_data)
async def get_creator_profile(self, userId: str) -> Dict:
post_data = {
@@ -204,12 +224,12 @@ class KuaiShouClient(AbstractApiClient, ProxyRefreshMixin):
max_count: int = 10,
):
"""
get video all comments include sub comments
:param photo_id:
:param crawl_interval:
:param callback:
:param max_count:
:return:
Get video all comments including sub comments (V2 REST API)
:param photo_id: video id
:param crawl_interval: delay between requests (seconds)
:param callback: callback function for processing comments
:param max_count: max number of comments to fetch
:return: list of all comments
"""
result = []
@@ -217,12 +237,12 @@ class KuaiShouClient(AbstractApiClient, ProxyRefreshMixin):
while pcursor != "no_more" and len(result) < max_count:
comments_res = await self.get_video_comments(photo_id, pcursor)
vision_commen_list = comments_res.get("visionCommentList", {})
pcursor = vision_commen_list.get("pcursor", "")
comments = vision_commen_list.get("rootComments", [])
# V2 API returns data at top level, not nested in visionCommentList
pcursor = comments_res.get("pcursorV2", "no_more")
comments = comments_res.get("rootCommentsV2", [])
if len(result) + len(comments) > max_count:
comments = comments[: max_count - len(result)]
if callback: # 如果有回调函数,就执行回调函数
if callback: # If there is a callback function, execute the callback function
await callback(photo_id, comments)
result.extend(comments)
await asyncio.sleep(crawl_interval)
@@ -240,14 +260,14 @@ class KuaiShouClient(AbstractApiClient, ProxyRefreshMixin):
callback: Optional[Callable] = None,
) -> List[Dict]:
"""
获取指定一级评论下的所有二级评论, 该方法会一直查找一级评论下的所有二级评论信息
Get all second-level comments under specified first-level comments (V2 REST API)
Args:
comments: 评论列表
photo_id: 视频id
crawl_interval: 爬取一次评论的延迟单位(秒)
callback: 一次评论爬取结束后
comments: Comment list
photo_id: Video ID
crawl_interval: Delay unit for crawling comments once (seconds)
callback: Callback after one comment crawl ends
Returns:
List of sub comments
"""
if not config.ENABLE_GET_SUB_COMMENTS:
utils.logger.info(
@@ -257,35 +277,36 @@ class KuaiShouClient(AbstractApiClient, ProxyRefreshMixin):
result = []
for comment in comments:
sub_comments = comment.get("subComments")
if sub_comments and callback:
await callback(photo_id, sub_comments)
sub_comment_pcursor = comment.get("subCommentsPcursor")
if sub_comment_pcursor == "no_more":
# V2 API uses hasSubComments (boolean) instead of subCommentsPcursor (string)
has_sub_comments = comment.get("hasSubComments", False)
if not has_sub_comments:
continue
# V2 API uses comment_id (int) instead of commentId (string)
root_comment_id = comment.get("comment_id")
if not root_comment_id:
continue
root_comment_id = comment.get("commentId")
sub_comment_pcursor = ""
while sub_comment_pcursor != "no_more":
comments_res = await self.get_video_sub_comments(
photo_id, root_comment_id, sub_comment_pcursor
)
vision_sub_comment_list = comments_res.get("visionSubCommentList", {})
sub_comment_pcursor = vision_sub_comment_list.get("pcursor", "no_more")
# V2 API returns data at top level
sub_comment_pcursor = comments_res.get("pcursorV2", "no_more")
sub_comments = comments_res.get("subCommentsV2", [])
comments = vision_sub_comment_list.get("subComments", {})
if callback:
await callback(photo_id, comments)
if callback and sub_comments:
await callback(photo_id, sub_comments)
await asyncio.sleep(crawl_interval)
result.extend(comments)
result.extend(sub_comments)
return result
async def get_creator_info(self, user_id: str) -> Dict:
"""
eg: https://www.kuaishou.com/profile/3x4jtnbfter525a
快手用户主页
Kuaishou user homepage
"""
visionProfile = await self.get_creator_profile(user_id)
@@ -298,11 +319,11 @@ class KuaiShouClient(AbstractApiClient, ProxyRefreshMixin):
callback: Optional[Callable] = None,
) -> List[Dict]:
"""
获取指定用户下的所有发过的帖子,该方法会一直查找一个用户下的所有帖子信息
Get all posts published by the specified user, this method will continue to find all post information under a user
Args:
user_id: 用户ID
crawl_interval: 爬取一次的延迟单位(秒)
callback: 一次分页爬取结束后的更新回调函数
user_id: User ID
crawl_interval: Delay unit for crawling once (seconds)
callback: Update callback function after one page crawl ends
Returns:
"""

View File

@@ -58,7 +58,7 @@ class KuaishouCrawler(AbstractCrawler):
self.index_url = "https://www.kuaishou.com"
self.user_agent = utils.get_user_agent()
self.cdp_manager = None
self.ip_proxy_pool = None # 代理IP池用于代理自动刷新
self.ip_proxy_pool = None # Proxy IP pool, used for automatic proxy refresh
async def start(self):
playwright_proxy_format, httpx_proxy_format = None, None
@@ -72,9 +72,9 @@ class KuaishouCrawler(AbstractCrawler):
)
async with async_playwright() as playwright:
# 根据配置选择启动模式
# Select startup mode based on configuration
if config.ENABLE_CDP_MODE:
utils.logger.info("[KuaishouCrawler] 使用CDP模式启动浏览器")
utils.logger.info("[KuaishouCrawler] Launching browser using CDP mode")
self.browser_context = await self.launch_browser_with_cdp(
playwright,
playwright_proxy_format,
@@ -82,7 +82,7 @@ class KuaishouCrawler(AbstractCrawler):
headless=config.CDP_HEADLESS,
)
else:
utils.logger.info("[KuaishouCrawler] 使用标准模式启动浏览器")
utils.logger.info("[KuaishouCrawler] Launching browser using standard mode")
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
@@ -318,7 +318,7 @@ class KuaishouCrawler(AbstractCrawler):
},
playwright_page=self.context_page,
cookie_dict=cookie_dict,
proxy_ip_pool=self.ip_proxy_pool, # 传递代理池用于自动刷新
proxy_ip_pool=self.ip_proxy_pool, # Pass proxy pool for automatic refresh
)
return ks_client_obj
@@ -344,7 +344,7 @@ class KuaishouCrawler(AbstractCrawler):
proxy=playwright_proxy, # type: ignore
viewport={"width": 1920, "height": 1080},
user_agent=user_agent,
channel="chrome", # 使用系统的Chrome稳定版
channel="chrome", # Use system's stable Chrome version
)
return browser_context
else:
@@ -362,7 +362,7 @@ class KuaishouCrawler(AbstractCrawler):
headless: bool = True,
) -> BrowserContext:
"""
使用CDP模式启动浏览器
Launch browser using CDP mode
"""
try:
self.cdp_manager = CDPBrowserManager()
@@ -373,17 +373,17 @@ class KuaishouCrawler(AbstractCrawler):
headless=headless,
)
# 显示浏览器信息
# Display browser information
browser_info = await self.cdp_manager.get_browser_info()
utils.logger.info(f"[KuaishouCrawler] CDP浏览器信息: {browser_info}")
utils.logger.info(f"[KuaishouCrawler] CDP browser info: {browser_info}")
return browser_context
except Exception as e:
utils.logger.error(
f"[KuaishouCrawler] CDP模式启动失败,回退到标准模式: {e}"
f"[KuaishouCrawler] CDP mode launch failed, fallback to standard mode: {e}"
)
# 回退到标准模式
# Fallback to standard mode
chromium = playwright.chromium
return await self.launch_browser(
chromium, playwright_proxy, user_agent, headless
@@ -438,7 +438,7 @@ class KuaishouCrawler(AbstractCrawler):
async def close(self):
"""Close browser context"""
# 如果使用CDP模式需要特殊处理
# If using CDP mode, need special handling
if self.cdp_manager:
await self.cdp_manager.cleanup()
self.cdp_manager = None

View File

@@ -18,8 +18,8 @@
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# 快手的数据传输是基于GraphQL实现的
# 这个类负责获取一些GraphQLschema
# Kuaishou's data transmission is based on GraphQL
# This class is responsible for obtaining some GraphQL schemas
from typing import Dict

View File

@@ -26,59 +26,59 @@ from model.m_kuaishou import VideoUrlInfo, CreatorUrlInfo
def parse_video_info_from_url(url: str) -> VideoUrlInfo:
"""
从快手视频URL中解析出视频ID
支持以下格式:
1. 完整视频URL: "https://www.kuaishou.com/short-video/3x3zxz4mjrsc8ke?authorId=3x84qugg4ch9zhs&streamSource=search"
2. 纯视频ID: "3x3zxz4mjrsc8ke"
Parse video ID from Kuaishou video URL
Supports the following formats:
1. Full video URL: "https://www.kuaishou.com/short-video/3x3zxz4mjrsc8ke?authorId=3x84qugg4ch9zhs&streamSource=search"
2. Pure video ID: "3x3zxz4mjrsc8ke"
Args:
url: 快手视频链接或视频ID
url: Kuaishou video link or video ID
Returns:
VideoUrlInfo: 包含视频ID的对象
VideoUrlInfo: Object containing video ID
"""
# 如果不包含http且不包含kuaishou.com认为是纯ID
# If it doesn't contain http and doesn't contain kuaishou.com, consider it as pure ID
if not url.startswith("http") and "kuaishou.com" not in url:
return VideoUrlInfo(video_id=url, url_type="normal")
# 从标准视频URL中提取ID: /short-video/视频ID
# Extract ID from standard video URL: /short-video/video_ID
video_pattern = r'/short-video/([a-zA-Z0-9_-]+)'
match = re.search(video_pattern, url)
if match:
video_id = match.group(1)
return VideoUrlInfo(video_id=video_id, url_type="normal")
raise ValueError(f"无法从URL中解析出视频ID: {url}")
raise ValueError(f"Unable to parse video ID from URL: {url}")
def parse_creator_info_from_url(url: str) -> CreatorUrlInfo:
"""
从快手创作者主页URL中解析出创作者ID
支持以下格式:
1. 创作者主页: "https://www.kuaishou.com/profile/3x84qugg4ch9zhs"
2. ID: "3x4sm73aye7jq7i"
Parse creator ID from Kuaishou creator homepage URL
Supports the following formats:
1. Creator homepage: "https://www.kuaishou.com/profile/3x84qugg4ch9zhs"
2. Pure ID: "3x4sm73aye7jq7i"
Args:
url: 快手创作者主页链接或user_id
url: Kuaishou creator homepage link or user_id
Returns:
CreatorUrlInfo: 包含创作者ID的对象
CreatorUrlInfo: Object containing creator ID
"""
# 如果不包含http且不包含kuaishou.com认为是纯ID
# If it doesn't contain http and doesn't contain kuaishou.com, consider it as pure ID
if not url.startswith("http") and "kuaishou.com" not in url:
return CreatorUrlInfo(user_id=url)
# 从创作者主页URL中提取user_id: /profile/xxx
# Extract user_id from creator homepage URL: /profile/xxx
user_pattern = r'/profile/([a-zA-Z0-9_-]+)'
match = re.search(user_pattern, url)
if match:
user_id = match.group(1)
return CreatorUrlInfo(user_id=user_id)
raise ValueError(f"无法从URL中解析出创作者ID: {url}")
raise ValueError(f"Unable to parse creator ID from URL: {url}")
if __name__ == '__main__':
# 测试视频URL解析
print("=== 视频URL解析测试 ===")
# Test video URL parsing
print("=== Video URL Parsing Test ===")
test_video_urls = [
"https://www.kuaishou.com/short-video/3x3zxz4mjrsc8ke?authorId=3x84qugg4ch9zhs&streamSource=search&area=searchxxnull&searchKey=python",
"3xf8enb8dbj6uig",
@@ -87,13 +87,13 @@ if __name__ == '__main__':
try:
result = parse_video_info_from_url(url)
print(f"✓ URL: {url[:80]}...")
print(f" 结果: {result}\n")
print(f" Result: {result}\n")
except Exception as e:
print(f"✗ URL: {url}")
print(f" 错误: {e}\n")
print(f" Error: {e}\n")
# 测试创作者URL解析
print("=== 创作者URL解析测试 ===")
# Test creator URL parsing
print("=== Creator URL Parsing Test ===")
test_creator_urls = [
"https://www.kuaishou.com/profile/3x84qugg4ch9zhs",
"3x4sm73aye7jq7i",
@@ -102,7 +102,7 @@ if __name__ == '__main__':
try:
result = parse_creator_info_from_url(url)
print(f"✓ URL: {url[:80]}...")
print(f" 结果: {result}\n")
print(f" Result: {result}\n")
except Exception as e:
print(f"✗ URL: {url}")
print(f" 错误: {e}\n")
print(f" Error: {e}\n")

View File

@@ -48,7 +48,7 @@ class BaiduTieBaClient(AbstractApiClient):
):
self.ip_pool: Optional[ProxyIpPool] = ip_pool
self.timeout = timeout
# 使用传入的headers(包含真实浏览器UA)或默认headers
# Use provided headers (including real browser UA) or default headers
self.headers = headers or {
"User-Agent": utils.get_user_agent(),
"Cookie": "",
@@ -56,21 +56,21 @@ class BaiduTieBaClient(AbstractApiClient):
self._host = "https://tieba.baidu.com"
self._page_extractor = TieBaExtractor()
self.default_ip_proxy = default_ip_proxy
self.playwright_page = playwright_page # Playwright页面对象
self.playwright_page = playwright_page # Playwright page object
def _sync_request(self, method, url, proxy=None, **kwargs):
"""
同步的requests请求方法
Synchronous requests method
Args:
method: 请求方法
url: 请求的URL
proxy: 代理IP
**kwargs: 其他请求参数
method: Request method
url: Request URL
proxy: Proxy IP
**kwargs: Other request parameters
Returns:
response对象
Response object
"""
# 构造代理字典
# Construct proxy dictionary
proxies = None
if proxy:
proxies = {
@@ -78,7 +78,7 @@ class BaiduTieBaClient(AbstractApiClient):
"https": proxy,
}
# 发送请求
# Send request
response = requests.request(
method=method,
url=url,
@@ -91,7 +91,7 @@ class BaiduTieBaClient(AbstractApiClient):
async def _refresh_proxy_if_expired(self) -> None:
"""
检测代理是否过期,如果过期则自动刷新
Check if proxy is expired and automatically refresh if necessary
"""
if self.ip_pool is None:
return
@@ -101,7 +101,7 @@ class BaiduTieBaClient(AbstractApiClient):
"[BaiduTieBaClient._refresh_proxy_if_expired] Proxy expired, refreshing..."
)
new_proxy = await self.ip_pool.get_or_refresh_proxy()
# 更新代理URL
# Update proxy URL
_, self.default_ip_proxy = utils.format_proxy_info(new_proxy)
utils.logger.info(
f"[BaiduTieBaClient._refresh_proxy_if_expired] New proxy: {new_proxy.ip}:{new_proxy.port}"
@@ -110,23 +110,23 @@ class BaiduTieBaClient(AbstractApiClient):
@retry(stop=stop_after_attempt(3), wait=wait_fixed(1))
async def request(self, method, url, return_ori_content=False, proxy=None, **kwargs) -> Union[str, Any]:
"""
封装requests的公共请求方法对请求响应做一些处理
Common request method wrapper for requests, handles request responses
Args:
method: 请求方法
url: 请求的URL
return_ori_content: 是否返回原始内容
proxy: 代理IP
**kwargs: 其他请求参数,例如请求头、请求体等
method: Request method
url: Request URL
return_ori_content: Whether to return original content
proxy: Proxy IP
**kwargs: Other request parameters, such as headers, request body, etc.
Returns:
"""
# 每次请求前检测代理是否过期
# Check if proxy is expired before each request
await self._refresh_proxy_if_expired()
actual_proxy = proxy if proxy else self.default_ip_proxy
# 在线程池中执行同步的requests请求
# Execute synchronous requests in thread pool
response = await asyncio.to_thread(
self._sync_request,
method,
@@ -151,11 +151,11 @@ class BaiduTieBaClient(AbstractApiClient):
async def get(self, uri: str, params=None, return_ori_content=False, **kwargs) -> Any:
"""
GET请求,对请求头签名
GET request with header signing
Args:
uri: 请求路由
params: 请求参数
return_ori_content: 是否返回原始内容
uri: Request route
params: Request parameters
return_ori_content: Whether to return original content
Returns:
@@ -175,15 +175,15 @@ class BaiduTieBaClient(AbstractApiClient):
self.default_ip_proxy = proxy
return res
utils.logger.error(f"[BaiduTieBaClient.get] 达到了最大重试次数IP已经被Block请尝试更换新的IP代理: {e}")
raise Exception(f"[BaiduTieBaClient.get] 达到了最大重试次数IP已经被Block请尝试更换新的IP代理: {e}")
utils.logger.error(f"[BaiduTieBaClient.get] Reached maximum retry attempts, IP is blocked, please try a new IP proxy: {e}")
raise Exception(f"[BaiduTieBaClient.get] Reached maximum retry attempts, IP is blocked, please try a new IP proxy: {e}")
async def post(self, uri: str, data: dict, **kwargs) -> Dict:
"""
POST请求,对请求头签名
POST request with header signing
Args:
uri: 请求路由
data: 请求体参数
uri: Request route
data: Request body parameters
Returns:
@@ -193,13 +193,13 @@ class BaiduTieBaClient(AbstractApiClient):
async def pong(self, browser_context: BrowserContext = None) -> bool:
"""
用于检查登录态是否失效了
使用Cookie检测而非API调用,避免被检测
Check if login state is still valid
Uses Cookie detection instead of API calls to avoid detection
Args:
browser_context: 浏览器上下文对象
browser_context: Browser context object
Returns:
bool: True表示已登录,False表示未登录
bool: True if logged in, False if not logged in
"""
utils.logger.info("[BaiduTieBaClient.pong] Begin to check tieba login state by cookies...")
@@ -208,13 +208,13 @@ class BaiduTieBaClient(AbstractApiClient):
return False
try:
# 从浏览器获取cookies并检查关键登录cookie
# Get cookies from browser and check key login cookies
_, cookie_dict = utils.convert_cookies(await browser_context.cookies())
# 百度贴吧的登录标识: STOKEN PTOKEN
# Baidu Tieba login identifiers: STOKEN or PTOKEN
stoken = cookie_dict.get("STOKEN")
ptoken = cookie_dict.get("PTOKEN")
bduss = cookie_dict.get("BDUSS") # 百度通用登录cookie
bduss = cookie_dict.get("BDUSS") # Baidu universal login cookie
if stoken or ptoken or bduss:
utils.logger.info(f"[BaiduTieBaClient.pong] Login state verified by cookies (STOKEN: {bool(stoken)}, PTOKEN: {bool(ptoken)}, BDUSS: {bool(bduss)})")
@@ -229,9 +229,9 @@ class BaiduTieBaClient(AbstractApiClient):
async def update_cookies(self, browser_context: BrowserContext):
"""
API客户端提供的更新cookies方法一般情况下登录成功后会调用此方法
Update cookies method provided by API client, usually called after successful login
Args:
browser_context: 浏览器上下文对象
browser_context: Browser context object
Returns:
@@ -249,13 +249,13 @@ class BaiduTieBaClient(AbstractApiClient):
note_type: SearchNoteType = SearchNoteType.FIXED_THREAD,
) -> List[TiebaNote]:
"""
根据关键词搜索贴吧帖子 (使用Playwright访问页面,避免API检测)
Search Tieba posts by keyword (uses Playwright to access page, avoiding API detection)
Args:
keyword: 关键词
page: 分页第几页
page_size: 每页大小
sort: 结果排序方式
note_type: 帖子类型(主题贴|主题+回复混合模式)
keyword: Keyword
page: Page number
page_size: Page size
sort: Result sort method
note_type: Post type (main thread | main thread + reply mixed mode)
Returns:
"""
@@ -263,8 +263,8 @@ class BaiduTieBaClient(AbstractApiClient):
utils.logger.error("[BaiduTieBaClient.get_notes_by_keyword] playwright_page is None, cannot use browser mode")
raise Exception("playwright_page is required for browser-based search")
# 构造搜索URL
# 示例: https://tieba.baidu.com/f/search/res?ie=utf-8&qw=编程
# Construct search URL
# Example: https://tieba.baidu.com/f/search/res?ie=utf-8&qw=keyword
search_url = f"{self._host}/f/search/res"
params = {
"ie": "utf-8",
@@ -275,64 +275,64 @@ class BaiduTieBaClient(AbstractApiClient):
"only_thread": note_type.value,
}
# 拼接完整URL
# Concatenate full URL
full_url = f"{search_url}?{urlencode(params)}"
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_keyword] 访问搜索页面: {full_url}")
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_keyword] Accessing search page: {full_url}")
try:
# 使用Playwright访问搜索页面
# Use Playwright to access search page
await self.playwright_page.goto(full_url, wait_until="domcontentloaded")
# 等待页面加载,使用配置文件中的延时设置
# Wait for page loading, using delay setting from config file
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
# 获取页面HTML内容
# Get page HTML content
page_content = await self.playwright_page.content()
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_keyword] 成功获取搜索页面HTML,长度: {len(page_content)}")
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_keyword] Successfully retrieved search page HTML, length: {len(page_content)}")
# 提取搜索结果
# Extract search results
notes = self._page_extractor.extract_search_note_list(page_content)
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_keyword] 提取到 {len(notes)} 条帖子")
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_keyword] Extracted {len(notes)} posts")
return notes
except Exception as e:
utils.logger.error(f"[BaiduTieBaClient.get_notes_by_keyword] 搜索失败: {e}")
utils.logger.error(f"[BaiduTieBaClient.get_notes_by_keyword] Search failed: {e}")
raise
async def get_note_by_id(self, note_id: str) -> TiebaNote:
"""
根据帖子ID获取帖子详情 (使用Playwright访问页面,避免API检测)
Get post details by post ID (uses Playwright to access page, avoiding API detection)
Args:
note_id: 帖子ID
note_id: Post ID
Returns:
TiebaNote: 帖子详情对象
TiebaNote: Post detail object
"""
if not self.playwright_page:
utils.logger.error("[BaiduTieBaClient.get_note_by_id] playwright_page is None, cannot use browser mode")
raise Exception("playwright_page is required for browser-based note detail fetching")
# 构造帖子详情URL
# Construct post detail URL
note_url = f"{self._host}/p/{note_id}"
utils.logger.info(f"[BaiduTieBaClient.get_note_by_id] 访问帖子详情页面: {note_url}")
utils.logger.info(f"[BaiduTieBaClient.get_note_by_id] Accessing post detail page: {note_url}")
try:
# 使用Playwright访问帖子详情页面
# Use Playwright to access post detail page
await self.playwright_page.goto(note_url, wait_until="domcontentloaded")
# 等待页面加载,使用配置文件中的延时设置
# Wait for page loading, using delay setting from config file
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
# 获取页面HTML内容
# Get page HTML content
page_content = await self.playwright_page.content()
utils.logger.info(f"[BaiduTieBaClient.get_note_by_id] 成功获取帖子详情HTML,长度: {len(page_content)}")
utils.logger.info(f"[BaiduTieBaClient.get_note_by_id] Successfully retrieved post detail HTML, length: {len(page_content)}")
# 提取帖子详情
# Extract post details
note_detail = self._page_extractor.extract_note_detail(page_content)
return note_detail
except Exception as e:
utils.logger.error(f"[BaiduTieBaClient.get_note_by_id] 获取帖子详情失败: {e}")
utils.logger.error(f"[BaiduTieBaClient.get_note_by_id] Failed to get post details: {e}")
raise
async def get_note_all_comments(
@@ -343,14 +343,14 @@ class BaiduTieBaClient(AbstractApiClient):
max_count: int = 10,
) -> List[TiebaComment]:
"""
获取指定帖子下的所有一级评论 (使用Playwright访问页面,避免API检测)
Get all first-level comments for specified post (uses Playwright to access page, avoiding API detection)
Args:
note_detail: 帖子详情对象
crawl_interval: 爬取一次笔记的延迟单位(秒)
callback: 一次笔记爬取结束后的回调函数
max_count: 一次帖子爬取的最大评论数量
note_detail: Post detail object
crawl_interval: Crawl delay interval in seconds
callback: Callback function after one post crawl completes
max_count: Maximum number of comments to crawl per post
Returns:
List[TiebaComment]: 评论列表
List[TiebaComment]: Comment list
"""
if not self.playwright_page:
utils.logger.error("[BaiduTieBaClient.get_note_all_comments] playwright_page is None, cannot use browser mode")
@@ -360,30 +360,30 @@ class BaiduTieBaClient(AbstractApiClient):
current_page = 1
while note_detail.total_replay_page >= current_page and len(result) < max_count:
# 构造评论页URL
# Construct comment page URL
comment_url = f"{self._host}/p/{note_detail.note_id}?pn={current_page}"
utils.logger.info(f"[BaiduTieBaClient.get_note_all_comments] 访问评论页面: {comment_url}")
utils.logger.info(f"[BaiduTieBaClient.get_note_all_comments] Accessing comment page: {comment_url}")
try:
# 使用Playwright访问评论页面
# Use Playwright to access comment page
await self.playwright_page.goto(comment_url, wait_until="domcontentloaded")
# 等待页面加载,使用配置文件中的延时设置
# Wait for page loading, using delay setting from config file
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
# 获取页面HTML内容
# Get page HTML content
page_content = await self.playwright_page.content()
# 提取评论
# Extract comments
comments = self._page_extractor.extract_tieba_note_parment_comments(
page_content, note_id=note_detail.note_id
)
if not comments:
utils.logger.info(f"[BaiduTieBaClient.get_note_all_comments] {current_page}页没有评论,停止爬取")
utils.logger.info(f"[BaiduTieBaClient.get_note_all_comments] Page {current_page} has no comments, stopping crawl")
break
# 限制评论数量
# Limit comment count
if len(result) + len(comments) > max_count:
comments = comments[:max_count - len(result)]
@@ -392,7 +392,7 @@ class BaiduTieBaClient(AbstractApiClient):
result.extend(comments)
# 获取所有子评论
# Get all sub-comments
await self.get_comments_all_sub_comments(
comments, crawl_interval=crawl_interval, callback=callback
)
@@ -401,10 +401,10 @@ class BaiduTieBaClient(AbstractApiClient):
current_page += 1
except Exception as e:
utils.logger.error(f"[BaiduTieBaClient.get_note_all_comments] 获取第{current_page}页评论失败: {e}")
utils.logger.error(f"[BaiduTieBaClient.get_note_all_comments] Failed to get page {current_page} comments: {e}")
break
utils.logger.info(f"[BaiduTieBaClient.get_note_all_comments] 共获取 {len(result)} 条一级评论")
utils.logger.info(f"[BaiduTieBaClient.get_note_all_comments] Total retrieved {len(result)} first-level comments")
return result
async def get_comments_all_sub_comments(
@@ -414,14 +414,14 @@ class BaiduTieBaClient(AbstractApiClient):
callback: Optional[Callable] = None,
) -> List[TiebaComment]:
"""
获取指定评论下的所有子评论 (使用Playwright访问页面,避免API检测)
Get all sub-comments for specified comments (uses Playwright to access page, avoiding API detection)
Args:
comments: 评论列表
crawl_interval: 爬取一次笔记的延迟单位(秒)
callback: 一次笔记爬取结束后的回调函数
comments: Comment list
crawl_interval: Crawl delay interval in seconds
callback: Callback function after one post crawl completes
Returns:
List[TiebaComment]: 子评论列表
List[TiebaComment]: Sub-comment list
"""
if not config.ENABLE_GET_SUB_COMMENTS:
return []
@@ -440,7 +440,7 @@ class BaiduTieBaClient(AbstractApiClient):
max_sub_page_num = parment_comment.sub_comment_count // 10 + 1
while max_sub_page_num >= current_page:
# 构造子评论URL
# Construct sub-comment URL
sub_comment_url = (
f"{self._host}/p/comment?"
f"tid={parment_comment.note_id}&"
@@ -448,19 +448,19 @@ class BaiduTieBaClient(AbstractApiClient):
f"fid={parment_comment.tieba_id}&"
f"pn={current_page}"
)
utils.logger.info(f"[BaiduTieBaClient.get_comments_all_sub_comments] 访问子评论页面: {sub_comment_url}")
utils.logger.info(f"[BaiduTieBaClient.get_comments_all_sub_comments] Accessing sub-comment page: {sub_comment_url}")
try:
# 使用Playwright访问子评论页面
# Use Playwright to access sub-comment page
await self.playwright_page.goto(sub_comment_url, wait_until="domcontentloaded")
# 等待页面加载,使用配置文件中的延时设置
# Wait for page loading, using delay setting from config file
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
# 获取页面HTML内容
# Get page HTML content
page_content = await self.playwright_page.content()
# 提取子评论
# Extract sub-comments
sub_comments = self._page_extractor.extract_tieba_note_sub_comments(
page_content, parent_comment=parment_comment
)
@@ -468,7 +468,7 @@ class BaiduTieBaClient(AbstractApiClient):
if not sub_comments:
utils.logger.info(
f"[BaiduTieBaClient.get_comments_all_sub_comments] "
f"评论{parment_comment.comment_id}{current_page}页没有子评论,停止爬取"
f"Comment {parment_comment.comment_id} page {current_page} has no sub-comments, stopping crawl"
)
break
@@ -482,125 +482,125 @@ class BaiduTieBaClient(AbstractApiClient):
except Exception as e:
utils.logger.error(
f"[BaiduTieBaClient.get_comments_all_sub_comments] "
f"获取评论{parment_comment.comment_id}{current_page}页子评论失败: {e}"
f"Failed to get comment {parment_comment.comment_id} page {current_page} sub-comments: {e}"
)
break
utils.logger.info(f"[BaiduTieBaClient.get_comments_all_sub_comments] 共获取 {len(all_sub_comments)} 条子评论")
utils.logger.info(f"[BaiduTieBaClient.get_comments_all_sub_comments] Total retrieved {len(all_sub_comments)} sub-comments")
return all_sub_comments
async def get_notes_by_tieba_name(self, tieba_name: str, page_num: int) -> List[TiebaNote]:
"""
根据贴吧名称获取帖子列表 (使用Playwright访问页面,避免API检测)
Get post list by Tieba name (uses Playwright to access page, avoiding API detection)
Args:
tieba_name: 贴吧名称
page_num: 分页页码
tieba_name: Tieba name
page_num: Page number
Returns:
List[TiebaNote]: 帖子列表
List[TiebaNote]: Post list
"""
if not self.playwright_page:
utils.logger.error("[BaiduTieBaClient.get_notes_by_tieba_name] playwright_page is None, cannot use browser mode")
raise Exception("playwright_page is required for browser-based tieba note fetching")
# 构造贴吧帖子列表URL
# Construct Tieba post list URL
tieba_url = f"{self._host}/f?kw={quote(tieba_name)}&pn={page_num}"
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_tieba_name] 访问贴吧页面: {tieba_url}")
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_tieba_name] Accessing Tieba page: {tieba_url}")
try:
# 使用Playwright访问贴吧页面
# Use Playwright to access Tieba page
await self.playwright_page.goto(tieba_url, wait_until="domcontentloaded")
# 等待页面加载,使用配置文件中的延时设置
# Wait for page loading, using delay setting from config file
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
# 获取页面HTML内容
# Get page HTML content
page_content = await self.playwright_page.content()
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_tieba_name] 成功获取贴吧页面HTML,长度: {len(page_content)}")
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_tieba_name] Successfully retrieved Tieba page HTML, length: {len(page_content)}")
# 提取帖子列表
# Extract post list
notes = self._page_extractor.extract_tieba_note_list(page_content)
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_tieba_name] 提取到 {len(notes)} 条帖子")
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_tieba_name] Extracted {len(notes)} posts")
return notes
except Exception as e:
utils.logger.error(f"[BaiduTieBaClient.get_notes_by_tieba_name] 获取贴吧帖子列表失败: {e}")
utils.logger.error(f"[BaiduTieBaClient.get_notes_by_tieba_name] Failed to get Tieba post list: {e}")
raise
async def get_creator_info_by_url(self, creator_url: str) -> str:
"""
根据创作者URL获取创作者信息 (使用Playwright访问页面,避免API检测)
Get creator information by creator URL (uses Playwright to access page, avoiding API detection)
Args:
creator_url: 创作者主页URL
creator_url: Creator homepage URL
Returns:
str: 页面HTML内容
str: Page HTML content
"""
if not self.playwright_page:
utils.logger.error("[BaiduTieBaClient.get_creator_info_by_url] playwright_page is None, cannot use browser mode")
raise Exception("playwright_page is required for browser-based creator info fetching")
utils.logger.info(f"[BaiduTieBaClient.get_creator_info_by_url] 访问创作者主页: {creator_url}")
utils.logger.info(f"[BaiduTieBaClient.get_creator_info_by_url] Accessing creator homepage: {creator_url}")
try:
# 使用Playwright访问创作者主页
# Use Playwright to access creator homepage
await self.playwright_page.goto(creator_url, wait_until="domcontentloaded")
# 等待页面加载,使用配置文件中的延时设置
# Wait for page loading, using delay setting from config file
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
# 获取页面HTML内容
# Get page HTML content
page_content = await self.playwright_page.content()
utils.logger.info(f"[BaiduTieBaClient.get_creator_info_by_url] 成功获取创作者主页HTML,长度: {len(page_content)}")
utils.logger.info(f"[BaiduTieBaClient.get_creator_info_by_url] Successfully retrieved creator homepage HTML, length: {len(page_content)}")
return page_content
except Exception as e:
utils.logger.error(f"[BaiduTieBaClient.get_creator_info_by_url] 获取创作者主页失败: {e}")
utils.logger.error(f"[BaiduTieBaClient.get_creator_info_by_url] Failed to get creator homepage: {e}")
raise
async def get_notes_by_creator(self, user_name: str, page_number: int) -> Dict:
"""
根据创作者获取创作者的帖子 (使用Playwright访问页面,避免API检测)
Get creator's posts by creator (uses Playwright to access page, avoiding API detection)
Args:
user_name: 创作者用户名
page_number: 页码
user_name: Creator username
page_number: Page number
Returns:
Dict: 包含帖子数据的字典
Dict: Dictionary containing post data
"""
if not self.playwright_page:
utils.logger.error("[BaiduTieBaClient.get_notes_by_creator] playwright_page is None, cannot use browser mode")
raise Exception("playwright_page is required for browser-based creator notes fetching")
# 构造创作者帖子列表URL
# Construct creator post list URL
creator_url = f"{self._host}/home/get/getthread?un={quote(user_name)}&pn={page_number}&id=utf-8&_={utils.get_current_timestamp()}"
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_creator] 访问创作者帖子列表: {creator_url}")
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_creator] Accessing creator post list: {creator_url}")
try:
# 使用Playwright访问创作者帖子列表页面
# Use Playwright to access creator post list page
await self.playwright_page.goto(creator_url, wait_until="domcontentloaded")
# 等待页面加载,使用配置文件中的延时设置
# Wait for page loading, using delay setting from config file
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
# 获取页面内容(这个接口返回JSON)
# Get page content (this API returns JSON)
page_content = await self.playwright_page.content()
# 提取JSON数据(页面会包含<pre>标签或直接是JSON)
# Extract JSON data (page will contain <pre> tag or is directly JSON)
try:
# 尝试从页面中提取JSON
# Try to extract JSON from page
json_text = await self.playwright_page.evaluate("() => document.body.innerText")
result = json.loads(json_text)
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_creator] 成功获取创作者帖子数据")
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_creator] Successfully retrieved creator post data")
return result
except json.JSONDecodeError as e:
utils.logger.error(f"[BaiduTieBaClient.get_notes_by_creator] JSON解析失败: {e}")
utils.logger.error(f"[BaiduTieBaClient.get_notes_by_creator] 页面内容: {page_content[:500]}")
utils.logger.error(f"[BaiduTieBaClient.get_notes_by_creator] JSON parsing failed: {e}")
utils.logger.error(f"[BaiduTieBaClient.get_notes_by_creator] Page content: {page_content[:500]}")
raise Exception(f"Failed to parse JSON from creator notes page: {e}")
except Exception as e:
utils.logger.error(f"[BaiduTieBaClient.get_notes_by_creator] 获取创作者帖子列表失败: {e}")
utils.logger.error(f"[BaiduTieBaClient.get_notes_by_creator] Failed to get creator post list: {e}")
raise
async def get_all_notes_by_creator_user_name(
@@ -612,18 +612,18 @@ class BaiduTieBaClient(AbstractApiClient):
creator_page_html_content: str = None,
) -> List[TiebaNote]:
"""
根据创作者用户名获取创作者所有帖子
Get all creator posts by creator username
Args:
user_name: 创作者用户名
crawl_interval: 爬取一次笔记的延迟单位(秒)
callback: 一次笔记爬取结束后的回调函数是一个awaitable类型的函数
max_note_count: 帖子最大获取数量如果为0则获取所有
creator_page_html_content: 创作者主页HTML内容
user_name: Creator username
crawl_interval: Crawl delay interval in seconds
callback: Callback function after one post crawl completes, an awaitable function
max_note_count: Maximum number of posts to retrieve, if 0 then get all
creator_page_html_content: Creator homepage HTML content
Returns:
"""
# 百度贴吧比较特殊一些前10个帖子是直接展示在主页上的要单独处理通过API获取不到
# Baidu Tieba is special, the first 10 posts are directly displayed on the homepage and need special handling, cannot be obtained through API
result: List[TiebaNote] = []
if creator_page_html_content:
thread_id_list = (self._page_extractor.extract_tieba_thread_id_list_from_creator_page(creator_page_html_content))

View File

@@ -79,9 +79,9 @@ class TieBaCrawler(AbstractCrawler):
)
async with async_playwright() as playwright:
# 根据配置选择启动模式
# Choose startup mode based on configuration
if config.ENABLE_CDP_MODE:
utils.logger.info("[BaiduTieBaCrawler] 使用CDP模式启动浏览器")
utils.logger.info("[BaiduTieBaCrawler] Launching browser in CDP mode")
self.browser_context = await self.launch_browser_with_cdp(
playwright,
playwright_proxy_format,
@@ -89,7 +89,7 @@ class TieBaCrawler(AbstractCrawler):
headless=config.CDP_HEADLESS,
)
else:
utils.logger.info("[BaiduTieBaCrawler] 使用标准模式启动浏览器")
utils.logger.info("[BaiduTieBaCrawler] Launching browser in standard mode")
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
@@ -99,12 +99,12 @@ class TieBaCrawler(AbstractCrawler):
headless=config.HEADLESS,
)
# 注入反检测脚本 - 针对百度的特殊检测
# Inject anti-detection scripts - for Baidu's special detection
await self._inject_anti_detection_scripts()
self.context_page = await self.browser_context.new_page()
# 先访问百度首页,再点击贴吧链接,避免触发安全验证
# First visit Baidu homepage, then click Tieba link to avoid triggering security verification
await self._navigate_to_tieba_via_baidu()
# Create a client to interact with the baidutieba website.
@@ -399,29 +399,29 @@ class TieBaCrawler(AbstractCrawler):
async def _navigate_to_tieba_via_baidu(self):
"""
模拟真实用户访问路径:
1. 先访问百度首页 (https://www.baidu.com/)
2. 等待页面加载
3. 点击顶部导航栏的"贴吧"链接
4. 跳转到贴吧首页
Simulate real user access path:
1. First visit Baidu homepage (https://www.baidu.com/)
2. Wait for page to load
3. Click "Tieba" link in top navigation bar
4. Jump to Tieba homepage
这样做可以避免触发百度的安全验证
This avoids triggering Baidu's security verification
"""
utils.logger.info("[TieBaCrawler] 模拟真实用户访问路径...")
utils.logger.info("[TieBaCrawler] Simulating real user access path...")
try:
# Step 1: 访问百度首页
utils.logger.info("[TieBaCrawler] Step 1: 访问百度首页 https://www.baidu.com/")
# Step 1: Visit Baidu homepage
utils.logger.info("[TieBaCrawler] Step 1: Visiting Baidu homepage https://www.baidu.com/")
await self.context_page.goto("https://www.baidu.com/", wait_until="domcontentloaded")
# Step 2: 等待页面加载,使用配置文件中的延时设置
utils.logger.info(f"[TieBaCrawler] Step 2: 等待 {config.CRAWLER_MAX_SLEEP_SEC}秒 模拟用户浏览...")
# Step 2: Wait for page loading, using delay setting from config file
utils.logger.info(f"[TieBaCrawler] Step 2: Waiting {config.CRAWLER_MAX_SLEEP_SEC} seconds to simulate user browsing...")
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
# Step 3: 查找并点击"贴吧"链接
utils.logger.info("[TieBaCrawler] Step 3: 查找并点击'贴吧'链接...")
# Step 3: Find and click "Tieba" link
utils.logger.info("[TieBaCrawler] Step 3: Finding and clicking 'Tieba' link...")
# 尝试多种选择器,确保能找到贴吧链接
# Try multiple selectors to ensure finding the Tieba link
tieba_selectors = [
'a[href="http://tieba.baidu.com/"]',
'a[href="https://tieba.baidu.com/"]',
@@ -434,74 +434,74 @@ class TieBaCrawler(AbstractCrawler):
try:
tieba_link = await self.context_page.wait_for_selector(selector, timeout=5000)
if tieba_link:
utils.logger.info(f"[TieBaCrawler] 找到贴吧链接 (selector: {selector})")
utils.logger.info(f"[TieBaCrawler] Found Tieba link (selector: {selector})")
break
except Exception:
continue
if not tieba_link:
utils.logger.warning("[TieBaCrawler] 未找到贴吧链接,直接访问贴吧首页")
utils.logger.warning("[TieBaCrawler] Tieba link not found, directly accessing Tieba homepage")
await self.context_page.goto(self.index_url, wait_until="domcontentloaded")
return
# Step 4: 点击贴吧链接 (检查是否会打开新标签页)
utils.logger.info("[TieBaCrawler] Step 4: 点击贴吧链接...")
# Step 4: Click Tieba link (check if it will open in a new tab)
utils.logger.info("[TieBaCrawler] Step 4: Clicking Tieba link...")
# 检查链接的target属性
# Check link's target attribute
target_attr = await tieba_link.get_attribute("target")
utils.logger.info(f"[TieBaCrawler] 链接target属性: {target_attr}")
utils.logger.info(f"[TieBaCrawler] Link target attribute: {target_attr}")
if target_attr == "_blank":
# 如果是新标签页,需要等待新页面并切换
utils.logger.info("[TieBaCrawler] 链接会在新标签页打开,等待新页面...")
# If it's a new tab, need to wait for new page and switch
utils.logger.info("[TieBaCrawler] Link will open in new tab, waiting for new page...")
async with self.browser_context.expect_page() as new_page_info:
await tieba_link.click()
# 获取新打开的页面
# Get newly opened page
new_page = await new_page_info.value
await new_page.wait_for_load_state("domcontentloaded")
# 关闭旧的百度首页
# Close old Baidu homepage
await self.context_page.close()
# 切换到新的贴吧页面
# Switch to new Tieba page
self.context_page = new_page
utils.logger.info("[TieBaCrawler] ✅ 已切换到新标签页 (贴吧页面)")
utils.logger.info("[TieBaCrawler] Successfully switched to new tab (Tieba page)")
else:
# 如果是同一标签页跳转,正常等待导航
utils.logger.info("[TieBaCrawler] 链接在当前标签页跳转...")
# If it's same tab navigation, wait for navigation normally
utils.logger.info("[TieBaCrawler] Link navigates in current tab...")
async with self.context_page.expect_navigation(wait_until="domcontentloaded"):
await tieba_link.click()
# Step 5: 等待页面稳定,使用配置文件中的延时设置
utils.logger.info(f"[TieBaCrawler] Step 5: 页面加载完成,等待 {config.CRAWLER_MAX_SLEEP_SEC}...")
# Step 5: Wait for page to stabilize, using delay setting from config file
utils.logger.info(f"[TieBaCrawler] Step 5: Page loaded, waiting {config.CRAWLER_MAX_SLEEP_SEC} seconds...")
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
current_url = self.context_page.url
utils.logger.info(f"[TieBaCrawler] ✅ 成功通过百度首页进入贴吧! 当前URL: {current_url}")
utils.logger.info(f"[TieBaCrawler] Successfully entered Tieba via Baidu homepage! Current URL: {current_url}")
except Exception as e:
utils.logger.error(f"[TieBaCrawler] 通过百度首页访问贴吧失败: {e}")
utils.logger.info("[TieBaCrawler] 回退:直接访问贴吧首页")
utils.logger.error(f"[TieBaCrawler] Failed to access Tieba via Baidu homepage: {e}")
utils.logger.info("[TieBaCrawler] Fallback: directly accessing Tieba homepage")
await self.context_page.goto(self.index_url, wait_until="domcontentloaded")
async def _inject_anti_detection_scripts(self):
"""
注入反检测JavaScript脚本
针对百度贴吧的特殊检测机制
Inject anti-detection JavaScript scripts
For Baidu Tieba's special detection mechanism
"""
utils.logger.info("[TieBaCrawler] Injecting anti-detection scripts...")
# 轻量级反检测脚本,只覆盖关键检测点
# Lightweight anti-detection script, only covering key detection points
anti_detection_js = """
// 覆盖 navigator.webdriver
// Override navigator.webdriver
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined,
configurable: true
});
// 覆盖 window.navigator.chrome
// Override window.navigator.chrome
if (!window.navigator.chrome) {
window.navigator.chrome = {
runtime: {},
@@ -511,7 +511,7 @@ class TieBaCrawler(AbstractCrawler):
};
}
// 覆盖 Permissions API
// Override Permissions API
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) => (
parameters.name === 'notifications' ?
@@ -519,19 +519,19 @@ class TieBaCrawler(AbstractCrawler):
originalQuery(parameters)
);
// 覆盖 plugins 长度(让它看起来有插件)
// Override plugins length (make it look like there are plugins)
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3, 4, 5],
configurable: true
});
// 覆盖 languages
// Override languages
Object.defineProperty(navigator, 'languages', {
get: () => ['zh-CN', 'zh', 'en'],
configurable: true
});
// 移除 window.cdc_ ChromeDriver 残留
// Remove window.cdc_ and other ChromeDriver remnants
delete window.cdc_adoQpoasnfa76pfcZLmcfl_Array;
delete window.cdc_adoQpoasnfa76pfcZLmcfl_Promise;
delete window.cdc_adoQpoasnfa76pfcZLmcfl_Symbol;
@@ -548,21 +548,21 @@ class TieBaCrawler(AbstractCrawler):
"""
Create tieba client with real browser User-Agent and complete headers
Args:
httpx_proxy: HTTP代理
ip_pool: IP代理池
httpx_proxy: HTTP proxy
ip_pool: IP proxy pool
Returns:
BaiduTieBaClient实例
BaiduTieBaClient instance
"""
utils.logger.info("[TieBaCrawler.create_tieba_client] Begin create tieba API client...")
# 从真实浏览器提取User-Agent,避免被检测
# Extract User-Agent from real browser to avoid detection
user_agent = await self.context_page.evaluate("() => navigator.userAgent")
utils.logger.info(f"[TieBaCrawler.create_tieba_client] Extracted User-Agent from browser: {user_agent}")
cookie_str, cookie_dict = utils.convert_cookies(await self.browser_context.cookies())
# 构建完整的浏览器请求头,模拟真实浏览器行为
# Build complete browser request headers, simulating real browser behavior
tieba_client = BaiduTieBaClient(
timeout=10,
ip_pool=ip_pool,
@@ -572,7 +572,7 @@ class TieBaCrawler(AbstractCrawler):
"Accept-Language": "zh-CN,zh;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"User-Agent": user_agent, # 使用真实浏览器的UA
"User-Agent": user_agent, # Use real browser UA
"Cookie": cookie_str,
"Host": "tieba.baidu.com",
"Referer": "https://tieba.baidu.com/",
@@ -585,7 +585,7 @@ class TieBaCrawler(AbstractCrawler):
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"macOS"',
},
playwright_page=self.context_page, # 传入playwright页面对象
playwright_page=self.context_page, # Pass in playwright page object
)
return tieba_client
@@ -623,7 +623,7 @@ class TieBaCrawler(AbstractCrawler):
proxy=playwright_proxy, # type: ignore
viewport={"width": 1920, "height": 1080},
user_agent=user_agent,
channel="chrome", # 使用系统的Chrome稳定版
channel="chrome", # Use system's stable Chrome version
)
return browser_context
else:
@@ -641,7 +641,7 @@ class TieBaCrawler(AbstractCrawler):
headless: bool = True,
) -> BrowserContext:
"""
使用CDP模式启动浏览器
Launch browser using CDP mode
"""
try:
self.cdp_manager = CDPBrowserManager()
@@ -652,15 +652,15 @@ class TieBaCrawler(AbstractCrawler):
headless=headless,
)
# 显示浏览器信息
# Display browser information
browser_info = await self.cdp_manager.get_browser_info()
utils.logger.info(f"[TieBaCrawler] CDP浏览器信息: {browser_info}")
utils.logger.info(f"[TieBaCrawler] CDP browser info: {browser_info}")
return browser_context
except Exception as e:
utils.logger.error(f"[TieBaCrawler] CDP模式启动失败,回退到标准模式: {e}")
# 回退到标准模式
utils.logger.error(f"[TieBaCrawler] CDP mode launch failed, falling back to standard mode: {e}")
# Fall back to standard mode
chromium = playwright.chromium
return await self.launch_browser(
chromium, playwright_proxy, user_agent, headless
@@ -672,7 +672,7 @@ class TieBaCrawler(AbstractCrawler):
Returns:
"""
# 如果使用CDP模式需要特殊处理
# If using CDP mode, need special handling
if self.cdp_manager:
await self.cdp_manager.cleanup()
self.cdp_manager = None

View File

@@ -23,16 +23,16 @@ from enum import Enum
class SearchSortType(Enum):
"""search sort type"""
# 按时间倒序
# Sort by time in descending order
TIME_DESC = "1"
# 按时间顺序
# Sort by time in ascending order
TIME_ASC = "0"
# 按相关性顺序
# Sort by relevance
RELEVANCE_ORDER = "2"
class SearchNoteType(Enum):
# 只看主题贴
# Only view main posts
MAIN_THREAD = "1"
# 混合模式(帖子+回复)
# Mixed mode (posts + replies)
FIXED_THREAD = "0"

View File

@@ -42,12 +42,12 @@ class TieBaExtractor:
@staticmethod
def extract_search_note_list(page_content: str) -> List[TiebaNote]:
"""
提取贴吧帖子列表,这里提取的关键词搜索结果页的数据,还缺少帖子的回复数和回复页等数据
Extract Tieba post list from keyword search result pages, still missing reply count and reply page data
Args:
page_content: 页面内容的HTML字符串
page_content: HTML string of page content
Returns:
包含帖子信息的字典列表
List of Tieba post objects
"""
xpath_selector = "//div[@class='s_post']"
post_list = Selector(text=page_content).xpath(xpath_selector)
@@ -71,12 +71,12 @@ class TieBaExtractor:
def extract_tieba_note_list(self, page_content: str) -> List[TiebaNote]:
"""
提取贴吧帖子列表
Extract Tieba post list from Tieba page
Args:
page_content:
page_content: HTML string of page content
Returns:
List of Tieba post objects
"""
page_content = page_content.replace('<!--', "")
content_selector = Selector(text=page_content)
@@ -106,21 +106,21 @@ class TieBaExtractor:
def extract_note_detail(self, page_content: str) -> TiebaNote:
"""
提取贴吧帖子详情
Extract Tieba post details from post detail page
Args:
page_content:
page_content: HTML string of page content
Returns:
Tieba post detail object
"""
content_selector = Selector(text=page_content)
first_floor_selector = content_selector.xpath("//div[@class='p_postlist'][1]")
only_view_author_link = content_selector.xpath("//*[@id='lzonly_cntn']/@href").get(default='').strip()
note_id = only_view_author_link.split("?")[0].split("/")[-1]
# 帖子回复数、回复页数
# Post reply count and reply page count
thread_num_infos = content_selector.xpath(
"//div[@id='thread_theme_5']//li[@class='l_reply_num']//span[@class='red']")
# IP地理位置、发表时间
# IP location and publish time
other_info_content = content_selector.xpath(".//div[@class='post-tail-wrap']").get(default="").strip()
ip_location, publish_time = self.extract_ip_and_pub_time(other_info_content)
note = TiebaNote(note_id=note_id, title=content_selector.xpath("//title/text()").get(default='').strip(),
@@ -138,18 +138,18 @@ class TieBaExtractor:
publish_time=publish_time,
total_replay_num=thread_num_infos[0].xpath("./text()").get(default='').strip(),
total_replay_page=thread_num_infos[1].xpath("./text()").get(default='').strip(), )
note.title = note.title.replace(f"{note.tieba_name}】_百度贴吧", "")
note.title = note.title.replace(f"{note.tieba_name}】_Baidu Tieba", "")
return note
def extract_tieba_note_parment_comments(self, page_content: str, note_id: str) -> List[TiebaComment]:
"""
提取贴吧帖子一级评论
Extract Tieba post first-level comments from comment page
Args:
page_content:
note_id:
page_content: HTML string of page content
note_id: Post ID
Returns:
List of first-level comment objects
"""
xpath_selector = "//div[@class='l_post l_post_bright j_l_post clearfix ']"
comment_list = Selector(text=page_content).xpath(xpath_selector)
@@ -180,13 +180,13 @@ class TieBaExtractor:
def extract_tieba_note_sub_comments(self, page_content: str, parent_comment: TiebaComment) -> List[TiebaComment]:
"""
提取贴吧帖子二级评论
Extract Tieba post second-level comments from sub-comment page
Args:
page_content:
parent_comment:
page_content: HTML string of page content
parent_comment: Parent comment object
Returns:
List of second-level comment objects
"""
selector = Selector(page_content)
comments = []
@@ -215,12 +215,12 @@ class TieBaExtractor:
def extract_creator_info(self, html_content: str) -> TiebaCreator:
"""
提取贴吧创作者信息
Extract Tieba creator information from creator homepage
Args:
html_content:
html_content: HTML string of creator homepage
Returns:
Tieba creator object
"""
selector = Selector(text=html_content)
user_link_selector = selector.xpath("//p[@class='space']/a")
@@ -251,12 +251,12 @@ class TieBaExtractor:
html_content: str
) -> List[str]:
"""
提取贴吧创作者主页的帖子列表
Extract post ID list from Tieba creator's homepage
Args:
html_content:
html_content: HTML string of creator homepage
Returns:
List of post IDs
"""
selector = Selector(text=html_content)
thread_id_list = []
@@ -271,12 +271,12 @@ class TieBaExtractor:
def extract_ip_and_pub_time(self, html_content: str) -> Tuple[str, str]:
"""
提取IP位置和发布时间
Extract IP location and publish time from HTML content
Args:
html_content:
html_content: HTML string
Returns:
Tuple of (IP location, publish time)
"""
pattern_pub_time = re.compile(r'<span class="tail-info">(\d{4}-\d{2}-\d{2} \d{2}:\d{2})</span>')
time_match = pattern_pub_time.search(html_content)
@@ -286,12 +286,12 @@ class TieBaExtractor:
@staticmethod
def extract_ip(html_content: str) -> str:
"""
提取IP
Extract IP location from HTML content
Args:
html_content:
html_content: HTML string
Returns:
IP location string
"""
pattern_ip = re.compile(r'IP属地:(\S+)</span>')
ip_match = pattern_ip.search(html_content)
@@ -301,28 +301,28 @@ class TieBaExtractor:
@staticmethod
def extract_gender(html_content: str) -> str:
"""
提取性别
Extract gender from HTML content
Args:
html_content:
html_content: HTML string
Returns:
Gender string ('Male', 'Female', or 'Unknown')
"""
if GENDER_MALE in html_content:
return ''
return 'Male'
elif GENDER_FEMALE in html_content:
return ''
return '未知'
return 'Female'
return 'Unknown'
@staticmethod
def extract_follow_and_fans(selectors: List[Selector]) -> Tuple[str, str]:
"""
提取关注数和粉丝数
Extract follow count and fan count from selectors
Args:
selectors:
selectors: List of selector objects
Returns:
Tuple of (follow count, fan count)
"""
pattern = re.compile(r'<span class="concern_num">\(<a[^>]*>(\d+)</a>\)</span>')
follow_match = pattern.findall(selectors[0].get())
@@ -334,9 +334,15 @@ class TieBaExtractor:
@staticmethod
def extract_registration_duration(html_content: str) -> str:
"""
"<span>吧龄:1.9年</span>"
Returns: 1.9年
Extract Tieba age from HTML content
Example: "<span>吧龄:1.9年</span>"
Returns: "1.9年"
Args:
html_content: HTML string
Returns:
Tieba age string
"""
pattern = re.compile(r'<span>吧龄:(\S+)</span>')
match = pattern.search(html_content)
@@ -345,22 +351,22 @@ class TieBaExtractor:
@staticmethod
def extract_data_field_value(selector: Selector) -> Dict:
"""
提取data-field的值
Extract data-field value from selector
Args:
selector:
selector: Selector object
Returns:
Dictionary containing data-field value
"""
data_field_value = selector.xpath("./@data-field").get(default='').strip()
if not data_field_value or data_field_value == "{}":
return {}
try:
# 先使用 html.unescape 处理转义字符 再json.loads 将 JSON 字符串转换为 Python 字典
# First use html.unescape to handle escape characters, then json.loads to convert JSON string to Python dictionary
unescaped_json_str = html.unescape(data_field_value)
data_field_dict_value = json.loads(unescaped_json_str)
except Exception as ex:
print(f"extract_data_field_value,错误信息:{ex}, 尝试使用其他方式解析")
print(f"extract_data_field_value, error: {ex}, trying alternative parsing method")
data_field_dict_value = {}
return data_field_dict_value

View File

@@ -50,7 +50,7 @@ class BaiduTieBaLogin(AbstractLogin):
@retry(stop=stop_after_attempt(600), wait=wait_fixed(1), retry=retry_if_result(lambda value: value is False))
async def check_login_state(self) -> bool:
"""
轮训检查登录状态是否成功成功返回True否则返回False
Poll to check if login status is successful, return True if successful, otherwise return False
Returns:

View File

@@ -20,7 +20,7 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/23 15:40
# @Desc : 微博爬虫 API 请求 client
# @Desc : Weibo crawler API request client
import asyncio
import copy
@@ -49,7 +49,7 @@ class WeiboClient(ProxyRefreshMixin):
def __init__(
self,
timeout=60, # 若开启爬取媒体选项weibo 的图片需要更久的超时时间
timeout=60, # If media crawling is enabled, Weibo images need a longer timeout
proxy=None,
*,
headers: Dict[str, str],
@@ -64,12 +64,12 @@ class WeiboClient(ProxyRefreshMixin):
self.playwright_page = playwright_page
self.cookie_dict = cookie_dict
self._image_agent_host = "https://i1.wp.com/"
# 初始化代理池(来自 ProxyRefreshMixin
# Initialize proxy pool (from ProxyRefreshMixin)
self.init_proxy_pool(proxy_ip_pool)
@retry(stop=stop_after_attempt(5), wait=wait_fixed(3))
async def request(self, method, url, **kwargs) -> Union[Response, Dict]:
# 每次请求前检测代理是否过期
# Check if proxy is expired before each request
await self._refresh_proxy_if_expired()
enable_return_response = kwargs.pop("return_response", False)
@@ -82,7 +82,7 @@ class WeiboClient(ProxyRefreshMixin):
try:
data: Dict = response.json()
except json.decoder.JSONDecodeError:
# issue: #771 搜索接口会报错432 多次重试 + 更新 h5 cookies
# issue: #771 Search API returns error 432, retry multiple times + update h5 cookies
utils.logger.error(f"[WeiboClient.request] request {method}:{url} err code: {response.status_code} res:{response.text}")
await self.playwright_page.goto(self._host)
await asyncio.sleep(2)
@@ -156,9 +156,9 @@ class WeiboClient(ProxyRefreshMixin):
) -> Dict:
"""
search note by keyword
:param keyword: 微博搜搜的关键词
:param page: 分页参数 -当前页码
:param search_type: 搜索的类型,见 weibo/filed.py 中的枚举SearchType
:param keyword: Search keyword for Weibo
:param page: Pagination parameter - current page number
:param search_type: Search type, see SearchType enum in weibo/field.py
:return:
"""
uri = "/api/container/getIndex"
@@ -172,9 +172,9 @@ class WeiboClient(ProxyRefreshMixin):
async def get_note_comments(self, mid_id: str, max_id: int, max_id_type: int = 0) -> Dict:
"""get notes comments
:param mid_id: 微博ID
:param max_id: 分页参数ID
:param max_id_type: 分页参数ID类型
:param mid_id: Weibo ID
:param max_id: Pagination parameter ID
:param max_id_type: Pagination parameter ID type
:return:
"""
uri = "/comments/hotflow"
@@ -218,7 +218,7 @@ class WeiboClient(ProxyRefreshMixin):
is_end = max_id == 0
if len(result) + len(comment_list) > max_count:
comment_list = comment_list[:max_count - len(result)]
if callback: # 如果有回调函数,就执行回调函数
if callback: # If callback function exists, execute it
await callback(note_id, comment_list)
await asyncio.sleep(crawl_interval)
result.extend(comment_list)
@@ -233,7 +233,7 @@ class WeiboClient(ProxyRefreshMixin):
callback: Optional[Callable] = None,
) -> List[Dict]:
"""
获取评论的所有子评论
Get all sub-comments of comments
Args:
note_id:
comment_list:
@@ -256,7 +256,7 @@ class WeiboClient(ProxyRefreshMixin):
async def get_note_info_by_id(self, note_id: str) -> Dict:
"""
根据帖子ID获取详情
Get note details by note ID
:param note_id:
:return:
"""
@@ -273,22 +273,22 @@ class WeiboClient(ProxyRefreshMixin):
note_item = {"mblog": note_detail}
return note_item
else:
utils.logger.info(f"[WeiboClient.get_note_info_by_id] 未找到$render_data的值")
utils.logger.info(f"[WeiboClient.get_note_info_by_id] $render_data value not found")
return dict()
async def get_note_image(self, image_url: str) -> bytes:
image_url = image_url[8:] # 去掉 https://
image_url = image_url[8:] # Remove https://
sub_url = image_url.split("/")
image_url = ""
for i in range(len(sub_url)):
if i == 1:
image_url += "large/" # 都获取高清大图
image_url += "large/" # Get high-resolution images
elif i == len(sub_url) - 1:
image_url += sub_url[i]
else:
image_url += sub_url[i] + "/"
# 微博图床对外存在防盗链,所以需要代理访问
# 由于微博图片是通过 i1.wp.com 来访问的,所以需要拼接一下
# Weibo image hosting has anti-hotlinking, so proxy access is needed
# Since Weibo images are accessed through i1.wp.com, we need to concatenate the URL
final_uri = (f"{self._image_agent_host}"
f"{image_url}")
async with httpx.AsyncClient(proxy=self.proxy) as client:
@@ -301,18 +301,18 @@ class WeiboClient(ProxyRefreshMixin):
else:
return response.content
except httpx.HTTPError as exc: # some wrong when call httpx.request method, such as connection error, client error, server error or response status code is not 2xx
utils.logger.error(f"[DouYinClient.get_aweme_media] {exc.__class__.__name__} for {exc.request.url} - {exc}") # 保留原始异常类型名称,以便开发者调试
utils.logger.error(f"[DouYinClient.get_aweme_media] {exc.__class__.__name__} for {exc.request.url} - {exc}") # Keep original exception type name for developer debugging
return None
async def get_creator_container_info(self, creator_id: str) -> Dict:
"""
获取用户的容器ID, 容器信息代表着真实请求的API路径
fid_container_id用户的微博详情API的容器ID
lfid_container_id用户的微博列表API的容器ID
Get user's container ID, container information represents the real API request path
fid_container_id: Container ID for user's Weibo detail API
lfid_container_id: Container ID for user's Weibo list API
Args:
creator_id:
creator_id: User ID
Returns: {
Returns: Dictionary with container IDs
"""
response = await self.get(f"/u/{creator_id}", return_response=True)
@@ -324,7 +324,7 @@ class WeiboClient(ProxyRefreshMixin):
async def get_creator_info_by_id(self, creator_id: str) -> Dict:
"""
根据用户ID获取用户详情
Get user details by user ID
Args:
creator_id:
@@ -349,11 +349,11 @@ class WeiboClient(ProxyRefreshMixin):
since_id: str = "0",
) -> Dict:
"""
获取博主的笔记
Get creator's notes
Args:
creator: 博主ID
container_id: 容器ID
since_id: 上一页最后一条笔记的ID
creator: Creator ID
container_id: Container ID
since_id: ID of the last note from previous page
Returns:
"""
@@ -376,14 +376,14 @@ class WeiboClient(ProxyRefreshMixin):
callback: Optional[Callable] = None,
) -> List[Dict]:
"""
获取指定用户下的所有发过的帖子,该方法会一直查找一个用户下的所有帖子信息
Get all posts published by a specified user, this method will continuously fetch all posts from a user
Args:
creator_id:
container_id:
crawl_interval:
callback:
creator_id: Creator user ID
container_id: Container ID for the user
crawl_interval: Interval between requests in seconds
callback: Optional callback function to process notes
Returns:
Returns: List of all notes
"""
result = []
@@ -393,7 +393,7 @@ class WeiboClient(ProxyRefreshMixin):
while notes_has_more:
notes_res = await self.get_notes_by_creator(creator_id, container_id, since_id)
if not notes_res:
utils.logger.error(f"[WeiboClient.get_notes_by_creator] The current creator may have been banned by xhs, so they cannot access the data.")
utils.logger.error(f"[WeiboClient.get_notes_by_creator] The current creator may have been banned by Weibo, so they cannot access the data.")
break
since_id = notes_res.get("cardlistInfo", {}).get("since_id", "0")
if "cards" not in notes_res:

View File

@@ -20,7 +20,7 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/23 15:41
# @Desc : 微博爬虫主流程代码
# @Desc : Weibo crawler main workflow code
import asyncio
import os
@@ -63,7 +63,7 @@ class WeiboCrawler(AbstractCrawler):
self.user_agent = utils.get_user_agent()
self.mobile_user_agent = utils.get_mobile_user_agent()
self.cdp_manager = None
self.ip_proxy_pool = None # 代理IP池用于代理自动刷新
self.ip_proxy_pool = None # Proxy IP pool for automatic proxy refresh
async def start(self):
playwright_proxy_format, httpx_proxy_format = None, None
@@ -73,9 +73,9 @@ class WeiboCrawler(AbstractCrawler):
playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(ip_proxy_info)
async with async_playwright() as playwright:
# 根据配置选择启动模式
# Select launch mode based on configuration
if config.ENABLE_CDP_MODE:
utils.logger.info("[WeiboCrawler] 使用CDP模式启动浏览器")
utils.logger.info("[WeiboCrawler] Launching browser with CDP mode")
self.browser_context = await self.launch_browser_with_cdp(
playwright,
playwright_proxy_format,
@@ -83,7 +83,7 @@ class WeiboCrawler(AbstractCrawler):
headless=config.CDP_HEADLESS,
)
else:
utils.logger.info("[WeiboCrawler] 使用标准模式启动浏览器")
utils.logger.info("[WeiboCrawler] Launching browser with standard mode")
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(chromium, None, self.mobile_user_agent, headless=config.HEADLESS)
@@ -109,11 +109,11 @@ class WeiboCrawler(AbstractCrawler):
)
await login_obj.begin()
# 登录成功后重定向到手机端的网站,再更新手机端登录成功的cookie
# After successful login, redirect to mobile website and update mobile cookies
utils.logger.info("[WeiboCrawler.start] redirect weibo mobile homepage and update cookies on mobile platform")
await self.context_page.goto(self.mobile_index_url)
await asyncio.sleep(3)
# 只获取移动端的 cookies避免 PC 端和移动端 cookies 混淆
# Only get mobile cookies to avoid confusion between PC and mobile cookies
await self.wb_client.update_cookies(
browser_context=self.browser_context,
urls=[self.mobile_index_url]
@@ -170,6 +170,8 @@ class WeiboCrawler(AbstractCrawler):
search_res = await self.wb_client.get_note_by_keyword(keyword=keyword, page=page, search_type=search_type)
note_id_list: List[str] = []
note_list = filter_search_result_card(search_res.get("cards"))
# If full text fetching is enabled, batch get full text of posts
note_list = await self.batch_get_notes_full_text(note_list)
for note_item in note_list:
if note_item:
mblog: Dict = note_item.get("mblog")
@@ -276,11 +278,18 @@ class WeiboCrawler(AbstractCrawler):
utils.logger.info(f"[WeiboCrawler.get_note_images] Crawling image mode is not enabled")
return
pics: Dict = mblog.get("pics")
pics: List = mblog.get("pics")
if not pics:
return
for pic in pics:
url = pic.get("url")
if isinstance(pic, str):
url = pic
pid = url.split("/")[-1].split(".")[0]
elif isinstance(pic, dict):
url = pic.get("url")
pid = pic.get("pid", "")
else:
continue
if not url:
continue
content = await self.wb_client.get_note_image(url)
@@ -288,7 +297,7 @@ class WeiboCrawler(AbstractCrawler):
utils.logger.info(f"[WeiboCrawler.get_note_images] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after fetching image")
if content != None:
extension_file_name = url.split(".")[-1]
await weibo_store.update_weibo_note_image(pic["pid"], content, extension_file_name)
await weibo_store.update_weibo_note_image(pid, content, extension_file_name)
async def get_creators_and_notes(self) -> None:
"""
@@ -306,12 +315,18 @@ class WeiboCrawler(AbstractCrawler):
raise DataFetchError("Get creator info error")
await weibo_store.save_creator(user_id, user_info=createor_info)
# Create a wrapper callback to get full text before saving data
async def save_notes_with_full_text(note_list: List[Dict]):
# If full text fetching is enabled, batch get full text first
updated_note_list = await self.batch_get_notes_full_text(note_list)
await weibo_store.batch_update_weibo_notes(updated_note_list)
# Get all note information of the creator
all_notes_list = await self.wb_client.get_all_notes_by_creator_id(
creator_id=user_id,
container_id=f"107603{user_id}",
crawl_interval=0,
callback=weibo_store.batch_update_weibo_notes,
callback=save_notes_with_full_text,
)
note_ids = [note_item.get("mblog", {}).get("id") for note_item in all_notes_list if note_item.get("mblog", {}).get("id")]
@@ -335,7 +350,7 @@ class WeiboCrawler(AbstractCrawler):
},
playwright_page=self.context_page,
cookie_dict=cookie_dict,
proxy_ip_pool=self.ip_proxy_pool, # 传递代理池用于自动刷新
proxy_ip_pool=self.ip_proxy_pool, # Pass proxy pool for automatic refresh
)
return weibo_client_obj
@@ -360,7 +375,7 @@ class WeiboCrawler(AbstractCrawler):
"height": 1080
},
user_agent=user_agent,
channel="chrome", # 使用系统的Chrome稳定版
channel="chrome", # Use system's Chrome stable version
)
return browser_context
else:
@@ -376,7 +391,7 @@ class WeiboCrawler(AbstractCrawler):
headless: bool = True,
) -> BrowserContext:
"""
使用CDP模式启动浏览器
Launch browser with CDP mode
"""
try:
self.cdp_manager = CDPBrowserManager()
@@ -387,21 +402,76 @@ class WeiboCrawler(AbstractCrawler):
headless=headless,
)
# 显示浏览器信息
# Display browser information
browser_info = await self.cdp_manager.get_browser_info()
utils.logger.info(f"[WeiboCrawler] CDP浏览器信息: {browser_info}")
utils.logger.info(f"[WeiboCrawler] CDP browser info: {browser_info}")
return browser_context
except Exception as e:
utils.logger.error(f"[WeiboCrawler] CDP模式启动失败,回退到标准模式: {e}")
# 回退到标准模式
utils.logger.error(f"[WeiboCrawler] CDP mode startup failed, falling back to standard mode: {e}")
# Fallback to standard mode
chromium = playwright.chromium
return await self.launch_browser(chromium, playwright_proxy, user_agent, headless)
async def get_note_full_text(self, note_item: Dict) -> Dict:
"""
Get full text content of a post
If the post content is truncated (isLongText=True), request the detail API to get complete content
:param note_item: Post data, contains mblog field
:return: Updated post data
"""
if not config.ENABLE_WEIBO_FULL_TEXT:
return note_item
mblog = note_item.get("mblog", {})
if not mblog:
return note_item
# Check if it's a long text
is_long_text = mblog.get("isLongText", False)
if not is_long_text:
return note_item
note_id = mblog.get("id")
if not note_id:
return note_item
try:
utils.logger.info(f"[WeiboCrawler.get_note_full_text] Fetching full text for note: {note_id}")
full_note = await self.wb_client.get_note_info_by_id(note_id)
if full_note and full_note.get("mblog"):
# Replace original content with complete content
note_item["mblog"] = full_note["mblog"]
utils.logger.info(f"[WeiboCrawler.get_note_full_text] Successfully fetched full text for note: {note_id}")
# Sleep after request to avoid rate limiting
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
except DataFetchError as ex:
utils.logger.error(f"[WeiboCrawler.get_note_full_text] Failed to fetch full text for note {note_id}: {ex}")
except Exception as ex:
utils.logger.error(f"[WeiboCrawler.get_note_full_text] Unexpected error for note {note_id}: {ex}")
return note_item
async def batch_get_notes_full_text(self, note_list: List[Dict]) -> List[Dict]:
"""
Batch get full text content of posts
:param note_list: List of posts
:return: Updated list of posts
"""
if not config.ENABLE_WEIBO_FULL_TEXT:
return note_list
result = []
for note_item in note_list:
updated_note = await self.get_note_full_text(note_item)
result.append(updated_note)
return result
async def close(self):
"""Close browser context"""
# 如果使用CDP模式需要特殊处理
# Special handling if using CDP mode
if self.cdp_manager:
await self.cdp_manager.cleanup()
self.cdp_manager = None

View File

@@ -26,14 +26,14 @@ from enum import Enum
class SearchType(Enum):
# 综合
# Comprehensive
DEFAULT = "1"
# 实时
# Real-time
REAL_TIME = "61"
# 热门
# Popular
POPULAR = "60"
# 视频
# Video
VIDEO = "64"

View File

@@ -28,9 +28,9 @@ from typing import Dict, List
def filter_search_result_card(card_list: List[Dict]) -> List[Dict]:
"""
过滤微博搜索的结果,只保留card_type为9类型的数据
:param card_list:
:return:
Filter Weibo search results, only keep data with card_type of 9
:param card_list: List of card items from search results
:return: Filtered list of note items
"""
note_list: List[Dict] = []
for card_item in card_list:

View File

@@ -21,7 +21,7 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/23 15:42
# @Desc : 微博登录实现
# @Desc : Weibo login implementation
import asyncio
import functools

View File

@@ -45,7 +45,7 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
def __init__(
self,
timeout=60, # 若开启爬取媒体选项xhs 的长视频需要更久的超时时间
timeout=60, # If media crawling is enabled, Xiaohongshu long videos need longer timeout
proxy=None,
*,
headers: Dict[str, str],
@@ -58,43 +58,46 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
self.headers = headers
self._host = "https://edith.xiaohongshu.com"
self._domain = "https://www.xiaohongshu.com"
self.IP_ERROR_STR = "网络连接异常,请检查网络设置或重启试试"
self.IP_ERROR_STR = "Network connection error, please check network settings or restart"
self.IP_ERROR_CODE = 300012
self.NOTE_ABNORMAL_STR = "笔记状态异常,请稍后查看"
self.NOTE_ABNORMAL_STR = "Note status abnormal, please check later"
self.NOTE_ABNORMAL_CODE = -510001
self.playwright_page = playwright_page
self.cookie_dict = cookie_dict
self._extractor = XiaoHongShuExtractor()
# 初始化代理池(来自 ProxyRefreshMixin
# Initialize proxy pool (from ProxyRefreshMixin)
self.init_proxy_pool(proxy_ip_pool)
async def _pre_headers(self, url: str, params: Optional[Dict] = None, payload: Optional[Dict] = None) -> Dict:
"""请求头参数签名(使用 playwright 注入方式)
"""Request header parameter signing (using playwright injection method)
Args:
url: 请求的URL
params: GET请求的参数
payload: POST请求的参数
url: Request URL
params: GET request parameters
payload: POST request parameters
Returns:
Dict: 请求头参数签名
Dict: Signed request header parameters
"""
a1_value = self.cookie_dict.get("a1", "")
# 确定请求数据和 URI
# Determine request data, method and URI
if params is not None:
data = params
method = "GET"
elif payload is not None:
data = payload
method = "POST"
else:
raise ValueError("params or payload is required")
# 使用 playwright 注入方式生成签名
# Generate signature using playwright injection method
signs = await sign_with_playwright(
page=self.playwright_page,
uri=url,
data=data,
a1=a1_value,
method=method,
)
headers = {
@@ -109,16 +112,16 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
@retry(stop=stop_after_attempt(3), wait=wait_fixed(1))
async def request(self, method, url, **kwargs) -> Union[str, Any]:
"""
封装httpx的公共请求方法对请求响应做一些处理
Wrapper for httpx common request method, processes request response
Args:
method: 请求方法
url: 请求的URL
**kwargs: 其他请求参数,例如请求头、请求体等
method: Request method
url: Request URL
**kwargs: Other request parameters, such as headers, body, etc.
Returns:
"""
# 每次请求前检测代理是否过期
# Check if proxy is expired before each request
await self._refresh_proxy_if_expired()
# return response.text
@@ -130,7 +133,7 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
# someday someone maybe will bypass captcha
verify_type = response.headers["Verifytype"]
verify_uuid = response.headers["Verifyuuid"]
msg = f"出现验证码,请求失败,Verifytype: {verify_type}Verifyuuid: {verify_uuid}, Response: {response}"
msg = f"CAPTCHA appeared, request failed, Verifytype: {verify_type}, Verifyuuid: {verify_uuid}, Response: {response}"
utils.logger.error(msg)
raise Exception(msg)
@@ -147,32 +150,27 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
async def get(self, uri: str, params: Optional[Dict] = None) -> Dict:
"""
GET请求,对请求头签名
GET request, signs request headers
Args:
uri: 请求路由
params: 请求参数
uri: Request route
params: Request parameters
Returns:
"""
headers = await self._pre_headers(uri, params)
if isinstance(params, dict):
# 构建带参数的完整 URL
query_string = urlencode(params)
full_url = f"{self._host}{uri}?{query_string}"
else:
full_url = f"{self._host}{uri}"
full_url = f"{self._host}{uri}"
return await self.request(
method="GET", url=full_url, headers=headers
method="GET", url=full_url, headers=headers, params=params
)
async def post(self, uri: str, data: dict, **kwargs) -> Dict:
"""
POST请求,对请求头签名
POST request, signs request headers
Args:
uri: 请求路由
data: 请求体参数
uri: Request route
data: Request body parameters
Returns:
@@ -188,7 +186,7 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
)
async def get_note_media(self, url: str) -> Union[bytes, None]:
# 请求前检测代理是否过期
# Check if proxy is expired before request
await self._refresh_proxy_if_expired()
async with httpx.AsyncClient(proxy=self.proxy) as client:
@@ -207,12 +205,12 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
) as exc: # some wrong when call httpx.request method, such as connection error, client error, server error or response status code is not 2xx
utils.logger.error(
f"[XiaoHongShuClient.get_aweme_media] {exc.__class__.__name__} for {exc.request.url} - {exc}"
) # 保留原始异常类型名称,以便开发者调试
) # Keep original exception type name for developer debugging
return None
async def pong(self) -> bool:
"""
用于检查登录态是否失效了
Check if login state is still valid
Returns:
"""
@@ -220,7 +218,7 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
utils.logger.info("[XiaoHongShuClient.pong] Begin to pong xhs...")
ping_flag = False
try:
note_card: Dict = await self.get_note_by_keyword(keyword="小红书")
note_card: Dict = await self.get_note_by_keyword(keyword="Xiaohongshu")
if note_card.get("items"):
ping_flag = True
except Exception as e:
@@ -232,9 +230,9 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
async def update_cookies(self, browser_context: BrowserContext):
"""
API客户端提供的更新cookies方法一般情况下登录成功后会调用此方法
Update cookies method provided by API client, usually called after successful login
Args:
browser_context: 浏览器上下文对象
browser_context: Browser context object
Returns:
@@ -253,13 +251,13 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
note_type: SearchNoteType = SearchNoteType.ALL,
) -> Dict:
"""
根据关键词搜索笔记
Search notes by keyword
Args:
keyword: 关键词参数
page: 分页第几页
page_size: 分页数据长度
sort: 搜索结果排序指定
note_type: 搜索的笔记类型
keyword: Keyword parameter
page: Page number
page_size: Page data length
sort: Search result sorting specification
note_type: Type of note to search
Returns:
@@ -282,11 +280,11 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
xsec_token: str,
) -> Dict:
"""
获取笔记详情API
Get note detail API
Args:
note_id:笔记ID
xsec_source: 渠道来源
xsec_token: 搜索关键字之后返回的比较列表中返回的token
note_id: Note ID
xsec_source: Channel source
xsec_token: Token returned from search keyword result list
Returns:
@@ -306,7 +304,7 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
if res and res.get("items"):
res_dict: Dict = res["items"][0]["note_card"]
return res_dict
# 爬取频繁了可能会出现有的笔记能有结果有的没有
# When crawling frequently, some notes may have results while others don't
utils.logger.error(
f"[XiaoHongShuClient.get_note_by_id] get note id:{note_id} empty and res:{res}"
)
@@ -319,11 +317,11 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
cursor: str = "",
) -> Dict:
"""
获取一级评论的API
Get first-level comments API
Args:
note_id: 笔记ID
xsec_token: 验证token
cursor: 分页游标
note_id: Note ID
xsec_token: Verification token
cursor: Pagination cursor
Returns:
@@ -347,13 +345,13 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
cursor: str = "",
):
"""
获取指定父评论下的子评论的API
Get sub-comments under specified parent comment API
Args:
note_id: 子评论的帖子ID
root_comment_id: 根评论ID
xsec_token: 验证token
num: 分页数量
cursor: 分页游标
note_id: Post ID of sub-comments
root_comment_id: Root comment ID
xsec_token: Verification token
num: Pagination quantity
cursor: Pagination cursor
Returns:
@@ -362,7 +360,7 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
params = {
"note_id": note_id,
"root_comment_id": root_comment_id,
"num": num,
"num": str(num),
"cursor": cursor,
"image_formats": "jpg,webp,avif",
"top_comment_id": "",
@@ -379,13 +377,13 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
max_count: int = 10,
) -> List[Dict]:
"""
获取指定笔记下的所有一级评论,该方法会一直查找一个帖子下的所有评论信息
Get all first-level comments under specified note, this method will continuously find all comment information under a post
Args:
note_id: 笔记ID
xsec_token: 验证token
crawl_interval: 爬取一次笔记的延迟单位(秒)
callback: 一次笔记爬取结束后
max_count: 一次笔记爬取的最大评论数量
note_id: Note ID
xsec_token: Verification token
crawl_interval: Crawl delay per note (seconds)
callback: Callback after one note crawl ends
max_count: Maximum number of comments to crawl per note
Returns:
"""
@@ -427,12 +425,12 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
callback: Optional[Callable] = None,
) -> List[Dict]:
"""
获取指定一级评论下的所有二级评论, 该方法会一直查找一级评论下的所有二级评论信息
Get all second-level comments under specified first-level comments, this method will continuously find all second-level comment information under first-level comments
Args:
comments: 评论列表
xsec_token: 验证token
crawl_interval: 爬取一次评论的延迟单位(秒)
callback: 一次评论爬取结束后
comments: Comment list
xsec_token: Verification token
crawl_interval: Crawl delay per comment (seconds)
callback: Callback after one comment crawl ends
Returns:
@@ -489,18 +487,18 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
self, user_id: str, xsec_token: str = "", xsec_source: str = ""
) -> Dict:
"""
通过解析网页版的用户主页HTML获取用户个人简要信息
PC端用户主页的网页存在window.__INITIAL_STATE__这个变量上的解析它即可
Get user profile brief information by parsing user homepage HTML
The PC user homepage has window.__INITIAL_STATE__ variable, just parse it
Args:
user_id: 用户ID
xsec_token: 验证token (可选,如果URL中包含此参数则传入)
xsec_source: 渠道来源 (可选,如果URL中包含此参数则传入)
user_id: User ID
xsec_token: Verification token (optional, pass if included in URL)
xsec_source: Channel source (optional, pass if included in URL)
Returns:
Dict: 创作者信息
Dict: Creator information
"""
# 构建URI,如果有xsec参数则添加到URL中
# Build URI, add xsec parameters to URL if available
uri = f"/user/profile/{user_id}"
if xsec_token and xsec_source:
uri = f"{uri}?xsec_token={xsec_token}&xsec_source={xsec_source}"
@@ -519,13 +517,13 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
xsec_source: str = "pc_feed",
) -> Dict:
"""
获取博主的笔记
Get creator's notes
Args:
creator: 博主ID
cursor: 上一页最后一条笔记的ID
page_size: 分页数据长度
xsec_token: 验证token
xsec_source: 渠道来源
creator: Creator ID
cursor: Last note ID from previous page
page_size: Page data length
xsec_token: Verification token
xsec_source: Channel source
Returns:
@@ -549,13 +547,13 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
xsec_source: str = "pc_feed",
) -> List[Dict]:
"""
获取指定用户下的所有发过的帖子,该方法会一直查找一个用户下的所有帖子信息
Get all posts published by specified user, this method will continuously find all post information under a user
Args:
user_id: 用户ID
crawl_interval: 爬取一次的延迟单位(秒)
callback: 一次分页爬取结束后的更新回调函数
xsec_token: 验证token
xsec_source: 渠道来源
user_id: User ID
crawl_interval: Crawl delay (seconds)
callback: Update callback function after one pagination crawl ends
xsec_token: Verification token
xsec_source: Channel source
Returns:
@@ -604,9 +602,9 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
async def get_note_short_url(self, note_id: str) -> Dict:
"""
获取笔记的短链接
Get note short URL
Args:
note_id: 笔记ID
note_id: Note ID
Returns:
@@ -624,7 +622,7 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
enable_cookie: bool = False,
) -> Optional[Dict]:
"""
通过解析网页版的笔记详情页HTML获取笔记详情, 该接口可能会出现失败的情况这里尝试重试3次
Get note details by parsing note detail page HTML, this interface may fail, retry 3 times here
copy from https://github.com/ReaJason/xhs/blob/eb1c5a0213f6fbb592f0a2897ee552847c69ea2d/xhs/core.py#L217-L259
thanks for ReaJason
Args:

View File

@@ -34,7 +34,6 @@ from tenacity import RetryError
import config
from base.base_crawler import AbstractCrawler
from config import CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES
from model.m_xiaohongshu import NoteUrlInfo, CreatorUrlInfo
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
from store import xhs as xhs_store
@@ -60,7 +59,7 @@ class XiaoHongShuCrawler(AbstractCrawler):
# self.user_agent = utils.get_user_agent()
self.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
self.cdp_manager = None
self.ip_proxy_pool = None # 代理IP池用于代理自动刷新
self.ip_proxy_pool = None # Proxy IP pool for automatic proxy refresh
async def start(self) -> None:
playwright_proxy_format, httpx_proxy_format = None, None
@@ -70,9 +69,9 @@ class XiaoHongShuCrawler(AbstractCrawler):
playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(ip_proxy_info)
async with async_playwright() as playwright:
# 根据配置选择启动模式
# Choose launch mode based on configuration
if config.ENABLE_CDP_MODE:
utils.logger.info("[XiaoHongShuCrawler] 使用CDP模式启动浏览器")
utils.logger.info("[XiaoHongShuCrawler] Launching browser using CDP mode")
self.browser_context = await self.launch_browser_with_cdp(
playwright,
playwright_proxy_format,
@@ -80,7 +79,7 @@ class XiaoHongShuCrawler(AbstractCrawler):
headless=config.CDP_HEADLESS,
)
else:
utils.logger.info("[XiaoHongShuCrawler] 使用标准模式启动浏览器")
utils.logger.info("[XiaoHongShuCrawler] Launching browser using standard mode")
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
@@ -95,7 +94,7 @@ class XiaoHongShuCrawler(AbstractCrawler):
self.context_page = await self.browser_context.new_page()
await self.context_page.goto(self.index_url)
# Create a client to interact with the xiaohongshu website.
# Create a client to interact with the Xiaohongshu website.
self.xhs_client = await self.create_xhs_client(httpx_proxy_format)
if not await self.xhs_client.pong():
login_obj = XiaoHongShuLogin(
@@ -125,8 +124,8 @@ class XiaoHongShuCrawler(AbstractCrawler):
async def search(self) -> None:
"""Search for notes and retrieve their comment information."""
utils.logger.info("[XiaoHongShuCrawler.search] Begin search xiaohongshu keywords")
xhs_limit_count = 20 # xhs limit page fixed value
utils.logger.info("[XiaoHongShuCrawler.search] Begin search Xiaohongshu keywords")
xhs_limit_count = 20 # Xiaohongshu limit page fixed value
if config.CRAWLER_MAX_NOTES_COUNT < xhs_limit_count:
config.CRAWLER_MAX_NOTES_COUNT = xhs_limit_count
start_page = config.START_PAGE
@@ -142,7 +141,7 @@ class XiaoHongShuCrawler(AbstractCrawler):
continue
try:
utils.logger.info(f"[XiaoHongShuCrawler.search] search xhs keyword: {keyword}, page: {page}")
utils.logger.info(f"[XiaoHongShuCrawler.search] search Xiaohongshu keyword: {keyword}, page: {page}")
note_ids: List[str] = []
xsec_tokens: List[str] = []
notes_res = await self.xhs_client.get_note_by_keyword(
@@ -151,9 +150,9 @@ class XiaoHongShuCrawler(AbstractCrawler):
page=page,
sort=(SearchSortType(config.SORT_TYPE) if config.SORT_TYPE != "" else SearchSortType.GENERAL),
)
utils.logger.info(f"[XiaoHongShuCrawler.search] Search notes res:{notes_res}")
utils.logger.info(f"[XiaoHongShuCrawler.search] Search notes response: {notes_res}")
if not notes_res or not notes_res.get("has_more", False):
utils.logger.info("No more content!")
utils.logger.info("[XiaoHongShuCrawler.search] No more content!")
break
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [
@@ -184,7 +183,7 @@ class XiaoHongShuCrawler(AbstractCrawler):
async def get_creators_and_notes(self) -> None:
"""Get creator's notes and retrieve their comment information."""
utils.logger.info("[XiaoHongShuCrawler.get_creators_and_notes] Begin get xiaohongshu creators")
utils.logger.info("[XiaoHongShuCrawler.get_creators_and_notes] Begin get Xiaohongshu creators")
for creator_url in config.XHS_CREATOR_ID_LIST:
try:
# Parse creator URL to get user_id and security tokens
@@ -223,9 +222,7 @@ class XiaoHongShuCrawler(AbstractCrawler):
await self.batch_get_note_comments(note_ids, xsec_tokens)
async def fetch_creator_notes_detail(self, note_list: List[Dict]):
"""
Concurrently obtain the specified post list and save the data
"""
"""Concurrently obtain the specified post list and save the data"""
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [
self.get_note_detail_async_task(
@@ -243,11 +240,9 @@ class XiaoHongShuCrawler(AbstractCrawler):
await self.get_notice_media(note_detail)
async def get_specified_notes(self):
"""
Get the information and comments of the specified post
must be specified note_id, xsec_source, xsec_token⚠
Returns:
"""Get the information and comments of the specified post
Note: Must specify note_id, xsec_source, xsec_token
"""
get_note_detail_task_list = []
for full_note_url in config.XHS_SPECIFIED_NOTE_URL_LIST:
@@ -348,7 +343,7 @@ class XiaoHongShuCrawler(AbstractCrawler):
xsec_token=xsec_token,
crawl_interval=crawl_interval,
callback=xhs_store.batch_update_xhs_note_comments,
max_count=CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES,
max_count=config.CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES,
)
# Sleep after fetching comments
@@ -356,8 +351,8 @@ class XiaoHongShuCrawler(AbstractCrawler):
utils.logger.info(f"[XiaoHongShuCrawler.get_comments] Sleeping for {crawl_interval} seconds after fetching comments for note {note_id}")
async def create_xhs_client(self, httpx_proxy: Optional[str]) -> XiaoHongShuClient:
"""Create xhs client"""
utils.logger.info("[XiaoHongShuCrawler.create_xhs_client] Begin create xiaohongshu API client ...")
"""Create Xiaohongshu client"""
utils.logger.info("[XiaoHongShuCrawler.create_xhs_client] Begin create Xiaohongshu API client ...")
cookie_str, cookie_dict = utils.convert_cookies(await self.browser_context.cookies())
xhs_client_obj = XiaoHongShuClient(
proxy=httpx_proxy,
@@ -381,7 +376,7 @@ class XiaoHongShuCrawler(AbstractCrawler):
},
playwright_page=self.context_page,
cookie_dict=cookie_dict,
proxy_ip_pool=self.ip_proxy_pool, # 传递代理池用于自动刷新
proxy_ip_pool=self.ip_proxy_pool, # Pass proxy pool for automatic refresh
)
return xhs_client_obj
@@ -422,9 +417,7 @@ class XiaoHongShuCrawler(AbstractCrawler):
user_agent: Optional[str],
headless: bool = True,
) -> BrowserContext:
"""
使用CDP模式启动浏览器
"""
"""Launch browser using CDP mode"""
try:
self.cdp_manager = CDPBrowserManager()
browser_context = await self.cdp_manager.launch_and_connect(
@@ -434,21 +427,21 @@ class XiaoHongShuCrawler(AbstractCrawler):
headless=headless,
)
# 显示浏览器信息
# Display browser information
browser_info = await self.cdp_manager.get_browser_info()
utils.logger.info(f"[XiaoHongShuCrawler] CDP浏览器信息: {browser_info}")
utils.logger.info(f"[XiaoHongShuCrawler] CDP browser info: {browser_info}")
return browser_context
except Exception as e:
utils.logger.error(f"[XiaoHongShuCrawler] CDP模式启动失败,回退到标准模式: {e}")
# 回退到标准模式
utils.logger.error(f"[XiaoHongShuCrawler] CDP mode launch failed, falling back to standard mode: {e}")
# Fall back to standard mode
chromium = playwright.chromium
return await self.launch_browser(chromium, playwright_proxy, user_agent, headless)
async def close(self):
"""Close browser context"""
# 如果使用CDP模式需要特殊处理
# Special handling if using CDP mode
if self.cdp_manager:
await self.cdp_manager.cleanup()
self.cdp_manager = None
@@ -464,10 +457,10 @@ class XiaoHongShuCrawler(AbstractCrawler):
await self.get_notice_video(note_detail)
async def get_note_images(self, note_item: Dict):
"""
get note images. please use get_notice_media
:param note_item:
:return:
"""Get note images. Please use get_notice_media
Args:
note_item: Note item dictionary
"""
if not config.ENABLE_GET_MEIDAS:
return
@@ -494,10 +487,10 @@ class XiaoHongShuCrawler(AbstractCrawler):
await xhs_store.update_xhs_note_image(note_id, content, extension_file_name)
async def get_notice_video(self, note_item: Dict):
"""
get note videos. please use get_notice_media
:param note_item:
:return:
"""Get note videos. Please use get_notice_media
Args:
note_item: Note item dictionary
"""
if not config.ENABLE_GET_MEIDAS:
return

View File

@@ -29,16 +29,16 @@ class XiaoHongShuExtractor:
pass
def extract_note_detail_from_html(self, note_id: str, html: str) -> Optional[Dict]:
"""从html中提取笔记详情
"""Extract note details from HTML
Args:
html (str): html字符串
html (str): HTML string
Returns:
Dict: 笔记详情字典
Dict: Note details dictionary
"""
if "noteDetailMap" not in html:
# 这种情况要么是出了验证码了,要么是笔记不存在
# Either a CAPTCHA appeared or the note doesn't exist
return None
state = re.findall(r"window.__INITIAL_STATE__=({.*})</script>", html)[
@@ -50,13 +50,13 @@ class XiaoHongShuExtractor:
return None
def extract_creator_info_from_html(self, html: str) -> Optional[Dict]:
"""从html中提取用户信息
"""Extract user information from HTML
Args:
html (str): html字符串
html (str): HTML string
Returns:
Dict: 用户信息字典
Dict: User information dictionary
"""
match = re.search(
r"<script>window.__INITIAL_STATE__=(.+)<\/script>", html, re.M

View File

@@ -23,27 +23,27 @@ from typing import NamedTuple
class FeedType(Enum):
# 推荐
# Recommend
RECOMMEND = "homefeed_recommend"
# 穿搭
# Fashion
FASION = "homefeed.fashion_v3"
# 美食
# Food
FOOD = "homefeed.food_v3"
# 彩妆
# Cosmetics
COSMETICS = "homefeed.cosmetics_v3"
# 影视
# Movie and TV
MOVIE = "homefeed.movie_and_tv_v3"
# 职场
# Career
CAREER = "homefeed.career_v3"
# 情感
# Emotion
EMOTION = "homefeed.love_v3"
# 家居
# Home
HOURSE = "homefeed.household_product_v3"
# 游戏
# Gaming
GAME = "homefeed.gaming_v3"
# 旅行
# Travel
TRAVEL = "homefeed.travel_v3"
# 健身
# Fitness
FITNESS = "homefeed.fitness_v3"
@@ -53,28 +53,27 @@ class NoteType(Enum):
class SearchSortType(Enum):
"""search sort type"""
# default
"""Search sort type"""
# Default
GENERAL = "general"
# most popular
# Most popular
MOST_POPULAR = "popularity_descending"
# Latest
LATEST = "time_descending"
class SearchNoteType(Enum):
"""search note type
"""
# default
"""Search note type"""
# Default
ALL = 0
# only video
# Only video
VIDEO = 1
# only image
# Only image
IMAGE = 2
class Note(NamedTuple):
"""note tuple"""
"""Note tuple"""
note_id: str
title: str
desc: str

View File

@@ -297,13 +297,13 @@ def get_img_urls_by_trace_id(trace_id: str, format_type: str = "png"):
def get_trace_id(img_url: str):
# 浏览器端上传的图片多了 /spectrum/ 这个路径
# Browser-uploaded images have an additional /spectrum/ path
return f"spectrum/{img_url.split('/')[-1]}" if img_url.find("spectrum") != -1 else img_url.split("/")[-1]
def parse_note_info_from_note_url(url: str) -> NoteUrlInfo:
"""
从小红书笔记url中解析出笔记信息
Parse note information from Xiaohongshu note URL
Args:
url: "https://www.xiaohongshu.com/explore/66fad51c000000001b0224b8?xsec_token=AB3rO-QopW5sgrJ41GwN01WCXh6yWPxjSoFI9D5JIMgKw=&xsec_source=pc_search"
Returns:
@@ -318,44 +318,44 @@ def parse_note_info_from_note_url(url: str) -> NoteUrlInfo:
def parse_creator_info_from_url(url: str) -> CreatorUrlInfo:
"""
从小红书创作者主页URL中解析出创作者信息
支持以下格式:
1. 完整URL: "https://www.xiaohongshu.com/user/profile/5eb8e1d400000000010075ae?xsec_token=AB1nWBKCo1vE2HEkfoJUOi5B6BE5n7wVrbdpHoWIj5xHw=&xsec_source=pc_feed"
2. ID: "5eb8e1d400000000010075ae"
Parse creator information from Xiaohongshu creator homepage URL
Supports the following formats:
1. Full URL: "https://www.xiaohongshu.com/user/profile/5eb8e1d400000000010075ae?xsec_token=AB1nWBKCo1vE2HEkfoJUOi5B6BE5n7wVrbdpHoWIj5xHw=&xsec_source=pc_feed"
2. Pure ID: "5eb8e1d400000000010075ae"
Args:
url: 创作者主页URL或user_id
url: Creator homepage URL or user_id
Returns:
CreatorUrlInfo: 包含user_id, xsec_token, xsec_source的对象
CreatorUrlInfo: Object containing user_id, xsec_token, xsec_source
"""
# 如果是纯ID格式(24位十六进制字符),直接返回
# If it's a pure ID format (24 hexadecimal characters), return directly
if len(url) == 24 and all(c in "0123456789abcdef" for c in url):
return CreatorUrlInfo(user_id=url, xsec_token="", xsec_source="")
# 从URL中提取user_id: /user/profile/xxx
# Extract user_id from URL: /user/profile/xxx
import re
user_pattern = r'/user/profile/([^/?]+)'
match = re.search(user_pattern, url)
if match:
user_id = match.group(1)
# 提取xsec_tokenxsec_source参数
# Extract xsec_token and xsec_source parameters
params = extract_url_params_to_dict(url)
xsec_token = params.get("xsec_token", "")
xsec_source = params.get("xsec_source", "")
return CreatorUrlInfo(user_id=user_id, xsec_token=xsec_token, xsec_source=xsec_source)
raise ValueError(f"无法从URL中解析出创作者信息: {url}")
raise ValueError(f"Unable to parse creator info from URL: {url}")
if __name__ == '__main__':
_img_url = "https://sns-img-bd.xhscdn.com/7a3abfaf-90c1-a828-5de7-022c80b92aa3"
# 获取一个图片地址在多个cdn下的url地址
# Get image URL addresses under multiple CDNs for a single image
# final_img_urls = get_img_urls_by_trace_id(get_trace_id(_img_url))
final_img_url = get_img_url_by_trace_id(get_trace_id(_img_url))
print(final_img_url)
# 测试创作者URL解析
print("\n=== 创作者URL解析测试 ===")
# Test creator URL parsing
print("\n=== Creator URL Parsing Test ===")
test_creator_urls = [
"https://www.xiaohongshu.com/user/profile/5eb8e1d400000000010075ae?xsec_token=AB1nWBKCo1vE2HEkfoJUOi5B6BE5n7wVrbdpHoWIj5xHw=&xsec_source=pc_feed",
"5eb8e1d400000000010075ae",
@@ -364,7 +364,7 @@ if __name__ == '__main__':
try:
result = parse_creator_info_from_url(url)
print(f"✓ URL: {url[:80]}...")
print(f" 结果: {result}\n")
print(f" Result: {result}\n")
except Exception as e:
print(f"✗ URL: {url}")
print(f" 错误: {e}\n")
print(f" Error: {e}\n")

View File

@@ -51,19 +51,37 @@ class XiaoHongShuLogin(AbstractLogin):
@retry(stop=stop_after_attempt(600), wait=wait_fixed(1), retry=retry_if_result(lambda value: value is False))
async def check_login_state(self, no_logged_in_session: str) -> bool:
"""
Check if the current login status is successful and return True otherwise return False
retry decorator will retry 20 times if the return value is False, and the retry interval is 1 second
if max retry times reached, raise RetryError
Verify login status using dual-check: UI elements and Cookies.
"""
# 1. Priority check: Check if the "Me" (Profile) node appears in the sidebar
try:
# Selector for elements containing "Me" text with a link pointing to the profile
# XPath Explanation: Find a span with text "Me" inside an anchor tag (<a>)
# whose href attribute contains "/user/profile/"
user_profile_selector = "xpath=//a[contains(@href, '/user/profile/')]//span[text()='']"
# Set a short timeout since this is called within a retry loop
is_visible = await self.context_page.is_visible(user_profile_selector, timeout=500)
if is_visible:
utils.logger.info("[XiaoHongShuLogin.check_login_state] Login status confirmed by UI element ('Me' button).")
return True
except Exception:
pass
# 2. Alternative: Check for CAPTCHA prompt
if "请通过验证" in await self.context_page.content():
utils.logger.info("[XiaoHongShuLogin.check_login_state] 登录过程中出现验证码,请手动验证")
utils.logger.info("[XiaoHongShuLogin.check_login_state] CAPTCHA appeared, please verify manually.")
# 3. Compatibility fallback: Original Cookie-based change detection
current_cookie = await self.browser_context.cookies()
_, cookie_dict = utils.convert_cookies(current_cookie)
current_web_session = cookie_dict.get("web_session")
if current_web_session != no_logged_in_session:
# If web_session has changed, consider the login successful
if current_web_session and current_web_session != no_logged_in_session:
utils.logger.info("[XiaoHongShuLogin.check_login_state] Login status confirmed by Cookie (web_session changed).")
return True
return False
async def begin(self):
@@ -83,14 +101,14 @@ class XiaoHongShuLogin(AbstractLogin):
utils.logger.info("[XiaoHongShuLogin.login_by_mobile] Begin login xiaohongshu by mobile ...")
await asyncio.sleep(1)
try:
# 小红书进入首页后,有可能不会自动弹出登录框,需要手动点击登录按钮
# After entering Xiaohongshu homepage, the login dialog may not pop up automatically, need to manually click login button
login_button_ele = await self.context_page.wait_for_selector(
selector="xpath=//*[@id='app']/div[1]/div[2]/div[1]/ul/div[1]/button",
timeout=5000
)
await login_button_ele.click()
# 弹窗的登录对话框也有两种形态,一种是直接可以看到手机号和验证码的
# 另一种是需要点击切换到手机登录的
# The login dialog has two forms: one shows phone number and verification code directly
# The other requires clicking to switch to phone login
element = await self.context_page.wait_for_selector(
selector='xpath=//div[@class="login-container"]//div[@class="other-method"]/div[1]',
timeout=5000
@@ -106,11 +124,11 @@ class XiaoHongShuLogin(AbstractLogin):
await asyncio.sleep(0.5)
send_btn_ele = await login_container_ele.query_selector("label.auth-code > span")
await send_btn_ele.click() # 点击发送验证码
await send_btn_ele.click() # Click to send verification code
sms_code_input_ele = await login_container_ele.query_selector("label.auth-code > input")
submit_btn_ele = await login_container_ele.query_selector("div.input-container > button")
cache_client = CacheFactory.create_cache(config.CACHE_TYPE_MEMORY)
max_get_sms_code_time = 60 * 2 # 最长获取验证码的时间为2分钟
max_get_sms_code_time = 60 * 2 # Maximum time to get verification code is 2 minutes
no_logged_in_session = ""
while max_get_sms_code_time > 0:
utils.logger.info(f"[XiaoHongShuLogin.login_by_mobile] get sms code from redis remaining time {max_get_sms_code_time}s ...")
@@ -125,15 +143,15 @@ class XiaoHongShuLogin(AbstractLogin):
_, cookie_dict = utils.convert_cookies(current_cookie)
no_logged_in_session = cookie_dict.get("web_session")
await sms_code_input_ele.fill(value=sms_code_value.decode()) # 输入短信验证码
await sms_code_input_ele.fill(value=sms_code_value.decode()) # Enter SMS verification code
await asyncio.sleep(0.5)
agree_privacy_ele = self.context_page.locator("xpath=//div[@class='agreements']//*[local-name()='svg']")
await agree_privacy_ele.click() # 点击同意隐私协议
await agree_privacy_ele.click() # Click to agree to privacy policy
await asyncio.sleep(0.5)
await submit_btn_ele.click() # 点击登录
await submit_btn_ele.click() # Click login
# todo ... 应该还需要检查验证码的正确性有可能输入的验证码不正确
# TODO: Should also check if the verification code is correct, as it may be incorrect
break
try:
@@ -196,7 +214,7 @@ class XiaoHongShuLogin(AbstractLogin):
"""login xiaohongshu website by cookies"""
utils.logger.info("[XiaoHongShuLogin.login_by_cookies] Begin login xiaohongshu by cookie ...")
for key, value in utils.convert_str_cookie_to_dict(self.cookie_str).items():
if key != "web_session": # only set web_session cookie attr
if key != "web_session": # Only set web_session cookie attribute
continue
await self.browser_context.add_cookies([{
'name': key,

View File

@@ -16,37 +16,71 @@
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# 通过 Playwright 注入调用 window.mnsv2 生成小红书签名
# Generate Xiaohongshu signature by calling window.mnsv2 via Playwright injection
import hashlib
import json
import time
from typing import Any, Dict, Optional, Union
from urllib.parse import urlparse
from urllib.parse import urlparse, quote
from playwright.async_api import Page
from .xhs_sign import b64_encode, encode_utf8, get_trace_id, mrc
def _build_sign_string(uri: str, data: Optional[Union[Dict, str]] = None) -> str:
"""构建待签名字符串"""
c = uri
if data is not None:
def _build_sign_string(uri: str, data: Optional[Union[Dict, str]] = None, method: str = "POST") -> str:
"""Build string to be signed
Args:
uri: API path
data: Request data
method: Request method (GET or POST)
Returns:
String to be signed
"""
if method.upper() == "POST":
# POST request uses JSON format
c = uri
if data is not None:
if isinstance(data, dict):
c += json.dumps(data, separators=(",", ":"), ensure_ascii=False)
elif isinstance(data, str):
c += data
return c
else:
# GET request uses query string format
if not data or (isinstance(data, dict) and len(data) == 0):
return uri
if isinstance(data, dict):
c += json.dumps(data, separators=(",", ":"), ensure_ascii=False)
params = []
for key in data.keys():
value = data[key]
if isinstance(value, list):
value_str = ",".join(str(v) for v in value)
elif value is not None:
value_str = str(value)
else:
value_str = ""
# Use URL encoding (safe parameter preserves certain characters from encoding)
# Note: httpx will encode commas, equals signs, etc., we need to handle the same way
value_str = quote(value_str, safe='')
params.append(f"{key}={value_str}")
return f"{uri}?{'&'.join(params)}"
elif isinstance(data, str):
c += data
return c
return f"{uri}?{data}"
return uri
def _md5_hex(s: str) -> str:
"""计算 MD5 哈希值"""
"""Calculate MD5 hash value"""
return hashlib.md5(s.encode("utf-8")).hexdigest()
def _build_xs_payload(x3_value: str, data_type: str = "object") -> str:
"""构建 x-s 签名"""
"""Build x-s signature"""
s = {
"x0": "4.2.1",
"x1": "xhs-pc-web",
@@ -58,7 +92,7 @@ def _build_xs_payload(x3_value: str, data_type: str = "object") -> str:
def _build_xs_common(a1: str, b1: str, x_s: str, x_t: str) -> str:
"""构建 x-s-common 请求头"""
"""Build x-s-common request header"""
payload = {
"s0": 3,
"s1": "",
@@ -79,7 +113,7 @@ def _build_xs_common(a1: str, b1: str, x_s: str, x_t: str) -> str:
async def get_b1_from_localstorage(page: Page) -> str:
""" localStorage 获取 b1 值"""
"""Get b1 value from localStorage"""
try:
local_storage = await page.evaluate("() => window.localStorage")
return local_storage.get("b1", "")
@@ -89,15 +123,15 @@ async def get_b1_from_localstorage(page: Page) -> str:
async def call_mnsv2(page: Page, sign_str: str, md5_str: str) -> str:
"""
通过 playwright 调用 window.mnsv2 函数
Call window.mnsv2 function via playwright
Args:
page: playwright Page 对象
sign_str: 待签名字符串 (uri + JSON.stringify(data))
md5_str: sign_str 的 MD5 哈希值
page: playwright Page object
sign_str: String to be signed (uri + JSON.stringify(data))
md5_str: MD5 hash value of sign_str
Returns:
mnsv2 返回的签名字符串
Signature string returned by mnsv2
"""
sign_str_escaped = sign_str.replace("\\", "\\\\").replace("'", "\\'").replace("\n", "\\n")
md5_str_escaped = md5_str.replace("\\", "\\\\").replace("'", "\\'")
@@ -113,19 +147,21 @@ async def sign_xs_with_playwright(
page: Page,
uri: str,
data: Optional[Union[Dict, str]] = None,
method: str = "POST",
) -> str:
"""
通过 playwright 注入生成 x-s 签名
Generate x-s signature via playwright injection
Args:
page: playwright Page 对象(必须已打开小红书页面)
uri: API 路径,如 "/api/sns/web/v1/search/notes"
data: 请求数据(GET params POST payload
page: playwright Page object (must have Xiaohongshu page open)
uri: API path, e.g., "/api/sns/web/v1/search/notes"
data: Request data (GET params or POST payload)
method: Request method (GET or POST)
Returns:
x-s 签名字符串
x-s signature string
"""
sign_str = _build_sign_string(uri, data)
sign_str = _build_sign_string(uri, data, method)
md5_str = _md5_hex(sign_str)
x3_value = await call_mnsv2(page, sign_str, md5_str)
data_type = "object" if isinstance(data, (dict, list)) else "string"
@@ -137,21 +173,23 @@ async def sign_with_playwright(
uri: str,
data: Optional[Union[Dict, str]] = None,
a1: str = "",
method: str = "POST",
) -> Dict[str, Any]:
"""
通过 playwright 生成完整的签名请求头
Generate complete signature request headers via playwright
Args:
page: playwright Page 对象(必须已打开小红书页面)
uri: API 路径
data: 请求数据
a1: cookie 中的 a1 值
page: playwright Page object (must have Xiaohongshu page open)
uri: API path
data: Request data
a1: a1 value from cookie
method: Request method (GET or POST)
Returns:
包含 x-s, x-t, x-s-common, x-b3-traceid 的字典
Dictionary containing x-s, x-t, x-s-common, x-b3-traceid
"""
b1 = await get_b1_from_localstorage(page)
x_s = await sign_xs_with_playwright(page, uri, data)
x_s = await sign_xs_with_playwright(page, uri, data, method)
x_t = str(int(time.time() * 1000))
return {
@@ -170,30 +208,33 @@ async def pre_headers_with_playwright(
payload: Optional[Dict] = None,
) -> Dict[str, str]:
"""
使用 playwright 注入方式生成请求头签名
可直接替换 client.py 中的 _pre_headers 方法
Generate request header signature using playwright injection method
Can directly replace _pre_headers method in client.py
Args:
page: playwright Page 对象
url: 请求 URL
cookie_dict: cookie 字典
params: GET 请求参数
payload: POST 请求参数
page: playwright Page object
url: Request URL
cookie_dict: Cookie dictionary
params: GET request parameters
payload: POST request parameters
Returns:
签名后的请求头字典
Signed request header dictionary
"""
a1_value = cookie_dict.get("a1", "")
uri = urlparse(url).path
# Determine request data and method
if params is not None:
data = params
method = "GET"
elif payload is not None:
data = payload
method = "POST"
else:
raise ValueError("params or payload is required")
signs = await sign_with_playwright(page, uri, data, a1_value)
signs = await sign_with_playwright(page, uri, data, a1_value, method)
return {
"X-S": signs["x-s"],

View File

@@ -16,19 +16,19 @@
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# 小红书签名算法核心函数
# 用于 playwright 注入方式生成签名
# Xiaohongshu signature algorithm core functions
# Used for generating signatures via playwright injection
import ctypes
import random
from urllib.parse import quote
# 自定义 Base64 字符表
# 标准 Base64: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
# 小红书打乱顺序用于混淆
# Custom Base64 character table
# Standard Base64: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
# Xiaohongshu shuffled order for obfuscation
BASE64_CHARS = list("ZmserbBoHQtNP+wOcza/LpngG8yJq42KWYj0DSfdikx3VT16IlUAFM97hECvuRX5")
# CRC32 查表
# CRC32 lookup table
CRC32_TABLE = [
0, 1996959894, 3993919788, 2567524794, 124634137, 1886057615, 3915621685,
2657392035, 249268274, 2044508324, 3772115230, 2547177864, 162941995,
@@ -77,14 +77,14 @@ CRC32_TABLE = [
def _right_shift_unsigned(num: int, bit: int = 0) -> int:
"""JavaScript 无符号右移 (>>>) 的 Python 实现"""
"""Python implementation of JavaScript unsigned right shift (>>>)"""
val = ctypes.c_uint32(num).value >> bit
MAX32INT = 4294967295
return (val + (MAX32INT + 1)) % (2 * (MAX32INT + 1)) - MAX32INT - 1
def mrc(e: str) -> int:
"""CRC32 变体,用于 x-s-common 的 x9 字段"""
"""CRC32 variant, used for x9 field in x-s-common"""
o = -1
for n in range(min(57, len(e))):
o = CRC32_TABLE[(o & 255) ^ ord(e[n])] ^ _right_shift_unsigned(o, 8)
@@ -92,7 +92,7 @@ def mrc(e: str) -> int:
def _triplet_to_base64(e: int) -> str:
"""将 24 位整数转换为 4 个 Base64 字符"""
"""Convert 24-bit integer to 4 Base64 characters"""
return (
BASE64_CHARS[(e >> 18) & 63]
+ BASE64_CHARS[(e >> 12) & 63]
@@ -102,7 +102,7 @@ def _triplet_to_base64(e: int) -> str:
def _encode_chunk(data: list, start: int, end: int) -> str:
"""编码数据块"""
"""Encode data chunk"""
result = []
for i in range(start, end, 3):
c = ((data[i] << 16) & 0xFF0000) + ((data[i + 1] << 8) & 0xFF00) + (data[i + 2] & 0xFF)
@@ -111,7 +111,7 @@ def _encode_chunk(data: list, start: int, end: int) -> str:
def encode_utf8(s: str) -> list:
"""将字符串编码为 UTF-8 字节列表"""
"""Encode string to UTF-8 byte list"""
encoded = quote(s, safe="~()*!.'")
result = []
i = 0
@@ -126,7 +126,7 @@ def encode_utf8(s: str) -> list:
def b64_encode(data: list) -> str:
"""自定义 Base64 编码"""
"""Custom Base64 encoding"""
length = len(data)
remainder = length % 3
chunks = []
@@ -148,5 +148,5 @@ def b64_encode(data: list) -> str:
def get_trace_id() -> str:
"""生成链路追踪 trace id"""
"""Generate trace id for link tracing"""
return "".join(random.choice("abcdef0123456789") for _ in range(16))

View File

@@ -60,14 +60,14 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
self.default_headers = headers
self.cookie_dict = cookie_dict
self._extractor = ZhihuExtractor()
# 初始化代理池(来自 ProxyRefreshMixin
# Initialize proxy pool (from ProxyRefreshMixin)
self.init_proxy_pool(proxy_ip_pool)
async def _pre_headers(self, url: str) -> Dict:
"""
请求头参数签名
Sign request headers
Args:
url: 请求的URL需要包含请求的参数
url: Request URL with query parameters
Returns:
"""
@@ -83,16 +83,16 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
@retry(stop=stop_after_attempt(3), wait=wait_fixed(1))
async def request(self, method, url, **kwargs) -> Union[str, Any]:
"""
封装httpx的公共请求方法对请求响应做一些处理
Wrapper for httpx common request method with response handling
Args:
method: 请求方法
url: 请求的URL
**kwargs: 其他请求参数,例如请求头、请求体等
method: Request method
url: Request URL
**kwargs: Other request parameters such as headers, body, etc.
Returns:
"""
# 每次请求前检测代理是否过期
# Check if proxy is expired before each request
await self._refresh_proxy_if_expired()
# return response.text
@@ -105,7 +105,7 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
utils.logger.error(f"[ZhiHuClient.request] Requset Url: {url}, Request error: {response.text}")
if response.status_code == 403:
raise ForbiddenError(response.text)
elif response.status_code == 404: # 如果一个content没有评论也是404
elif response.status_code == 404: # Content without comments also returns 404
return {}
raise DataFetchError(response.text)
@@ -124,10 +124,10 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
async def get(self, uri: str, params=None, **kwargs) -> Union[Response, Dict, str]:
"""
GET请求,对请求头签名
GET request with header signing
Args:
uri: 请求路由
params: 请求参数
uri: Request URI
params: Request parameters
Returns:
@@ -141,7 +141,7 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
async def pong(self) -> bool:
"""
用于检查登录态是否失效了
Check if login status is still valid
Returns:
"""
@@ -161,9 +161,9 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
async def update_cookies(self, browser_context: BrowserContext):
"""
API客户端提供的更新cookies方法一般情况下登录成功后会调用此方法
Update cookies method provided by API client, typically called after successful login
Args:
browser_context: 浏览器上下文对象
browser_context: Browser context object
Returns:
@@ -174,7 +174,7 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
async def get_current_user_info(self) -> Dict:
"""
获取当前登录用户信息
Get current logged-in user information
Returns:
"""
@@ -191,14 +191,14 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
search_time: SearchTime = SearchTime.DEFAULT,
) -> List[ZhihuContent]:
"""
根据关键词搜索
Search by keyword
Args:
keyword: 关键词
page: 第几页
page_size: 分页size
sort: 排序
note_type: 搜索结果类型
search_time: 搜索多久时间的结果
keyword: Search keyword
page: Page number
page_size: Page size
sort: Sorting method
note_type: Search result type
search_time: Time range for search results
Returns:
@@ -232,10 +232,10 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
order_by: str = "score",
) -> Dict:
"""
获取内容的一级评论
Get root-level comments for content
Args:
content_id: 内容ID
content_type: 内容类型(answer, article, zvideo)
content_id: Content ID
content_type: Content type (answer, article, zvideo)
offset:
limit:
order_by:
@@ -262,7 +262,7 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
order_by: str = "sort",
) -> Dict:
"""
获取一级评论下的子评论
Get child comments under a root comment
Args:
root_comment_id:
offset:
@@ -287,11 +287,11 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
callback: Optional[Callable] = None,
) -> List[ZhihuComment]:
"""
获取指定帖子下的所有一级评论,该方法会一直查找一个帖子下的所有评论信息
Get all root-level comments for a specified post, this method will retrieve all comment information under a post
Args:
content: 内容详情对象(问题|文章|视频)
crawl_interval: 爬取一次笔记的延迟单位(秒)
callback: 一次笔记爬取结束后
content: Content detail object (question|article|video)
crawl_interval: Crawl delay interval in seconds
callback: Callback after completing one crawl
Returns:
@@ -328,12 +328,12 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
callback: Optional[Callable] = None,
) -> List[ZhihuComment]:
"""
获取指定评论下的所有子评论
Get all sub-comments under specified comments
Args:
content: 内容详情对象(问题|文章|视频)
comments: 评论列表
crawl_interval: 爬取一次笔记的延迟单位(秒)
callback: 一次笔记爬取结束后
content: Content detail object (question|article|video)
comments: Comment list
crawl_interval: Crawl delay interval in seconds
callback: Callback after completing one crawl
Returns:
@@ -370,7 +370,7 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
async def get_creator_info(self, url_token: str) -> Optional[ZhihuCreator]:
"""
获取创作者信息
Get creator information
Args:
url_token:
@@ -383,7 +383,7 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
async def get_creator_answers(self, url_token: str, offset: int = 0, limit: int = 20) -> Dict:
"""
获取创作者的回答
Get creator's answers
Args:
url_token:
offset:
@@ -405,7 +405,7 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
async def get_creator_articles(self, url_token: str, offset: int = 0, limit: int = 20) -> Dict:
"""
获取创作者的文章
Get creator's articles
Args:
url_token:
offset:
@@ -426,7 +426,7 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
async def get_creator_videos(self, url_token: str, offset: int = 0, limit: int = 20) -> Dict:
"""
获取创作者的视频
Get creator's videos
Args:
url_token:
offset:
@@ -446,11 +446,11 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
async def get_all_anwser_by_creator(self, creator: ZhihuCreator, crawl_interval: float = 1.0, callback: Optional[Callable] = None) -> List[ZhihuContent]:
"""
获取创作者的所有回答
Get all answers by creator
Args:
creator: 创作者信息
crawl_interval: 爬取一次笔记的延迟单位(秒)
callback: 一次笔记爬取结束后
creator: Creator information
crawl_interval: Crawl delay interval in seconds
callback: Callback after completing one crawl
Returns:
@@ -481,7 +481,7 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
callback: Optional[Callable] = None,
) -> List[ZhihuContent]:
"""
获取创作者的所有文章
Get all articles by creator
Args:
creator:
crawl_interval:
@@ -515,7 +515,7 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
callback: Optional[Callable] = None,
) -> List[ZhihuContent]:
"""
获取创作者的所有视频
Get all videos by creator
Args:
creator:
crawl_interval:
@@ -548,7 +548,7 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
answer_id: str,
) -> Optional[ZhihuContent]:
"""
获取回答信息
Get answer information
Args:
question_id:
answer_id:
@@ -562,7 +562,7 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
async def get_article_info(self, article_id: str) -> Optional[ZhihuContent]:
"""
获取文章信息
Get article information
Args:
article_id:
@@ -575,7 +575,7 @@ class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
async def get_video_info(self, video_id: str) -> Optional[ZhihuContent]:
"""
获取视频信息
Get video information
Args:
video_id:

View File

@@ -61,7 +61,7 @@ class ZhihuCrawler(AbstractCrawler):
self.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
self._extractor = ZhihuExtractor()
self.cdp_manager = None
self.ip_proxy_pool = None # 代理IP池用于代理自动刷新
self.ip_proxy_pool = None # Proxy IP pool for automatic proxy refresh
async def start(self) -> None:
"""
@@ -80,9 +80,9 @@ class ZhihuCrawler(AbstractCrawler):
)
async with async_playwright() as playwright:
# 根据配置选择启动模式
# Choose launch mode based on configuration
if config.ENABLE_CDP_MODE:
utils.logger.info("[ZhihuCrawler] 使用CDP模式启动浏览器")
utils.logger.info("[ZhihuCrawler] Launching browser in CDP mode")
self.browser_context = await self.launch_browser_with_cdp(
playwright,
playwright_proxy_format,
@@ -90,7 +90,7 @@ class ZhihuCrawler(AbstractCrawler):
headless=config.CDP_HEADLESS,
)
else:
utils.logger.info("[ZhihuCrawler] 使用标准模式启动浏览器")
utils.logger.info("[ZhihuCrawler] Launching browser in standard mode")
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
@@ -117,9 +117,9 @@ class ZhihuCrawler(AbstractCrawler):
browser_context=self.browser_context
)
# 知乎的搜索接口需要打开搜索页面之后cookies才能访问API单独的首页不行
# Zhihu's search API requires opening the search page first to access cookies, homepage alone won't work
utils.logger.info(
"[ZhihuCrawler.start] Zhihu跳转到搜索页面获取搜索页面的Cookies该过程需要5秒左右"
"[ZhihuCrawler.start] Zhihu navigating to search page to get search page cookies, this process takes about 5 seconds"
)
await self.context_page.goto(
f"{self.index_url}/search?q=python&search_source=Guess&utm_content=search_hot&type=content"
@@ -273,7 +273,7 @@ class ZhihuCrawler(AbstractCrawler):
)
await zhihu_store.save_creator(creator=createor_info)
# 默认只提取回答信息,如果需要文章和视频,把下面的注释打开即可
# By default, only answer information is extracted, uncomment below if articles and videos are needed
# Get all anwser information of the creator
all_content_list = await self.zhihu_client.get_all_anwser_by_creator(
@@ -315,7 +315,7 @@ class ZhihuCrawler(AbstractCrawler):
utils.logger.info(
f"[ZhihuCrawler.get_specified_notes] Begin get specified note {full_note_url}"
)
# judge note type
# Judge note type
note_type: str = judge_zhihu_url(full_note_url)
if note_type == constant.ANSWER_NAME:
question_id = full_note_url.split("/")[-3]
@@ -412,7 +412,7 @@ class ZhihuCrawler(AbstractCrawler):
},
playwright_page=self.context_page,
cookie_dict=cookie_dict,
proxy_ip_pool=self.ip_proxy_pool, # 传递代理池用于自动刷新
proxy_ip_pool=self.ip_proxy_pool, # Pass proxy pool for automatic refresh
)
return zhihu_client_obj
@@ -440,7 +440,7 @@ class ZhihuCrawler(AbstractCrawler):
proxy=playwright_proxy, # type: ignore
viewport={"width": 1920, "height": 1080},
user_agent=user_agent,
channel="chrome", # 使用系统的Chrome稳定版
channel="chrome", # Use system Chrome stable version
)
return browser_context
else:
@@ -458,7 +458,7 @@ class ZhihuCrawler(AbstractCrawler):
headless: bool = True,
) -> BrowserContext:
"""
使用CDP模式启动浏览器
Launch browser using CDP mode
"""
try:
self.cdp_manager = CDPBrowserManager()
@@ -469,15 +469,15 @@ class ZhihuCrawler(AbstractCrawler):
headless=headless,
)
# 显示浏览器信息
# Display browser information
browser_info = await self.cdp_manager.get_browser_info()
utils.logger.info(f"[ZhihuCrawler] CDP浏览器信息: {browser_info}")
utils.logger.info(f"[ZhihuCrawler] CDP browser info: {browser_info}")
return browser_context
except Exception as e:
utils.logger.error(f"[ZhihuCrawler] CDP模式启动失败,回退到标准模式: {e}")
# 回退到标准模式
utils.logger.error(f"[ZhihuCrawler] CDP mode launch failed, falling back to standard mode: {e}")
# Fall back to standard mode
chromium = playwright.chromium
return await self.launch_browser(
chromium, playwright_proxy, user_agent, headless
@@ -485,7 +485,7 @@ class ZhihuCrawler(AbstractCrawler):
async def close(self):
"""Close browser context"""
# 如果使用CDP模式需要特殊处理
# Special handling if using CDP mode
if self.cdp_manager:
await self.cdp_manager.cleanup()
self.cdp_manager = None

View File

@@ -26,31 +26,31 @@ from constant import zhihu as zhihu_constant
class SearchTime(Enum):
"""
搜索时间范围
Search time range
"""
DEFAULT = "" # 不限时间
ONE_DAY = "a_day" # 一天内
ONE_WEEK = "a_week" # 一周内
ONE_MONTH = "a_month" # 一个月内
THREE_MONTH = "three_months" # 三个月内
HALF_YEAR = "half_a_year" # 半年内
ONE_YEAR = "a_year" # 一年内
DEFAULT = "" # No time limit
ONE_DAY = "a_day" # Within one day
ONE_WEEK = "a_week" # Within one week
ONE_MONTH = "a_month" # Within one month
THREE_MONTH = "three_months" # Within three months
HALF_YEAR = "half_a_year" # Within half a year
ONE_YEAR = "a_year" # Within one year
class SearchType(Enum):
"""
搜索结果类型
Search result type
"""
DEFAULT = "" # 不限类型
ANSWER = zhihu_constant.ANSWER_NAME # 只看回答
ARTICLE = zhihu_constant.ARTICLE_NAME # 只看文章
VIDEO = zhihu_constant.VIDEO_NAME # 只看视频
DEFAULT = "" # No type limit
ANSWER = zhihu_constant.ANSWER_NAME # Answers only
ARTICLE = zhihu_constant.ARTICLE_NAME # Articles only
VIDEO = zhihu_constant.VIDEO_NAME # Videos only
class SearchSort(Enum):
"""
搜索结果排序
Search result sorting
"""
DEFAULT = "" # 综合排序
UPVOTED_COUNT = "upvoted_count" # 最多赞同
CREATE_TIME = "created_time" # 最新发布
DEFAULT = "" # Default sorting
UPVOTED_COUNT = "upvoted_count" # Most upvoted
CREATE_TIME = "created_time" # Latest published

View File

@@ -168,7 +168,7 @@ class ZhihuExtractor:
"""
res = ZhihuContent()
if "video" in zvideo and isinstance(zvideo.get("video"), dict): # 说明是从创作者主页的视频列表接口来的
if "video" in zvideo and isinstance(zvideo.get("video"), dict): # This indicates data from the creator's homepage video list API
res.content_url = f"{zhihu_constant.ZHIHU_URL}/zvideo/{res.content_id}"
res.created_time = zvideo.get("published_at")
res.updated_time = zvideo.get("updated_at")
@@ -318,11 +318,11 @@ class ZhihuExtractor:
"""
if gender == 1:
return ""
return "Male"
elif gender == 0:
return ""
return "Female"
else:
return "未知"
return "Unknown"
def extract_creator(self, user_url_token: str, html_content: str) -> Optional[ZhihuCreator]:

View File

@@ -26,55 +26,55 @@ from pydantic import BaseModel, Field
class TiebaNote(BaseModel):
"""
百度贴吧帖子
Baidu Tieba post
"""
note_id: str = Field(..., description="帖子ID")
title: str = Field(..., description="帖子标题")
desc: str = Field(default="", description="帖子描述")
note_url: str = Field(..., description="帖子链接")
publish_time: str = Field(default="", description="发布时间")
user_link: str = Field(default="", description="用户主页链接")
user_nickname: str = Field(default="", description="用户昵称")
user_avatar: str = Field(default="", description="用户头像地址")
tieba_name: str = Field(..., description="贴吧名称")
tieba_link: str = Field(..., description="贴吧链接")
total_replay_num: int = Field(default=0, description="回复总数")
total_replay_page: int = Field(default=0, description="回复总页数")
ip_location: Optional[str] = Field(default="", description="IP地理位置")
source_keyword: str = Field(default="", description="来源关键词")
note_id: str = Field(..., description="Post ID")
title: str = Field(..., description="Post title")
desc: str = Field(default="", description="Post description")
note_url: str = Field(..., description="Post link")
publish_time: str = Field(default="", description="Publish time")
user_link: str = Field(default="", description="User homepage link")
user_nickname: str = Field(default="", description="User nickname")
user_avatar: str = Field(default="", description="User avatar URL")
tieba_name: str = Field(..., description="Tieba name")
tieba_link: str = Field(..., description="Tieba link")
total_replay_num: int = Field(default=0, description="Total reply count")
total_replay_page: int = Field(default=0, description="Total reply pages")
ip_location: Optional[str] = Field(default="", description="IP location")
source_keyword: str = Field(default="", description="Source keyword")
class TiebaComment(BaseModel):
"""
百度贴吧评论
Baidu Tieba comment
"""
comment_id: str = Field(..., description="评论ID")
parent_comment_id: str = Field(default="", description="父评论ID")
content: str = Field(..., description="评论内容")
user_link: str = Field(default="", description="用户主页链接")
user_nickname: str = Field(default="", description="用户昵称")
user_avatar: str = Field(default="", description="用户头像地址")
publish_time: str = Field(default="", description="发布时间")
ip_location: Optional[str] = Field(default="", description="IP地理位置")
sub_comment_count: int = Field(default=0, description="子评论数")
note_id: str = Field(..., description="帖子ID")
note_url: str = Field(..., description="帖子链接")
tieba_id: str = Field(..., description="所属的贴吧ID")
tieba_name: str = Field(..., description="所属的贴吧名称")
tieba_link: str = Field(..., description="贴吧链接")
comment_id: str = Field(..., description="Comment ID")
parent_comment_id: str = Field(default="", description="Parent comment ID")
content: str = Field(..., description="Comment content")
user_link: str = Field(default="", description="User homepage link")
user_nickname: str = Field(default="", description="User nickname")
user_avatar: str = Field(default="", description="User avatar URL")
publish_time: str = Field(default="", description="Publish time")
ip_location: Optional[str] = Field(default="", description="IP location")
sub_comment_count: int = Field(default=0, description="Sub-comment count")
note_id: str = Field(..., description="Post ID")
note_url: str = Field(..., description="Post link")
tieba_id: str = Field(..., description="Tieba ID")
tieba_name: str = Field(..., description="Tieba name")
tieba_link: str = Field(..., description="Tieba link")
class TiebaCreator(BaseModel):
"""
百度贴吧创作者
Baidu Tieba creator
"""
user_id: str = Field(..., description="用户ID")
user_name: str = Field(..., description="用户名")
nickname: str = Field(..., description="用户昵称")
gender: str = Field(default="", description="用户性别")
avatar: str = Field(..., description="用户头像地址")
ip_location: Optional[str] = Field(default="", description="IP地理位置")
follows: int = Field(default=0, description="关注数")
fans: int = Field(default=0, description="粉丝数")
registration_duration: str = Field(default="", description="注册时长")
user_id: str = Field(..., description="User ID")
user_name: str = Field(..., description="Username")
nickname: str = Field(..., description="User nickname")
gender: str = Field(default="", description="User gender")
avatar: str = Field(..., description="User avatar URL")
ip_location: Optional[str] = Field(default="", description="IP location")
follows: int = Field(default=0, description="Follows count")
fans: int = Field(default=0, description="Fans count")
registration_duration: str = Field(default="", description="Registration duration")

View File

@@ -33,11 +33,11 @@ from pydantic import BaseModel, Field
class VideoUrlInfo(BaseModel):
"""B站视频URL信息"""
"""Bilibili video URL information"""
video_id: str = Field(title="video id (BV id)")
video_type: str = Field(default="video", title="video type")
class CreatorUrlInfo(BaseModel):
"""B站创作者URL信息"""
"""Bilibili creator URL information"""
creator_id: str = Field(title="creator id (UID)")

View File

@@ -24,11 +24,11 @@ from pydantic import BaseModel, Field
class VideoUrlInfo(BaseModel):
"""抖音视频URL信息"""
"""Douyin video URL information"""
aweme_id: str = Field(title="aweme id (video id)")
url_type: str = Field(default="normal", title="url type: normal, short, modal")
class CreatorUrlInfo(BaseModel):
"""抖音创作者URL信息"""
"""Douyin creator URL information"""
sec_user_id: str = Field(title="sec_user_id (creator id)")

View File

@@ -24,11 +24,11 @@ from pydantic import BaseModel, Field
class VideoUrlInfo(BaseModel):
"""快手视频URL信息"""
"""Kuaishou video URL information"""
video_id: str = Field(title="video id (photo id)")
url_type: str = Field(default="normal", title="url type: normal")
class CreatorUrlInfo(BaseModel):
"""快手创作者URL信息"""
"""Kuaishou creator URL information"""
user_id: str = Field(title="user id (creator id)")

View File

@@ -31,7 +31,7 @@ class NoteUrlInfo(BaseModel):
class CreatorUrlInfo(BaseModel):
"""小红书创作者URL信息"""
"""Xiaohongshu creator URL information"""
user_id: str = Field(title="user id (creator id)")
xsec_token: str = Field(default="", title="xsec token")
xsec_source: str = Field(default="", title="xsec source")

View File

@@ -26,66 +26,66 @@ from pydantic import BaseModel, Field
class ZhihuContent(BaseModel):
"""
知乎内容(回答、文章、视频)
Zhihu content (answer, article, video)
"""
content_id: str = Field(default="", description="内容ID")
content_type: str = Field(default="", description="内容类型(article | answer | zvideo)")
content_text: str = Field(default="", description="内容文本, 如果是视频类型这里为空")
content_url: str = Field(default="", description="内容落地链接")
question_id: str = Field(default="", description="问题ID, type为answer时有值")
title: str = Field(default="", description="内容标题")
desc: str = Field(default="", description="内容描述")
created_time: int = Field(default=0, description="创建时间")
updated_time: int = Field(default=0, description="更新时间")
voteup_count: int = Field(default=0, description="赞同人数")
comment_count: int = Field(default=0, description="评论数量")
source_keyword: str = Field(default="", description="来源关键词")
content_id: str = Field(default="", description="Content ID")
content_type: str = Field(default="", description="Content type (article | answer | zvideo)")
content_text: str = Field(default="", description="Content text, empty for video type")
content_url: str = Field(default="", description="Content landing page URL")
question_id: str = Field(default="", description="Question ID, has value when type is answer")
title: str = Field(default="", description="Content title")
desc: str = Field(default="", description="Content description")
created_time: int = Field(default=0, description="Create time")
updated_time: int = Field(default=0, description="Update time")
voteup_count: int = Field(default=0, description="Upvote count")
comment_count: int = Field(default=0, description="Comment count")
source_keyword: str = Field(default="", description="Source keyword")
user_id: str = Field(default="", description="用户ID")
user_link: str = Field(default="", description="用户主页链接")
user_nickname: str = Field(default="", description="用户昵称")
user_avatar: str = Field(default="", description="用户头像地址")
user_url_token: str = Field(default="", description="用户url_token")
user_id: str = Field(default="", description="User ID")
user_link: str = Field(default="", description="User homepage link")
user_nickname: str = Field(default="", description="User nickname")
user_avatar: str = Field(default="", description="User avatar URL")
user_url_token: str = Field(default="", description="User url_token")
class ZhihuComment(BaseModel):
"""
知乎评论
Zhihu comment
"""
comment_id: str = Field(default="", description="评论ID")
parent_comment_id: str = Field(default="", description="父评论ID")
content: str = Field(default="", description="评论内容")
publish_time: int = Field(default=0, description="发布时间")
ip_location: Optional[str] = Field(default="", description="IP地理位置")
sub_comment_count: int = Field(default=0, description="子评论数")
like_count: int = Field(default=0, description="点赞数")
dislike_count: int = Field(default=0, description="踩数")
content_id: str = Field(default="", description="内容ID")
content_type: str = Field(default="", description="内容类型(article | answer | zvideo)")
comment_id: str = Field(default="", description="Comment ID")
parent_comment_id: str = Field(default="", description="Parent comment ID")
content: str = Field(default="", description="Comment content")
publish_time: int = Field(default=0, description="Publish time")
ip_location: Optional[str] = Field(default="", description="IP location")
sub_comment_count: int = Field(default=0, description="Sub-comment count")
like_count: int = Field(default=0, description="Like count")
dislike_count: int = Field(default=0, description="Dislike count")
content_id: str = Field(default="", description="Content ID")
content_type: str = Field(default="", description="Content type (article | answer | zvideo)")
user_id: str = Field(default="", description="用户ID")
user_link: str = Field(default="", description="用户主页链接")
user_nickname: str = Field(default="", description="用户昵称")
user_avatar: str = Field(default="", description="用户头像地址")
user_id: str = Field(default="", description="User ID")
user_link: str = Field(default="", description="User homepage link")
user_nickname: str = Field(default="", description="User nickname")
user_avatar: str = Field(default="", description="User avatar URL")
class ZhihuCreator(BaseModel):
"""
知乎创作者
Zhihu creator
"""
user_id: str = Field(default="", description="用户ID")
user_link: str = Field(default="", description="用户主页链接")
user_nickname: str = Field(default="", description="用户昵称")
user_avatar: str = Field(default="", description="用户头像地址")
url_token: str = Field(default="", description="用户url_token")
gender: str = Field(default="", description="用户性别")
ip_location: Optional[str] = Field(default="", description="IP地理位置")
follows: int = Field(default=0, description="关注数")
fans: int = Field(default=0, description="粉丝数")
anwser_count: int = Field(default=0, description="回答数")
video_count: int = Field(default=0, description="视频数")
question_count: int = Field(default=0, description="提问数")
article_count: int = Field(default=0, description="文章数")
column_count: int = Field(default=0, description="专栏数")
get_voteup_count: int = Field(default=0, description="获得的赞同数")
user_id: str = Field(default="", description="User ID")
user_link: str = Field(default="", description="User homepage link")
user_nickname: str = Field(default="", description="User nickname")
user_avatar: str = Field(default="", description="User avatar URL")
url_token: str = Field(default="", description="User url_token")
gender: str = Field(default="", description="User gender")
ip_location: Optional[str] = Field(default="", description="IP location")
follows: int = Field(default=0, description="Follows count")
fans: int = Field(default=0, description="Fans count")
anwser_count: int = Field(default=0, description="Answer count")
video_count: int = Field(default=0, description="Video count")
question_count: int = Field(default=0, description="Question count")
article_count: int = Field(default=0, description="Article count")
column_count: int = Field(default=0, description="Column count")
get_voteup_count: int = Field(default=0, description="Total upvotes received")

1438
package-lock.json generated
View File

File diff suppressed because it is too large Load Diff

View File

@@ -5,6 +5,8 @@
"docs:preview": "vitepress preview docs"
},
"devDependencies": {
"vitepress": "^1.3.4"
"mermaid": "^11.12.2",
"vitepress": "^1.3.4",
"vitepress-plugin-mermaid": "^2.0.17"
}
}

View File

@@ -21,5 +21,5 @@
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/2 14:37
# @Desc : IP代理池入口
# @Desc : IP proxy pool entry point
from .base_proxy import *

Some files were not shown because too many files have changed in this diff Show More