61 Commits

Author SHA1 Message Date
程序员阿江(Relakkes)
f14242c239 docs: update data store 2025-11-28 22:21:20 +08:00
程序员阿江(Relakkes)
29832ded91 chore: add coderowner rules 2025-11-28 22:17:40 +08:00
程序员阿江(Relakkes)
11f2802624 docs: update README.md 2025-11-28 18:16:04 +08:00
程序员阿江-Relakkes
ab19494883 Merge pull request #785 from hsparks-codes/feat/update_readme
docs: Move data storage section to separate guide
2025-11-28 18:07:56 +08:00
hsparks.codes
2bc9297812 docs: Move data storage section to separate guide
- Create comprehensive data storage guide (docs/data_storage_guide.md)
- Update README.md with link to storage guide instead of full details
- Update README_en.md with link to storage guide
- Bilingual guide (Chinese and English) in single document
- Includes all storage options: CSV, JSON, Excel, SQLite, MySQL
- Detailed usage examples and documentation links

This change improves README readability by moving detailed storage
information to a dedicated document while keeping main README concise.
2025-11-28 10:18:09 +01:00
程序员阿江-Relakkes
ba64c8ff9c Merge pull request #784 from NanmiCoder/feature/excel-export-and-tests
feat: excel store with other platform
2025-11-28 15:15:31 +08:00
程序员阿江-Relakkes
ebbf86d67b Merge pull request #783 from hsparks-codes/feature/excel-export-and-tests
feat: Add Excel export functionality and unit tests
2025-11-28 15:14:25 +08:00
程序员阿江(Relakkes)
6e858c1a00 feat: excel store with other platform 2025-11-28 15:12:36 +08:00
hsparks.codes
324f09cf9f fix: Update tests to handle openpyxl color format and ContextVar
- Fix header color assertion to check only RGB values (not alpha channel)
- Remove ContextVar mock as it cannot be patched in Python 3.11+
- All 17 tests now passing successfully
2025-11-28 05:04:00 +01:00
hsparks.codes
46ef86ddef feat: Add Excel export functionality and unit tests
Features:
- Excel export with formatted multi-sheet workbooks (Contents, Comments, Creators)
- Professional styling: blue headers, auto-width columns, borders, text wrapping
- Smart export: empty sheets automatically removed
- Support for all platforms (xhs, dy, ks, bili, wb, tieba, zhihu)

Testing:
- Added pytest framework with asyncio support
- Unit tests for Excel store functionality
- Unit tests for store factory pattern
- Shared fixtures for test data
- Test coverage for edge cases

Documentation:
- Comprehensive Excel export guide (docs/excel_export_guide.md)
- Updated README.md and README_en.md with Excel examples
- Updated config comments to include excel option

Dependencies:
- Added openpyxl>=3.1.2 for Excel support
- Added pytest>=7.4.0 and pytest-asyncio>=0.21.0 for testing

This contribution adds immediate value for users who need data analysis
capabilities and establishes a testing foundation for future development.
2025-11-28 04:44:12 +01:00
程序员阿江-Relakkes
31a092c653 Merge pull request #782 from NanmiCoder/fix/xhs-sign-20251127
feat: xhs sign playwright version
2025-11-27 11:05:24 +08:00
程序员阿江(Relakkes)
f989ce0788 feat: xhs sign playwright version 2025-11-27 10:53:08 +08:00
程序员阿江-Relakkes
15b98fa511 ip proxy expired logic switch
Fix/proxy 20251125
2025-11-26 16:05:01 +08:00
程序员阿江(Relakkes)
f1e7124654 fix: proxy extract error 2025-11-26 16:01:54 +08:00
程序员阿江(Relakkes)
6eef02d08c feat: ip proxy expired check 2025-11-25 12:39:10 +08:00
程序员阿江(Relakkes)
1da347cbf8 docs: update index.md 2025-11-22 09:12:25 +08:00
程序员阿江(Relakkes)
422cc92dd1 docs: update README 2025-11-22 08:20:09 +08:00
程序员阿江(Relakkes)
13d2302c9c docs: update README 2025-11-18 17:56:55 +08:00
程序员阿江(Relakkes)
ff8c92daad chore: add copyright to every file 2025-11-18 12:24:02 +08:00
程序员阿江(Relakkes)
5288bddb42 refactor: weibo search #771 2025-11-17 17:24:47 +08:00
程序员阿江(Relakkes)
6dcfd7e0a5 refactor: weibo login 2025-11-17 17:11:35 +08:00
程序员阿江(Relakkes)
e89a6d5781 feat: cdp browser cleanup after crawler done 2025-11-17 12:21:53 +08:00
程序员阿江(Relakkes)
a1c5e07df8 fix: xhs sub comment bugfix #769 2025-11-17 11:47:33 +08:00
程序员阿江(Relakkes)
b6caa7a85e refactor: add xhs creator params 2025-11-10 21:10:03 +08:00
程序员阿江(Relakkes)
1e3637f238 refactor: update xhs note detail 2025-11-10 18:13:51 +08:00
程序员阿江(Relakkes)
b5dab6d1e8 refactor: 使用 xhshow 替代 playwright 签名方案
感谢 @Cloxl/xhshow 开源项目

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-10 18:12:45 +08:00
程序员阿江-Relakkes
54f23b8d1c Merge pull request #768 from yangtao210/main
优化mongodb配置获取逻辑,移动存储基类位置。集成测试
2025-11-07 05:44:07 -05:00
yangtao210
58eb89f073 Merge branch 'NanmiCoder:main' into main 2025-11-07 17:44:09 +08:00
yt210
7888f4c6bd 优化mongodb配置获取逻辑,移动存储基类位置。集成测试 2025-11-07 17:42:50 +08:00
yt210
b61ec54a72 优化mongodb配置获取逻辑,移动存储基类位置。 2025-11-07 17:42:28 +08:00
程序员阿江(Relakkes)
60cbb3e37d fix: weibo container error #568 2025-11-06 19:43:09 +08:00
程序员阿江-Relakkes
05a1782746 Merge pull request #764 from yangtao210/main
新增存储到mongoDB
2025-11-06 06:10:49 -05:00
yt210
ef6948b305 新增存储到mongoDB 2025-11-06 10:40:30 +08:00
程序员阿江(Relakkes)
45ec4b433a docs: update 2025-11-06 00:08:03 +08:00
程序员阿江(Relakkes)
0074e975dd fix: dy search 2025-11-04 00:14:16 +08:00
程序员阿江(Relakkes)
889fa01466 fix: bili词云图修复 2025-11-02 13:25:31 +08:00
程序员阿江(Relakkes)
3f5925e326 feat: update xhs sign 2025-10-27 19:06:07 +08:00
程序员阿江(Relakkes)
ed6e0bfb5f refactor: tieba 改为浏览器获取数据 2025-10-19 17:09:55 +08:00
程序员阿江(Relakkes)
26a261bc09 Merge branch 'feature/config-refactor-20251018' 2025-10-19 15:32:42 +08:00
程序员阿江(Relakkes)
03e384bbe2 refactor: cdp模式下移除stealth注入 2025-10-19 15:32:03 +08:00
程序员阿江-Relakkes
56bf5d226f The configuration file supports URL crawling
Feature/config refactor 20251018
2025-10-18 07:42:14 +08:00
程序员阿江(Relakkes)
ae7955787c feat: kuaishou support url link 2025-10-18 07:40:10 +08:00
程序员阿江(Relakkes)
a9dd08680f feat: xhs support creator url link 2025-10-18 07:20:09 +08:00
程序员阿江(Relakkes)
cae707cb2a feat: douyin support url link 2025-10-18 07:00:21 +08:00
程序员阿江(Relakkes)
906c259cc7 feat: bilibili support url link 2025-10-18 06:30:20 +08:00
程序员阿江(Relakkes)
3b6fae8a62 docs: update README.md 2025-10-17 15:30:44 +08:00
程序员阿江-Relakkes
a72504a33d Merge pull request #739 from callmeiks/add-tikhub-sponsor
docs: resize TikHub banner to smaller size
2025-10-16 16:54:18 +08:00
Callmeiks
e177f799df docs: resize TikHub banner to smaller size 2025-10-16 01:51:55 -07:00
程序员阿江-Relakkes
1a5dcb6db7 Merge pull request #738 from callmeiks/add-tikhub-sponsor
docs: add TikHub as sponsor
2025-10-16 16:41:19 +08:00
Callmeiks
2c9eec544d docs: add TikHub as sponsor 2025-10-16 01:22:40 -07:00
程序员阿江(Relakkes)
d1f73e811c docs: update README.md 2025-10-12 21:19:11 +08:00
程序员阿江(Relakkes)
2d3e7555c6 docs: update README.md 2025-10-11 16:16:11 +08:00
程序员阿江(Relakkes)
3c5b9e8035 docs: update wechat qrcode 2025-10-02 14:27:10 +08:00
程序员阿江(Relakkes)
e6f3182ed7 Merge branch 'codex/replace-argparse-with-typer-for-cli' 2025-09-26 18:11:02 +08:00
程序员阿江(Relakkes)
2cf143cc7c fix: #730 2025-09-26 18:10:30 +08:00
程序员阿江-Relakkes
eb625b0b48 Merge pull request #729 from NanmiCoder/codex/replace-argparse-with-typer-for-cli
feat(cli): migrate CLI argument parsing to Typer
2025-09-26 18:08:21 +08:00
程序员阿江(Relakkes)
84f6f650f8 fix: typer args bugfix 2025-09-26 18:07:57 +08:00
程序员阿江-Relakkes
9d6cf065e9 fix(cli): support runtime without peps604 2025-09-26 17:38:50 +08:00
程序员阿江-Relakkes
95c740dee2 refine: harden typer cli defaults 2025-09-26 17:38:44 +08:00
程序员阿江-Relakkes
f97e0c18cd feat(cli): migrate argument parsing to typer 2025-09-26 17:21:47 +08:00
程序员阿江-Relakkes
879a72ea30 fix: 修复cdp启动的浏览器无法关闭的bug
Improve BrowserLauncher shutdown reliability
2025-09-26 16:57:48 +08:00
166 changed files with 8373 additions and 2787 deletions

BIN
.DS_Store vendored Normal file
View File

Binary file not shown.

18
.github/CODEOWNERS vendored Normal file
View File

@@ -0,0 +1,18 @@
# 默认:仓库所有文件都需要 @NanmiCoder 审核
* @NanmiCoder
.github/workflows/** @NanmiCoder
requirements.txt @NanmiCoder
pyproject.toml @NanmiCoder
Pipfile @NanmiCoder
package.json @NanmiCoder
package-lock.json @NanmiCoder
pnpm-lock.yaml @NanmiCoder
Dockerfile @NanmiCoder
docker/** @NanmiCoder
scripts/deploy/** @NanmiCoder

46
.pre-commit-config.yaml Normal file
View File

@@ -0,0 +1,46 @@
# Pre-commit hooks configuration for MediaCrawler project
# See https://pre-commit.com for more information
repos:
# Local hooks
- repo: local
hooks:
# Python file header copyright check
- id: check-file-headers
name: Check Python file headers
entry: python tools/file_header_manager.py --check
language: system
types: [python]
pass_filenames: true
stages: [pre-commit]
# Auto-fix Python file headers
- id: add-file-headers
name: Add copyright headers to Python files
entry: python tools/file_header_manager.py
language: system
types: [python]
pass_filenames: true
stages: [pre-commit]
# Standard pre-commit hooks (optional, can be enabled later)
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: trailing-whitespace
exclude: ^(.*\.md|.*\.txt)$
- id: end-of-file-fixer
exclude: ^(.*\.md|.*\.txt)$
- id: check-yaml
- id: check-added-large-files
args: ['--maxkb=10240'] # 10MB limit
- id: check-merge-conflict
- id: check-case-conflict
- id: mixed-line-ending
# Global configuration
default_language_version:
python: python3
# Run hooks on all files during manual run
# Usage: pre-commit run --all-files

View File

@@ -1 +1 @@
3.9 3.11

110
README.md
View File

@@ -1,3 +1,5 @@
# 🔥 MediaCrawler - 自媒体平台爬虫 🕷️
<div align="center" markdown="1"> <div align="center" markdown="1">
<sup>Special thanks to:</sup> <sup>Special thanks to:</sup>
<br> <br>
@@ -12,8 +14,6 @@
</div> </div>
<hr> <hr>
# 🔥 MediaCrawler - 自媒体平台爬虫 🕷️
<div align="center"> <div align="center">
<a href="https://trendshift.io/repositories/8291" target="_blank"> <a href="https://trendshift.io/repositories/8291" target="_blank">
@@ -129,15 +129,10 @@ uv sync
uv run playwright install uv run playwright install
``` ```
> **💡 提示**MediaCrawler 目前已经支持使用 playwright 连接你本地的 Chrome 浏览器了,一些因为 Webdriver 导致的问题迎刃而解了。
>
> 目前开放了 `xhs` 和 `dy` 这两个使用 CDP 的方式连接本地浏览器,如有需要,查看 `config/base_config.py` 中的配置项。
## 🚀 运行爬虫程序 ## 🚀 运行爬虫程序
```shell ```shell
# 项目默认是没有开启评论爬取模式,如需评论请在 config/base_config.py 中的 ENABLE_GET_COMMENTS 变量修改 # 在 config/base_config.py 查看配置项目功能,写的有中文注释
# 一些其他支持项,也可以在 config/base_config.py 查看功能,写的有中文注释
# 从配置文件中读取关键词搜索相关的帖子并爬取帖子信息与评论 # 从配置文件中读取关键词搜索相关的帖子并爬取帖子信息与评论
uv run main.py --platform xhs --lt qrcode --type search uv run main.py --platform xhs --lt qrcode --type search
@@ -163,7 +158,7 @@ uv run main.py --help
cd MediaCrawler cd MediaCrawler
# 创建虚拟环境 # 创建虚拟环境
# 我的 python 版本是3.9.6requirements.txt 中的库是基于这个版本的 # 我的 python 版本是3.11 requirements.txt 中的库是基于这个版本的
# 如果是其他 python 版本,可能 requirements.txt 中的库不兼容,需自行解决 # 如果是其他 python 版本,可能 requirements.txt 中的库不兼容,需自行解决
python -m venv venv python -m venv venv
@@ -209,56 +204,45 @@ python main.py --help
## 💾 数据保存 ## 💾 数据保存
支持多种数据存储方式: MediaCrawler 支持多种数据存储方式,包括 CSV、JSON、Excel、SQLite 和 MySQL 数据库。
- **CSV 文件**:支持保存到 CSV 中(`data/` 目录下)
- **JSON 文件**:支持保存到 JSON 中(`data/` 目录下)
- **数据库存储**
- 使用参数 `--init_db` 进行数据库初始化(使用`--init_db`时不需要携带其他optional
- **SQLite 数据库**:轻量级数据库,无需服务器,适合个人使用(推荐)
1. 初始化:`--init_db sqlite`
2. 数据存储:`--save_data_option sqlite`
- **MySQL 数据库**:支持关系型数据库 MySQL 中保存(需要提前创建数据库)
1. 初始化:`--init_db mysql`
2. 数据存储:`--save_data_option db`db 参数为兼容历史更新保留)
📖 **详细使用说明请查看:[数据存储指南](docs/data_storage_guide.md)**
### 使用示例:
```shell
# 初始化 SQLite 数据库(使用'--init_db'时不需要携带其他optional
uv run main.py --init_db sqlite
# 使用 SQLite 存储数据(推荐个人用户使用)
uv run main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
```
```shell
# 初始化 MySQL 数据库
uv run main.py --init_db mysql
# 使用 MySQL 存储数据为适配历史更新db参数进行沿用
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
```
[🚀 MediaCrawlerPro 重磅发布 🚀!更多的功能,更好的架构设计!](https://github.com/MediaCrawlerPro) [🚀 MediaCrawlerPro 重磅发布 🚀!更多的功能,更好的架构设计!](https://github.com/MediaCrawlerPro)
---
### 💬 交流群组
- **微信交流群**[点击加入](https://nanmicoder.github.io/MediaCrawler/%E5%BE%AE%E4%BF%A1%E4%BA%A4%E6%B5%81%E7%BE%A4.html)
### 💰 赞助商展示 ### 💰 赞助商展示
<a href="https://www.swiftproxy.net/?ref=nanmi">
<img src="docs/static/images/img_5.png">
<br>
Swiftproxy - 90M+ 全球高质量纯净住宅IP注册可领免费 500MB 测试流量,动态流量不过期!
> 专属折扣码:**GHB5** 立享九折优惠!
</a>
<br>
<br>
<a href="https://h.wandouip.com"> <a href="https://h.wandouip.com">
<img src="docs/static/images/img_8.jpg"> <img src="docs/static/images/img_8.jpg">
<br> <br>
豌豆HTTP自营千万级IP资源池IP纯净度≥99.8%每日保持IP高频更新快速响应稳定连接满足多种业务场景支持按需定制注册免费提取10000ip。 豌豆HTTP自营千万级IP资源池IP纯净度≥99.8%每日保持IP高频更新快速响应稳定连接,满足多种业务场景支持按需定制注册免费提取10000ip。
</a> </a>
---
<p align="center">
<a href="https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad">
<img style="border-radius:20px" width="500" alt="TikHub IO_Banner zh" src="docs/static/images/tikhub_banner_zh.png">
</a>
</p>
[TikHub](https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad) 提供超过 **700 个端点**,可用于从 **14+ 个社交媒体平台** 获取与分析数据 —— 包括视频、用户、评论、商店、商品与趋势等,一站式完成所有数据访问与分析。
通过每日签到,可以获取免费额度。可以使用我的注册链接:[https://user.tikhub.io/users/signup?referral_code=cfzyejV9](https://user.tikhub.io/users/signup?referral_code=cfzyejV9&utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad) 或使用邀请码:`cfzyejV9`,注册并充值即可获得 **$2 免费额度**。
[TikHub](https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad) 提供以下服务:
- 🚀 丰富的社交媒体数据接口TikTok、Douyin、XHS、YouTube、Instagram等
- 💎 每日签到免费领取额度
- ⚡ 高成功率与高并发支持
- 🌐 官网:[https://tikhub.io/](https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad)
- 💻 GitHub地址[https://github.com/TikHubIO/](https://github.com/TikHubIO/)
### 🤝 成为赞助者 ### 🤝 成为赞助者
@@ -266,34 +250,16 @@ Swiftproxy - 90M+ 全球高质量纯净住宅IP注册可领免费 500MB 测
成为赞助者,可以将您的产品展示在这里,每天获得大量曝光! 成为赞助者,可以将您的产品展示在这里,每天获得大量曝光!
**联系方式** **联系方式**
- 微信:`yzglan` - 微信:`relakkes`
- 邮箱:`relakkes@gmail.com` - 邮箱:`relakkes@gmail.com`
## 🤝 社区与支持
### 💬 交流群组
- **微信交流群**[点击加入](https://nanmicoder.github.io/MediaCrawler/%E5%BE%AE%E4%BF%A1%E4%BA%A4%E6%B5%81%E7%BE%A4.html)
### 📚 文档与教程
- **在线文档**[MediaCrawler 完整文档](https://nanmicoder.github.io/MediaCrawler/)
- **爬虫教程**[CrawlerTutorial 免费教程](https://github.com/NanmiCoder/CrawlerTutorial)
# 其他常见问题可以查看在线文档
>
> 在线文档包含使用方法、常见问题、加入项目交流群等。
> [MediaCrawler在线文档](https://nanmicoder.github.io/MediaCrawler/)
>
# 作者提供的知识服务
> 如果想快速入门和学习该项目的使用、源码架构设计等、学习编程技术、亦或者想了解MediaCrawlerPro的源代码设计可以看下我的知识付费栏目。
[作者的知识付费栏目介绍](https://nanmicoder.github.io/MediaCrawler/%E7%9F%A5%E8%AF%86%E4%BB%98%E8%B4%B9%E4%BB%8B%E7%BB%8D.html)
--- ---
### 📚 其他
- **常见问题**[MediaCrawler 完整文档](https://nanmicoder.github.io/MediaCrawler/)
- **爬虫入门教程**[CrawlerTutorial 免费教程](https://github.com/NanmiCoder/CrawlerTutorial)
- **新闻爬虫开源项目**[NewsCrawlerCollection](https://github.com/NanmiCoder/NewsCrawlerCollection)
## ⭐ Star 趋势图 ## ⭐ Star 趋势图
如果这个项目对您有帮助,请给个 ⭐ Star 支持一下,让更多的人看到 MediaCrawler 如果这个项目对您有帮助,请给个 ⭐ Star 支持一下,让更多的人看到 MediaCrawler
@@ -301,9 +267,9 @@ Swiftproxy - 90M+ 全球高质量纯净住宅IP注册可领免费 500MB 测
[![Star History Chart](https://api.star-history.com/svg?repos=NanmiCoder/MediaCrawler&type=Date)](https://star-history.com/#NanmiCoder/MediaCrawler&Date) [![Star History Chart](https://api.star-history.com/svg?repos=NanmiCoder/MediaCrawler&type=Date)](https://star-history.com/#NanmiCoder/MediaCrawler&Date)
## 📚 参考 ## 📚 参考
- **小红书签名仓库**[Cloxl 的 xhs 签名仓库](https://github.com/Cloxl/xhshow)
- **小红书客户端**[ReaJason 的 xhs 仓库](https://github.com/ReaJason/xhs) - **小红书客户端**[ReaJason 的 xhs 仓库](https://github.com/ReaJason/xhs)
- **短信转发**[SmsForwarder 参考仓库](https://github.com/pppscn/SmsForwarder) - **短信转发**[SmsForwarder 参考仓库](https://github.com/pppscn/SmsForwarder)
- **内网穿透工具**[ngrok 官方文档](https://ngrok.com/docs/) - **内网穿透工具**[ngrok 官方文档](https://ngrok.com/docs/)

View File

@@ -206,32 +206,9 @@ python main.py --help
## 💾 Data Storage ## 💾 Data Storage
Supports multiple data storage methods: MediaCrawler supports multiple data storage methods, including CSV, JSON, Excel, SQLite, and MySQL databases.
- **CSV Files**: Supports saving to CSV (under `data/` directory)
- **JSON Files**: Supports saving to JSON (under `data/` directory)
- **Database Storage**
- Use the `--init_db` parameter for database initialization (when using `--init_db`, no other optional arguments are needed)
- **SQLite Database**: Lightweight database, no server required, suitable for personal use (recommended)
1. Initialization: `--init_db sqlite`
2. Data Storage: `--save_data_option sqlite`
- **MySQL Database**: Supports saving to relational database MySQL (database needs to be created in advance)
1. Initialization: `--init_db mysql`
2. Data Storage: `--save_data_option db` (the db parameter is retained for compatibility with historical updates)
📖 **For detailed usage instructions, please see: [Data Storage Guide](docs/data_storage_guide.md)**
### Usage Examples:
```shell
# Initialize SQLite database (when using '--init_db', no other optional arguments are needed)
uv run main.py --init_db sqlite
# Use SQLite to store data (recommended for personal users)
uv run main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
```
```shell
# Initialize MySQL database
uv run main.py --init_db mysql
# Use MySQL to store data (the db parameter is retained for compatibility with historical updates)
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
```
--- ---
@@ -282,7 +259,7 @@ If this project helps you, please give a ⭐ Star to support and let more people
Become a sponsor and showcase your product here, getting massive exposure daily! Become a sponsor and showcase your product here, getting massive exposure daily!
**Contact Information**: **Contact Information**:
- WeChat: `yzglan` - WeChat: `relakkes`
- Email: `relakkes@gmail.com` - Email: `relakkes@gmail.com`

View File

@@ -282,7 +282,7 @@ uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
¡Conviértase en patrocinador y muestre su producto aquí, obteniendo exposición masiva diariamente! ¡Conviértase en patrocinador y muestre su producto aquí, obteniendo exposición masiva diariamente!
**Información de Contacto**: **Información de Contacto**:
- WeChat: `yzglan` - WeChat: `relakkes`
- Email: `relakkes@gmail.com` - Email: `relakkes@gmail.com`

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/base/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -7,5 +16,3 @@
# #
# 详细许可条款请参阅项目根目录下的LICENSE文件。 # 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。 # 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/base/base_crawler.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

11
cache/__init__.py vendored
View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/cache/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -7,5 +16,3 @@
# #
# 详细许可条款请参阅项目根目录下的LICENSE文件。 # 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。 # 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。

9
cache/abs_cache.py vendored
View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/cache/abs_cache.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/cache/cache_factory.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/cache/local_cache.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/cache/redis_cache.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/cmd_arg/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/cmd_arg/arg.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -9,52 +18,251 @@
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。 # 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
import argparse from __future__ import annotations
import sys
from enum import Enum
from types import SimpleNamespace
from typing import Iterable, Optional, Sequence, Type, TypeVar
import typer
from typing_extensions import Annotated
import config import config
from tools.utils import str2bool from tools.utils import str2bool
async def parse_cmd(): EnumT = TypeVar("EnumT", bound=Enum)
# 读取command arg
parser = argparse.ArgumentParser(description='Media crawler program. / 媒体爬虫程序')
parser.add_argument('--platform', type=str,
help='Media platform select / 选择媒体平台 (xhs=小红书 | dy=抖音 | ks=快手 | bili=哔哩哔哩 | wb=微博 | tieba=百度贴吧 | zhihu=知乎)',
choices=["xhs", "dy", "ks", "bili", "wb", "tieba", "zhihu"], default=config.PLATFORM)
parser.add_argument('--lt', type=str,
help='Login type / 登录方式 (qrcode=二维码 | phone=手机号 | cookie=Cookie)',
choices=["qrcode", "phone", "cookie"], default=config.LOGIN_TYPE)
parser.add_argument('--type', type=str,
help='Crawler type / 爬取类型 (search=搜索 | detail=详情 | creator=创作者)',
choices=["search", "detail", "creator"], default=config.CRAWLER_TYPE)
parser.add_argument('--start', type=int,
help='Number of start page / 起始页码', default=config.START_PAGE)
parser.add_argument('--keywords', type=str,
help='Please input keywords / 请输入关键词', default=config.KEYWORDS)
parser.add_argument('--get_comment', type=str2bool,
help='''Whether to crawl level one comment / 是否爬取一级评论, supported values case insensitive / 支持的值(不区分大小写) ('yes', 'true', 't', 'y', '1', 'no', 'false', 'f', 'n', '0')''', default=config.ENABLE_GET_COMMENTS)
parser.add_argument('--get_sub_comment', type=str2bool,
help=''''Whether to crawl level two comment / 是否爬取二级评论, supported values case insensitive / 支持的值(不区分大小写) ('yes', 'true', 't', 'y', '1', 'no', 'false', 'f', 'n', '0')''', default=config.ENABLE_GET_SUB_COMMENTS)
parser.add_argument('--save_data_option', type=str,
help='Where to save the data / 数据保存方式 (csv=CSV文件 | db=MySQL数据库 | json=JSON文件 | sqlite=SQLite数据库)',
choices=['csv', 'db', 'json', 'sqlite'], default=config.SAVE_DATA_OPTION)
parser.add_argument('--init_db', type=str,
help='Initialize database schema / 初始化数据库表结构 (sqlite | mysql)',
choices=['sqlite', 'mysql'], default=None)
parser.add_argument('--cookies', type=str,
help='Cookies used for cookie login type / Cookie登录方式使用的Cookie值', default=config.COOKIES)
args = parser.parse_args()
# override config class PlatformEnum(str, Enum):
config.PLATFORM = args.platform """支持的媒体平台枚举"""
config.LOGIN_TYPE = args.lt
config.CRAWLER_TYPE = args.type
config.START_PAGE = args.start
config.KEYWORDS = args.keywords
config.ENABLE_GET_COMMENTS = args.get_comment
config.ENABLE_GET_SUB_COMMENTS = args.get_sub_comment
config.SAVE_DATA_OPTION = args.save_data_option
config.COOKIES = args.cookies
return args XHS = "xhs"
DOUYIN = "dy"
KUAISHOU = "ks"
BILIBILI = "bili"
WEIBO = "wb"
TIEBA = "tieba"
ZHIHU = "zhihu"
class LoginTypeEnum(str, Enum):
"""登录方式枚举"""
QRCODE = "qrcode"
PHONE = "phone"
COOKIE = "cookie"
class CrawlerTypeEnum(str, Enum):
"""爬虫类型枚举"""
SEARCH = "search"
DETAIL = "detail"
CREATOR = "creator"
class SaveDataOptionEnum(str, Enum):
"""数据保存方式枚举"""
CSV = "csv"
DB = "db"
JSON = "json"
SQLITE = "sqlite"
MONGODB = "mongodb"
EXCEL = "excel"
class InitDbOptionEnum(str, Enum):
"""数据库初始化选项"""
SQLITE = "sqlite"
MYSQL = "mysql"
def _to_bool(value: bool | str) -> bool:
if isinstance(value, bool):
return value
return str2bool(value)
def _coerce_enum(
enum_cls: Type[EnumT],
value: EnumT | str,
default: EnumT,
) -> EnumT:
"""Safely convert a raw config value to an enum member."""
if isinstance(value, enum_cls):
return value
try:
return enum_cls(value)
except ValueError:
typer.secho(
f"⚠️ 配置值 '{value}' 不在 {enum_cls.__name__} 支持的范围内,已回退到默认值 '{default.value}'.",
fg=typer.colors.YELLOW,
)
return default
def _normalize_argv(argv: Optional[Sequence[str]]) -> Iterable[str]:
if argv is None:
return list(sys.argv[1:])
return list(argv)
def _inject_init_db_default(args: Sequence[str]) -> list[str]:
"""Ensure bare --init_db defaults to sqlite for backward compatibility."""
normalized: list[str] = []
i = 0
while i < len(args):
arg = args[i]
normalized.append(arg)
if arg == "--init_db":
next_arg = args[i + 1] if i + 1 < len(args) else None
if not next_arg or next_arg.startswith("-"):
normalized.append(InitDbOptionEnum.SQLITE.value)
i += 1
return normalized
async def parse_cmd(argv: Optional[Sequence[str]] = None):
"""使用 Typer 解析命令行参数。"""
app = typer.Typer(add_completion=False)
@app.callback(invoke_without_command=True)
def main(
platform: Annotated[
PlatformEnum,
typer.Option(
"--platform",
help="媒体平台选择 (xhs=小红书 | dy=抖音 | ks=快手 | bili=哔哩哔哩 | wb=微博 | tieba=百度贴吧 | zhihu=知乎)",
rich_help_panel="基础配置",
),
] = _coerce_enum(PlatformEnum, config.PLATFORM, PlatformEnum.XHS),
lt: Annotated[
LoginTypeEnum,
typer.Option(
"--lt",
help="登录方式 (qrcode=二维码 | phone=手机号 | cookie=Cookie)",
rich_help_panel="账号配置",
),
] = _coerce_enum(LoginTypeEnum, config.LOGIN_TYPE, LoginTypeEnum.QRCODE),
crawler_type: Annotated[
CrawlerTypeEnum,
typer.Option(
"--type",
help="爬取类型 (search=搜索 | detail=详情 | creator=创作者)",
rich_help_panel="基础配置",
),
] = _coerce_enum(CrawlerTypeEnum, config.CRAWLER_TYPE, CrawlerTypeEnum.SEARCH),
start: Annotated[
int,
typer.Option(
"--start",
help="起始页码",
rich_help_panel="基础配置",
),
] = config.START_PAGE,
keywords: Annotated[
str,
typer.Option(
"--keywords",
help="请输入关键词,多个关键词用逗号分隔",
rich_help_panel="基础配置",
),
] = config.KEYWORDS,
get_comment: Annotated[
str,
typer.Option(
"--get_comment",
help="是否爬取一级评论,支持 yes/true/t/y/1 或 no/false/f/n/0",
rich_help_panel="评论配置",
show_default=True,
),
] = str(config.ENABLE_GET_COMMENTS),
get_sub_comment: Annotated[
str,
typer.Option(
"--get_sub_comment",
help="是否爬取二级评论,支持 yes/true/t/y/1 或 no/false/f/n/0",
rich_help_panel="评论配置",
show_default=True,
),
] = str(config.ENABLE_GET_SUB_COMMENTS),
save_data_option: Annotated[
SaveDataOptionEnum,
typer.Option(
"--save_data_option",
help="数据保存方式 (csv=CSV文件 | db=MySQL数据库 | json=JSON文件 | sqlite=SQLite数据库 | mongodb=MongoDB数据库 | excel=Excel文件)",
rich_help_panel="存储配置",
),
] = _coerce_enum(
SaveDataOptionEnum, config.SAVE_DATA_OPTION, SaveDataOptionEnum.JSON
),
init_db: Annotated[
Optional[InitDbOptionEnum],
typer.Option(
"--init_db",
help="初始化数据库表结构 (sqlite | mysql)",
rich_help_panel="存储配置",
),
] = None,
cookies: Annotated[
str,
typer.Option(
"--cookies",
help="Cookie 登录方式使用的 Cookie 值",
rich_help_panel="账号配置",
),
] = config.COOKIES,
) -> SimpleNamespace:
"""MediaCrawler 命令行入口"""
enable_comment = _to_bool(get_comment)
enable_sub_comment = _to_bool(get_sub_comment)
init_db_value = init_db.value if init_db else None
# override global config
config.PLATFORM = platform.value
config.LOGIN_TYPE = lt.value
config.CRAWLER_TYPE = crawler_type.value
config.START_PAGE = start
config.KEYWORDS = keywords
config.ENABLE_GET_COMMENTS = enable_comment
config.ENABLE_GET_SUB_COMMENTS = enable_sub_comment
config.SAVE_DATA_OPTION = save_data_option.value
config.COOKIES = cookies
return SimpleNamespace(
platform=config.PLATFORM,
lt=config.LOGIN_TYPE,
type=config.CRAWLER_TYPE,
start=config.START_PAGE,
keywords=config.KEYWORDS,
get_comment=config.ENABLE_GET_COMMENTS,
get_sub_comment=config.ENABLE_GET_SUB_COMMENTS,
save_data_option=config.SAVE_DATA_OPTION,
init_db=init_db_value,
cookies=config.COOKIES,
)
command = typer.main.get_command(app)
cli_args = _normalize_argv(argv)
cli_args = _inject_init_db_default(cli_args)
try:
result = command.main(args=cli_args, standalone_mode=False)
if isinstance(result, int): # help/options handled by Typer; propagate exit code
raise SystemExit(result)
return result
except typer.Exit as exc: # pragma: no cover - CLI exit paths
raise SystemExit(exc.exit_code) from exc

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/base_config.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -38,7 +47,7 @@ SAVE_LOGIN_STATE = True
# 是否启用CDP模式 - 使用用户现有的Chrome/Edge浏览器进行爬取提供更好的反检测能力 # 是否启用CDP模式 - 使用用户现有的Chrome/Edge浏览器进行爬取提供更好的反检测能力
# 启用后将自动检测并启动用户的Chrome/Edge浏览器通过CDP协议进行控制 # 启用后将自动检测并启动用户的Chrome/Edge浏览器通过CDP协议进行控制
# 这种方式使用真实的浏览器环境包括用户的扩展、Cookie和设置大大降低被检测的风险 # 这种方式使用真实的浏览器环境包括用户的扩展、Cookie和设置大大降低被检测的风险
ENABLE_CDP_MODE = False ENABLE_CDP_MODE = True
# CDP调试端口用于与浏览器通信 # CDP调试端口用于与浏览器通信
# 如果端口被占用,系统会自动尝试下一个可用端口 # 如果端口被占用,系统会自动尝试下一个可用端口
@@ -55,14 +64,14 @@ CUSTOM_BROWSER_PATH = ""
CDP_HEADLESS = False CDP_HEADLESS = False
# 浏览器启动超时时间(秒) # 浏览器启动超时时间(秒)
BROWSER_LAUNCH_TIMEOUT = 30 BROWSER_LAUNCH_TIMEOUT = 60
# 是否在程序结束时自动关闭浏览器 # 是否在程序结束时自动关闭浏览器
# 设置为False可以保持浏览器运行便于调试 # 设置为False可以保持浏览器运行便于调试
AUTO_CLOSE_BROWSER = True AUTO_CLOSE_BROWSER = True
# 数据保存类型选项配置,支持种类型csv、db、json、sqlite, 最好保存到DB有排重的功能。 # 数据保存类型选项配置,支持种类型csv、db、json、sqlite、excel, 最好保存到DB有排重的功能。
SAVE_DATA_OPTION = "json" # csv or db or json or sqlite SAVE_DATA_OPTION = "json" # csv or db or json or sqlite or excel
# 用户浏览器缓存的浏览器文件配置 # 用户浏览器缓存的浏览器文件配置
USER_DATA_DIR = "%s_user_data_dir" # %s will be replaced by platform name USER_DATA_DIR = "%s_user_data_dir" # %s will be replaced by platform name

View File

@@ -1,4 +1,12 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/bilibili_config.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -13,16 +21,23 @@
# 每天爬取视频/帖子的数量控制 # 每天爬取视频/帖子的数量控制
MAX_NOTES_PER_DAY = 1 MAX_NOTES_PER_DAY = 1
# 指定B站视频ID列表 # 指定B站视频URL列表 (支持完整URL或BV号)
# 示例:
# - 完整URL: "https://www.bilibili.com/video/BV1dwuKzmE26/?spm_id_from=333.1387.homepage.video_card.click"
# - BV号: "BV1d54y1g7db"
BILI_SPECIFIED_ID_LIST = [ BILI_SPECIFIED_ID_LIST = [
"BV1d54y1g7db", "https://www.bilibili.com/video/BV1dwuKzmE26/?spm_id_from=333.1387.homepage.video_card.click",
"BV1Sz4y1U77N", "BV1Sz4y1U77N",
"BV14Q4y1n7jz", "BV14Q4y1n7jz",
# ........................ # ........................
] ]
# 指定B站用户ID列表 # 指定B站创作者URL列表 (支持完整URL或UID)
# 示例:
# - 完整URL: "https://space.bilibili.com/434377496?spm_id_from=333.1007.0.0"
# - UID: "20813884"
BILI_CREATOR_ID_LIST = [ BILI_CREATOR_ID_LIST = [
"https://space.bilibili.com/434377496?spm_id_from=333.1007.0.0",
"20813884", "20813884",
# ........................ # ........................
] ]

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/db_config.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -43,3 +52,18 @@ SQLITE_DB_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)), "datab
sqlite_db_config = { sqlite_db_config = {
"db_path": SQLITE_DB_PATH "db_path": SQLITE_DB_PATH
} }
# mongodb config
MONGODB_HOST = os.getenv("MONGODB_HOST", "localhost")
MONGODB_PORT = os.getenv("MONGODB_PORT", 27017)
MONGODB_USER = os.getenv("MONGODB_USER", "")
MONGODB_PWD = os.getenv("MONGODB_PWD", "")
MONGODB_DB_NAME = os.getenv("MONGODB_DB_NAME", "media_crawler")
mongodb_config = {
"host": MONGODB_HOST,
"port": int(MONGODB_PORT),
"user": MONGODB_USER,
"password": MONGODB_PWD,
"db_name": MONGODB_DB_NAME,
}

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/dy_config.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -11,15 +20,27 @@
# 抖音平台配置 # 抖音平台配置
PUBLISH_TIME_TYPE = 0 PUBLISH_TIME_TYPE = 0
# 指定DY视频ID列表 # 指定DY视频URL列表 (支持多种格式)
# 支持格式:
# 1. 完整视频URL: "https://www.douyin.com/video/7525538910311632128"
# 2. 带modal_id的URL: "https://www.douyin.com/user/xxx?modal_id=7525538910311632128"
# 3. 搜索页带modal_id: "https://www.douyin.com/root/search/python?modal_id=7525538910311632128"
# 4. 短链接: "https://v.douyin.com/drIPtQ_WPWY/"
# 5. 纯视频ID: "7280854932641664319"
DY_SPECIFIED_ID_LIST = [ DY_SPECIFIED_ID_LIST = [
"7280854932641664319", "https://www.douyin.com/video/7525538910311632128",
"https://v.douyin.com/drIPtQ_WPWY/",
"https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main&modal_id=7525538910311632128",
"7202432992642387233", "7202432992642387233",
# ........................ # ........................
] ]
# 指定DY用户ID列表 # 指定DY创作者URL列表 (支持完整URL或sec_user_id)
# 支持格式:
# 1. 完整创作者主页URL: "https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main"
# 2. sec_user_id: "MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE"
DY_CREATOR_ID_LIST = [ DY_CREATOR_ID_LIST = [
"MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE", "https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main",
"MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE"
# ........................ # ........................
] ]

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/ks_config.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -10,11 +19,22 @@
# 快手平台配置 # 快手平台配置
# 指定快手视频ID列表 # 指定快手视频URL列表 (支持完整URL或纯ID)
KS_SPECIFIED_ID_LIST = ["3xf8enb8dbj6uig", "3x6zz972bchmvqe"] # 支持格式:
# 1. 完整视频URL: "https://www.kuaishou.com/short-video/3x3zxz4mjrsc8ke?authorId=3x84qugg4ch9zhs&streamSource=search"
# 2. 纯视频ID: "3xf8enb8dbj6uig"
KS_SPECIFIED_ID_LIST = [
"https://www.kuaishou.com/short-video/3x3zxz4mjrsc8ke?authorId=3x84qugg4ch9zhs&streamSource=search&area=searchxxnull&searchKey=python",
"3xf8enb8dbj6uig",
# ........................
]
# 指定快手用户ID列表 # 指定快手创作者URL列表 (支持完整URL或纯ID)
# 支持格式:
# 1. 创作者主页URL: "https://www.kuaishou.com/profile/3x84qugg4ch9zhs"
# 2. 纯user_id: "3x4sm73aye7jq7i"
KS_CREATOR_ID_LIST = [ KS_CREATOR_ID_LIST = [
"https://www.kuaishou.com/profile/3x84qugg4ch9zhs",
"3x4sm73aye7jq7i", "3x4sm73aye7jq7i",
# ........................ # ........................
] ]

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/tieba_config.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/weibo_config.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -12,7 +21,7 @@
# 微博平台配置 # 微博平台配置
# 搜索类型具体的枚举值在media_platform/weibo/field.py中 # 搜索类型具体的枚举值在media_platform/weibo/field.py中
WEIBO_SEARCH_TYPE = "popular" WEIBO_SEARCH_TYPE = "default"
# 指定微博ID列表 # 指定微博ID列表
WEIBO_SPECIFIED_ID_LIST = [ WEIBO_SPECIFIED_ID_LIST = [

View File

@@ -1,4 +1,12 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/xhs_config.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -17,12 +25,13 @@ SORT_TYPE = "popularity_descending"
# 指定笔记URL列表, 必须要携带xsec_token参数 # 指定笔记URL列表, 必须要携带xsec_token参数
XHS_SPECIFIED_NOTE_URL_LIST = [ XHS_SPECIFIED_NOTE_URL_LIST = [
"https://www.xiaohongshu.com/explore/66fad51c000000001b0224b8?xsec_token=AB3rO-QopW5sgrJ41GwN01WCXh6yWPxjSoFI9D5JIMgKw=&xsec_source=pc_search" "https://www.xiaohongshu.com/explore/64b95d01000000000c034587?xsec_token=AB0EFqJvINCkj6xOCKCQgfNNh8GdnBC_6XecG4QOddo3Q=&xsec_source=pc_cfeed"
# ........................ # ........................
] ]
# 指定用户ID列表 # 指定创作者URL列表需要携带xsec_token和xsec_source参数
XHS_CREATOR_ID_LIST = [ XHS_CREATOR_ID_LIST = [
"63e36c9a000000002703502b", "https://www.xiaohongshu.com/user/profile/5f58bd990000000001003753?xsec_token=ABYVg1evluJZZzpMX-VWzchxQ1qSNVW3r-jOEnKqMcgZw=&xsec_source=pc_search"
# ........................ # ........................
] ]

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/zhihu_config.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/constant/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/constant/baidu_tieba.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/constant/zhihu.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -16,4 +25,3 @@ ZHIHU_ZHUANLAN_URL = "https://zhuanlan.zhihu.com"
ANSWER_NAME = "answer" ANSWER_NAME = "answer"
ARTICLE_NAME = "article" ARTICLE_NAME = "article"
VIDEO_NAME = "zvideo" VIDEO_NAME = "zvideo"

View File

@@ -0,0 +1,17 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/database/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。

View File

@@ -1,3 +1,21 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/database/db.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# persist-1<persist1@126.com> # persist-1<persist1@126.com>
# 原因:将 db.py 改造为模块,移除直接执行入口,修复相对导入问题。 # 原因:将 db.py 改造为模块,移除直接执行入口,修复相对导入问题。
# 副作用:无 # 副作用:无

View File

@@ -1,3 +1,21 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/database/db_session.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
from sqlalchemy import text from sqlalchemy import text
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker from sqlalchemy.orm import sessionmaker

View File

@@ -1,3 +1,21 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/database/models.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
from sqlalchemy import create_engine, Column, Integer, Text, String, BigInteger from sqlalchemy import create_engine, Column, Integer, Text, String, BigInteger
from sqlalchemy.ext.declarative import declarative_base from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker from sqlalchemy.orm import sessionmaker

View File

@@ -0,0 +1,143 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/database/mongodb_store_base.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
"""MongoDB存储基类提供连接管理和通用存储方法"""
import asyncio
from typing import Dict, List, Optional
from motor.motor_asyncio import AsyncIOMotorClient, AsyncIOMotorDatabase, AsyncIOMotorCollection
from config import db_config
from tools import utils
class MongoDBConnection:
"""MongoDB连接管理单例模式"""
_instance = None
_client: Optional[AsyncIOMotorClient] = None
_db: Optional[AsyncIOMotorDatabase] = None
_lock = asyncio.Lock()
def __new__(cls):
if cls._instance is None:
cls._instance = super(MongoDBConnection, cls).__new__(cls)
return cls._instance
async def get_client(self) -> AsyncIOMotorClient:
"""获取客户端"""
if self._client is None:
async with self._lock:
if self._client is None:
await self._connect()
return self._client
async def get_db(self) -> AsyncIOMotorDatabase:
"""获取数据库"""
if self._db is None:
async with self._lock:
if self._db is None:
await self._connect()
return self._db
async def _connect(self):
"""建立连接"""
try:
mongo_config = db_config.mongodb_config
host = mongo_config["host"]
port = mongo_config["port"]
user = mongo_config["user"]
password = mongo_config["password"]
db_name = mongo_config["db_name"]
# 构建连接URL有认证/无认证)
if user and password:
connection_url = f"mongodb://{user}:{password}@{host}:{port}/"
else:
connection_url = f"mongodb://{host}:{port}/"
self._client = AsyncIOMotorClient(connection_url, serverSelectionTimeoutMS=5000)
await self._client.server_info() # 测试连接
self._db = self._client[db_name]
utils.logger.info(f"[MongoDBConnection] Connected to {host}:{port}/{db_name}")
except Exception as e:
utils.logger.error(f"[MongoDBConnection] Connection failed: {e}")
raise
async def close(self):
"""关闭连接"""
if self._client is not None:
self._client.close()
self._client = None
self._db = None
utils.logger.info("[MongoDBConnection] Connection closed")
class MongoDBStoreBase:
"""MongoDB存储基类提供通用的CRUD操作"""
def __init__(self, collection_prefix: str):
"""初始化存储基类
Args:
collection_prefix: 平台前缀xhs/douyin/bilibili等
"""
self.collection_prefix = collection_prefix
self._connection = MongoDBConnection()
async def get_collection(self, collection_suffix: str) -> AsyncIOMotorCollection:
"""获取集合:{prefix}_{suffix}"""
db = await self._connection.get_db()
collection_name = f"{self.collection_prefix}_{collection_suffix}"
return db[collection_name]
async def save_or_update(self, collection_suffix: str, query: Dict, data: Dict) -> bool:
"""保存或更新数据upsert"""
try:
collection = await self.get_collection(collection_suffix)
await collection.update_one(query, {"$set": data}, upsert=True)
return True
except Exception as e:
utils.logger.error(f"[MongoDBStoreBase] Save failed ({self.collection_prefix}_{collection_suffix}): {e}")
return False
async def find_one(self, collection_suffix: str, query: Dict) -> Optional[Dict]:
"""查询单条数据"""
try:
collection = await self.get_collection(collection_suffix)
return await collection.find_one(query)
except Exception as e:
utils.logger.error(f"[MongoDBStoreBase] Find one failed ({self.collection_prefix}_{collection_suffix}): {e}")
return None
async def find_many(self, collection_suffix: str, query: Dict, limit: int = 0) -> List[Dict]:
"""查询多条数据limit=0表示不限制"""
try:
collection = await self.get_collection(collection_suffix)
cursor = collection.find(query)
if limit > 0:
cursor = cursor.limit(limit)
return await cursor.to_list(length=None)
except Exception as e:
utils.logger.error(f"[MongoDBStoreBase] Find many failed ({self.collection_prefix}_{collection_suffix}): {e}")
return []
async def create_index(self, collection_suffix: str, keys: List[tuple], unique: bool = False):
"""创建索引keys=[("field", 1)]"""
try:
collection = await self.get_collection(collection_suffix)
await collection.create_index(keys, unique=unique)
utils.logger.info(f"[MongoDBStoreBase] Index created on {self.collection_prefix}_{collection_suffix}")
except Exception as e:
utils.logger.error(f"[MongoDBStoreBase] Create index failed: {e}")

View File

@@ -59,7 +59,6 @@ export default defineConfig({
text: 'MediaCrawler源码剖析课', text: 'MediaCrawler源码剖析课',
link: 'https://relakkes.feishu.cn/wiki/JUgBwdhIeiSbAwkFCLkciHdAnhh' link: 'https://relakkes.feishu.cn/wiki/JUgBwdhIeiSbAwkFCLkciHdAnhh'
}, },
{text: '知识星球文章专栏', link: '/知识星球介绍'},
{text: '开发者咨询服务', link: '/开发者咨询'}, {text: '开发者咨询服务', link: '/开发者咨询'},
] ]
}, },

View File

@@ -0,0 +1,57 @@
# 数据保存指南 / Data Storage Guide
### 💾 数据保存
MediaCrawler 支持多种数据存储方式,您可以根据需求选择最适合的方案:
#### 存储方式
- **CSV 文件**:支持保存到 CSV 中(`data/` 目录下)
- **JSON 文件**:支持保存到 JSON 中(`data/` 目录下)
- **Excel 文件**:支持保存到格式化的 Excel 文件(`data/` 目录下)✨ 新功能
- 多工作表支持(内容、评论、创作者)
- 专业格式化(标题样式、自动列宽、边框)
- 易于分析和分享
- **数据库存储**
- 使用参数 `--init_db` 进行数据库初始化(使用`--init_db`时不需要携带其他optional
- **SQLite 数据库**:轻量级数据库,无需服务器,适合个人使用(推荐)
1. 初始化:`--init_db sqlite`
2. 数据存储:`--save_data_option sqlite`
- **MySQL 数据库**:支持关系型数据库 MySQL 中保存(需要提前创建数据库)
1. 初始化:`--init_db mysql`
2. 数据存储:`--save_data_option db`db 参数为兼容历史更新保留)
#### 使用示例
```shell
# 使用 Excel 存储数据(推荐用于数据分析)✨ 新功能
uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel
# 初始化 SQLite 数据库
uv run main.py --init_db sqlite
# 使用 SQLite 存储数据
uv run main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
```
```shell
# 初始化 MySQL 数据库
uv run main.py --init_db mysql
# 使用 MySQL 存储数据为适配历史更新db参数进行沿用
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
```
```shell
# 使用 CSV 存储数据
uv run main.py --platform xhs --lt qrcode --type search --save_data_option csv
# 使用 JSON 存储数据
uv run main.py --platform xhs --lt qrcode --type search --save_data_option json
```
#### 详细文档
- **Excel 导出详细指南**:查看 [Excel 导出指南](excel_export_guide.md)
- **数据库配置**:参考 [常见问题](常见问题.md)
---

244
docs/excel_export_guide.md Normal file
View File

@@ -0,0 +1,244 @@
# Excel Export Guide
## Overview
MediaCrawler now supports exporting crawled data to formatted Excel files (.xlsx) with professional styling and multiple sheets for contents, comments, and creators.
## Features
- **Multi-sheet workbooks**: Separate sheets for Contents, Comments, and Creators
- **Professional formatting**:
- Styled headers with blue background and white text
- Auto-adjusted column widths
- Cell borders and text wrapping
- Clean, readable layout
- **Smart export**: Empty sheets are automatically removed
- **Organized storage**: Files saved to `data/{platform}/` directory with timestamps
## Installation
Excel export requires the `openpyxl` library:
```bash
# Using uv (recommended)
uv sync
# Or using pip
pip install openpyxl
```
## Usage
### Basic Usage
1. **Configure Excel export** in `config/base_config.py`:
```python
SAVE_DATA_OPTION = "excel" # Change from json/csv/db to excel
```
2. **Run the crawler**:
```bash
# Xiaohongshu example
uv run main.py --platform xhs --lt qrcode --type search
# Douyin example
uv run main.py --platform dy --lt qrcode --type search
# Bilibili example
uv run main.py --platform bili --lt qrcode --type search
```
3. **Find your Excel file** in `data/{platform}/` directory:
- Filename format: `{platform}_{crawler_type}_{timestamp}.xlsx`
- Example: `xhs_search_20250128_143025.xlsx`
### Command Line Examples
```bash
# Search by keywords and export to Excel
uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel
# Crawl specific posts and export to Excel
uv run main.py --platform xhs --lt qrcode --type detail --save_data_option excel
# Crawl creator profile and export to Excel
uv run main.py --platform xhs --lt qrcode --type creator --save_data_option excel
```
## Excel File Structure
### Contents Sheet
Contains post/video information:
- `note_id`: Unique post identifier
- `title`: Post title
- `desc`: Post description
- `user_id`: Author user ID
- `nickname`: Author nickname
- `liked_count`: Number of likes
- `comment_count`: Number of comments
- `share_count`: Number of shares
- `ip_location`: IP location
- `image_list`: Comma-separated image URLs
- `tag_list`: Comma-separated tags
- `note_url`: Direct link to post
- And more platform-specific fields...
### Comments Sheet
Contains comment information:
- `comment_id`: Unique comment identifier
- `note_id`: Associated post ID
- `content`: Comment text
- `user_id`: Commenter user ID
- `nickname`: Commenter nickname
- `like_count`: Comment likes
- `create_time`: Comment timestamp
- `ip_location`: Commenter location
- `sub_comment_count`: Number of replies
- And more...
### Creators Sheet
Contains creator/author information:
- `user_id`: Unique user identifier
- `nickname`: Display name
- `gender`: Gender
- `avatar`: Profile picture URL
- `desc`: Bio/description
- `fans`: Follower count
- `follows`: Following count
- `interaction`: Total interactions
- And more...
## Advantages Over Other Formats
### vs CSV
- ✅ Multiple sheets in one file
- ✅ Professional formatting
- ✅ Better handling of special characters
- ✅ Auto-adjusted column widths
- ✅ No encoding issues
### vs JSON
- ✅ Human-readable tabular format
- ✅ Easy to open in Excel/Google Sheets
- ✅ Better for data analysis
- ✅ Easier to share with non-technical users
### vs Database
- ✅ No database setup required
- ✅ Portable single-file format
- ✅ Easy to share and archive
- ✅ Works offline
## Tips & Best Practices
1. **Large datasets**: For very large crawls (>10,000 rows), consider using database storage instead for better performance
2. **Data analysis**: Excel files work great with:
- Microsoft Excel
- Google Sheets
- LibreOffice Calc
- Python pandas: `pd.read_excel('file.xlsx')`
3. **Combining data**: You can merge multiple Excel files using:
```python
import pandas as pd
df1 = pd.read_excel('file1.xlsx', sheet_name='Contents')
df2 = pd.read_excel('file2.xlsx', sheet_name='Contents')
combined = pd.concat([df1, df2])
combined.to_excel('combined.xlsx', index=False)
```
4. **File size**: Excel files are typically 2-3x larger than CSV but smaller than JSON
## Troubleshooting
### "openpyxl not installed" error
```bash
# Install openpyxl
uv add openpyxl
# or
pip install openpyxl
```
### Excel file not created
Check that:
1. `SAVE_DATA_OPTION = "excel"` in config
2. Crawler successfully collected data
3. No errors in console output
4. `data/{platform}/` directory exists
### Empty Excel file
This happens when:
- No data was crawled (check keywords/IDs)
- Login failed (check login status)
- Platform blocked requests (check IP/rate limits)
## Example Output
After running a successful crawl, you'll see:
```
[ExcelStoreBase] Initialized Excel export to: data/xhs/xhs_search_20250128_143025.xlsx
[ExcelStoreBase] Stored content to Excel: 7123456789
[ExcelStoreBase] Stored comment to Excel: comment_123
...
[Main] Excel file saved successfully
```
Your Excel file will have:
- Professional blue headers
- Clean borders
- Wrapped text for long content
- Auto-sized columns
- Separate organized sheets
## Advanced Usage
### Programmatic Access
```python
from store.excel_store_base import ExcelStoreBase
# Create store
store = ExcelStoreBase(platform="xhs", crawler_type="search")
# Store data
await store.store_content({
"note_id": "123",
"title": "Test Post",
"liked_count": 100
})
# Save to file
store.flush()
```
### Custom Formatting
You can extend `ExcelStoreBase` to customize formatting:
```python
from store.excel_store_base import ExcelStoreBase
class CustomExcelStore(ExcelStoreBase):
def _apply_header_style(self, sheet, row_num=1):
# Custom header styling
super()._apply_header_style(sheet, row_num)
# Add your customizations here
```
## Support
For issues or questions:
- Check [常见问题](常见问题.md)
- Open an issue on GitHub
- Join the WeChat discussion group
---
**Note**: Excel export is designed for learning and research purposes. Please respect platform terms of service and rate limits.

View File

@@ -1,58 +1,76 @@
# MediaCrawler使用方法 # MediaCrawler使用方法
## 创建并激活 python 虚拟环境 ## 推荐:使用 uv 管理依赖
> 如果是爬取抖音和知乎需要提前安装nodejs环境版本大于等于`16`即可 <br>
```shell
# 进入项目根目录
cd MediaCrawler
# 创建虚拟环境 ### 1. 前置依赖
# 我的python版本是3.9.6requirements.txt中的库是基于这个版本的如果是其他python版本可能requirements.txt中的库不兼容自行解决一下 - 安装 [uv](https://docs.astral.sh/uv/getting-started/installation),并用 `uv --version` 验证
python -m venv venv - Python 版本建议使用 **3.11**(当前依赖基于该版本构建)。
- 安装 Node.js抖音、知乎等平台需要版本需 `>= 16.0.0`
# macos & linux 激活虚拟环境 ### 2. 同步 Python 依赖
source venv/bin/activate ```shell
# 进入项目根目录
cd MediaCrawler
# windows 激活虚拟环境 # 使用 uv 保证 Python 版本和依赖一致性
venv\Scripts\activate uv sync
```
``` ### 3. 安装 Playwright 浏览器驱动
```shell
uv run playwright install
```
> 项目已支持使用 Playwright 连接本地 Chrome。如需使用 CDP 方式,可在 `config/base_config.py` 中调整 `xhs` 和 `dy` 的相关配置。
## 安装依赖库 ### 4. 运行爬虫程序
```shell
# 项目默认未开启评论爬取,如需评论请在 config/base_config.py 中修改 ENABLE_GET_COMMENTS
# 其他功能开关也可在 config/base_config.py 查看,均有中文注释
```shell # 从配置中读取关键词搜索并爬取帖子与评论
pip install -r requirements.txt uv run main.py --platform xhs --lt qrcode --type search
```
## 安装 playwright浏览器驱动 # 从配置中读取指定帖子ID列表并爬取帖子与评论
uv run main.py --platform xhs --lt qrcode --type detail
```shell # 使用 SQLite 数据库存储数据(推荐个人用户使用)
playwright install uv run main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
```
## 运行爬虫程序 # 使用 MySQL 数据库存储数据
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
```shell # 其他平台示例
### 项目默认是没有开启评论爬取模式如需评论请在config/base_config.py中的 ENABLE_GET_COMMENTS 变量修改 uv run main.py --help
### 一些其他支持项也可以在config/base_config.py查看功能写的有中文注释 ```
# 从配置文件中读取关键词搜索相关的帖子并爬取帖子信息与评论 ## 备选Python 原生 venv不推荐
python main.py --platform xhs --lt qrcode --type search > 如果爬取抖音或知乎,需要提前安装 Node.js版本 `>= 16`。
```shell
# 进入项目根目录
cd MediaCrawler
# 从配置文件中读取指定的帖子ID列表获取指定帖子的信息与评论信息 # 创建虚拟环境(示例 Python 版本3.11requirements 基于该版本)
python main.py --platform xhs --lt qrcode --type detail python -m venv venv
# 使用SQLite数据库存储数据推荐个人用户使用 # macOS & Linux 激活虚拟环境
python main.py --platform xhs --lt qrcode --type search --save_data_option sqlite source venv/bin/activate
# 使用MySQL数据库存储数据 # Windows 激活虚拟环境
python main.py --platform xhs --lt qrcode --type search --save_data_option db venv\Scripts\activate
```
# 打开对应APP扫二维码登录 ```shell
# 安装依赖与驱动
# 其他平台爬虫使用示例,执行下面的命令查看 pip install -r requirements.txt
python main.py --help playwright install
``` ```
```shell
# 运行爬虫程序venv 环境)
python main.py --platform xhs --lt qrcode --type search
python main.py --platform xhs --lt qrcode --type detail
python main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
python main.py --platform xhs --lt qrcode --type search --save_data_option db
python main.py --help
```
## 💾 数据存储 ## 💾 数据存储
@@ -74,4 +92,3 @@
> 大家请以学习为目的使用本仓库爬虫违法违规的案件https://github.com/HiddenStrawberry/Crawler_Illegal_Cases_In_China <br> > 大家请以学习为目的使用本仓库爬虫违法违规的案件https://github.com/HiddenStrawberry/Crawler_Illegal_Cases_In_China <br>
> >
>本项目的所有内容仅供学习和参考之用,禁止用于商业用途。任何人或组织不得将本仓库的内容用于非法用途或侵犯他人合法权益。本仓库所涉及的爬虫技术仅用于学习和研究,不得用于对其他平台进行大规模爬虫或其他非法行为。对于因使用本仓库内容而引起的任何法律责任,本仓库不承担任何责任。使用本仓库的内容即表示您同意本免责声明的所有条款和条件。 >本项目的所有内容仅供学习和参考之用,禁止用于商业用途。任何人或组织不得将本仓库的内容用于非法用途或侵犯他人合法权益。本仓库所涉及的爬虫技术仅用于学习和研究,不得用于对其他平台进行大规模爬虫或其他非法行为。对于因使用本仓库内容而引起的任何法律责任,本仓库不承担任何责任。使用本仓库的内容即表示您同意本免责声明的所有条款和条件。

View File

@@ -17,7 +17,7 @@
扫描下方我的个人微信备注pro版本如果图片展示不出来可以直接添加我的微信号yzglan 扫描下方我的个人微信备注pro版本如果图片展示不出来可以直接添加我的微信号relakkes
![relakkes_weichat.JPG](static/images/relakkes_weichat.jpg) ![relakkes_weichat.JPG](static/images/relakkes_weichat.jpg)

BIN
docs/static/images/nstbrowser.jpg vendored Normal file
View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 580 KiB

View File

Binary file not shown.

Before

Width:  |  Height:  |  Size: 223 KiB

After

Width:  |  Height:  |  Size: 230 KiB

BIN
docs/static/images/tikhub_banner.png vendored Normal file
View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 750 KiB

BIN
docs/static/images/tikhub_banner_zh.png vendored Normal file
View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 758 KiB

View File

@@ -1,12 +1,12 @@
# 关于作者 # 关于作者
> 大家都叫我阿江,网名:程序员阿江-Relakkes目前裸辞正探索自由职业,希望能靠自己的技术能力和努力,实现自己理想的生活方式 > 大家都叫我阿江,网名:程序员阿江-Relakkes目前是一名独立开发者,专注于 AI Agent 和爬虫相关的开发工作All in AI
>
> 我身边有大量的技术人脉资源,如果大家有一些爬虫咨询或者编程单子可以向我丢过来
- [Github万星开源自媒体爬虫仓库MediaCrawler作者](https://github.com/NanmiCoder/MediaCrawler) - [Github万星开源自媒体爬虫仓库MediaCrawler作者](https://github.com/NanmiCoder/MediaCrawler)
- 全栈程序员熟悉Python、Golang、JavaScript工作中主要用Golang。 - 全栈程序员熟悉Python、Golang、JavaScript工作中主要用Golang。
- 曾经主导并参与过百万级爬虫采集系统架构设计与编码 - 曾经主导并参与过百万级爬虫采集系统架构设计与编码
- 爬虫是一种技术兴趣爱好,参与爬虫有一种对抗的感觉,越难越兴奋。 - 爬虫是一种技术兴趣爱好,参与爬虫有一种对抗的感觉,越难越兴奋。
- 目前专注于 AI Agent 领域,积极探索 AI 技术的应用与创新
- 如果你有 AI Agent 相关的项目需要合作,欢迎联系我,我有很多时间可以投入
## 微信联系方式 ## 微信联系方式
![relakkes_weichat.JPG](static/images/relakkes_weichat.jpg) ![relakkes_weichat.JPG](static/images/relakkes_weichat.jpg)

View File

@@ -1,52 +1,74 @@
## 使用python原生venv管理依赖不推荐了 # 本地原生环境管理
## 创建并激活 python 虚拟环境 ## 推荐方案:使用 uv 管理依赖
> 如果是爬取抖音和知乎需要提前安装nodejs环境版本大于等于`16`即可 <br>
> 新增 [uv](https://github.com/astral-sh/uv) 来管理项目依赖使用uv来替代python版本管理、pip进行依赖安装更加方便快捷
```shell
# 进入项目根目录
cd MediaCrawler
# 创建虚拟环境 ### 1. 前置依赖
# 我的python版本是3.9.6requirements.txt中的库是基于这个版本的如果是其他python版本可能requirements.txt中的库不兼容自行解决一下 - 安装 [uv](https://docs.astral.sh/uv/getting-started/installation),并使用 `uv --version` 验证
python -m venv venv - Python 版本建议使用 **3.11**(当前依赖基于该版本构建)。
- 安装 Node.js抖音、知乎等平台需要版本需 `>= 16.0.0`
# macos & linux 激活虚拟环境 ### 2. 同步 Python 依赖
source venv/bin/activate ```shell
# 进入项目根目录
cd MediaCrawler
# windows 激活虚拟环境 # 使用 uv 保证 Python 版本和依赖一致性
venv\Scripts\activate uv sync
```
``` ### 3. 安装 Playwright 浏览器驱动
```shell
uv run playwright install
```
> 项目已支持使用 Playwright 连接本地 Chrome。如需使用 CDP 方式,可在 `config/base_config.py` 中调整 `xhs` 和 `dy` 的相关配置。
## 安装依赖库 ### 4. 运行爬虫程序
```shell
# 项目默认未开启评论爬取,如需评论请在 config/base_config.py 中修改 ENABLE_GET_COMMENTS
# 其他功能开关也可在 config/base_config.py 查看,均有中文注释
```shell # 从配置中读取关键词搜索并爬取帖子与评论
pip install -r requirements.txt uv run main.py --platform xhs --lt qrcode --type search
```
## 查看配置文件 # 从配置中读取指定帖子ID列表并爬取帖子与评论
uv run main.py --platform xhs --lt qrcode --type detail
## 安装 playwright浏览器驱动 (非必需) # 其他平台示例
uv run main.py --help
```
```shell ## 备选方案Python 原生 venv不推荐
playwright install
```
## 运行爬虫程序 ### 创建并激活虚拟环境
> 如果爬取抖音或知乎,需要提前安装 Node.js版本 `>= 16`。
```shell
# 进入项目根目录
cd MediaCrawler
```shell # 创建虚拟环境(示例 Python 版本3.11requirements 基于该版本)
### 项目默认是没有开启评论爬取模式如需评论请在config/base_config.py中的 ENABLE_GET_COMMENTS 变量修改 python -m venv venv
### 一些其他支持项也可以在config/base_config.py查看功能写的有中文注释
# 从配置文件中读取关键词搜索相关的帖子并爬取帖子信息与评论 # macOS & Linux 激活虚拟环境
python main.py --platform xhs --lt qrcode --type search source venv/bin/activate
# 从配置文件中读取指定的帖子ID列表获取指定帖子的信息与评论信息 # Windows 激活虚拟环境
python main.py --platform xhs --lt qrcode --type detail venv\Scripts\activate
```
# 打开对应APP扫二维码登录 ### 安装依赖与驱动
```shell
pip install -r requirements.txt
playwright install
```
# 其他平台爬虫使用示例,执行下面的命令查看 ### 运行爬虫程序venv 环境)
python main.py --help ```shell
``` # 从配置中读取关键词搜索并爬取帖子与评论
python main.py --platform xhs --lt qrcode --type search
# 从配置中读取指定帖子ID列表并爬取帖子与评论
python main.py --platform xhs --lt qrcode --type detail
# 更多示例
python main.py --help
```

View File

@@ -7,6 +7,6 @@
## 加群方式 ## 加群方式
> 备注github会有拉群小助手自动拉你进群。 > 备注github会有拉群小助手自动拉你进群。
> >
> 如果图片展示不出来或过期,可以直接添加我的微信号:yzglan并备注github会有拉群小助手自动拉你进群 > 如果图片展示不出来或过期,可以直接添加我的微信号:relakkes并备注github会有拉群小助手自动拉你进群
![relakkes_wechat](static/images/relakkes_weichat.jpg) ![relakkes_wechat](static/images/relakkes_weichat.jpg)

View File

@@ -15,5 +15,3 @@
## MediaCrawler源码剖析视频课程 ## MediaCrawler源码剖析视频课程
[mediacrawler源码课程介绍](https://relakkes.feishu.cn/wiki/JUgBwdhIeiSbAwkFCLkciHdAnhh) [mediacrawler源码课程介绍](https://relakkes.feishu.cn/wiki/JUgBwdhIeiSbAwkFCLkciHdAnhh)
## 知识星球爬虫逆向、编程专栏
[知识星球专栏介绍](知识星球介绍.md)

View File

@@ -1,31 +0,0 @@
# 知识星球专栏
## 基本介绍
文章:
- 1.爬虫JS逆向案例分享
- 2.MediaCrawler技术实现分享。
- 3.沉淀python开发经验和技巧
- ......................
提问:
- 4.在星球内向我提问关于MediaCrawler、爬虫、编程任何问题
## 章节内容
- [逆向案例 - 某16x8平台商品列表接口逆向参数分析](https://articles.zsxq.com/id_x1qmtg8pzld9.html)
- [逆向案例 - Product Hunt月度最佳产品榜单接口加密参数分析](https://articles.zsxq.com/id_au4eich3x2sg.html)
- [逆向案例 - 某zhi乎x-zse-96参数分析过程](https://articles.zsxq.com/id_dui2vil0ag1l.html)
- [逆向案例 - 某x识星球X-Signature加密参数分析过程](https://articles.zsxq.com/id_pp4madwcwcg8.html)
- [【独创】使用Playwright获取某音a_bogus参数流程包含加密参数分析](https://articles.zsxq.com/id_u89al50jk9x0.html)
- [【独创】使用Playwright低成本获取某书X-s参数流程分析当年的回忆录](https://articles.zsxq.com/id_u4lcrvqakuc7.html)
- [ MediaCrawler-基于抽象类设计重构项目缓存](https://articles.zsxq.com/id_4ju73oxewt9j.html)
- [ 手把手带你撸一个自己的IP代理池](https://articles.zsxq.com/id_38fza371ladm.html)
- [一次Mysql数据库中混用collation排序规则带来的bug](https://articles.zsxq.com/id_pibwr1wnst2p.html)
- [错误使用 Python 可变类型带来的隐藏 Bug](https://articles.zsxq.com/id_f7vn89l1d303.html)
- [【MediaCrawler】微博帖子评论爬虫教程](https://articles.zsxq.com/id_vrmuhw0ovj3t.html)
- [Python协程在并发场景下的幂等性问题](https://articles.zsxq.com/id_wocdwsfmfcmp.html)
- ........................................
## 加入星球
![星球qrcode.JPG](static/images/星球qrcode.jpg)

89
main.py
View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/main.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -11,6 +20,7 @@
import asyncio import asyncio
import sys import sys
import signal
from typing import Optional from typing import Optional
import cmd_arg import cmd_arg
@@ -24,6 +34,8 @@ from media_platform.tieba import TieBaCrawler
from media_platform.weibo import WeiboCrawler from media_platform.weibo import WeiboCrawler
from media_platform.xhs import XiaoHongShuCrawler from media_platform.xhs import XiaoHongShuCrawler
from media_platform.zhihu import ZhihuCrawler from media_platform.zhihu import ZhihuCrawler
from tools.async_file_writer import AsyncFileWriter
from var import crawler_type_var
class CrawlerFactory: class CrawlerFactory:
@@ -72,17 +84,84 @@ async def main():
crawler = CrawlerFactory.create_crawler(platform=config.PLATFORM) crawler = CrawlerFactory.create_crawler(platform=config.PLATFORM)
await crawler.start() await crawler.start()
# Flush Excel data if using Excel export
if config.SAVE_DATA_OPTION == "excel":
try:
from store.excel_store_base import ExcelStoreBase
ExcelStoreBase.flush_all()
print("[Main] Excel files saved successfully")
except Exception as e:
print(f"[Main] Error flushing Excel data: {e}")
# Generate wordcloud after crawling is complete
# Only for JSON save mode
if config.SAVE_DATA_OPTION == "json" and config.ENABLE_GET_WORDCLOUD:
try:
file_writer = AsyncFileWriter(
platform=config.PLATFORM,
crawler_type=crawler_type_var.get()
)
await file_writer.generate_wordcloud_from_comments()
except Exception as e:
print(f"Error generating wordcloud: {e}")
async def async_cleanup():
"""异步清理函数用于处理CDP浏览器等异步资源"""
global crawler
if crawler:
# 检查并清理CDP浏览器
if hasattr(crawler, 'cdp_manager') and crawler.cdp_manager:
try:
await crawler.cdp_manager.cleanup(force=True) # 强制清理浏览器进程
except Exception as e:
# 只在非预期错误时打印
error_msg = str(e).lower()
if "closed" not in error_msg and "disconnected" not in error_msg:
print(f"[Main] 清理CDP浏览器时出错: {e}")
# 检查并清理标准浏览器上下文仅在非CDP模式下
elif hasattr(crawler, 'browser_context') and crawler.browser_context:
try:
# 检查上下文是否仍然打开
if hasattr(crawler.browser_context, 'pages'):
await crawler.browser_context.close()
except Exception as e:
# 只在非预期错误时打印
error_msg = str(e).lower()
if "closed" not in error_msg and "disconnected" not in error_msg:
print(f"[Main] 关闭浏览器上下文时出错: {e}")
# 关闭数据库连接
if config.SAVE_DATA_OPTION in ["db", "sqlite"]:
await db.close()
def cleanup(): def cleanup():
if crawler: """同步清理函数"""
# asyncio.run(crawler.close()) try:
pass # 创建新的事件循环来执行异步清理
if config.SAVE_DATA_OPTION in ["db", "sqlite"]: loop = asyncio.new_event_loop()
asyncio.run(db.close()) asyncio.set_event_loop(loop)
loop.run_until_complete(async_cleanup())
loop.close()
except Exception as e:
print(f"[Main] 清理时出错: {e}")
def signal_handler(signum, _frame):
"""信号处理器处理Ctrl+C等中断信号"""
print(f"\n[Main] 收到中断信号 {signum},正在清理资源...")
cleanup()
sys.exit(0)
if __name__ == "__main__": if __name__ == "__main__":
# 注册信号处理器
signal.signal(signal.SIGINT, signal_handler) # Ctrl+C
signal.signal(signal.SIGTERM, signal_handler) # 终止信号
try: try:
asyncio.get_event_loop().run_until_complete(main()) asyncio.get_event_loop().run_until_complete(main())
except KeyboardInterrupt:
print("\n[Main] 收到键盘中断,正在清理资源...")
finally: finally:
cleanup() cleanup()

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -7,5 +16,3 @@
# #
# 详细许可条款请参阅项目根目录下的LICENSE文件。 # 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。 # 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/bilibili/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/bilibili/client.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -15,7 +24,7 @@
import asyncio import asyncio
import json import json
import random import random
from typing import Any, Callable, Dict, List, Optional, Tuple, Union from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple, Union
from urllib.parse import urlencode from urllib.parse import urlencode
import httpx import httpx
@@ -23,14 +32,18 @@ from playwright.async_api import BrowserContext, Page
import config import config
from base.base_crawler import AbstractApiClient from base.base_crawler import AbstractApiClient
from proxy.proxy_mixin import ProxyRefreshMixin
from tools import utils from tools import utils
if TYPE_CHECKING:
from proxy.proxy_ip_pool import ProxyIpPool
from .exception import DataFetchError from .exception import DataFetchError
from .field import CommentOrderType, SearchOrderType from .field import CommentOrderType, SearchOrderType
from .help import BilibiliSign from .help import BilibiliSign
class BilibiliClient(AbstractApiClient): class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
def __init__( def __init__(
self, self,
@@ -40,6 +53,7 @@ class BilibiliClient(AbstractApiClient):
headers: Dict[str, str], headers: Dict[str, str],
playwright_page: Page, playwright_page: Page,
cookie_dict: Dict[str, str], cookie_dict: Dict[str, str],
proxy_ip_pool: Optional["ProxyIpPool"] = None,
): ):
self.proxy = proxy self.proxy = proxy
self.timeout = timeout self.timeout = timeout
@@ -47,8 +61,13 @@ class BilibiliClient(AbstractApiClient):
self._host = "https://api.bilibili.com" self._host = "https://api.bilibili.com"
self.playwright_page = playwright_page self.playwright_page = playwright_page
self.cookie_dict = cookie_dict self.cookie_dict = cookie_dict
# 初始化代理池(来自 ProxyRefreshMixin
self.init_proxy_pool(proxy_ip_pool)
async def request(self, method, url, **kwargs) -> Any: async def request(self, method, url, **kwargs) -> Any:
# 每次请求前检测代理是否过期
await self._refresh_proxy_if_expired()
async with httpx.AsyncClient(proxy=self.proxy) as client: async with httpx.AsyncClient(proxy=self.proxy) as client:
response = await client.request(method, url, timeout=self.timeout, **kwargs) response = await client.request(method, url, timeout=self.timeout, **kwargs)
try: try:

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/bilibili/core.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -41,6 +50,7 @@ from var import crawler_type_var, source_keyword_var
from .client import BilibiliClient from .client import BilibiliClient
from .exception import DataFetchError from .exception import DataFetchError
from .field import SearchOrderType from .field import SearchOrderType
from .help import parse_video_info_from_url, parse_creator_info_from_url
from .login import BilibiliLogin from .login import BilibiliLogin
@@ -54,12 +64,13 @@ class BilibiliCrawler(AbstractCrawler):
self.index_url = "https://www.bilibili.com" self.index_url = "https://www.bilibili.com"
self.user_agent = utils.get_user_agent() self.user_agent = utils.get_user_agent()
self.cdp_manager = None self.cdp_manager = None
self.ip_proxy_pool = None # 代理IP池用于代理自动刷新
async def start(self): async def start(self):
playwright_proxy_format, httpx_proxy_format = None, None playwright_proxy_format, httpx_proxy_format = None, None
if config.ENABLE_IP_PROXY: if config.ENABLE_IP_PROXY:
ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True) self.ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy() ip_proxy_info: IpInfoModel = await self.ip_proxy_pool.get_proxy()
playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(ip_proxy_info) playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(ip_proxy_info)
async with async_playwright() as playwright: async with async_playwright() as playwright:
@@ -79,6 +90,7 @@ class BilibiliCrawler(AbstractCrawler):
self.browser_context = await self.launch_browser(chromium, None, self.user_agent, headless=config.HEADLESS) self.browser_context = await self.launch_browser(chromium, None, self.user_agent, headless=config.HEADLESS)
# stealth.min.js is a js script to prevent the website from detecting the crawler. # stealth.min.js is a js script to prevent the website from detecting the crawler.
await self.browser_context.add_init_script(path="libs/stealth.min.js") await self.browser_context.add_init_script(path="libs/stealth.min.js")
self.context_page = await self.browser_context.new_page() self.context_page = await self.browser_context.new_page()
await self.context_page.goto(self.index_url) await self.context_page.goto(self.index_url)
@@ -103,8 +115,14 @@ class BilibiliCrawler(AbstractCrawler):
await self.get_specified_videos(config.BILI_SPECIFIED_ID_LIST) await self.get_specified_videos(config.BILI_SPECIFIED_ID_LIST)
elif config.CRAWLER_TYPE == "creator": elif config.CRAWLER_TYPE == "creator":
if config.CREATOR_MODE: if config.CREATOR_MODE:
for creator_id in config.BILI_CREATOR_ID_LIST: for creator_url in config.BILI_CREATOR_ID_LIST:
await self.get_creator_videos(int(creator_id)) try:
creator_info = parse_creator_info_from_url(creator_url)
utils.logger.info(f"[BilibiliCrawler.start] Parsed creator ID: {creator_info.creator_id} from {creator_url}")
await self.get_creator_videos(int(creator_info.creator_id))
except ValueError as e:
utils.logger.error(f"[BilibiliCrawler.start] Failed to parse creator URL: {e}")
continue
else: else:
await self.get_all_creator_details(config.BILI_CREATOR_ID_LIST) await self.get_all_creator_details(config.BILI_CREATOR_ID_LIST)
else: else:
@@ -362,11 +380,23 @@ class BilibiliCrawler(AbstractCrawler):
utils.logger.info(f"[BilibiliCrawler.get_creator_videos] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after page {pn}") utils.logger.info(f"[BilibiliCrawler.get_creator_videos] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after page {pn}")
pn += 1 pn += 1
async def get_specified_videos(self, bvids_list: List[str]): async def get_specified_videos(self, video_url_list: List[str]):
""" """
get specified videos info get specified videos info from URLs or BV IDs
:param video_url_list: List of video URLs or BV IDs
:return: :return:
""" """
utils.logger.info("[BilibiliCrawler.get_specified_videos] Parsing video URLs...")
bvids_list = []
for video_url in video_url_list:
try:
video_info = parse_video_info_from_url(video_url)
bvids_list.append(video_info.video_id)
utils.logger.info(f"[BilibiliCrawler.get_specified_videos] Parsed video ID: {video_info.video_id} from {video_url}")
except ValueError as e:
utils.logger.error(f"[BilibiliCrawler.get_specified_videos] Failed to parse video URL: {e}")
continue
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM) semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [self.get_video_info_task(aid=0, bvid=video_id, semaphore=semaphore) for video_id in bvids_list] task_list = [self.get_video_info_task(aid=0, bvid=video_id, semaphore=semaphore) for video_id in bvids_list]
video_details = await asyncio.gather(*task_list) video_details = await asyncio.gather(*task_list)
@@ -444,6 +474,7 @@ class BilibiliCrawler(AbstractCrawler):
}, },
playwright_page=self.context_page, playwright_page=self.context_page,
cookie_dict=cookie_dict, cookie_dict=cookie_dict,
proxy_ip_pool=self.ip_proxy_pool, # 传递代理池用于自动刷新
) )
return bilibili_client_obj return bilibili_client_obj
@@ -477,11 +508,12 @@ class BilibiliCrawler(AbstractCrawler):
"height": 1080 "height": 1080
}, },
user_agent=user_agent, user_agent=user_agent,
channel="chrome", # 使用系统的Chrome稳定版
) )
return browser_context return browser_context
else: else:
# type: ignore # type: ignore
browser = await chromium.launch(headless=headless, proxy=playwright_proxy) browser = await chromium.launch(headless=headless, proxy=playwright_proxy, channel="chrome")
browser_context = await browser.new_context(viewport={"width": 1920, "height": 1080}, user_agent=user_agent) browser_context = await browser.new_context(viewport={"width": 1920, "height": 1080}, user_agent=user_agent)
return browser_context return browser_context
@@ -568,18 +600,30 @@ class BilibiliCrawler(AbstractCrawler):
extension_file_name = f"video.mp4" extension_file_name = f"video.mp4"
await bilibili_store.store_video(aid, content, extension_file_name) await bilibili_store.store_video(aid, content, extension_file_name)
async def get_all_creator_details(self, creator_id_list: List[int]): async def get_all_creator_details(self, creator_url_list: List[str]):
""" """
creator_id_list: get details for creator from creator_id_list creator_url_list: get details for creator from creator URL list
""" """
utils.logger.info(f"[BilibiliCrawler.get_creator_details] Crawling the detalis of creator") utils.logger.info(f"[BilibiliCrawler.get_all_creator_details] Crawling the details of creators")
utils.logger.info(f"[BilibiliCrawler.get_creator_details] creator ids:{creator_id_list}") utils.logger.info(f"[BilibiliCrawler.get_all_creator_details] Parsing creator URLs...")
creator_id_list = []
for creator_url in creator_url_list:
try:
creator_info = parse_creator_info_from_url(creator_url)
creator_id_list.append(int(creator_info.creator_id))
utils.logger.info(f"[BilibiliCrawler.get_all_creator_details] Parsed creator ID: {creator_info.creator_id} from {creator_url}")
except ValueError as e:
utils.logger.error(f"[BilibiliCrawler.get_all_creator_details] Failed to parse creator URL: {e}")
continue
utils.logger.info(f"[BilibiliCrawler.get_all_creator_details] creator ids:{creator_id_list}")
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM) semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list: List[Task] = [] task_list: List[Task] = []
try: try:
for creator_id in creator_id_list: for creator_id in creator_id_list:
task = asyncio.create_task(self.get_creator_details(creator_id, semaphore), name=creator_id) task = asyncio.create_task(self.get_creator_details(creator_id, semaphore), name=str(creator_id))
task_list.append(task) task_list.append(task)
except Exception as e: except Exception as e:
utils.logger.warning(f"[BilibiliCrawler.get_all_creator_details] error in the task list. The creator will not be included. {e}") utils.logger.warning(f"[BilibiliCrawler.get_all_creator_details] error in the task list. The creator will not be included. {e}")

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/bilibili/exception.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/bilibili/field.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/bilibili/help.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -9,15 +18,17 @@
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。 # 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com # @Author : relakkes@gmail.com
# @Time : 2023/12/2 23:26 # @Time : 2023/12/2 23:26
# @Desc : bilibili 请求参数签名 # @Desc : bilibili 请求参数签名
# 逆向实现参考https://socialsisteryi.github.io/bilibili-API-collect/docs/misc/sign/wbi.html#wbi%E7%AD%BE%E5%90%8D%E7%AE%97%E6%B3%95 # 逆向实现参考https://socialsisteryi.github.io/bilibili-API-collect/docs/misc/sign/wbi.html#wbi%E7%AD%BE%E5%90%8D%E7%AE%97%E6%B3%95
import re
import urllib.parse import urllib.parse
from hashlib import md5 from hashlib import md5
from typing import Dict from typing import Dict
from model.m_bilibili import VideoUrlInfo, CreatorUrlInfo
from tools import utils from tools import utils
@@ -66,16 +77,71 @@ class BilibiliSign:
return req_data return req_data
def parse_video_info_from_url(url: str) -> VideoUrlInfo:
"""
从B站视频URL中解析出视频ID
Args:
url: B站视频链接
- https://www.bilibili.com/video/BV1dwuKzmE26/?spm_id_from=333.1387.homepage.video_card.click
- https://www.bilibili.com/video/BV1d54y1g7db
- BV1d54y1g7db (直接传入BV号)
Returns:
VideoUrlInfo: 包含视频ID的对象
"""
# 如果传入的已经是BV号,直接返回
if url.startswith("BV"):
return VideoUrlInfo(video_id=url)
# 使用正则表达式提取BV号
# 匹配 /video/BV... 或 /video/av... 格式
bv_pattern = r'/video/(BV[a-zA-Z0-9]+)'
match = re.search(bv_pattern, url)
if match:
video_id = match.group(1)
return VideoUrlInfo(video_id=video_id)
raise ValueError(f"无法从URL中解析出视频ID: {url}")
def parse_creator_info_from_url(url: str) -> CreatorUrlInfo:
"""
从B站创作者空间URL中解析出创作者ID
Args:
url: B站创作者空间链接
- https://space.bilibili.com/434377496?spm_id_from=333.1007.0.0
- https://space.bilibili.com/20813884
- 434377496 (直接传入UID)
Returns:
CreatorUrlInfo: 包含创作者ID的对象
"""
# 如果传入的已经是纯数字ID,直接返回
if url.isdigit():
return CreatorUrlInfo(creator_id=url)
# 使用正则表达式提取UID
# 匹配 /space.bilibili.com/数字 格式
uid_pattern = r'space\.bilibili\.com/(\d+)'
match = re.search(uid_pattern, url)
if match:
creator_id = match.group(1)
return CreatorUrlInfo(creator_id=creator_id)
raise ValueError(f"无法从URL中解析出创作者ID: {url}")
if __name__ == '__main__': if __name__ == '__main__':
_img_key = "7cd084941338484aae1ad9425b84077c" # 测试视频URL解析
_sub_key = "4932caff0ff746eab6f01bf08b70ac45" video_url1 = "https://www.bilibili.com/video/BV1dwuKzmE26/?spm_id_from=333.1387.homepage.video_card.click"
_search_url = "__refresh__=true&_extra=&ad_resource=5654&category_id=&context=&dynamic_offset=0&from_source=&from_spmid=333.337&gaia_vtoken=&highlight=1&keyword=python&order=click&page=1&page_size=20&platform=pc&qv_id=OQ8f2qtgYdBV1UoEnqXUNUl8LEDAdzsD&search_type=video&single_column=0&source_tag=3&web_location=1430654" video_url2 = "BV1d54y1g7db"
_req_data = dict() print("视频URL解析测试:")
for params in _search_url.split("&"): print(f"URL1: {video_url1} -> {parse_video_info_from_url(video_url1)}")
kvalues = params.split("=") print(f"URL2: {video_url2} -> {parse_video_info_from_url(video_url2)}")
key = kvalues[0]
value = kvalues[1] # 测试创作者URL解析
_req_data[key] = value creator_url1 = "https://space.bilibili.com/434377496?spm_id_from=333.1007.0.0"
print("pre req_data", _req_data) creator_url2 = "20813884"
_req_data = BilibiliSign(img_key=_img_key, sub_key=_sub_key).sign(req_data={"aid":170001}) print("\n创作者URL解析测试:")
print(_req_data) print(f"URL1: {creator_url1} -> {parse_creator_info_from_url(creator_url1)}")
print(f"URL2: {creator_url2} -> {parse_creator_info_from_url(creator_url2)}")

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/bilibili/login.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/douyin/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/douyin/client.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -12,21 +21,25 @@ import asyncio
import copy import copy
import json import json
import urllib.parse import urllib.parse
from typing import Any, Callable, Dict, Union, Optional from typing import TYPE_CHECKING, Any, Callable, Dict, Union, Optional
import httpx import httpx
from playwright.async_api import BrowserContext from playwright.async_api import BrowserContext
from base.base_crawler import AbstractApiClient from base.base_crawler import AbstractApiClient
from proxy.proxy_mixin import ProxyRefreshMixin
from tools import utils from tools import utils
from var import request_keyword_var from var import request_keyword_var
if TYPE_CHECKING:
from proxy.proxy_ip_pool import ProxyIpPool
from .exception import * from .exception import *
from .field import * from .field import *
from .help import * from .help import *
class DouYinClient(AbstractApiClient): class DouYinClient(AbstractApiClient, ProxyRefreshMixin):
def __init__( def __init__(
self, self,
@@ -36,6 +49,7 @@ class DouYinClient(AbstractApiClient):
headers: Dict, headers: Dict,
playwright_page: Optional[Page], playwright_page: Optional[Page],
cookie_dict: Dict, cookie_dict: Dict,
proxy_ip_pool: Optional["ProxyIpPool"] = None,
): ):
self.proxy = proxy self.proxy = proxy
self.timeout = timeout self.timeout = timeout
@@ -43,6 +57,8 @@ class DouYinClient(AbstractApiClient):
self._host = "https://www.douyin.com" self._host = "https://www.douyin.com"
self.playwright_page = playwright_page self.playwright_page = playwright_page
self.cookie_dict = cookie_dict self.cookie_dict = cookie_dict
# 初始化代理池(来自 ProxyRefreshMixin
self.init_proxy_pool(proxy_ip_pool)
async def __process_req_params( async def __process_req_params(
self, self,
@@ -91,10 +107,15 @@ class DouYinClient(AbstractApiClient):
post_data = {} post_data = {}
if request_method == "POST": if request_method == "POST":
post_data = params post_data = params
if "/v1/web/general/search" not in uri:
a_bogus = await get_a_bogus(uri, query_string, post_data, headers["User-Agent"], self.playwright_page) a_bogus = await get_a_bogus(uri, query_string, post_data, headers["User-Agent"], self.playwright_page)
params["a_bogus"] = a_bogus params["a_bogus"] = a_bogus
async def request(self, method, url, **kwargs): async def request(self, method, url, **kwargs):
# 每次请求前检测代理是否过期
await self._refresh_proxy_if_expired()
async with httpx.AsyncClient(proxy=self.proxy) as client: async with httpx.AsyncClient(proxy=self.proxy) as client:
response = await client.request(method, url, timeout=self.timeout, **kwargs) response = await client.request(method, url, timeout=self.timeout, **kwargs)
try: try:
@@ -324,3 +345,28 @@ class DouYinClient(AbstractApiClient):
except httpx.HTTPError as exc: # some wrong when call httpx.request method, such as connection error, client error, server error or response status code is not 2xx except httpx.HTTPError as exc: # some wrong when call httpx.request method, such as connection error, client error, server error or response status code is not 2xx
utils.logger.error(f"[DouYinClient.get_aweme_media] {exc.__class__.__name__} for {exc.request.url} - {exc}") # 保留原始异常类型名称,以便开发者调试 utils.logger.error(f"[DouYinClient.get_aweme_media] {exc.__class__.__name__} for {exc.request.url} - {exc}") # 保留原始异常类型名称,以便开发者调试
return None return None
async def resolve_short_url(self, short_url: str) -> str:
"""
解析抖音短链接,获取重定向后的真实URL
Args:
short_url: 短链接,如 https://v.douyin.com/iF12345ABC/
Returns:
重定向后的完整URL
"""
async with httpx.AsyncClient(proxy=self.proxy, follow_redirects=False) as client:
try:
utils.logger.info(f"[DouYinClient.resolve_short_url] Resolving short URL: {short_url}")
response = await client.get(short_url, timeout=10)
# 短链接通常返回302重定向
if response.status_code in [301, 302, 303, 307, 308]:
redirect_url = response.headers.get("Location", "")
utils.logger.info(f"[DouYinClient.resolve_short_url] Resolved to: {redirect_url}")
return redirect_url
else:
utils.logger.warning(f"[DouYinClient.resolve_short_url] Unexpected status code: {response.status_code}")
return ""
except Exception as e:
utils.logger.error(f"[DouYinClient.resolve_short_url] Failed to resolve short URL: {e}")
return ""

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/douyin/core.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -33,6 +42,7 @@ from var import crawler_type_var, source_keyword_var
from .client import DouYinClient from .client import DouYinClient
from .exception import DataFetchError from .exception import DataFetchError
from .field import PublishTimeType from .field import PublishTimeType
from .help import parse_video_info_from_url, parse_creator_info_from_url
from .login import DouYinLogin from .login import DouYinLogin
@@ -45,12 +55,13 @@ class DouYinCrawler(AbstractCrawler):
def __init__(self) -> None: def __init__(self) -> None:
self.index_url = "https://www.douyin.com" self.index_url = "https://www.douyin.com"
self.cdp_manager = None self.cdp_manager = None
self.ip_proxy_pool = None # 代理IP池用于代理自动刷新
async def start(self) -> None: async def start(self) -> None:
playwright_proxy_format, httpx_proxy_format = None, None playwright_proxy_format, httpx_proxy_format = None, None
if config.ENABLE_IP_PROXY: if config.ENABLE_IP_PROXY:
ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True) self.ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy() ip_proxy_info: IpInfoModel = await self.ip_proxy_pool.get_proxy()
playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(ip_proxy_info) playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(ip_proxy_info)
async with async_playwright() as playwright: async with async_playwright() as playwright:
@@ -75,6 +86,7 @@ class DouYinCrawler(AbstractCrawler):
) )
# stealth.min.js is a js script to prevent the website from detecting the crawler. # stealth.min.js is a js script to prevent the website from detecting the crawler.
await self.browser_context.add_init_script(path="libs/stealth.min.js") await self.browser_context.add_init_script(path="libs/stealth.min.js")
self.context_page = await self.browser_context.new_page() self.context_page = await self.browser_context.new_page()
await self.context_page.goto(self.index_url) await self.context_page.goto(self.index_url)
@@ -154,15 +166,39 @@ class DouYinCrawler(AbstractCrawler):
await self.batch_get_note_comments(aweme_list) await self.batch_get_note_comments(aweme_list)
async def get_specified_awemes(self): async def get_specified_awemes(self):
"""Get the information and comments of the specified post""" """Get the information and comments of the specified post from URLs or IDs"""
utils.logger.info("[DouYinCrawler.get_specified_awemes] Parsing video URLs...")
aweme_id_list = []
for video_url in config.DY_SPECIFIED_ID_LIST:
try:
video_info = parse_video_info_from_url(video_url)
# 处理短链接
if video_info.url_type == "short":
utils.logger.info(f"[DouYinCrawler.get_specified_awemes] Resolving short link: {video_url}")
resolved_url = await self.dy_client.resolve_short_url(video_url)
if resolved_url:
# 从解析后的URL中提取视频ID
video_info = parse_video_info_from_url(resolved_url)
utils.logger.info(f"[DouYinCrawler.get_specified_awemes] Short link resolved to aweme ID: {video_info.aweme_id}")
else:
utils.logger.error(f"[DouYinCrawler.get_specified_awemes] Failed to resolve short link: {video_url}")
continue
aweme_id_list.append(video_info.aweme_id)
utils.logger.info(f"[DouYinCrawler.get_specified_awemes] Parsed aweme ID: {video_info.aweme_id} from {video_url}")
except ValueError as e:
utils.logger.error(f"[DouYinCrawler.get_specified_awemes] Failed to parse video URL: {e}")
continue
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM) semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [self.get_aweme_detail(aweme_id=aweme_id, semaphore=semaphore) for aweme_id in config.DY_SPECIFIED_ID_LIST] task_list = [self.get_aweme_detail(aweme_id=aweme_id, semaphore=semaphore) for aweme_id in aweme_id_list]
aweme_details = await asyncio.gather(*task_list) aweme_details = await asyncio.gather(*task_list)
for aweme_detail in aweme_details: for aweme_detail in aweme_details:
if aweme_detail is not None: if aweme_detail is not None:
await douyin_store.update_douyin_aweme(aweme_item=aweme_detail) await douyin_store.update_douyin_aweme(aweme_item=aweme_detail)
await self.get_aweme_media(aweme_item=aweme_detail) await self.get_aweme_media(aweme_item=aweme_detail)
await self.batch_get_note_comments(config.DY_SPECIFIED_ID_LIST) await self.batch_get_note_comments(aweme_id_list)
async def get_aweme_detail(self, aweme_id: str, semaphore: asyncio.Semaphore) -> Any: async def get_aweme_detail(self, aweme_id: str, semaphore: asyncio.Semaphore) -> Any:
"""Get note detail""" """Get note detail"""
@@ -218,10 +254,20 @@ class DouYinCrawler(AbstractCrawler):
async def get_creators_and_videos(self) -> None: async def get_creators_and_videos(self) -> None:
""" """
Get the information and videos of the specified creator Get the information and videos of the specified creator from URLs or IDs
""" """
utils.logger.info("[DouYinCrawler.get_creators_and_videos] Begin get douyin creators") utils.logger.info("[DouYinCrawler.get_creators_and_videos] Begin get douyin creators")
for user_id in config.DY_CREATOR_ID_LIST: utils.logger.info("[DouYinCrawler.get_creators_and_videos] Parsing creator URLs...")
for creator_url in config.DY_CREATOR_ID_LIST:
try:
creator_info_parsed = parse_creator_info_from_url(creator_url)
user_id = creator_info_parsed.sec_user_id
utils.logger.info(f"[DouYinCrawler.get_creators_and_videos] Parsed sec_user_id: {user_id} from {creator_url}")
except ValueError as e:
utils.logger.error(f"[DouYinCrawler.get_creators_and_videos] Failed to parse creator URL: {e}")
continue
creator_info: Dict = await self.dy_client.get_user_info(user_id) creator_info: Dict = await self.dy_client.get_user_info(user_id)
if creator_info: if creator_info:
await douyin_store.save_creator(user_id, creator=creator_info) await douyin_store.save_creator(user_id, creator=creator_info)
@@ -260,6 +306,7 @@ class DouYinCrawler(AbstractCrawler):
}, },
playwright_page=self.context_page, playwright_page=self.context_page,
cookie_dict=cookie_dict, cookie_dict=cookie_dict,
proxy_ip_pool=self.ip_proxy_pool, # 传递代理池用于自动刷新
) )
return douyin_client return douyin_client

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/douyin/exception.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/douyin/field.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/douyin/help.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -16,10 +25,15 @@
# @Desc : 获取 a_bogus 参数, 学习交流使用,请勿用作商业用途,侵权联系作者删除 # @Desc : 获取 a_bogus 参数, 学习交流使用,请勿用作商业用途,侵权联系作者删除
import random import random
import re
from typing import Optional
import execjs import execjs
from playwright.async_api import Page from playwright.async_api import Page
from model.m_douyin import VideoUrlInfo, CreatorUrlInfo
from tools.crawler_util import extract_url_params_to_dict
douyin_sign_obj = execjs.compile(open('libs/douyin.js', encoding='utf-8-sig').read()) douyin_sign_obj = execjs.compile(open('libs/douyin.js', encoding='utf-8-sig').read())
def get_web_id(): def get_web_id():
@@ -83,3 +97,102 @@ async def get_a_bogus_from_playright(params: str, post_data: dict, user_agent: s
return a_bogus return a_bogus
def parse_video_info_from_url(url: str) -> VideoUrlInfo:
"""
从抖音视频URL中解析出视频ID
支持以下格式:
1. 普通视频链接: https://www.douyin.com/video/7525082444551310602
2. 带modal_id参数的链接:
- https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?modal_id=7525082444551310602
- https://www.douyin.com/root/search/python?modal_id=7471165520058862848
3. 短链接: https://v.douyin.com/iF12345ABC/ (需要client解析)
4. 纯ID: 7525082444551310602
Args:
url: 抖音视频链接或ID
Returns:
VideoUrlInfo: 包含视频ID的对象
"""
# 如果是纯数字ID,直接返回
if url.isdigit():
return VideoUrlInfo(aweme_id=url, url_type="normal")
# 检查是否是短链接 (v.douyin.com)
if "v.douyin.com" in url or url.startswith("http") and len(url) < 50 and "video" not in url:
return VideoUrlInfo(aweme_id="", url_type="short") # 需要通过client解析
# 尝试从URL参数中提取modal_id
params = extract_url_params_to_dict(url)
modal_id = params.get("modal_id")
if modal_id:
return VideoUrlInfo(aweme_id=modal_id, url_type="modal")
# 从标准视频URL中提取ID: /video/数字
video_pattern = r'/video/(\d+)'
match = re.search(video_pattern, url)
if match:
aweme_id = match.group(1)
return VideoUrlInfo(aweme_id=aweme_id, url_type="normal")
raise ValueError(f"无法从URL中解析出视频ID: {url}")
def parse_creator_info_from_url(url: str) -> CreatorUrlInfo:
"""
从抖音创作者主页URL中解析出创作者ID (sec_user_id)
支持以下格式:
1. 创作者主页: https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main
2. 纯ID: MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE
Args:
url: 抖音创作者主页链接或sec_user_id
Returns:
CreatorUrlInfo: 包含创作者ID的对象
"""
# 如果是纯ID格式(通常以MS4wLjABAAAA开头),直接返回
if url.startswith("MS4wLjABAAAA") or (not url.startswith("http") and "douyin.com" not in url):
return CreatorUrlInfo(sec_user_id=url)
# 从创作者主页URL中提取sec_user_id: /user/xxx
user_pattern = r'/user/([^/?]+)'
match = re.search(user_pattern, url)
if match:
sec_user_id = match.group(1)
return CreatorUrlInfo(sec_user_id=sec_user_id)
raise ValueError(f"无法从URL中解析出创作者ID: {url}")
if __name__ == '__main__':
# 测试视频URL解析
print("=== 视频URL解析测试 ===")
test_urls = [
"https://www.douyin.com/video/7525082444551310602",
"https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main&modal_id=7525082444551310602",
"https://www.douyin.com/root/search/python?aid=b733a3b0-4662-4639-9a72-c2318fba9f3f&modal_id=7471165520058862848&type=general",
"7525082444551310602",
]
for url in test_urls:
try:
result = parse_video_info_from_url(url)
print(f"✓ URL: {url[:80]}...")
print(f" 结果: {result}\n")
except Exception as e:
print(f"✗ URL: {url}")
print(f" 错误: {e}\n")
# 测试创作者URL解析
print("=== 创作者URL解析测试 ===")
test_creator_urls = [
"https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main",
"MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE",
]
for url in test_creator_urls:
try:
result = parse_creator_info_from_url(url)
print(f"✓ URL: {url[:80]}...")
print(f" 结果: {result}\n")
except Exception as e:
print(f"✗ URL: {url}")
print(f" 错误: {e}\n")

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/douyin/login.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/kuaishou/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/kuaishou/client.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -12,7 +21,7 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
import asyncio import asyncio
import json import json
from typing import Any, Callable, Dict, List, Optional from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional
from urllib.parse import urlencode from urllib.parse import urlencode
import httpx import httpx
@@ -20,13 +29,17 @@ from playwright.async_api import BrowserContext, Page
import config import config
from base.base_crawler import AbstractApiClient from base.base_crawler import AbstractApiClient
from proxy.proxy_mixin import ProxyRefreshMixin
from tools import utils from tools import utils
if TYPE_CHECKING:
from proxy.proxy_ip_pool import ProxyIpPool
from .exception import DataFetchError from .exception import DataFetchError
from .graphql import KuaiShouGraphQL from .graphql import KuaiShouGraphQL
class KuaiShouClient(AbstractApiClient): class KuaiShouClient(AbstractApiClient, ProxyRefreshMixin):
def __init__( def __init__(
self, self,
timeout=10, timeout=10,
@@ -35,6 +48,7 @@ class KuaiShouClient(AbstractApiClient):
headers: Dict[str, str], headers: Dict[str, str],
playwright_page: Page, playwright_page: Page,
cookie_dict: Dict[str, str], cookie_dict: Dict[str, str],
proxy_ip_pool: Optional["ProxyIpPool"] = None,
): ):
self.proxy = proxy self.proxy = proxy
self.timeout = timeout self.timeout = timeout
@@ -43,8 +57,13 @@ class KuaiShouClient(AbstractApiClient):
self.playwright_page = playwright_page self.playwright_page = playwright_page
self.cookie_dict = cookie_dict self.cookie_dict = cookie_dict
self.graphql = KuaiShouGraphQL() self.graphql = KuaiShouGraphQL()
# 初始化代理池(来自 ProxyRefreshMixin
self.init_proxy_pool(proxy_ip_pool)
async def request(self, method, url, **kwargs) -> Any: async def request(self, method, url, **kwargs) -> Any:
# 每次请求前检测代理是否过期
await self._refresh_proxy_if_expired()
async with httpx.AsyncClient(proxy=self.proxy) as client: async with httpx.AsyncClient(proxy=self.proxy) as client:
response = await client.request(method, url, timeout=self.timeout, **kwargs) response = await client.request(method, url, timeout=self.timeout, **kwargs)
data: Dict = response.json() data: Dict = response.json()

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/kuaishou/core.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -26,6 +35,7 @@ from playwright.async_api import (
import config import config
from base.base_crawler import AbstractCrawler from base.base_crawler import AbstractCrawler
from model.m_kuaishou import VideoUrlInfo, CreatorUrlInfo
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
from store import kuaishou as kuaishou_store from store import kuaishou as kuaishou_store
from tools import utils from tools import utils
@@ -34,6 +44,7 @@ from var import comment_tasks_var, crawler_type_var, source_keyword_var
from .client import KuaiShouClient from .client import KuaiShouClient
from .exception import DataFetchError from .exception import DataFetchError
from .help import parse_video_info_from_url, parse_creator_info_from_url
from .login import KuaishouLogin from .login import KuaishouLogin
@@ -47,14 +58,15 @@ class KuaishouCrawler(AbstractCrawler):
self.index_url = "https://www.kuaishou.com" self.index_url = "https://www.kuaishou.com"
self.user_agent = utils.get_user_agent() self.user_agent = utils.get_user_agent()
self.cdp_manager = None self.cdp_manager = None
self.ip_proxy_pool = None # 代理IP池用于代理自动刷新
async def start(self): async def start(self):
playwright_proxy_format, httpx_proxy_format = None, None playwright_proxy_format, httpx_proxy_format = None, None
if config.ENABLE_IP_PROXY: if config.ENABLE_IP_PROXY:
ip_proxy_pool = await create_ip_pool( self.ip_proxy_pool = await create_ip_pool(
config.IP_PROXY_POOL_COUNT, enable_validate_ip=True config.IP_PROXY_POOL_COUNT, enable_validate_ip=True
) )
ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy() ip_proxy_info: IpInfoModel = await self.ip_proxy_pool.get_proxy()
playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info( playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(
ip_proxy_info ip_proxy_info
) )
@@ -78,6 +90,8 @@ class KuaishouCrawler(AbstractCrawler):
) )
# stealth.min.js is a js script to prevent the website from detecting the crawler. # stealth.min.js is a js script to prevent the website from detecting the crawler.
await self.browser_context.add_init_script(path="libs/stealth.min.js") await self.browser_context.add_init_script(path="libs/stealth.min.js")
self.context_page = await self.browser_context.new_page() self.context_page = await self.browser_context.new_page()
await self.context_page.goto(f"{self.index_url}?isHome=1") await self.context_page.goto(f"{self.index_url}?isHome=1")
@@ -168,16 +182,27 @@ class KuaishouCrawler(AbstractCrawler):
async def get_specified_videos(self): async def get_specified_videos(self):
"""Get the information and comments of the specified post""" """Get the information and comments of the specified post"""
utils.logger.info("[KuaishouCrawler.get_specified_videos] Parsing video URLs...")
video_ids = []
for video_url in config.KS_SPECIFIED_ID_LIST:
try:
video_info = parse_video_info_from_url(video_url)
video_ids.append(video_info.video_id)
utils.logger.info(f"Parsed video ID: {video_info.video_id} from {video_url}")
except ValueError as e:
utils.logger.error(f"Failed to parse video URL: {e}")
continue
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM) semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [ task_list = [
self.get_video_info_task(video_id=video_id, semaphore=semaphore) self.get_video_info_task(video_id=video_id, semaphore=semaphore)
for video_id in config.KS_SPECIFIED_ID_LIST for video_id in video_ids
] ]
video_details = await asyncio.gather(*task_list) video_details = await asyncio.gather(*task_list)
for video_detail in video_details: for video_detail in video_details:
if video_detail is not None: if video_detail is not None:
await kuaishou_store.update_kuaishou_video(video_detail) await kuaishou_store.update_kuaishou_video(video_detail)
await self.batch_get_video_comments(config.KS_SPECIFIED_ID_LIST) await self.batch_get_video_comments(video_ids)
async def get_video_info_task( async def get_video_info_task(
self, video_id: str, semaphore: asyncio.Semaphore self, video_id: str, semaphore: asyncio.Semaphore
@@ -293,6 +318,7 @@ class KuaishouCrawler(AbstractCrawler):
}, },
playwright_page=self.context_page, playwright_page=self.context_page,
cookie_dict=cookie_dict, cookie_dict=cookie_dict,
proxy_ip_pool=self.ip_proxy_pool, # 传递代理池用于自动刷新
) )
return ks_client_obj return ks_client_obj
@@ -318,10 +344,11 @@ class KuaishouCrawler(AbstractCrawler):
proxy=playwright_proxy, # type: ignore proxy=playwright_proxy, # type: ignore
viewport={"width": 1920, "height": 1080}, viewport={"width": 1920, "height": 1080},
user_agent=user_agent, user_agent=user_agent,
channel="chrome", # 使用系统的Chrome稳定版
) )
return browser_context return browser_context
else: else:
browser = await chromium.launch(headless=headless, proxy=playwright_proxy) # type: ignore browser = await chromium.launch(headless=headless, proxy=playwright_proxy, channel="chrome") # type: ignore
browser_context = await browser.new_context( browser_context = await browser.new_context(
viewport={"width": 1920, "height": 1080}, user_agent=user_agent viewport={"width": 1920, "height": 1080}, user_agent=user_agent
) )
@@ -367,16 +394,25 @@ class KuaishouCrawler(AbstractCrawler):
utils.logger.info( utils.logger.info(
"[KuaiShouCrawler.get_creators_and_videos] Begin get kuaishou creators" "[KuaiShouCrawler.get_creators_and_videos] Begin get kuaishou creators"
) )
for user_id in config.KS_CREATOR_ID_LIST: for creator_url in config.KS_CREATOR_ID_LIST:
try:
# Parse creator URL to get user_id
creator_info: CreatorUrlInfo = parse_creator_info_from_url(creator_url)
utils.logger.info(f"[KuaiShouCrawler.get_creators_and_videos] Parse creator URL info: {creator_info}")
user_id = creator_info.user_id
# get creator detail info from web html content # get creator detail info from web html content
createor_info: Dict = await self.ks_client.get_creator_info(user_id=user_id) createor_info: Dict = await self.ks_client.get_creator_info(user_id=user_id)
if createor_info: if createor_info:
await kuaishou_store.save_creator(user_id, creator=createor_info) await kuaishou_store.save_creator(user_id, creator=createor_info)
except ValueError as e:
utils.logger.error(f"[KuaiShouCrawler.get_creators_and_videos] Failed to parse creator URL: {e}")
continue
# Get all video information of the creator # Get all video information of the creator
all_video_list = await self.ks_client.get_all_videos_by_creator( all_video_list = await self.ks_client.get_all_videos_by_creator(
user_id=user_id, user_id=user_id,
crawl_interval=random.random(), crawl_interval=config.CRAWLER_MAX_SLEEP_SEC,
callback=self.fetch_creator_video_detail, callback=self.fetch_creator_video_detail,
) )

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/kuaishou/exception.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/kuaishou/field.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/kuaishou/graphql.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -0,0 +1,108 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/kuaishou/help.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
import re
from model.m_kuaishou import VideoUrlInfo, CreatorUrlInfo
def parse_video_info_from_url(url: str) -> VideoUrlInfo:
"""
从快手视频URL中解析出视频ID
支持以下格式:
1. 完整视频URL: "https://www.kuaishou.com/short-video/3x3zxz4mjrsc8ke?authorId=3x84qugg4ch9zhs&streamSource=search"
2. 纯视频ID: "3x3zxz4mjrsc8ke"
Args:
url: 快手视频链接或视频ID
Returns:
VideoUrlInfo: 包含视频ID的对象
"""
# 如果不包含http且不包含kuaishou.com认为是纯ID
if not url.startswith("http") and "kuaishou.com" not in url:
return VideoUrlInfo(video_id=url, url_type="normal")
# 从标准视频URL中提取ID: /short-video/视频ID
video_pattern = r'/short-video/([a-zA-Z0-9_-]+)'
match = re.search(video_pattern, url)
if match:
video_id = match.group(1)
return VideoUrlInfo(video_id=video_id, url_type="normal")
raise ValueError(f"无法从URL中解析出视频ID: {url}")
def parse_creator_info_from_url(url: str) -> CreatorUrlInfo:
"""
从快手创作者主页URL中解析出创作者ID
支持以下格式:
1. 创作者主页: "https://www.kuaishou.com/profile/3x84qugg4ch9zhs"
2. 纯ID: "3x4sm73aye7jq7i"
Args:
url: 快手创作者主页链接或user_id
Returns:
CreatorUrlInfo: 包含创作者ID的对象
"""
# 如果不包含http且不包含kuaishou.com认为是纯ID
if not url.startswith("http") and "kuaishou.com" not in url:
return CreatorUrlInfo(user_id=url)
# 从创作者主页URL中提取user_id: /profile/xxx
user_pattern = r'/profile/([a-zA-Z0-9_-]+)'
match = re.search(user_pattern, url)
if match:
user_id = match.group(1)
return CreatorUrlInfo(user_id=user_id)
raise ValueError(f"无法从URL中解析出创作者ID: {url}")
if __name__ == '__main__':
# 测试视频URL解析
print("=== 视频URL解析测试 ===")
test_video_urls = [
"https://www.kuaishou.com/short-video/3x3zxz4mjrsc8ke?authorId=3x84qugg4ch9zhs&streamSource=search&area=searchxxnull&searchKey=python",
"3xf8enb8dbj6uig",
]
for url in test_video_urls:
try:
result = parse_video_info_from_url(url)
print(f"✓ URL: {url[:80]}...")
print(f" 结果: {result}\n")
except Exception as e:
print(f"✗ URL: {url}")
print(f" 错误: {e}\n")
# 测试创作者URL解析
print("=== 创作者URL解析测试 ===")
test_creator_urls = [
"https://www.kuaishou.com/profile/3x84qugg4ch9zhs",
"3x4sm73aye7jq7i",
]
for url in test_creator_urls:
try:
result = parse_creator_info_from_url(url)
print(f"✓ URL: {url[:80]}...")
print(f" 结果: {result}\n")
except Exception as e:
print(f"✗ URL: {url}")
print(f" 错误: {e}\n")

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/kuaishou/login.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/tieba/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/tieba/client.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -11,10 +20,10 @@
import asyncio import asyncio
import json import json
from typing import Any, Callable, Dict, List, Optional, Union from typing import Any, Callable, Dict, List, Optional, Union
from urllib.parse import urlencode from urllib.parse import urlencode, quote
import httpx import requests
from playwright.async_api import BrowserContext from playwright.async_api import BrowserContext, Page
from tenacity import RetryError, retry, stop_after_attempt, wait_fixed from tenacity import RetryError, retry, stop_after_attempt, wait_fixed
import config import config
@@ -34,34 +43,97 @@ class BaiduTieBaClient(AbstractApiClient):
timeout=10, timeout=10,
ip_pool=None, ip_pool=None,
default_ip_proxy=None, default_ip_proxy=None,
headers: Dict[str, str] = None,
playwright_page: Optional[Page] = None,
): ):
self.ip_pool: Optional[ProxyIpPool] = ip_pool self.ip_pool: Optional[ProxyIpPool] = ip_pool
self.timeout = timeout self.timeout = timeout
self.headers = { # 使用传入的headers(包含真实浏览器UA)或默认headers
self.headers = headers or {
"User-Agent": utils.get_user_agent(), "User-Agent": utils.get_user_agent(),
"Cookies": "", "Cookie": "",
} }
self._host = "https://tieba.baidu.com" self._host = "https://tieba.baidu.com"
self._page_extractor = TieBaExtractor() self._page_extractor = TieBaExtractor()
self.default_ip_proxy = default_ip_proxy self.default_ip_proxy = default_ip_proxy
self.playwright_page = playwright_page # Playwright页面对象
def _sync_request(self, method, url, proxy=None, **kwargs):
"""
同步的requests请求方法
Args:
method: 请求方法
url: 请求的URL
proxy: 代理IP
**kwargs: 其他请求参数
Returns:
response对象
"""
# 构造代理字典
proxies = None
if proxy:
proxies = {
"http": proxy,
"https": proxy,
}
# 发送请求
response = requests.request(
method=method,
url=url,
headers=self.headers,
proxies=proxies,
timeout=self.timeout,
**kwargs
)
return response
async def _refresh_proxy_if_expired(self) -> None:
"""
检测代理是否过期,如果过期则自动刷新
"""
if self.ip_pool is None:
return
if self.ip_pool.is_current_proxy_expired():
utils.logger.info(
"[BaiduTieBaClient._refresh_proxy_if_expired] Proxy expired, refreshing..."
)
new_proxy = await self.ip_pool.get_or_refresh_proxy()
# 更新代理URL
_, self.default_ip_proxy = utils.format_proxy_info(new_proxy)
utils.logger.info(
f"[BaiduTieBaClient._refresh_proxy_if_expired] New proxy: {new_proxy.ip}:{new_proxy.port}"
)
@retry(stop=stop_after_attempt(3), wait=wait_fixed(1)) @retry(stop=stop_after_attempt(3), wait=wait_fixed(1))
async def request(self, method, url, return_ori_content=False, proxy=None, **kwargs) -> Union[str, Any]: async def request(self, method, url, return_ori_content=False, proxy=None, **kwargs) -> Union[str, Any]:
""" """
封装httpx的公共请求方法,对请求响应做一些处理 封装requests的公共请求方法,对请求响应做一些处理
Args: Args:
method: 请求方法 method: 请求方法
url: 请求的URL url: 请求的URL
return_ori_content: 是否返回原始内容 return_ori_content: 是否返回原始内容
proxies: 代理IP proxy: 代理IP
**kwargs: 其他请求参数,例如请求头、请求体等 **kwargs: 其他请求参数,例如请求头、请求体等
Returns: Returns:
""" """
# 每次请求前检测代理是否过期
await self._refresh_proxy_if_expired()
actual_proxy = proxy if proxy else self.default_ip_proxy actual_proxy = proxy if proxy else self.default_ip_proxy
async with httpx.AsyncClient(proxy=actual_proxy) as client:
response = await client.request(method, url, timeout=self.timeout, headers=self.headers, **kwargs) # 在线程池中执行同步的requests请求
response = await asyncio.to_thread(
self._sync_request,
method,
url,
actual_proxy,
**kwargs
)
if response.status_code != 200: if response.status_code != 200:
utils.logger.error(f"Request failed, method: {method}, url: {url}, status code: {response.status_code}") utils.logger.error(f"Request failed, method: {method}, url: {url}, status code: {response.status_code}")
@@ -69,7 +141,7 @@ class BaiduTieBaClient(AbstractApiClient):
raise Exception(f"Request failed, method: {method}, url: {url}, status code: {response.status_code}") raise Exception(f"Request failed, method: {method}, url: {url}, status code: {response.status_code}")
if response.text == "" or response.text == "blocked": if response.text == "" or response.text == "blocked":
utils.logger.error(f"request params incrr, response.text: {response.text}") utils.logger.error(f"request params incorrect, response.text: {response.text}")
raise Exception("account blocked") raise Exception("account blocked")
if return_ori_content: if return_ori_content:
@@ -119,26 +191,41 @@ class BaiduTieBaClient(AbstractApiClient):
json_str = json.dumps(data, separators=(',', ':'), ensure_ascii=False) json_str = json.dumps(data, separators=(',', ':'), ensure_ascii=False)
return await self.request(method="POST", url=f"{self._host}{uri}", data=json_str, **kwargs) return await self.request(method="POST", url=f"{self._host}{uri}", data=json_str, **kwargs)
async def pong(self) -> bool: async def pong(self, browser_context: BrowserContext = None) -> bool:
""" """
用于检查登录态是否失效了 用于检查登录态是否失效了
Returns: 使用Cookie检测而非API调用,避免被检测
Args:
browser_context: 浏览器上下文对象
Returns:
bool: True表示已登录,False表示未登录
""" """
utils.logger.info("[BaiduTieBaClient.pong] Begin to pong tieba...") utils.logger.info("[BaiduTieBaClient.pong] Begin to check tieba login state by cookies...")
if not browser_context:
utils.logger.warning("[BaiduTieBaClient.pong] browser_context is None, assume not logged in")
return False
try: try:
uri = "/mo/q/sync" # 从浏览器获取cookies并检查关键登录cookie
res: Dict = await self.get(uri) _, cookie_dict = utils.convert_cookies(await browser_context.cookies())
utils.logger.info(f"[BaiduTieBaClient.pong] res: {res}")
if res and res.get("no") == 0: # 百度贴吧的登录标识: STOKEN 或 PTOKEN
ping_flag = True stoken = cookie_dict.get("STOKEN")
ptoken = cookie_dict.get("PTOKEN")
bduss = cookie_dict.get("BDUSS") # 百度通用登录cookie
if stoken or ptoken or bduss:
utils.logger.info(f"[BaiduTieBaClient.pong] Login state verified by cookies (STOKEN: {bool(stoken)}, PTOKEN: {bool(ptoken)}, BDUSS: {bool(bduss)})")
return True
else: else:
utils.logger.info(f"[BaiduTieBaClient.pong] user not login, will try to login again...") utils.logger.info("[BaiduTieBaClient.pong] No valid login cookies found, need to login")
ping_flag = False return False
except Exception as e: except Exception as e:
utils.logger.error(f"[BaiduTieBaClient.pong] Ping tieba failed: {e}, and try to login again...") utils.logger.error(f"[BaiduTieBaClient.pong] Check login state failed: {e}, assume not logged in")
ping_flag = False return False
return ping_flag
async def update_cookies(self, browser_context: BrowserContext): async def update_cookies(self, browser_context: BrowserContext):
""" """
@@ -149,7 +236,9 @@ class BaiduTieBaClient(AbstractApiClient):
Returns: Returns:
""" """
pass cookie_str, cookie_dict = utils.convert_cookies(await browser_context.cookies())
self.headers["Cookie"] = cookie_str
utils.logger.info("[BaiduTieBaClient.update_cookies] Cookie has been updated")
async def get_notes_by_keyword( async def get_notes_by_keyword(
self, self,
@@ -160,7 +249,7 @@ class BaiduTieBaClient(AbstractApiClient):
note_type: SearchNoteType = SearchNoteType.FIXED_THREAD, note_type: SearchNoteType = SearchNoteType.FIXED_THREAD,
) -> List[TiebaNote]: ) -> List[TiebaNote]:
""" """
根据关键词搜索贴吧帖子 根据关键词搜索贴吧帖子 (使用Playwright访问页面,避免API检测)
Args: Args:
keyword: 关键词 keyword: 关键词
page: 分页第几页 page: 分页第几页
@@ -170,30 +259,81 @@ class BaiduTieBaClient(AbstractApiClient):
Returns: Returns:
""" """
uri = "/f/search/res" if not self.playwright_page:
utils.logger.error("[BaiduTieBaClient.get_notes_by_keyword] playwright_page is None, cannot use browser mode")
raise Exception("playwright_page is required for browser-based search")
# 构造搜索URL
# 示例: https://tieba.baidu.com/f/search/res?ie=utf-8&qw=编程
search_url = f"{self._host}/f/search/res"
params = { params = {
"isnew": 1, "ie": "utf-8",
"qw": keyword, "qw": keyword,
"rn": page_size, "rn": page_size,
"pn": page, "pn": page,
"sm": sort.value, "sm": sort.value,
"only_thread": note_type.value, "only_thread": note_type.value,
} }
page_content = await self.get(uri, params=params, return_ori_content=True)
return self._page_extractor.extract_search_note_list(page_content) # 拼接完整URL
full_url = f"{search_url}?{urlencode(params)}"
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_keyword] 访问搜索页面: {full_url}")
try:
# 使用Playwright访问搜索页面
await self.playwright_page.goto(full_url, wait_until="domcontentloaded")
# 等待页面加载,使用配置文件中的延时设置
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
# 获取页面HTML内容
page_content = await self.playwright_page.content()
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_keyword] 成功获取搜索页面HTML,长度: {len(page_content)}")
# 提取搜索结果
notes = self._page_extractor.extract_search_note_list(page_content)
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_keyword] 提取到 {len(notes)} 条帖子")
return notes
except Exception as e:
utils.logger.error(f"[BaiduTieBaClient.get_notes_by_keyword] 搜索失败: {e}")
raise
async def get_note_by_id(self, note_id: str) -> TiebaNote: async def get_note_by_id(self, note_id: str) -> TiebaNote:
""" """
根据帖子ID获取帖子详情 根据帖子ID获取帖子详情 (使用Playwright访问页面,避免API检测)
Args: Args:
note_id: note_id: 帖子ID
Returns: Returns:
TiebaNote: 帖子详情对象
""" """
uri = f"/p/{note_id}" if not self.playwright_page:
page_content = await self.get(uri, return_ori_content=True) utils.logger.error("[BaiduTieBaClient.get_note_by_id] playwright_page is None, cannot use browser mode")
return self._page_extractor.extract_note_detail(page_content) raise Exception("playwright_page is required for browser-based note detail fetching")
# 构造帖子详情URL
note_url = f"{self._host}/p/{note_id}"
utils.logger.info(f"[BaiduTieBaClient.get_note_by_id] 访问帖子详情页面: {note_url}")
try:
# 使用Playwright访问帖子详情页面
await self.playwright_page.goto(note_url, wait_until="domcontentloaded")
# 等待页面加载,使用配置文件中的延时设置
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
# 获取页面HTML内容
page_content = await self.playwright_page.content()
utils.logger.info(f"[BaiduTieBaClient.get_note_by_id] 成功获取帖子详情HTML,长度: {len(page_content)}")
# 提取帖子详情
note_detail = self._page_extractor.extract_note_detail(page_content)
return note_detail
except Exception as e:
utils.logger.error(f"[BaiduTieBaClient.get_note_by_id] 获取帖子详情失败: {e}")
raise
async def get_note_all_comments( async def get_note_all_comments(
self, self,
@@ -203,35 +343,68 @@ class BaiduTieBaClient(AbstractApiClient):
max_count: int = 10, max_count: int = 10,
) -> List[TiebaComment]: ) -> List[TiebaComment]:
""" """
获取指定帖子下的所有一级评论,该方法会一直查找一个帖子下的所有评论信息 获取指定帖子下的所有一级评论 (使用Playwright访问页面,避免API检测)
Args: Args:
note_detail: 帖子详情对象 note_detail: 帖子详情对象
crawl_interval: 爬取一次笔记的延迟单位(秒) crawl_interval: 爬取一次笔记的延迟单位(秒)
callback: 一次笔记爬取结束后 callback: 一次笔记爬取结束后的回调函数
max_count: 一次帖子爬取的最大评论数量 max_count: 一次帖子爬取的最大评论数量
Returns: Returns:
List[TiebaComment]: 评论列表
""" """
uri = f"/p/{note_detail.note_id}" if not self.playwright_page:
utils.logger.error("[BaiduTieBaClient.get_note_all_comments] playwright_page is None, cannot use browser mode")
raise Exception("playwright_page is required for browser-based comment fetching")
result: List[TiebaComment] = [] result: List[TiebaComment] = []
current_page = 1 current_page = 1
while note_detail.total_replay_page >= current_page and len(result) < max_count: while note_detail.total_replay_page >= current_page and len(result) < max_count:
params = { # 构造评论页URL
"pn": current_page, comment_url = f"{self._host}/p/{note_detail.note_id}?pn={current_page}"
} utils.logger.info(f"[BaiduTieBaClient.get_note_all_comments] 访问评论页面: {comment_url}")
page_content = await self.get(uri, params=params, return_ori_content=True)
comments = self._page_extractor.extract_tieba_note_parment_comments(page_content, note_id=note_detail.note_id) try:
# 使用Playwright访问评论页面
await self.playwright_page.goto(comment_url, wait_until="domcontentloaded")
# 等待页面加载,使用配置文件中的延时设置
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
# 获取页面HTML内容
page_content = await self.playwright_page.content()
# 提取评论
comments = self._page_extractor.extract_tieba_note_parment_comments(
page_content, note_id=note_detail.note_id
)
if not comments: if not comments:
utils.logger.info(f"[BaiduTieBaClient.get_note_all_comments] 第{current_page}页没有评论,停止爬取")
break break
# 限制评论数量
if len(result) + len(comments) > max_count: if len(result) + len(comments) > max_count:
comments = comments[:max_count - len(result)] comments = comments[:max_count - len(result)]
if callback: if callback:
await callback(note_detail.note_id, comments) await callback(note_detail.note_id, comments)
result.extend(comments) result.extend(comments)
# 获取所有子评论 # 获取所有子评论
await self.get_comments_all_sub_comments(comments, crawl_interval=crawl_interval, callback=callback) await self.get_comments_all_sub_comments(
comments, crawl_interval=crawl_interval, callback=callback
)
await asyncio.sleep(crawl_interval) await asyncio.sleep(crawl_interval)
current_page += 1 current_page += 1
except Exception as e:
utils.logger.error(f"[BaiduTieBaClient.get_note_all_comments] 获取第{current_page}页评论失败: {e}")
break
utils.logger.info(f"[BaiduTieBaClient.get_note_all_comments] 共获取 {len(result)} 条一级评论")
return result return result
async def get_comments_all_sub_comments( async def get_comments_all_sub_comments(
@@ -241,93 +414,194 @@ class BaiduTieBaClient(AbstractApiClient):
callback: Optional[Callable] = None, callback: Optional[Callable] = None,
) -> List[TiebaComment]: ) -> List[TiebaComment]:
""" """
获取指定评论下的所有子评论 获取指定评论下的所有子评论 (使用Playwright访问页面,避免API检测)
Args: Args:
comments: 评论列表 comments: 评论列表
crawl_interval: 爬取一次笔记的延迟单位(秒) crawl_interval: 爬取一次笔记的延迟单位(秒)
callback: 一次笔记爬取结束后 callback: 一次笔记爬取结束后的回调函数
Returns: Returns:
List[TiebaComment]: 子评论列表
""" """
uri = "/p/comment"
if not config.ENABLE_GET_SUB_COMMENTS: if not config.ENABLE_GET_SUB_COMMENTS:
return [] return []
# # 贴吧获取所有子评论需要登录态 if not self.playwright_page:
# if self.headers.get("Cookies") == "" or not self.pong(): utils.logger.error("[BaiduTieBaClient.get_comments_all_sub_comments] playwright_page is None, cannot use browser mode")
# raise Exception(f"[BaiduTieBaClient.pong] Cookies is empty, please login first...") raise Exception("playwright_page is required for browser-based sub-comment fetching")
all_sub_comments: List[TiebaComment] = [] all_sub_comments: List[TiebaComment] = []
for parment_comment in comments: for parment_comment in comments:
if parment_comment.sub_comment_count == 0: if parment_comment.sub_comment_count == 0:
continue continue
current_page = 1 current_page = 1
max_sub_page_num = parment_comment.sub_comment_count // 10 + 1 max_sub_page_num = parment_comment.sub_comment_count // 10 + 1
while max_sub_page_num >= current_page: while max_sub_page_num >= current_page:
params = { # 构造子评论URL
"tid": parment_comment.note_id, # 帖子ID sub_comment_url = (
"pid": parment_comment.comment_id, # 父级评论ID f"{self._host}/p/comment?"
"fid": parment_comment.tieba_id, # 贴吧ID f"tid={parment_comment.note_id}&"
"pn": current_page # 页码 f"pid={parment_comment.comment_id}&"
} f"fid={parment_comment.tieba_id}&"
page_content = await self.get(uri, params=params, return_ori_content=True) f"pn={current_page}"
sub_comments = self._page_extractor.extract_tieba_note_sub_comments(page_content, parent_comment=parment_comment) )
utils.logger.info(f"[BaiduTieBaClient.get_comments_all_sub_comments] 访问子评论页面: {sub_comment_url}")
try:
# 使用Playwright访问子评论页面
await self.playwright_page.goto(sub_comment_url, wait_until="domcontentloaded")
# 等待页面加载,使用配置文件中的延时设置
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
# 获取页面HTML内容
page_content = await self.playwright_page.content()
# 提取子评论
sub_comments = self._page_extractor.extract_tieba_note_sub_comments(
page_content, parent_comment=parment_comment
)
if not sub_comments: if not sub_comments:
utils.logger.info(
f"[BaiduTieBaClient.get_comments_all_sub_comments] "
f"评论{parment_comment.comment_id}{current_page}页没有子评论,停止爬取"
)
break break
if callback: if callback:
await callback(parment_comment.note_id, sub_comments) await callback(parment_comment.note_id, sub_comments)
all_sub_comments.extend(sub_comments) all_sub_comments.extend(sub_comments)
await asyncio.sleep(crawl_interval) await asyncio.sleep(crawl_interval)
current_page += 1 current_page += 1
except Exception as e:
utils.logger.error(
f"[BaiduTieBaClient.get_comments_all_sub_comments] "
f"获取评论{parment_comment.comment_id}{current_page}页子评论失败: {e}"
)
break
utils.logger.info(f"[BaiduTieBaClient.get_comments_all_sub_comments] 共获取 {len(all_sub_comments)} 条子评论")
return all_sub_comments return all_sub_comments
async def get_notes_by_tieba_name(self, tieba_name: str, page_num: int) -> List[TiebaNote]: async def get_notes_by_tieba_name(self, tieba_name: str, page_num: int) -> List[TiebaNote]:
""" """
根据贴吧名称获取帖子列表 根据贴吧名称获取帖子列表 (使用Playwright访问页面,避免API检测)
Args: Args:
tieba_name: 贴吧名称 tieba_name: 贴吧名称
page_num: 分页数量 page_num: 分页页码
Returns: Returns:
List[TiebaNote]: 帖子列表
""" """
uri = f"/f?kw={tieba_name}&pn={page_num}" if not self.playwright_page:
page_content = await self.get(uri, return_ori_content=True) utils.logger.error("[BaiduTieBaClient.get_notes_by_tieba_name] playwright_page is None, cannot use browser mode")
return self._page_extractor.extract_tieba_note_list(page_content) raise Exception("playwright_page is required for browser-based tieba note fetching")
# 构造贴吧帖子列表URL
tieba_url = f"{self._host}/f?kw={quote(tieba_name)}&pn={page_num}"
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_tieba_name] 访问贴吧页面: {tieba_url}")
try:
# 使用Playwright访问贴吧页面
await self.playwright_page.goto(tieba_url, wait_until="domcontentloaded")
# 等待页面加载,使用配置文件中的延时设置
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
# 获取页面HTML内容
page_content = await self.playwright_page.content()
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_tieba_name] 成功获取贴吧页面HTML,长度: {len(page_content)}")
# 提取帖子列表
notes = self._page_extractor.extract_tieba_note_list(page_content)
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_tieba_name] 提取到 {len(notes)} 条帖子")
return notes
except Exception as e:
utils.logger.error(f"[BaiduTieBaClient.get_notes_by_tieba_name] 获取贴吧帖子列表失败: {e}")
raise
async def get_creator_info_by_url(self, creator_url: str) -> str: async def get_creator_info_by_url(self, creator_url: str) -> str:
""" """
根据创作者ID获取创作者信息 根据创作者URL获取创作者信息 (使用Playwright访问页面,避免API检测)
Args: Args:
creator_url: 创作者主页URL creator_url: 创作者主页URL
Returns: Returns:
str: 页面HTML内容
""" """
page_content = await self.request(method="GET", url=creator_url, return_ori_content=True) if not self.playwright_page:
utils.logger.error("[BaiduTieBaClient.get_creator_info_by_url] playwright_page is None, cannot use browser mode")
raise Exception("playwright_page is required for browser-based creator info fetching")
utils.logger.info(f"[BaiduTieBaClient.get_creator_info_by_url] 访问创作者主页: {creator_url}")
try:
# 使用Playwright访问创作者主页
await self.playwright_page.goto(creator_url, wait_until="domcontentloaded")
# 等待页面加载,使用配置文件中的延时设置
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
# 获取页面HTML内容
page_content = await self.playwright_page.content()
utils.logger.info(f"[BaiduTieBaClient.get_creator_info_by_url] 成功获取创作者主页HTML,长度: {len(page_content)}")
return page_content return page_content
except Exception as e:
utils.logger.error(f"[BaiduTieBaClient.get_creator_info_by_url] 获取创作者主页失败: {e}")
raise
async def get_notes_by_creator(self, user_name: str, page_number: int) -> Dict: async def get_notes_by_creator(self, user_name: str, page_number: int) -> Dict:
""" """
根据创作者获取创作者的所有帖子 根据创作者获取创作者的帖子 (使用Playwright访问页面,避免API检测)
Args: Args:
user_name: user_name: 创作者用户名
page_number: page_number: 页码
Returns: Returns:
Dict: 包含帖子数据的字典
""" """
uri = f"/home/get/getthread" if not self.playwright_page:
params = { utils.logger.error("[BaiduTieBaClient.get_notes_by_creator] playwright_page is None, cannot use browser mode")
"un": user_name, raise Exception("playwright_page is required for browser-based creator notes fetching")
"pn": page_number,
"id": "utf-8", # 构造创作者帖子列表URL
"_": utils.get_current_timestamp(), creator_url = f"{self._host}/home/get/getthread?un={quote(user_name)}&pn={page_number}&id=utf-8&_={utils.get_current_timestamp()}"
} utils.logger.info(f"[BaiduTieBaClient.get_notes_by_creator] 访问创作者帖子列表: {creator_url}")
return await self.get(uri, params=params)
try:
# 使用Playwright访问创作者帖子列表页面
await self.playwright_page.goto(creator_url, wait_until="domcontentloaded")
# 等待页面加载,使用配置文件中的延时设置
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
# 获取页面内容(这个接口返回JSON)
page_content = await self.playwright_page.content()
# 提取JSON数据(页面会包含<pre>标签或直接是JSON)
try:
# 尝试从页面中提取JSON
json_text = await self.playwright_page.evaluate("() => document.body.innerText")
result = json.loads(json_text)
utils.logger.info(f"[BaiduTieBaClient.get_notes_by_creator] 成功获取创作者帖子数据")
return result
except json.JSONDecodeError as e:
utils.logger.error(f"[BaiduTieBaClient.get_notes_by_creator] JSON解析失败: {e}")
utils.logger.error(f"[BaiduTieBaClient.get_notes_by_creator] 页面内容: {page_content[:500]}")
raise Exception(f"Failed to parse JSON from creator notes page: {e}")
except Exception as e:
utils.logger.error(f"[BaiduTieBaClient.get_notes_by_creator] 获取创作者帖子列表失败: {e}")
raise
async def get_all_notes_by_creator_user_name( async def get_all_notes_by_creator_user_name(
self, self,

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/tieba/core.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -11,7 +20,6 @@
import asyncio import asyncio
import os import os
# import random # Removed as we now use fixed config.CRAWLER_MAX_SLEEP_SEC intervals
from asyncio import Task from asyncio import Task
from typing import Dict, List, Optional, Tuple from typing import Dict, List, Optional, Tuple
@@ -26,7 +34,7 @@ from playwright.async_api import (
import config import config
from base.base_crawler import AbstractCrawler from base.base_crawler import AbstractCrawler
from model.m_baidu_tieba import TiebaCreator, TiebaNote from model.m_baidu_tieba import TiebaCreator, TiebaNote
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool from proxy.proxy_ip_pool import IpInfoModel, ProxyIpPool, create_ip_pool
from store import tieba as tieba_store from store import tieba as tieba_store
from tools import utils from tools import utils
from tools.cdp_browser import CDPBrowserManager from tools.cdp_browser import CDPBrowserManager
@@ -56,7 +64,7 @@ class TieBaCrawler(AbstractCrawler):
Returns: Returns:
""" """
ip_proxy_pool, httpx_proxy_format = None, None playwright_proxy_format, httpx_proxy_format = None, None
if config.ENABLE_IP_PROXY: if config.ENABLE_IP_PROXY:
utils.logger.info( utils.logger.info(
"[BaiduTieBaCrawler.start] Begin create ip proxy pool ..." "[BaiduTieBaCrawler.start] Begin create ip proxy pool ..."
@@ -65,16 +73,58 @@ class TieBaCrawler(AbstractCrawler):
config.IP_PROXY_POOL_COUNT, enable_validate_ip=True config.IP_PROXY_POOL_COUNT, enable_validate_ip=True
) )
ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy() ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy()
_, httpx_proxy_format = utils.format_proxy_info(ip_proxy_info) playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(ip_proxy_info)
utils.logger.info( utils.logger.info(
f"[BaiduTieBaCrawler.start] Init default ip proxy, value: {httpx_proxy_format}" f"[BaiduTieBaCrawler.start] Init default ip proxy, value: {httpx_proxy_format}"
) )
# Create a client to interact with the baidutieba website. async with async_playwright() as playwright:
self.tieba_client = BaiduTieBaClient( # 根据配置选择启动模式
ip_pool=ip_proxy_pool, if config.ENABLE_CDP_MODE:
default_ip_proxy=httpx_proxy_format, utils.logger.info("[BaiduTieBaCrawler] 使用CDP模式启动浏览器")
self.browser_context = await self.launch_browser_with_cdp(
playwright,
playwright_proxy_format,
self.user_agent,
headless=config.CDP_HEADLESS,
) )
else:
utils.logger.info("[BaiduTieBaCrawler] 使用标准模式启动浏览器")
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
chromium,
playwright_proxy_format,
self.user_agent,
headless=config.HEADLESS,
)
# 注入反检测脚本 - 针对百度的特殊检测
await self._inject_anti_detection_scripts()
self.context_page = await self.browser_context.new_page()
# 先访问百度首页,再点击贴吧链接,避免触发安全验证
await self._navigate_to_tieba_via_baidu()
# Create a client to interact with the baidutieba website.
self.tieba_client = await self.create_tieba_client(
httpx_proxy_format,
ip_proxy_pool if config.ENABLE_IP_PROXY else None
)
# Check login status and perform login if necessary
if not await self.tieba_client.pong(browser_context=self.browser_context):
login_obj = BaiduTieBaLogin(
login_type=config.LOGIN_TYPE,
login_phone="", # your phone number
browser_context=self.browser_context,
context_page=self.context_page,
cookie_str=config.COOKIES,
)
await login_obj.begin()
await self.tieba_client.update_cookies(browser_context=self.browser_context)
crawler_type_var.set(config.CRAWLER_TYPE) crawler_type_var.set(config.CRAWLER_TYPE)
if config.CRAWLER_TYPE == "search": if config.CRAWLER_TYPE == "search":
# Search for notes and retrieve their comment information. # Search for notes and retrieve their comment information.
@@ -347,6 +397,198 @@ class TieBaCrawler(AbstractCrawler):
f"[WeiboCrawler.get_creators_and_notes] get creator info error, creator_url:{creator_url}" f"[WeiboCrawler.get_creators_and_notes] get creator info error, creator_url:{creator_url}"
) )
async def _navigate_to_tieba_via_baidu(self):
"""
模拟真实用户访问路径:
1. 先访问百度首页 (https://www.baidu.com/)
2. 等待页面加载
3. 点击顶部导航栏的"贴吧"链接
4. 跳转到贴吧首页
这样做可以避免触发百度的安全验证
"""
utils.logger.info("[TieBaCrawler] 模拟真实用户访问路径...")
try:
# Step 1: 访问百度首页
utils.logger.info("[TieBaCrawler] Step 1: 访问百度首页 https://www.baidu.com/")
await self.context_page.goto("https://www.baidu.com/", wait_until="domcontentloaded")
# Step 2: 等待页面加载,使用配置文件中的延时设置
utils.logger.info(f"[TieBaCrawler] Step 2: 等待 {config.CRAWLER_MAX_SLEEP_SEC}秒 模拟用户浏览...")
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
# Step 3: 查找并点击"贴吧"链接
utils.logger.info("[TieBaCrawler] Step 3: 查找并点击'贴吧'链接...")
# 尝试多种选择器,确保能找到贴吧链接
tieba_selectors = [
'a[href="http://tieba.baidu.com/"]',
'a[href="https://tieba.baidu.com/"]',
'a.mnav:has-text("贴吧")',
'text=贴吧',
]
tieba_link = None
for selector in tieba_selectors:
try:
tieba_link = await self.context_page.wait_for_selector(selector, timeout=5000)
if tieba_link:
utils.logger.info(f"[TieBaCrawler] 找到贴吧链接 (selector: {selector})")
break
except Exception:
continue
if not tieba_link:
utils.logger.warning("[TieBaCrawler] 未找到贴吧链接,直接访问贴吧首页")
await self.context_page.goto(self.index_url, wait_until="domcontentloaded")
return
# Step 4: 点击贴吧链接 (检查是否会打开新标签页)
utils.logger.info("[TieBaCrawler] Step 4: 点击贴吧链接...")
# 检查链接的target属性
target_attr = await tieba_link.get_attribute("target")
utils.logger.info(f"[TieBaCrawler] 链接target属性: {target_attr}")
if target_attr == "_blank":
# 如果是新标签页,需要等待新页面并切换
utils.logger.info("[TieBaCrawler] 链接会在新标签页打开,等待新页面...")
async with self.browser_context.expect_page() as new_page_info:
await tieba_link.click()
# 获取新打开的页面
new_page = await new_page_info.value
await new_page.wait_for_load_state("domcontentloaded")
# 关闭旧的百度首页
await self.context_page.close()
# 切换到新的贴吧页面
self.context_page = new_page
utils.logger.info("[TieBaCrawler] ✅ 已切换到新标签页 (贴吧页面)")
else:
# 如果是同一标签页跳转,正常等待导航
utils.logger.info("[TieBaCrawler] 链接在当前标签页跳转...")
async with self.context_page.expect_navigation(wait_until="domcontentloaded"):
await tieba_link.click()
# Step 5: 等待页面稳定,使用配置文件中的延时设置
utils.logger.info(f"[TieBaCrawler] Step 5: 页面加载完成,等待 {config.CRAWLER_MAX_SLEEP_SEC}秒...")
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
current_url = self.context_page.url
utils.logger.info(f"[TieBaCrawler] ✅ 成功通过百度首页进入贴吧! 当前URL: {current_url}")
except Exception as e:
utils.logger.error(f"[TieBaCrawler] 通过百度首页访问贴吧失败: {e}")
utils.logger.info("[TieBaCrawler] 回退:直接访问贴吧首页")
await self.context_page.goto(self.index_url, wait_until="domcontentloaded")
async def _inject_anti_detection_scripts(self):
"""
注入反检测JavaScript脚本
针对百度贴吧的特殊检测机制
"""
utils.logger.info("[TieBaCrawler] Injecting anti-detection scripts...")
# 轻量级反检测脚本,只覆盖关键检测点
anti_detection_js = """
// 覆盖 navigator.webdriver
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined,
configurable: true
});
// 覆盖 window.navigator.chrome
if (!window.navigator.chrome) {
window.navigator.chrome = {
runtime: {},
loadTimes: function() {},
csi: function() {},
app: {}
};
}
// 覆盖 Permissions API
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) => (
parameters.name === 'notifications' ?
Promise.resolve({ state: Notification.permission }) :
originalQuery(parameters)
);
// 覆盖 plugins 长度(让它看起来有插件)
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3, 4, 5],
configurable: true
});
// 覆盖 languages
Object.defineProperty(navigator, 'languages', {
get: () => ['zh-CN', 'zh', 'en'],
configurable: true
});
// 移除 window.cdc_ 等 ChromeDriver 残留
delete window.cdc_adoQpoasnfa76pfcZLmcfl_Array;
delete window.cdc_adoQpoasnfa76pfcZLmcfl_Promise;
delete window.cdc_adoQpoasnfa76pfcZLmcfl_Symbol;
console.log('[Anti-Detection] Scripts injected successfully');
"""
await self.browser_context.add_init_script(anti_detection_js)
utils.logger.info("[TieBaCrawler] Anti-detection scripts injected")
async def create_tieba_client(
self, httpx_proxy: Optional[str], ip_pool: Optional[ProxyIpPool] = None
) -> BaiduTieBaClient:
"""
Create tieba client with real browser User-Agent and complete headers
Args:
httpx_proxy: HTTP代理
ip_pool: IP代理池
Returns:
BaiduTieBaClient实例
"""
utils.logger.info("[TieBaCrawler.create_tieba_client] Begin create tieba API client...")
# 从真实浏览器提取User-Agent,避免被检测
user_agent = await self.context_page.evaluate("() => navigator.userAgent")
utils.logger.info(f"[TieBaCrawler.create_tieba_client] Extracted User-Agent from browser: {user_agent}")
cookie_str, cookie_dict = utils.convert_cookies(await self.browser_context.cookies())
# 构建完整的浏览器请求头,模拟真实浏览器行为
tieba_client = BaiduTieBaClient(
timeout=10,
ip_pool=ip_pool,
default_ip_proxy=httpx_proxy,
headers={
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Language": "zh-CN,zh;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"User-Agent": user_agent, # 使用真实浏览器的UA
"Cookie": cookie_str,
"Host": "tieba.baidu.com",
"Referer": "https://tieba.baidu.com/",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"sec-ch-ua": '"Google Chrome";v="141", "Not?A_Brand";v="8", "Chromium";v="141"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"macOS"',
},
playwright_page=self.context_page, # 传入playwright页面对象
)
return tieba_client
async def launch_browser( async def launch_browser(
self, self,
chromium: BrowserType, chromium: BrowserType,
@@ -381,10 +623,11 @@ class TieBaCrawler(AbstractCrawler):
proxy=playwright_proxy, # type: ignore proxy=playwright_proxy, # type: ignore
viewport={"width": 1920, "height": 1080}, viewport={"width": 1920, "height": 1080},
user_agent=user_agent, user_agent=user_agent,
channel="chrome", # 使用系统的Chrome稳定版
) )
return browser_context return browser_context
else: else:
browser = await chromium.launch(headless=headless, proxy=playwright_proxy) # type: ignore browser = await chromium.launch(headless=headless, proxy=playwright_proxy, channel="chrome") # type: ignore
browser_context = await browser.new_context( browser_context = await browser.new_context(
viewport={"width": 1920, "height": 1080}, user_agent=user_agent viewport={"width": 1920, "height": 1080}, user_agent=user_agent
) )

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/tieba/field.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/tieba/help.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/tieba/login.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/weibo/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/weibo/client.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -17,21 +26,26 @@ import asyncio
import copy import copy
import json import json
import re import re
from typing import Callable, Dict, List, Optional, Union from typing import TYPE_CHECKING, Callable, Dict, List, Optional, Union
from urllib.parse import parse_qs, unquote, urlencode from urllib.parse import parse_qs, unquote, urlencode
import httpx import httpx
from httpx import Response from httpx import Response
from playwright.async_api import BrowserContext, Page from playwright.async_api import BrowserContext, Page
from tenacity import retry, stop_after_attempt, wait_fixed
import config import config
from proxy.proxy_mixin import ProxyRefreshMixin
from tools import utils from tools import utils
if TYPE_CHECKING:
from proxy.proxy_ip_pool import ProxyIpPool
from .exception import DataFetchError from .exception import DataFetchError
from .field import SearchType from .field import SearchType
class WeiboClient: class WeiboClient(ProxyRefreshMixin):
def __init__( def __init__(
self, self,
@@ -41,6 +55,7 @@ class WeiboClient:
headers: Dict[str, str], headers: Dict[str, str],
playwright_page: Page, playwright_page: Page,
cookie_dict: Dict[str, str], cookie_dict: Dict[str, str],
proxy_ip_pool: Optional["ProxyIpPool"] = None,
): ):
self.proxy = proxy self.proxy = proxy
self.timeout = timeout self.timeout = timeout
@@ -49,8 +64,14 @@ class WeiboClient:
self.playwright_page = playwright_page self.playwright_page = playwright_page
self.cookie_dict = cookie_dict self.cookie_dict = cookie_dict
self._image_agent_host = "https://i1.wp.com/" self._image_agent_host = "https://i1.wp.com/"
# 初始化代理池(来自 ProxyRefreshMixin
self.init_proxy_pool(proxy_ip_pool)
@retry(stop=stop_after_attempt(5), wait=wait_fixed(3))
async def request(self, method, url, **kwargs) -> Union[Response, Dict]: async def request(self, method, url, **kwargs) -> Union[Response, Dict]:
# 每次请求前检测代理是否过期
await self._refresh_proxy_if_expired()
enable_return_response = kwargs.pop("return_response", False) enable_return_response = kwargs.pop("return_response", False)
async with httpx.AsyncClient(proxy=self.proxy) as client: async with httpx.AsyncClient(proxy=self.proxy) as client:
response = await client.request(method, url, timeout=self.timeout, **kwargs) response = await client.request(method, url, timeout=self.timeout, **kwargs)
@@ -58,7 +79,16 @@ class WeiboClient:
if enable_return_response: if enable_return_response:
return response return response
try:
data: Dict = response.json() data: Dict = response.json()
except json.decoder.JSONDecodeError:
# issue: #771 搜索接口会报错432 多次重试 + 更新 h5 cookies
utils.logger.error(f"[WeiboClient.request] request {method}:{url} err code: {response.status_code} res:{response.text}")
await self.playwright_page.goto(self._host)
await asyncio.sleep(2)
await self.update_cookies(browser_context=self.playwright_page.context)
raise DataFetchError(f"get response code error: {response.status_code}")
ok_code = data.get("ok") ok_code = data.get("ok")
if ok_code == 0: # response error if ok_code == 0: # response error
utils.logger.error(f"[WeiboClient.request] request {method}:{url} err, res:{data}") utils.logger.error(f"[WeiboClient.request] request {method}:{url} err, res:{data}")
@@ -99,10 +129,24 @@ class WeiboClient:
ping_flag = False ping_flag = False
return ping_flag return ping_flag
async def update_cookies(self, browser_context: BrowserContext): async def update_cookies(self, browser_context: BrowserContext, urls: Optional[List[str]] = None):
cookie_str, cookie_dict = utils.convert_cookies(await browser_context.cookies()) """
Update cookies from browser context
:param browser_context: Browser context
:param urls: Optional list of URLs to filter cookies (e.g., ["https://m.weibo.cn"])
If provided, only cookies for these URLs will be retrieved
"""
if urls:
cookies = await browser_context.cookies(urls=urls)
utils.logger.info(f"[WeiboClient.update_cookies] Updating cookies for specific URLs: {urls}")
else:
cookies = await browser_context.cookies()
utils.logger.info("[WeiboClient.update_cookies] Updating all cookies")
cookie_str, cookie_dict = utils.convert_cookies(cookies)
self.headers["Cookie"] = cookie_str self.headers["Cookie"] = cookie_str
self.cookie_dict = cookie_dict self.cookie_dict = cookie_dict
utils.logger.info(f"[WeiboClient.update_cookies] Cookie updated successfully, total: {len(cookie_dict)} cookies")
async def get_note_by_keyword( async def get_note_by_keyword(
self, self,
@@ -288,27 +332,14 @@ class WeiboClient:
""" """
uri = "/api/container/getIndex" uri = "/api/container/getIndex"
container_info = await self.get_creator_container_info(creator_id) containerid = f"100505{creator_id}"
if container_info.get("fid_container_id") == "" or container_info.get("lfid_container_id") == "":
utils.logger.error(f"[WeiboClient.get_creator_info_by_id] get containerid failed")
raise DataFetchError("get containerid failed")
params = { params = {
"jumpfrom": "weibocom", "jumpfrom": "weibocom",
"type": "uid", "type": "uid",
"value": creator_id, "value": creator_id,
"containerid": container_info["fid_container_id"], "containerid":containerid,
} }
user_res = await self.get(uri, params) user_res = await self.get(uri, params)
if user_res.get("tabsInfo"):
tabs: List[Dict] = user_res.get("tabsInfo", {}).get("tabs", [])
for tab in tabs:
if tab.get("tabKey") == "weibo":
container_info["lfid_container_id"] = tab.get("containerid")
break
user_res.update(container_info)
return user_res return user_res
async def get_notes_by_creator( async def get_notes_by_creator(

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/weibo/core.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -54,12 +63,13 @@ class WeiboCrawler(AbstractCrawler):
self.user_agent = utils.get_user_agent() self.user_agent = utils.get_user_agent()
self.mobile_user_agent = utils.get_mobile_user_agent() self.mobile_user_agent = utils.get_mobile_user_agent()
self.cdp_manager = None self.cdp_manager = None
self.ip_proxy_pool = None # 代理IP池用于代理自动刷新
async def start(self): async def start(self):
playwright_proxy_format, httpx_proxy_format = None, None playwright_proxy_format, httpx_proxy_format = None, None
if config.ENABLE_IP_PROXY: if config.ENABLE_IP_PROXY:
ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True) self.ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy() ip_proxy_info: IpInfoModel = await self.ip_proxy_pool.get_proxy()
playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(ip_proxy_info) playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(ip_proxy_info)
async with async_playwright() as playwright: async with async_playwright() as playwright:
@@ -77,10 +87,15 @@ class WeiboCrawler(AbstractCrawler):
# Launch a browser context. # Launch a browser context.
chromium = playwright.chromium chromium = playwright.chromium
self.browser_context = await self.launch_browser(chromium, None, self.mobile_user_agent, headless=config.HEADLESS) self.browser_context = await self.launch_browser(chromium, None, self.mobile_user_agent, headless=config.HEADLESS)
# stealth.min.js is a js script to prevent the website from detecting the crawler. # stealth.min.js is a js script to prevent the website from detecting the crawler.
await self.browser_context.add_init_script(path="libs/stealth.min.js") await self.browser_context.add_init_script(path="libs/stealth.min.js")
self.context_page = await self.browser_context.new_page() self.context_page = await self.browser_context.new_page()
await self.context_page.goto(self.mobile_index_url) await self.context_page.goto(self.index_url)
await asyncio.sleep(2)
# Create a client to interact with the xiaohongshu website. # Create a client to interact with the xiaohongshu website.
self.wb_client = await self.create_weibo_client(httpx_proxy_format) self.wb_client = await self.create_weibo_client(httpx_proxy_format)
@@ -97,8 +112,12 @@ class WeiboCrawler(AbstractCrawler):
# 登录成功后重定向到手机端的网站再更新手机端登录成功的cookie # 登录成功后重定向到手机端的网站再更新手机端登录成功的cookie
utils.logger.info("[WeiboCrawler.start] redirect weibo mobile homepage and update cookies on mobile platform") utils.logger.info("[WeiboCrawler.start] redirect weibo mobile homepage and update cookies on mobile platform")
await self.context_page.goto(self.mobile_index_url) await self.context_page.goto(self.mobile_index_url)
await asyncio.sleep(2) await asyncio.sleep(3)
await self.wb_client.update_cookies(browser_context=self.browser_context) # 只获取移动端的 cookies避免 PC 端和移动端 cookies 混淆
await self.wb_client.update_cookies(
browser_context=self.browser_context,
urls=[self.mobile_index_url]
)
crawler_type_var.set(config.CRAWLER_TYPE) crawler_type_var.set(config.CRAWLER_TYPE)
if config.CRAWLER_TYPE == "search": if config.CRAWLER_TYPE == "search":
@@ -290,7 +309,7 @@ class WeiboCrawler(AbstractCrawler):
# Get all note information of the creator # Get all note information of the creator
all_notes_list = await self.wb_client.get_all_notes_by_creator_id( all_notes_list = await self.wb_client.get_all_notes_by_creator_id(
creator_id=user_id, creator_id=user_id,
container_id=createor_info_res.get("lfid_container_id"), container_id=f"107603{user_id}",
crawl_interval=0, crawl_interval=0,
callback=weibo_store.batch_update_weibo_notes, callback=weibo_store.batch_update_weibo_notes,
) )
@@ -304,7 +323,7 @@ class WeiboCrawler(AbstractCrawler):
async def create_weibo_client(self, httpx_proxy: Optional[str]) -> WeiboClient: async def create_weibo_client(self, httpx_proxy: Optional[str]) -> WeiboClient:
"""Create xhs client""" """Create xhs client"""
utils.logger.info("[WeiboCrawler.create_weibo_client] Begin create weibo API client ...") utils.logger.info("[WeiboCrawler.create_weibo_client] Begin create weibo API client ...")
cookie_str, cookie_dict = utils.convert_cookies(await self.browser_context.cookies()) cookie_str, cookie_dict = utils.convert_cookies(await self.browser_context.cookies(urls=[self.mobile_index_url]))
weibo_client_obj = WeiboClient( weibo_client_obj = WeiboClient(
proxy=httpx_proxy, proxy=httpx_proxy,
headers={ headers={
@@ -316,6 +335,7 @@ class WeiboCrawler(AbstractCrawler):
}, },
playwright_page=self.context_page, playwright_page=self.context_page,
cookie_dict=cookie_dict, cookie_dict=cookie_dict,
proxy_ip_pool=self.ip_proxy_pool, # 传递代理池用于自动刷新
) )
return weibo_client_obj return weibo_client_obj
@@ -340,10 +360,11 @@ class WeiboCrawler(AbstractCrawler):
"height": 1080 "height": 1080
}, },
user_agent=user_agent, user_agent=user_agent,
channel="chrome", # 使用系统的Chrome稳定版
) )
return browser_context return browser_context
else: else:
browser = await chromium.launch(headless=headless, proxy=playwright_proxy) # type: ignore browser = await chromium.launch(headless=headless, proxy=playwright_proxy, channel="chrome") # type: ignore
browser_context = await browser.new_context(viewport={"width": 1920, "height": 1080}, user_agent=user_agent) browser_context = await browser.new_context(viewport={"width": 1920, "height": 1080}, user_agent=user_agent)
return browser_context return browser_context

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/weibo/exception.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/weibo/field.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/weibo/help.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/weibo/login.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/xhs/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/xhs/client.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -10,26 +19,29 @@
import asyncio import asyncio
import json import json
import re from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Union
from typing import Any, Callable, Dict, List, Optional, Union
from urllib.parse import urlencode from urllib.parse import urlencode
import httpx import httpx
from playwright.async_api import BrowserContext, Page from playwright.async_api import BrowserContext, Page
from tenacity import retry, stop_after_attempt, wait_fixed, retry_if_result from tenacity import retry, stop_after_attempt, wait_fixed
import config import config
from base.base_crawler import AbstractApiClient from base.base_crawler import AbstractApiClient
from proxy.proxy_mixin import ProxyRefreshMixin
from tools import utils from tools import utils
from html import unescape
if TYPE_CHECKING:
from proxy.proxy_ip_pool import ProxyIpPool
from .exception import DataFetchError, IPBlockError from .exception import DataFetchError, IPBlockError
from .field import SearchNoteType, SearchSortType from .field import SearchNoteType, SearchSortType
from .help import get_search_id, sign from .help import get_search_id
from .extractor import XiaoHongShuExtractor from .extractor import XiaoHongShuExtractor
from .playwright_sign import sign_with_playwright
class XiaoHongShuClient(AbstractApiClient): class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
def __init__( def __init__(
self, self,
@@ -39,6 +51,7 @@ class XiaoHongShuClient(AbstractApiClient):
headers: Dict[str, str], headers: Dict[str, str],
playwright_page: Page, playwright_page: Page,
cookie_dict: Dict[str, str], cookie_dict: Dict[str, str],
proxy_ip_pool: Optional["ProxyIpPool"] = None,
): ):
self.proxy = proxy self.proxy = proxy
self.timeout = timeout self.timeout = timeout
@@ -52,26 +65,36 @@ class XiaoHongShuClient(AbstractApiClient):
self.playwright_page = playwright_page self.playwright_page = playwright_page
self.cookie_dict = cookie_dict self.cookie_dict = cookie_dict
self._extractor = XiaoHongShuExtractor() self._extractor = XiaoHongShuExtractor()
# 初始化代理池(来自 ProxyRefreshMixin
self.init_proxy_pool(proxy_ip_pool)
async def _pre_headers(self, url: str, params: Optional[Dict] = None, payload: Optional[Dict] = None) -> Dict:
"""请求头参数签名(使用 playwright 注入方式)
async def _pre_headers(self, url: str, data=None) -> Dict:
"""
请求头参数签名
Args: Args:
url: url: 请求的URL
data: params: GET请求的参数
payload: POST请求的参数
Returns: Returns:
Dict: 请求头参数签名
""" """
encrypt_params = await self.playwright_page.evaluate( a1_value = self.cookie_dict.get("a1", "")
"([url, data]) => window._webmsxyw(url,data)", [url, data]
) # 确定请求数据和 URI
local_storage = await self.playwright_page.evaluate("() => window.localStorage") if params is not None:
signs = sign( data = params
a1=self.cookie_dict.get("a1", ""), elif payload is not None:
b1=local_storage.get("b1", ""), data = payload
x_s=encrypt_params.get("X-s", ""), else:
x_t=str(encrypt_params.get("X-t", "")), raise ValueError("params or payload is required")
# 使用 playwright 注入方式生成签名
signs = await sign_with_playwright(
page=self.playwright_page,
uri=url,
data=data,
a1=a1_value,
) )
headers = { headers = {
@@ -95,6 +118,9 @@ class XiaoHongShuClient(AbstractApiClient):
Returns: Returns:
""" """
# 每次请求前检测代理是否过期
await self._refresh_proxy_if_expired()
# return response.text # return response.text
return_response = kwargs.pop("return_response", False) return_response = kwargs.pop("return_response", False)
async with httpx.AsyncClient(proxy=self.proxy) as client: async with httpx.AsyncClient(proxy=self.proxy) as client:
@@ -116,9 +142,10 @@ class XiaoHongShuClient(AbstractApiClient):
elif data["code"] == self.IP_ERROR_CODE: elif data["code"] == self.IP_ERROR_CODE:
raise IPBlockError(self.IP_ERROR_STR) raise IPBlockError(self.IP_ERROR_STR)
else: else:
raise DataFetchError(data.get("msg", None)) err_msg = data.get("msg", None) or f"{response.text}"
raise DataFetchError(err_msg)
async def get(self, uri: str, params=None) -> Dict: async def get(self, uri: str, params: Optional[Dict] = None) -> Dict:
""" """
GET请求对请求头签名 GET请求对请求头签名
Args: Args:
@@ -128,12 +155,16 @@ class XiaoHongShuClient(AbstractApiClient):
Returns: Returns:
""" """
final_uri = uri headers = await self._pre_headers(uri, params)
if isinstance(params, dict): if isinstance(params, dict):
final_uri = f"{uri}?" f"{urlencode(params)}" # 构建带参数的完整 URL
headers = await self._pre_headers(final_uri) query_string = urlencode(params)
full_url = f"{self._host}{uri}?{query_string}"
else:
full_url = f"{self._host}{uri}"
return await self.request( return await self.request(
method="GET", url=f"{self._host}{final_uri}", headers=headers method="GET", url=full_url, headers=headers
) )
async def post(self, uri: str, data: dict, **kwargs) -> Dict: async def post(self, uri: str, data: dict, **kwargs) -> Dict:
@@ -146,7 +177,7 @@ class XiaoHongShuClient(AbstractApiClient):
Returns: Returns:
""" """
headers = await self._pre_headers(uri, data) headers = await self._pre_headers(uri, payload=data)
json_str = json.dumps(data, separators=(",", ":"), ensure_ascii=False) json_str = json.dumps(data, separators=(",", ":"), ensure_ascii=False)
return await self.request( return await self.request(
method="POST", method="POST",
@@ -157,6 +188,9 @@ class XiaoHongShuClient(AbstractApiClient):
) )
async def get_note_media(self, url: str) -> Union[bytes, None]: async def get_note_media(self, url: str) -> Union[bytes, None]:
# 请求前检测代理是否过期
await self._refresh_proxy_if_expired()
async with httpx.AsyncClient(proxy=self.proxy) as client: async with httpx.AsyncClient(proxy=self.proxy) as client:
try: try:
response = await client.request("GET", url, timeout=self.timeout) response = await client.request("GET", url, timeout=self.timeout)
@@ -451,13 +485,26 @@ class XiaoHongShuClient(AbstractApiClient):
result.extend(comments) result.extend(comments)
return result return result
async def get_creator_info(self, user_id: str) -> Dict: async def get_creator_info(
self, user_id: str, xsec_token: str = "", xsec_source: str = ""
) -> Dict:
""" """
通过解析网页版的用户主页HTML获取用户个人简要信息 通过解析网页版的用户主页HTML获取用户个人简要信息
PC端用户主页的网页存在window.__INITIAL_STATE__这个变量上的解析它即可 PC端用户主页的网页存在window.__INITIAL_STATE__这个变量上的解析它即可
eg: https://www.xiaohongshu.com/user/profile/59d8cb33de5fb4696bf17217
Args:
user_id: 用户ID
xsec_token: 验证token (可选,如果URL中包含此参数则传入)
xsec_source: 渠道来源 (可选,如果URL中包含此参数则传入)
Returns:
Dict: 创作者信息
""" """
# 构建URI,如果有xsec参数则添加到URL中
uri = f"/user/profile/{user_id}" uri = f"/user/profile/{user_id}"
if xsec_token and xsec_source:
uri = f"{uri}?xsec_token={xsec_token}&xsec_source={xsec_source}"
html_content = await self.request( html_content = await self.request(
"GET", self._domain + uri, return_response=True, headers=self.headers "GET", self._domain + uri, return_response=True, headers=self.headers
) )
@@ -468,6 +515,8 @@ class XiaoHongShuClient(AbstractApiClient):
creator: str, creator: str,
cursor: str, cursor: str,
page_size: int = 30, page_size: int = 30,
xsec_token: str = "",
xsec_source: str = "pc_feed",
) -> Dict: ) -> Dict:
""" """
获取博主的笔记 获取博主的笔记
@@ -475,24 +524,29 @@ class XiaoHongShuClient(AbstractApiClient):
creator: 博主ID creator: 博主ID
cursor: 上一页最后一条笔记的ID cursor: 上一页最后一条笔记的ID
page_size: 分页数据长度 page_size: 分页数据长度
xsec_token: 验证token
xsec_source: 渠道来源
Returns: Returns:
""" """
uri = "/api/sns/web/v1/user_posted" uri = f"/api/sns/web/v1/user_posted"
data = { params = {
"user_id": creator,
"cursor": cursor,
"num": page_size, "num": page_size,
"image_formats": "jpg,webp,avif", "cursor": cursor,
"user_id": creator,
"xsec_token": xsec_token,
"xsec_source": xsec_source,
} }
return await self.get(uri, data) return await self.get(uri, params)
async def get_all_notes_by_creator( async def get_all_notes_by_creator(
self, self,
user_id: str, user_id: str,
crawl_interval: float = 1.0, crawl_interval: float = 1.0,
callback: Optional[Callable] = None, callback: Optional[Callable] = None,
xsec_token: str = "",
xsec_source: str = "pc_feed",
) -> List[Dict]: ) -> List[Dict]:
""" """
获取指定用户下的所有发过的帖子,该方法会一直查找一个用户下的所有帖子信息 获取指定用户下的所有发过的帖子,该方法会一直查找一个用户下的所有帖子信息
@@ -500,6 +554,8 @@ class XiaoHongShuClient(AbstractApiClient):
user_id: 用户ID user_id: 用户ID
crawl_interval: 爬取一次的延迟单位(秒) crawl_interval: 爬取一次的延迟单位(秒)
callback: 一次分页爬取结束后的更新回调函数 callback: 一次分页爬取结束后的更新回调函数
xsec_token: 验证token
xsec_source: 渠道来源
Returns: Returns:
@@ -508,7 +564,9 @@ class XiaoHongShuClient(AbstractApiClient):
notes_has_more = True notes_has_more = True
notes_cursor = "" notes_cursor = ""
while notes_has_more and len(result) < config.CRAWLER_MAX_NOTES_COUNT: while notes_has_more and len(result) < config.CRAWLER_MAX_NOTES_COUNT:
notes_res = await self.get_notes_by_creator(user_id, notes_cursor) notes_res = await self.get_notes_by_creator(
user_id, notes_cursor, xsec_token=xsec_token, xsec_source=xsec_source
)
if not notes_res: if not notes_res:
utils.logger.error( utils.logger.error(
f"[XiaoHongShuClient.get_notes_by_creator] The current creator may have been banned by xhs, so they cannot access the data." f"[XiaoHongShuClient.get_notes_by_creator] The current creator may have been banned by xhs, so they cannot access the data."

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/xhs/core.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -26,7 +35,7 @@ from tenacity import RetryError
import config import config
from base.base_crawler import AbstractCrawler from base.base_crawler import AbstractCrawler
from config import CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES from config import CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES
from model.m_xiaohongshu import NoteUrlInfo from model.m_xiaohongshu import NoteUrlInfo, CreatorUrlInfo
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
from store import xhs as xhs_store from store import xhs as xhs_store
from tools import utils from tools import utils
@@ -36,7 +45,7 @@ from var import crawler_type_var, source_keyword_var
from .client import XiaoHongShuClient from .client import XiaoHongShuClient
from .exception import DataFetchError from .exception import DataFetchError
from .field import SearchSortType from .field import SearchSortType
from .help import parse_note_info_from_note_url, get_search_id from .help import parse_note_info_from_note_url, parse_creator_info_from_url, get_search_id
from .login import XiaoHongShuLogin from .login import XiaoHongShuLogin
@@ -51,12 +60,13 @@ class XiaoHongShuCrawler(AbstractCrawler):
# self.user_agent = utils.get_user_agent() # self.user_agent = utils.get_user_agent()
self.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36" self.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
self.cdp_manager = None self.cdp_manager = None
self.ip_proxy_pool = None # 代理IP池用于代理自动刷新
async def start(self) -> None: async def start(self) -> None:
playwright_proxy_format, httpx_proxy_format = None, None playwright_proxy_format, httpx_proxy_format = None, None
if config.ENABLE_IP_PROXY: if config.ENABLE_IP_PROXY:
ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True) self.ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy() ip_proxy_info: IpInfoModel = await self.ip_proxy_pool.get_proxy()
playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(ip_proxy_info) playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(ip_proxy_info)
async with async_playwright() as playwright: async with async_playwright() as playwright:
@@ -81,6 +91,7 @@ class XiaoHongShuCrawler(AbstractCrawler):
) )
# stealth.min.js is a js script to prevent the website from detecting the crawler. # stealth.min.js is a js script to prevent the website from detecting the crawler.
await self.browser_context.add_init_script(path="libs/stealth.min.js") await self.browser_context.add_init_script(path="libs/stealth.min.js")
self.context_page = await self.browser_context.new_page() self.context_page = await self.browser_context.new_page()
await self.context_page.goto(self.index_url) await self.context_page.goto(self.index_url)
@@ -174,11 +185,24 @@ class XiaoHongShuCrawler(AbstractCrawler):
async def get_creators_and_notes(self) -> None: async def get_creators_and_notes(self) -> None:
"""Get creator's notes and retrieve their comment information.""" """Get creator's notes and retrieve their comment information."""
utils.logger.info("[XiaoHongShuCrawler.get_creators_and_notes] Begin get xiaohongshu creators") utils.logger.info("[XiaoHongShuCrawler.get_creators_and_notes] Begin get xiaohongshu creators")
for user_id in config.XHS_CREATOR_ID_LIST: for creator_url in config.XHS_CREATOR_ID_LIST:
try:
# Parse creator URL to get user_id and security tokens
creator_info: CreatorUrlInfo = parse_creator_info_from_url(creator_url)
utils.logger.info(f"[XiaoHongShuCrawler.get_creators_and_notes] Parse creator URL info: {creator_info}")
user_id = creator_info.user_id
# get creator detail info from web html content # get creator detail info from web html content
createor_info: Dict = await self.xhs_client.get_creator_info(user_id=user_id) createor_info: Dict = await self.xhs_client.get_creator_info(
user_id=user_id,
xsec_token=creator_info.xsec_token,
xsec_source=creator_info.xsec_source
)
if createor_info: if createor_info:
await xhs_store.save_creator(user_id, creator=createor_info) await xhs_store.save_creator(user_id, creator=createor_info)
except ValueError as e:
utils.logger.error(f"[XiaoHongShuCrawler.get_creators_and_notes] Failed to parse creator URL: {e}")
continue
# Use fixed crawling interval # Use fixed crawling interval
crawl_interval = config.CRAWLER_MAX_SLEEP_SEC crawl_interval = config.CRAWLER_MAX_SLEEP_SEC
@@ -187,6 +211,8 @@ class XiaoHongShuCrawler(AbstractCrawler):
user_id=user_id, user_id=user_id,
crawl_interval=crawl_interval, crawl_interval=crawl_interval,
callback=self.fetch_creator_notes_detail, callback=self.fetch_creator_notes_detail,
xsec_token=creator_info.xsec_token,
xsec_source=creator_info.xsec_source,
) )
note_ids = [] note_ids = []
@@ -265,17 +291,17 @@ class XiaoHongShuCrawler(AbstractCrawler):
Dict: note detail Dict: note detail
""" """
note_detail = None note_detail = None
utils.logger.info(f"[get_note_detail_async_task] Begin get note detail, note_id: {note_id}")
async with semaphore: async with semaphore:
try: try:
utils.logger.info(f"[get_note_detail_async_task] Begin get note detail, note_id: {note_id}")
try: try:
note_detail = await self.xhs_client.get_note_by_id(note_id, xsec_source, xsec_token) note_detail = await self.xhs_client.get_note_by_id(note_id, xsec_source, xsec_token)
except RetryError as e: except RetryError:
pass pass
if not note_detail: if not note_detail:
note_detail = await self.xhs_client.get_note_by_id_from_html(note_id, xsec_source, xsec_token, enable_cookie=True) note_detail = await self.xhs_client.get_note_by_id_from_html(note_id, xsec_source, xsec_token,
enable_cookie=True)
if not note_detail: if not note_detail:
raise Exception(f"[get_note_detail_async_task] Failed to get note detail, Id: {note_id}") raise Exception(f"[get_note_detail_async_task] Failed to get note detail, Id: {note_id}")
@@ -355,6 +381,7 @@ class XiaoHongShuCrawler(AbstractCrawler):
}, },
playwright_page=self.context_page, playwright_page=self.context_page,
cookie_dict=cookie_dict, cookie_dict=cookie_dict,
proxy_ip_pool=self.ip_proxy_pool, # 传递代理池用于自动刷新
) )
return xhs_client_obj return xhs_client_obj

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/xhs/exception.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/xhs/extractor.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/xhs/field.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/xhs/help.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -15,7 +24,7 @@ import random
import time import time
import urllib.parse import urllib.parse
from model.m_xiaohongshu import NoteUrlInfo from model.m_xiaohongshu import NoteUrlInfo, CreatorUrlInfo
from tools.crawler_util import extract_url_params_to_dict from tools.crawler_util import extract_url_params_to_dict
@@ -27,16 +36,17 @@ def sign(a1="", b1="", x_s="", x_t=""):
"s0": 3, # getPlatformCode "s0": 3, # getPlatformCode
"s1": "", "s1": "",
"x0": "1", # localStorage.getItem("b1b1") "x0": "1", # localStorage.getItem("b1b1")
"x1": "3.7.8-2", # version "x1": "4.2.2", # version
"x2": "Mac OS", "x2": "Mac OS",
"x3": "xhs-pc-web", "x3": "xhs-pc-web",
"x4": "4.27.2", "x4": "4.74.0",
"x5": a1, # cookie of a1 "x5": a1, # cookie of a1
"x6": x_t, "x6": x_t,
"x7": x_s, "x7": x_s,
"x8": b1, # localStorage.getItem("b1") "x8": b1, # localStorage.getItem("b1")
"x9": mrc(x_t + x_s + b1), "x9": mrc(x_t + x_s + b1),
"x10": 154, # getSigCount "x10": 154, # getSigCount
"x11": "normal"
} }
encode_str = encodeUtf8(json.dumps(common, separators=(',', ':'))) encode_str = encodeUtf8(json.dumps(common, separators=(',', ':')))
x_s_common = b64Encode(encode_str) x_s_common = b64Encode(encode_str)
@@ -306,6 +316,37 @@ def parse_note_info_from_note_url(url: str) -> NoteUrlInfo:
return NoteUrlInfo(note_id=note_id, xsec_token=xsec_token, xsec_source=xsec_source) return NoteUrlInfo(note_id=note_id, xsec_token=xsec_token, xsec_source=xsec_source)
def parse_creator_info_from_url(url: str) -> CreatorUrlInfo:
"""
从小红书创作者主页URL中解析出创作者信息
支持以下格式:
1. 完整URL: "https://www.xiaohongshu.com/user/profile/5eb8e1d400000000010075ae?xsec_token=AB1nWBKCo1vE2HEkfoJUOi5B6BE5n7wVrbdpHoWIj5xHw=&xsec_source=pc_feed"
2. 纯ID: "5eb8e1d400000000010075ae"
Args:
url: 创作者主页URL或user_id
Returns:
CreatorUrlInfo: 包含user_id, xsec_token, xsec_source的对象
"""
# 如果是纯ID格式(24位十六进制字符),直接返回
if len(url) == 24 and all(c in "0123456789abcdef" for c in url):
return CreatorUrlInfo(user_id=url, xsec_token="", xsec_source="")
# 从URL中提取user_id: /user/profile/xxx
import re
user_pattern = r'/user/profile/([^/?]+)'
match = re.search(user_pattern, url)
if match:
user_id = match.group(1)
# 提取xsec_token和xsec_source参数
params = extract_url_params_to_dict(url)
xsec_token = params.get("xsec_token", "")
xsec_source = params.get("xsec_source", "")
return CreatorUrlInfo(user_id=user_id, xsec_token=xsec_token, xsec_source=xsec_source)
raise ValueError(f"无法从URL中解析出创作者信息: {url}")
if __name__ == '__main__': if __name__ == '__main__':
_img_url = "https://sns-img-bd.xhscdn.com/7a3abfaf-90c1-a828-5de7-022c80b92aa3" _img_url = "https://sns-img-bd.xhscdn.com/7a3abfaf-90c1-a828-5de7-022c80b92aa3"
# 获取一个图片地址在多个cdn下的url地址 # 获取一个图片地址在多个cdn下的url地址
@@ -313,4 +354,17 @@ if __name__ == '__main__':
final_img_url = get_img_url_by_trace_id(get_trace_id(_img_url)) final_img_url = get_img_url_by_trace_id(get_trace_id(_img_url))
print(final_img_url) print(final_img_url)
# 测试创作者URL解析
print("\n=== 创作者URL解析测试 ===")
test_creator_urls = [
"https://www.xiaohongshu.com/user/profile/5eb8e1d400000000010075ae?xsec_token=AB1nWBKCo1vE2HEkfoJUOi5B6BE5n7wVrbdpHoWIj5xHw=&xsec_source=pc_feed",
"5eb8e1d400000000010075ae",
]
for url in test_creator_urls:
try:
result = parse_creator_info_from_url(url)
print(f"✓ URL: {url[:80]}...")
print(f" 结果: {result}\n")
except Exception as e:
print(f"✗ URL: {url}")
print(f" 错误: {e}\n")

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/xhs/login.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -0,0 +1,203 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/xhs/playwright_sign.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# 通过 Playwright 注入调用 window.mnsv2 生成小红书签名
import hashlib
import json
import time
from typing import Any, Dict, Optional, Union
from urllib.parse import urlparse
from playwright.async_api import Page
from .xhs_sign import b64_encode, encode_utf8, get_trace_id, mrc
def _build_sign_string(uri: str, data: Optional[Union[Dict, str]] = None) -> str:
"""构建待签名字符串"""
c = uri
if data is not None:
if isinstance(data, dict):
c += json.dumps(data, separators=(",", ":"), ensure_ascii=False)
elif isinstance(data, str):
c += data
return c
def _md5_hex(s: str) -> str:
"""计算 MD5 哈希值"""
return hashlib.md5(s.encode("utf-8")).hexdigest()
def _build_xs_payload(x3_value: str, data_type: str = "object") -> str:
"""构建 x-s 签名"""
s = {
"x0": "4.2.1",
"x1": "xhs-pc-web",
"x2": "Mac OS",
"x3": x3_value,
"x4": data_type,
}
return "XYS_" + b64_encode(encode_utf8(json.dumps(s, separators=(",", ":"))))
def _build_xs_common(a1: str, b1: str, x_s: str, x_t: str) -> str:
"""构建 x-s-common 请求头"""
payload = {
"s0": 3,
"s1": "",
"x0": "1",
"x1": "4.2.2",
"x2": "Mac OS",
"x3": "xhs-pc-web",
"x4": "4.74.0",
"x5": a1,
"x6": x_t,
"x7": x_s,
"x8": b1,
"x9": mrc(x_t + x_s + b1),
"x10": 154,
"x11": "normal",
}
return b64_encode(encode_utf8(json.dumps(payload, separators=(",", ":"))))
async def get_b1_from_localstorage(page: Page) -> str:
"""从 localStorage 获取 b1 值"""
try:
local_storage = await page.evaluate("() => window.localStorage")
return local_storage.get("b1", "")
except Exception:
return ""
async def call_mnsv2(page: Page, sign_str: str, md5_str: str) -> str:
"""
通过 playwright 调用 window.mnsv2 函数
Args:
page: playwright Page 对象
sign_str: 待签名字符串 (uri + JSON.stringify(data))
md5_str: sign_str 的 MD5 哈希值
Returns:
mnsv2 返回的签名字符串
"""
sign_str_escaped = sign_str.replace("\\", "\\\\").replace("'", "\\'").replace("\n", "\\n")
md5_str_escaped = md5_str.replace("\\", "\\\\").replace("'", "\\'")
try:
result = await page.evaluate(f"window.mnsv2('{sign_str_escaped}', '{md5_str_escaped}')")
return result if result else ""
except Exception:
return ""
async def sign_xs_with_playwright(
page: Page,
uri: str,
data: Optional[Union[Dict, str]] = None,
) -> str:
"""
通过 playwright 注入生成 x-s 签名
Args:
page: playwright Page 对象(必须已打开小红书页面)
uri: API 路径,如 "/api/sns/web/v1/search/notes"
data: 请求数据GET 的 params 或 POST 的 payload
Returns:
x-s 签名字符串
"""
sign_str = _build_sign_string(uri, data)
md5_str = _md5_hex(sign_str)
x3_value = await call_mnsv2(page, sign_str, md5_str)
data_type = "object" if isinstance(data, (dict, list)) else "string"
return _build_xs_payload(x3_value, data_type)
async def sign_with_playwright(
page: Page,
uri: str,
data: Optional[Union[Dict, str]] = None,
a1: str = "",
) -> Dict[str, Any]:
"""
通过 playwright 生成完整的签名请求头
Args:
page: playwright Page 对象(必须已打开小红书页面)
uri: API 路径
data: 请求数据
a1: cookie 中的 a1 值
Returns:
包含 x-s, x-t, x-s-common, x-b3-traceid 的字典
"""
b1 = await get_b1_from_localstorage(page)
x_s = await sign_xs_with_playwright(page, uri, data)
x_t = str(int(time.time() * 1000))
return {
"x-s": x_s,
"x-t": x_t,
"x-s-common": _build_xs_common(a1, b1, x_s, x_t),
"x-b3-traceid": get_trace_id(),
}
async def pre_headers_with_playwright(
page: Page,
url: str,
cookie_dict: Dict[str, str],
params: Optional[Dict] = None,
payload: Optional[Dict] = None,
) -> Dict[str, str]:
"""
使用 playwright 注入方式生成请求头签名
可直接替换 client.py 中的 _pre_headers 方法
Args:
page: playwright Page 对象
url: 请求 URL
cookie_dict: cookie 字典
params: GET 请求参数
payload: POST 请求参数
Returns:
签名后的请求头字典
"""
a1_value = cookie_dict.get("a1", "")
uri = urlparse(url).path
if params is not None:
data = params
elif payload is not None:
data = payload
else:
raise ValueError("params or payload is required")
signs = await sign_with_playwright(page, uri, data, a1_value)
return {
"X-S": signs["x-s"],
"X-T": signs["x-t"],
"x-S-Common": signs["x-s-common"],
"X-B3-Traceid": signs["x-b3-traceid"],
}

View File

@@ -0,0 +1,152 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/xhs/xhs_sign.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# 小红书签名算法核心函数
# 用于 playwright 注入方式生成签名
import ctypes
import random
from urllib.parse import quote
# 自定义 Base64 字符表
# 标准 Base64: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
# 小红书打乱顺序用于混淆
BASE64_CHARS = list("ZmserbBoHQtNP+wOcza/LpngG8yJq42KWYj0DSfdikx3VT16IlUAFM97hECvuRX5")
# CRC32 查表
CRC32_TABLE = [
0, 1996959894, 3993919788, 2567524794, 124634137, 1886057615, 3915621685,
2657392035, 249268274, 2044508324, 3772115230, 2547177864, 162941995,
2125561021, 3887607047, 2428444049, 498536548, 1789927666, 4089016648,
2227061214, 450548861, 1843258603, 4107580753, 2211677639, 325883990,
1684777152, 4251122042, 2321926636, 335633487, 1661365465, 4195302755,
2366115317, 997073096, 1281953886, 3579855332, 2724688242, 1006888145,
1258607687, 3524101629, 2768942443, 901097722, 1119000684, 3686517206,
2898065728, 853044451, 1172266101, 3705015759, 2882616665, 651767980,
1373503546, 3369554304, 3218104598, 565507253, 1454621731, 3485111705,
3099436303, 671266974, 1594198024, 3322730930, 2970347812, 795835527,
1483230225, 3244367275, 3060149565, 1994146192, 31158534, 2563907772,
4023717930, 1907459465, 112637215, 2680153253, 3904427059, 2013776290,
251722036, 2517215374, 3775830040, 2137656763, 141376813, 2439277719,
3865271297, 1802195444, 476864866, 2238001368, 4066508878, 1812370925,
453092731, 2181625025, 4111451223, 1706088902, 314042704, 2344532202,
4240017532, 1658658271, 366619977, 2362670323, 4224994405, 1303535960,
984961486, 2747007092, 3569037538, 1256170817, 1037604311, 2765210733,
3554079995, 1131014506, 879679996, 2909243462, 3663771856, 1141124467,
855842277, 2852801631, 3708648649, 1342533948, 654459306, 3188396048,
3373015174, 1466479909, 544179635, 3110523913, 3462522015, 1591671054,
702138776, 2966460450, 3352799412, 1504918807, 783551873, 3082640443,
3233442989, 3988292384, 2596254646, 62317068, 1957810842, 3939845945,
2647816111, 81470997, 1943803523, 3814918930, 2489596804, 225274430,
2053790376, 3826175755, 2466906013, 167816743, 2097651377, 4027552580,
2265490386, 503444072, 1762050814, 4150417245, 2154129355, 426522225,
1852507879, 4275313526, 2312317920, 282753626, 1742555852, 4189708143,
2394877945, 397917763, 1622183637, 3604390888, 2714866558, 953729732,
1340076626, 3518719985, 2797360999, 1068828381, 1219638859, 3624741850,
2936675148, 906185462, 1090812512, 3747672003, 2825379669, 829329135,
1181335161, 3412177804, 3160834842, 628085408, 1382605366, 3423369109,
3138078467, 570562233, 1426400815, 3317316542, 2998733608, 733239954,
1555261956, 3268935591, 3050360625, 752459403, 1541320221, 2607071920,
3965973030, 1969922972, 40735498, 2617837225, 3943577151, 1913087877,
83908371, 2512341634, 3803740692, 2075208622, 213261112, 2463272603,
3855990285, 2094854071, 198958881, 2262029012, 4057260610, 1759359992,
534414190, 2176718541, 4139329115, 1873836001, 414664567, 2282248934,
4279200368, 1711684554, 285281116, 2405801727, 4167216745, 1634467795,
376229701, 2685067896, 3608007406, 1308918612, 956543938, 2808555105,
3495958263, 1231636301, 1047427035, 2932959818, 3654703836, 1088359270,
936918000, 2847714899, 3736837829, 1202900863, 817233897, 3183342108,
3401237130, 1404277552, 615818150, 3134207493, 3453421203, 1423857449,
601450431, 3009837614, 3294710456, 1567103746, 711928724, 3020668471,
3272380065, 1510334235, 755167117,
]
def _right_shift_unsigned(num: int, bit: int = 0) -> int:
"""JavaScript 无符号右移 (>>>) 的 Python 实现"""
val = ctypes.c_uint32(num).value >> bit
MAX32INT = 4294967295
return (val + (MAX32INT + 1)) % (2 * (MAX32INT + 1)) - MAX32INT - 1
def mrc(e: str) -> int:
"""CRC32 变体,用于 x-s-common 的 x9 字段"""
o = -1
for n in range(min(57, len(e))):
o = CRC32_TABLE[(o & 255) ^ ord(e[n])] ^ _right_shift_unsigned(o, 8)
return o ^ -1 ^ 3988292384
def _triplet_to_base64(e: int) -> str:
"""将 24 位整数转换为 4 个 Base64 字符"""
return (
BASE64_CHARS[(e >> 18) & 63]
+ BASE64_CHARS[(e >> 12) & 63]
+ BASE64_CHARS[(e >> 6) & 63]
+ BASE64_CHARS[e & 63]
)
def _encode_chunk(data: list, start: int, end: int) -> str:
"""编码数据块"""
result = []
for i in range(start, end, 3):
c = ((data[i] << 16) & 0xFF0000) + ((data[i + 1] << 8) & 0xFF00) + (data[i + 2] & 0xFF)
result.append(_triplet_to_base64(c))
return "".join(result)
def encode_utf8(s: str) -> list:
"""将字符串编码为 UTF-8 字节列表"""
encoded = quote(s, safe="~()*!.'")
result = []
i = 0
while i < len(encoded):
if encoded[i] == "%":
result.append(int(encoded[i + 1: i + 3], 16))
i += 3
else:
result.append(ord(encoded[i]))
i += 1
return result
def b64_encode(data: list) -> str:
"""自定义 Base64 编码"""
length = len(data)
remainder = length % 3
chunks = []
main_length = length - remainder
for i in range(0, main_length, 16383):
chunks.append(_encode_chunk(data, i, min(i + 16383, main_length)))
if remainder == 1:
a = data[length - 1]
chunks.append(BASE64_CHARS[a >> 2] + BASE64_CHARS[(a << 4) & 63] + "==")
elif remainder == 2:
a = (data[length - 2] << 8) + data[length - 1]
chunks.append(
BASE64_CHARS[a >> 10] + BASE64_CHARS[(a >> 4) & 63] + BASE64_CHARS[(a << 2) & 63] + "="
)
return "".join(chunks)
def get_trace_id() -> str:
"""生成链路追踪 trace id"""
return "".join(random.choice("abcdef0123456789") for _ in range(16))

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/zhihu/__init__.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/zhihu/client.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -11,7 +20,7 @@
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
import asyncio import asyncio
import json import json
from typing import Any, Callable, Dict, List, Optional, Union from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Union
from urllib.parse import urlencode from urllib.parse import urlencode
import httpx import httpx
@@ -23,14 +32,18 @@ import config
from base.base_crawler import AbstractApiClient from base.base_crawler import AbstractApiClient
from constant import zhihu as zhihu_constant from constant import zhihu as zhihu_constant
from model.m_zhihu import ZhihuComment, ZhihuContent, ZhihuCreator from model.m_zhihu import ZhihuComment, ZhihuContent, ZhihuCreator
from proxy.proxy_mixin import ProxyRefreshMixin
from tools import utils from tools import utils
if TYPE_CHECKING:
from proxy.proxy_ip_pool import ProxyIpPool
from .exception import DataFetchError, ForbiddenError from .exception import DataFetchError, ForbiddenError
from .field import SearchSort, SearchTime, SearchType from .field import SearchSort, SearchTime, SearchType
from .help import ZhihuExtractor, sign from .help import ZhihuExtractor, sign
class ZhiHuClient(AbstractApiClient): class ZhiHuClient(AbstractApiClient, ProxyRefreshMixin):
def __init__( def __init__(
self, self,
@@ -40,12 +53,15 @@ class ZhiHuClient(AbstractApiClient):
headers: Dict[str, str], headers: Dict[str, str],
playwright_page: Page, playwright_page: Page,
cookie_dict: Dict[str, str], cookie_dict: Dict[str, str],
proxy_ip_pool: Optional["ProxyIpPool"] = None,
): ):
self.proxy = proxy self.proxy = proxy
self.timeout = timeout self.timeout = timeout
self.default_headers = headers self.default_headers = headers
self.cookie_dict = cookie_dict self.cookie_dict = cookie_dict
self._extractor = ZhihuExtractor() self._extractor = ZhihuExtractor()
# 初始化代理池(来自 ProxyRefreshMixin
self.init_proxy_pool(proxy_ip_pool)
async def _pre_headers(self, url: str) -> Dict: async def _pre_headers(self, url: str) -> Dict:
""" """
@@ -76,6 +92,9 @@ class ZhiHuClient(AbstractApiClient):
Returns: Returns:
""" """
# 每次请求前检测代理是否过期
await self._refresh_proxy_if_expired()
# return response.text # return response.text
return_response = kwargs.pop('return_response', False) return_response = kwargs.pop('return_response', False)

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/zhihu/core.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
@@ -52,6 +61,7 @@ class ZhihuCrawler(AbstractCrawler):
self.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36" self.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
self._extractor = ZhihuExtractor() self._extractor = ZhihuExtractor()
self.cdp_manager = None self.cdp_manager = None
self.ip_proxy_pool = None # 代理IP池用于代理自动刷新
async def start(self) -> None: async def start(self) -> None:
""" """
@@ -61,10 +71,10 @@ class ZhihuCrawler(AbstractCrawler):
""" """
playwright_proxy_format, httpx_proxy_format = None, None playwright_proxy_format, httpx_proxy_format = None, None
if config.ENABLE_IP_PROXY: if config.ENABLE_IP_PROXY:
ip_proxy_pool = await create_ip_pool( self.ip_proxy_pool = await create_ip_pool(
config.IP_PROXY_POOL_COUNT, enable_validate_ip=True config.IP_PROXY_POOL_COUNT, enable_validate_ip=True
) )
ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy() ip_proxy_info: IpInfoModel = await self.ip_proxy_pool.get_proxy()
playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info( playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(
ip_proxy_info ip_proxy_info
) )
@@ -402,6 +412,7 @@ class ZhihuCrawler(AbstractCrawler):
}, },
playwright_page=self.context_page, playwright_page=self.context_page,
cookie_dict=cookie_dict, cookie_dict=cookie_dict,
proxy_ip_pool=self.ip_proxy_pool, # 传递代理池用于自动刷新
) )
return zhihu_client_obj return zhihu_client_obj
@@ -429,10 +440,11 @@ class ZhihuCrawler(AbstractCrawler):
proxy=playwright_proxy, # type: ignore proxy=playwright_proxy, # type: ignore
viewport={"width": 1920, "height": 1080}, viewport={"width": 1920, "height": 1080},
user_agent=user_agent, user_agent=user_agent,
channel="chrome", # 使用系统的Chrome稳定版
) )
return browser_context return browser_context
else: else:
browser = await chromium.launch(headless=headless, proxy=playwright_proxy) # type: ignore browser = await chromium.launch(headless=headless, proxy=playwright_proxy, channel="chrome") # type: ignore
browser_context = await browser.new_context( browser_context = await browser.new_context(
viewport={"width": 1920, "height": 1080}, user_agent=user_agent viewport={"width": 1920, "height": 1080}, user_agent=user_agent
) )

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/zhihu/exception.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

View File

@@ -1,3 +1,12 @@
# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/zhihu/field.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则: # 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。 # 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。 # 2. 使用时应遵守目标平台的使用条款和robots.txt规则。

Some files were not shown because too many files have changed in this diff Show More