Compare commits
87 Commits
feature/co
...
51a7d94de8
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
51a7d94de8 | ||
|
|
df39d293de | ||
|
|
79048e265e | ||
|
|
94553fd818 | ||
|
|
90f72536ba | ||
|
|
f7d27ab43a | ||
|
|
be5b786a74 | ||
|
|
04fb716a44 | ||
|
|
1f89713b90 | ||
|
|
00a9e19139 | ||
|
|
8a2c349d67 | ||
|
|
4de2a325a9 | ||
|
|
2517e51ed4 | ||
|
|
e3d7fa7bed | ||
|
|
a59b385615 | ||
|
|
7c240747b6 | ||
|
|
70a6ca55bb | ||
|
|
57b688fea4 | ||
|
|
ee4539c8fa | ||
|
|
c895f53e22 | ||
|
|
99db95c499 | ||
|
|
483c5ec8c6 | ||
|
|
c56b8c4c5d | ||
|
|
a47c119303 | ||
|
|
157ddfb21b | ||
|
|
1544d13dd5 | ||
|
|
55d8c7783f | ||
|
|
ff1b681311 | ||
|
|
11500ef57a | ||
|
|
b9663c6a6d | ||
|
|
1a38ae12bd | ||
|
|
4ceb94f9c8 | ||
|
|
508675a251 | ||
|
|
eb66e57f60 | ||
|
|
a8930555ac | ||
|
|
fb66ef016d | ||
|
|
26c511e35f | ||
|
|
08fcf68b98 | ||
|
|
2426095123 | ||
|
|
3c75d4f1d0 | ||
|
|
332a07ce62 | ||
|
|
8a0fd49b96 | ||
|
|
9ade3b3eef | ||
|
|
2600c48359 | ||
|
|
ff9a1624f1 | ||
|
|
630d4c1614 | ||
|
|
f14242c239 | ||
|
|
29832ded91 | ||
|
|
11f2802624 | ||
|
|
ab19494883 | ||
|
|
2bc9297812 | ||
|
|
ba64c8ff9c | ||
|
|
ebbf86d67b | ||
|
|
6e858c1a00 | ||
|
|
324f09cf9f | ||
|
|
46ef86ddef | ||
|
|
31a092c653 | ||
|
|
f989ce0788 | ||
|
|
15b98fa511 | ||
|
|
f1e7124654 | ||
|
|
6eef02d08c | ||
|
|
1da347cbf8 | ||
|
|
422cc92dd1 | ||
|
|
13d2302c9c | ||
|
|
ff8c92daad | ||
|
|
5288bddb42 | ||
|
|
6dcfd7e0a5 | ||
|
|
e89a6d5781 | ||
|
|
a1c5e07df8 | ||
|
|
b6caa7a85e | ||
|
|
1e3637f238 | ||
|
|
b5dab6d1e8 | ||
|
|
54f23b8d1c | ||
|
|
58eb89f073 | ||
|
|
7888f4c6bd | ||
|
|
b61ec54a72 | ||
|
|
60cbb3e37d | ||
|
|
05a1782746 | ||
|
|
ef6948b305 | ||
|
|
45ec4b433a | ||
|
|
0074e975dd | ||
|
|
889fa01466 | ||
|
|
3f5925e326 | ||
|
|
ed6e0bfb5f | ||
|
|
26a261bc09 | ||
|
|
03e384bbe2 | ||
|
|
56bf5d226f |
46
.env.example
Normal file
@@ -0,0 +1,46 @@
|
||||
# MySQL Configuration
|
||||
MYSQL_DB_PWD=123456
|
||||
MYSQL_DB_USER=root
|
||||
MYSQL_DB_HOST=localhost
|
||||
MYSQL_DB_PORT=3306
|
||||
MYSQL_DB_NAME=media_crawler
|
||||
|
||||
# Redis Configuration
|
||||
REDIS_DB_HOST=127.0.0.1
|
||||
REDIS_DB_PWD=123456
|
||||
REDIS_DB_PORT=6379
|
||||
REDIS_DB_NUM=0
|
||||
|
||||
# MongoDB Configuration
|
||||
MONGODB_HOST=localhost
|
||||
MONGODB_PORT=27017
|
||||
MONGODB_USER=
|
||||
MONGODB_PWD=
|
||||
MONGODB_DB_NAME=media_crawler
|
||||
|
||||
# PostgreSQL Configuration
|
||||
POSTGRES_DB_PWD=123456
|
||||
POSTGRES_DB_USER=postgres
|
||||
POSTGRES_DB_HOST=localhost
|
||||
POSTGRES_DB_PORT=5432
|
||||
POSTGRES_DB_NAME=media_crawler
|
||||
|
||||
# Proxy Configuration (Wandou HTTP)
|
||||
# your_wandou_http_app_key
|
||||
WANDOU_APP_KEY=
|
||||
|
||||
# Proxy Configuration (Kuaidaili)
|
||||
# your_kuaidaili_secret_id
|
||||
KDL_SECERT_ID=
|
||||
# your_kuaidaili_signature
|
||||
KDL_SIGNATURE=
|
||||
# your_kuaidaili_username
|
||||
KDL_USER_NAME=
|
||||
# your_kuaidaili_password
|
||||
KDL_USER_PWD=
|
||||
|
||||
# Proxy Configuration (Jisu HTTP)
|
||||
# Get JiSu HTTP IP extraction key value
|
||||
jisu_key=
|
||||
# Get JiSu HTTP IP extraction encryption signature
|
||||
jisu_crypto=
|
||||
18
.github/CODEOWNERS
vendored
Normal file
@@ -0,0 +1,18 @@
|
||||
# 默认:仓库所有文件都需要 @NanmiCoder 审核
|
||||
* @NanmiCoder
|
||||
|
||||
|
||||
.github/workflows/** @NanmiCoder
|
||||
|
||||
|
||||
requirements.txt @NanmiCoder
|
||||
pyproject.toml @NanmiCoder
|
||||
Pipfile @NanmiCoder
|
||||
package.json @NanmiCoder
|
||||
package-lock.json @NanmiCoder
|
||||
pnpm-lock.yaml @NanmiCoder
|
||||
|
||||
|
||||
Dockerfile @NanmiCoder
|
||||
docker/** @NanmiCoder
|
||||
scripts/deploy/** @NanmiCoder
|
||||
2
.gitignore
vendored
@@ -178,4 +178,4 @@ docs/.vitepress/cache
|
||||
agent_zone
|
||||
debug_tools
|
||||
|
||||
database/*.db
|
||||
database/*.db
|
||||
|
||||
46
.pre-commit-config.yaml
Normal file
@@ -0,0 +1,46 @@
|
||||
# Pre-commit hooks configuration for MediaCrawler project
|
||||
# See https://pre-commit.com for more information
|
||||
|
||||
repos:
|
||||
# Local hooks
|
||||
- repo: local
|
||||
hooks:
|
||||
# Python file header copyright check
|
||||
- id: check-file-headers
|
||||
name: Check Python file headers
|
||||
entry: python tools/file_header_manager.py --check
|
||||
language: system
|
||||
types: [python]
|
||||
pass_filenames: true
|
||||
stages: [pre-commit]
|
||||
|
||||
# Auto-fix Python file headers
|
||||
- id: add-file-headers
|
||||
name: Add copyright headers to Python files
|
||||
entry: python tools/file_header_manager.py
|
||||
language: system
|
||||
types: [python]
|
||||
pass_filenames: true
|
||||
stages: [pre-commit]
|
||||
|
||||
# Standard pre-commit hooks (optional, can be enabled later)
|
||||
- repo: https://github.com/pre-commit/pre-commit-hooks
|
||||
rev: v4.5.0
|
||||
hooks:
|
||||
- id: trailing-whitespace
|
||||
exclude: ^(.*\.md|.*\.txt)$
|
||||
- id: end-of-file-fixer
|
||||
exclude: ^(.*\.md|.*\.txt)$
|
||||
- id: check-yaml
|
||||
- id: check-added-large-files
|
||||
args: ['--maxkb=10240'] # 10MB limit
|
||||
- id: check-merge-conflict
|
||||
- id: check-case-conflict
|
||||
- id: mixed-line-ending
|
||||
|
||||
# Global configuration
|
||||
default_language_version:
|
||||
python: python3
|
||||
|
||||
# Run hooks on all files during manual run
|
||||
# Usage: pre-commit run --all-files
|
||||
132
README.md
@@ -53,6 +53,7 @@
|
||||
- **无需JS逆向**:利用保留登录态的浏览器上下文环境,通过 JS 表达式获取签名参数
|
||||
- **优势特点**:无需逆向复杂的加密算法,大幅降低技术门槛
|
||||
|
||||
|
||||
## ✨ 功能特性
|
||||
| 平台 | 关键词搜索 | 指定帖子ID爬取 | 二级评论 | 指定创作者主页 | 登录态缓存 | IP代理池 | 生成评论词云图 |
|
||||
| ------ | ---------- | -------------- | -------- | -------------- | ---------- | -------- | -------------- |
|
||||
@@ -66,7 +67,8 @@
|
||||
|
||||
|
||||
|
||||
### 🚀 MediaCrawlerPro 重磅发布!
|
||||
<details>
|
||||
<summary>🚀 <strong>MediaCrawlerPro 重磅发布!开源不易,欢迎订阅支持</strong></summary>
|
||||
|
||||
> 专注于学习成熟项目的架构设计,不仅仅是爬虫技术,Pro 版本的代码设计思路同样值得深入学习!
|
||||
|
||||
@@ -90,10 +92,12 @@
|
||||
|
||||
点击查看:[MediaCrawlerPro 项目主页](https://github.com/MediaCrawlerPro) 更多介绍
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
## 🚀 快速开始
|
||||
|
||||
> 💡 **开源不易,如果这个项目对您有帮助,请给个 ⭐ Star 支持一下!**
|
||||
> 💡 **如果这个项目对您有帮助,请给个 ⭐ Star 支持一下!**
|
||||
|
||||
## 📋 前置依赖
|
||||
|
||||
@@ -129,15 +133,10 @@ uv sync
|
||||
uv run playwright install
|
||||
```
|
||||
|
||||
> **💡 提示**:MediaCrawler 目前已经支持使用 playwright 连接你本地的 Chrome 浏览器了,一些因为 Webdriver 导致的问题迎刃而解了。
|
||||
>
|
||||
> 目前开放了 `xhs` 和 `dy` 这两个使用 CDP 的方式连接本地浏览器,如有需要,查看 `config/base_config.py` 中的配置项。
|
||||
|
||||
## 🚀 运行爬虫程序
|
||||
|
||||
```shell
|
||||
# 项目默认是没有开启评论爬取模式,如需评论请在 config/base_config.py 中的 ENABLE_GET_COMMENTS 变量修改
|
||||
# 一些其他支持项,也可以在 config/base_config.py 查看功能,写的有中文注释
|
||||
# 在 config/base_config.py 查看配置项目功能,写的有中文注释
|
||||
|
||||
# 从配置文件中读取关键词搜索相关的帖子并爬取帖子信息与评论
|
||||
uv run main.py --platform xhs --lt qrcode --type search
|
||||
@@ -151,6 +150,37 @@ uv run main.py --platform xhs --lt qrcode --type detail
|
||||
uv run main.py --help
|
||||
```
|
||||
|
||||
## WebUI支持
|
||||
|
||||
<details>
|
||||
<summary>🖥️ <strong>WebUI 可视化操作界面</strong></summary>
|
||||
|
||||
MediaCrawler 提供了基于 Web 的可视化操作界面,无需命令行也能轻松使用爬虫功能。
|
||||
|
||||
#### 启动 WebUI 服务
|
||||
|
||||
```shell
|
||||
# 启动 API 服务器(默认端口 8080)
|
||||
uv run uvicorn api.main:app --port 8080 --reload
|
||||
|
||||
# 或者使用模块方式启动
|
||||
uv run python -m api.main
|
||||
```
|
||||
|
||||
启动成功后,访问 `http://localhost:8080` 即可打开 WebUI 界面。
|
||||
|
||||
#### WebUI 功能特性
|
||||
|
||||
- 可视化配置爬虫参数(平台、登录方式、爬取类型等)
|
||||
- 实时查看爬虫运行状态和日志
|
||||
- 数据预览和导出
|
||||
|
||||
#### 界面预览
|
||||
|
||||
<img src="docs/static/images/img_8.png" alt="WebUI 界面预览">
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🔗 <strong>使用 Python 原生 venv 管理环境(不推荐)</strong></summary>
|
||||
|
||||
@@ -163,7 +193,7 @@ uv run main.py --help
|
||||
cd MediaCrawler
|
||||
|
||||
# 创建虚拟环境
|
||||
# 我的 python 版本是:3.9.6,requirements.txt 中的库是基于这个版本的
|
||||
# 我的 python 版本是:3.11 requirements.txt 中的库是基于这个版本的
|
||||
# 如果是其他 python 版本,可能 requirements.txt 中的库不兼容,需自行解决
|
||||
python -m venv venv
|
||||
|
||||
@@ -209,45 +239,18 @@ python main.py --help
|
||||
|
||||
## 💾 数据保存
|
||||
|
||||
支持多种数据存储方式:
|
||||
- **CSV 文件**:支持保存到 CSV 中(`data/` 目录下)
|
||||
- **JSON 文件**:支持保存到 JSON 中(`data/` 目录下)
|
||||
- **数据库存储**
|
||||
- 使用参数 `--init_db` 进行数据库初始化(使用`--init_db`时不需要携带其他optional)
|
||||
- **SQLite 数据库**:轻量级数据库,无需服务器,适合个人使用(推荐)
|
||||
1. 初始化:`--init_db sqlite`
|
||||
2. 数据存储:`--save_data_option sqlite`
|
||||
- **MySQL 数据库**:支持关系型数据库 MySQL 中保存(需要提前创建数据库)
|
||||
1. 初始化:`--init_db mysql`
|
||||
2. 数据存储:`--save_data_option db`(db 参数为兼容历史更新保留)
|
||||
MediaCrawler 支持多种数据存储方式,包括 CSV、JSON、Excel、SQLite 和 MySQL 数据库。
|
||||
|
||||
📖 **详细使用说明请查看:[数据存储指南](docs/data_storage_guide.md)**
|
||||
|
||||
|
||||
### 使用示例:
|
||||
```shell
|
||||
# 初始化 SQLite 数据库(使用'--init_db'时不需要携带其他optional)
|
||||
uv run main.py --init_db sqlite
|
||||
# 使用 SQLite 存储数据(推荐个人用户使用)
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
|
||||
```
|
||||
```shell
|
||||
# 初始化 MySQL 数据库
|
||||
uv run main.py --init_db mysql
|
||||
# 使用 MySQL 存储数据(为适配历史更新,db参数进行沿用)
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
|
||||
```
|
||||
|
||||
|
||||
[🚀 MediaCrawlerPro 重磅发布 🚀!更多的功能,更好的架构设计!](https://github.com/MediaCrawlerPro)
|
||||
[🚀 MediaCrawlerPro 重磅发布 🚀!更多的功能,更好的架构设计!开源不易,欢迎订阅支持!](https://github.com/MediaCrawlerPro)
|
||||
|
||||
|
||||
### 💬 交流群组
|
||||
- **微信交流群**:[点击加入](https://nanmicoder.github.io/MediaCrawler/%E5%BE%AE%E4%BF%A1%E4%BA%A4%E6%B5%81%E7%BE%A4.html)
|
||||
- **B站账号**:[关注我](https://space.bilibili.com/434377496),分享AI与爬虫技术知识
|
||||
|
||||
### 📚 其他
|
||||
- **常见问题**:[MediaCrawler 完整文档](https://nanmicoder.github.io/MediaCrawler/)
|
||||
- **爬虫入门教程**:[CrawlerTutorial 免费教程](https://github.com/NanmiCoder/CrawlerTutorial)
|
||||
- **新闻爬虫开源项目**:[NewsCrawlerCollection](https://github.com/NanmiCoder/NewsCrawlerCollection)
|
||||
---
|
||||
|
||||
### 💰 赞助商展示
|
||||
|
||||
@@ -259,39 +262,21 @@ uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
|
||||
|
||||
---
|
||||
|
||||
<p align="center">
|
||||
<a href="https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad">
|
||||
<img style="border-radius:20px" width="500" alt="TikHub IO_Banner zh" src="docs/static/images/tikhub_banner_zh.png">
|
||||
</a>
|
||||
</p>
|
||||
|
||||
[TikHub](https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad) 提供超过 **700 个端点**,可用于从 **14+ 个社交媒体平台** 获取与分析数据 —— 包括视频、用户、评论、商店、商品与趋势等,一站式完成所有数据访问与分析。
|
||||
|
||||
通过每日签到,可以获取免费额度。可以使用我的注册链接:[https://user.tikhub.io/users/signup?referral_code=cfzyejV9](https://user.tikhub.io/users/signup?referral_code=cfzyejV9&utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad) 或使用邀请码:`cfzyejV9`,注册并充值即可获得 **$2 免费额度**。
|
||||
|
||||
[TikHub](https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad) 提供以下服务:
|
||||
|
||||
- 🚀 丰富的社交媒体数据接口(TikTok、Douyin、XHS、YouTube、Instagram等)
|
||||
- 💎 每日签到免费领取额度
|
||||
- ⚡ 高成功率与高并发支持
|
||||
- 🌐 官网:[https://tikhub.io/](https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad)
|
||||
- 💻 GitHub地址:[https://github.com/TikHubIO/](https://github.com/TikHubIO/)
|
||||
<a href="https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad">
|
||||
<img width="500" src="docs/static/images/tikhub_banner_zh.png">
|
||||
<br>
|
||||
TikHub.io 提供 900+ 高稳定性数据接口,覆盖 TK、DY、XHS、Y2B、Ins、X 等 14+ 海内外主流平台,支持用户、内容、商品、评论等多维度公开数据 API,并配套 4000 万+ 已清洗结构化数据集,使用邀请码 <code>cfzyejV9</code> 注册并充值,即可额外获得 $2 赠送额度。
|
||||
</a>
|
||||
|
||||
---
|
||||
<p align="center">
|
||||
<a href="https://app.nstbrowser.io/account/register?utm_source=official&utm_term=mediacrawler">
|
||||
<img style="border-radius:20px" alt="NstBrowser Banner " src="docs/static/images/nstbrowser.jpg">
|
||||
</a>
|
||||
</p>
|
||||
|
||||
Nstbrowser 指纹浏览器 — 多账号运营&自动化管理的最佳解决方案
|
||||
<a href="https://www.thordata.com/?ls=github&lk=mediacrawler">
|
||||
<img width="500" src="docs/static/images/Thordata.png">
|
||||
<br>
|
||||
多账号安全管理与会话隔离;指纹定制结合反检测浏览器环境,兼顾真实度与稳定性;覆盖店铺管理、电商监控、社媒营销、广告验证、Web3、投放监控与联盟营销等业务线;提供生产级并发与定制化企业服务;提供可一键部署的云端浏览器方案,配套全球高质量 IP 池,为您构建长期行业竞争力
|
||||
Thordata:可靠且经济高效的代理服务提供商。为企业和开发者提供稳定、高效且合规的全球代理 IP 服务。立即注册,赠送1GB住宅代理免费试用和2000次serp-api调用。
|
||||
</a>
|
||||
<br>
|
||||
[点击此处即刻开始免费使用](https://app.nstbrowser.io/account/register?utm_source=official&utm_term=mediacrawler)
|
||||
<br>
|
||||
使用 NSTBROWSER 可获得 10% 充值赠礼
|
||||
|
||||
<a href="https://www.thordata.com/products/residential-proxies/?ls=github&lk=mediacrawler">【住宅代理】</a> | <a href="https://www.thordata.com/products/web-scraper/?ls=github&lk=mediacrawler">【serp-api】</a>
|
||||
|
||||
|
||||
### 🤝 成为赞助者
|
||||
@@ -301,9 +286,14 @@ Nstbrowser 指纹浏览器 — 多账号运营&自动化管理的最佳解决方
|
||||
**联系方式**:
|
||||
- 微信:`relakkes`
|
||||
- 邮箱:`relakkes@gmail.com`
|
||||
|
||||
---
|
||||
|
||||
### 📚 其他
|
||||
- **常见问题**:[MediaCrawler 完整文档](https://nanmicoder.github.io/MediaCrawler/)
|
||||
- **爬虫入门教程**:[CrawlerTutorial 免费教程](https://github.com/NanmiCoder/CrawlerTutorial)
|
||||
- **新闻爬虫开源项目**:[NewsCrawlerCollection](https://github.com/NanmiCoder/NewsCrawlerCollection)
|
||||
|
||||
|
||||
## ⭐ Star 趋势图
|
||||
|
||||
如果这个项目对您有帮助,请给个 ⭐ Star 支持一下,让更多的人看到 MediaCrawler!
|
||||
@@ -311,9 +301,9 @@ Nstbrowser 指纹浏览器 — 多账号运营&自动化管理的最佳解决方
|
||||
[](https://star-history.com/#NanmiCoder/MediaCrawler&Date)
|
||||
|
||||
|
||||
|
||||
## 📚 参考
|
||||
|
||||
- **小红书签名仓库**:[Cloxl 的 xhs 签名仓库](https://github.com/Cloxl/xhshow)
|
||||
- **小红书客户端**:[ReaJason 的 xhs 仓库](https://github.com/ReaJason/xhs)
|
||||
- **短信转发**:[SmsForwarder 参考仓库](https://github.com/pppscn/SmsForwarder)
|
||||
- **内网穿透工具**:[ngrok 官方文档](https://ngrok.com/docs/)
|
||||
|
||||
124
README_en.md
@@ -148,6 +148,37 @@ uv run main.py --platform xhs --lt qrcode --type detail
|
||||
uv run main.py --help
|
||||
```
|
||||
|
||||
## WebUI Support
|
||||
|
||||
<details>
|
||||
<summary>🖥️ <strong>WebUI Visual Operation Interface</strong></summary>
|
||||
|
||||
MediaCrawler provides a web-based visual operation interface, allowing you to easily use crawler features without command line.
|
||||
|
||||
#### Start WebUI Service
|
||||
|
||||
```shell
|
||||
# Start API server (default port 8080)
|
||||
uv run uvicorn api.main:app --port 8080 --reload
|
||||
|
||||
# Or start using module method
|
||||
uv run python -m api.main
|
||||
```
|
||||
|
||||
After successful startup, visit `http://localhost:8080` to open the WebUI interface.
|
||||
|
||||
#### WebUI Features
|
||||
|
||||
- Visualize crawler parameter configuration (platform, login method, crawling type, etc.)
|
||||
- Real-time view of crawler running status and logs
|
||||
- Data preview and export
|
||||
|
||||
#### Interface Preview
|
||||
|
||||
<img src="docs/static/images/img_8.png" alt="WebUI Interface Preview">
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🔗 <strong>Using Python native venv environment management (Not recommended)</strong></summary>
|
||||
|
||||
@@ -206,76 +237,45 @@ python main.py --help
|
||||
|
||||
## 💾 Data Storage
|
||||
|
||||
Supports multiple data storage methods:
|
||||
- **CSV Files**: Supports saving to CSV (under `data/` directory)
|
||||
- **JSON Files**: Supports saving to JSON (under `data/` directory)
|
||||
- **Database Storage**
|
||||
- Use the `--init_db` parameter for database initialization (when using `--init_db`, no other optional arguments are needed)
|
||||
- **SQLite Database**: Lightweight database, no server required, suitable for personal use (recommended)
|
||||
1. Initialization: `--init_db sqlite`
|
||||
2. Data Storage: `--save_data_option sqlite`
|
||||
- **MySQL Database**: Supports saving to relational database MySQL (database needs to be created in advance)
|
||||
1. Initialization: `--init_db mysql`
|
||||
2. Data Storage: `--save_data_option db` (the db parameter is retained for compatibility with historical updates)
|
||||
MediaCrawler supports multiple data storage methods, including CSV, JSON, Excel, SQLite, and MySQL databases.
|
||||
|
||||
|
||||
### Usage Examples:
|
||||
```shell
|
||||
# Initialize SQLite database (when using '--init_db', no other optional arguments are needed)
|
||||
uv run main.py --init_db sqlite
|
||||
# Use SQLite to store data (recommended for personal users)
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
|
||||
```
|
||||
```shell
|
||||
# Initialize MySQL database
|
||||
uv run main.py --init_db mysql
|
||||
# Use MySQL to store data (the db parameter is retained for compatibility with historical updates)
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
|
||||
```
|
||||
📖 **For detailed usage instructions, please see: [Data Storage Guide](docs/data_storage_guide.md)**
|
||||
|
||||
---
|
||||
|
||||
[🚀 MediaCrawlerPro Major Release 🚀! More features, better architectural design!](https://github.com/MediaCrawlerPro)
|
||||
|
||||
## 🤝 Community & Support
|
||||
|
||||
### 💬 Discussion Groups
|
||||
- **WeChat Discussion Group**: [Click to join](https://nanmicoder.github.io/MediaCrawler/%E5%BE%AE%E4%BF%A1%E4%BA%A4%E6%B5%81%E7%BE%A4.html)
|
||||
- **Bilibili Account**: [Follow me](https://space.bilibili.com/434377496), sharing AI and crawler technology knowledge
|
||||
|
||||
### 📚 Documentation & Tutorials
|
||||
- **Online Documentation**: [MediaCrawler Complete Documentation](https://nanmicoder.github.io/MediaCrawler/)
|
||||
- **Crawler Tutorial**: [CrawlerTutorial Free Tutorial](https://github.com/NanmiCoder/CrawlerTutorial)
|
||||
|
||||
|
||||
# Other common questions can be viewed in the online documentation
|
||||
>
|
||||
> The online documentation includes usage methods, common questions, joining project discussion groups, etc.
|
||||
> [MediaCrawler Online Documentation](https://nanmicoder.github.io/MediaCrawler/)
|
||||
>
|
||||
|
||||
# Author's Knowledge Services
|
||||
> If you want to quickly get started and learn the usage of this project, source code architectural design, learn programming technology, or want to understand the source code design of MediaCrawlerPro, you can check out my paid knowledge column.
|
||||
|
||||
[Author's Paid Knowledge Column Introduction](https://nanmicoder.github.io/MediaCrawler/%E7%9F%A5%E8%AF%86%E4%BB%98%E8%B4%B9%E4%BB%8B%E7%BB%8D.html)
|
||||
|
||||
|
||||
---
|
||||
|
||||
## ⭐ Star Trend Chart
|
||||
|
||||
If this project helps you, please give a ⭐ Star to support and let more people see MediaCrawler!
|
||||
|
||||
[](https://star-history.com/#NanmiCoder/MediaCrawler&Date)
|
||||
|
||||
### 💰 Sponsor Display
|
||||
|
||||
<a href="https://www.swiftproxy.net/?ref=nanmi">
|
||||
<img src="docs/static/images/img_5.png">
|
||||
<a href="https://h.wandouip.com">
|
||||
<img src="docs/static/images/img_8.jpg">
|
||||
<br>
|
||||
**Swiftproxy** - 90M+ global high-quality pure residential IPs, register to get free 500MB test traffic, dynamic traffic never expires!
|
||||
> Exclusive discount code: **GHB5** Get 10% off instantly!
|
||||
WandouHTTP - Self-operated tens of millions IP resource pool, IP purity ≥99.8%, daily high-frequency IP updates, fast response, stable connection, supports multiple business scenarios, customizable on demand, register to get 10000 free IPs.
|
||||
</a>
|
||||
|
||||
---
|
||||
|
||||
<a href="https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad">
|
||||
<img width="500" src="docs/static/images/tikhub_banner_zh.png">
|
||||
<br>
|
||||
TikHub.io provides 900+ highly stable data interfaces, covering 14+ mainstream domestic and international platforms including TK, DY, XHS, Y2B, Ins, X, etc. Supports multi-dimensional public data APIs for users, content, products, comments, etc., with 40M+ cleaned structured datasets. Use invitation code <code>cfzyejV9</code> to register and recharge, and get an additional $2 bonus.
|
||||
</a>
|
||||
|
||||
---
|
||||
|
||||
<a href="https://www.thordata.com/?ls=github&lk=mediacrawler">
|
||||
<img width="500" src="docs/static/images/Thordata.png">
|
||||
<br>
|
||||
Thordata: Reliable and cost-effective proxy service provider. Provides stable, efficient and compliant global proxy IP services for enterprises and developers. Register now to get 1GB free residential proxy trial and 2000 serp-api calls.
|
||||
</a>
|
||||
<br>
|
||||
<a href="https://www.thordata.com/products/residential-proxies/?ls=github&lk=mediacrawler">【Residential Proxies】</a> | <a href="https://www.thordata.com/products/web-scraper/?ls=github&lk=mediacrawler">【serp-api】</a>
|
||||
|
||||
|
||||
### 🤝 Become a Sponsor
|
||||
|
||||
@@ -284,10 +284,24 @@ Become a sponsor and showcase your product here, getting massive exposure daily!
|
||||
**Contact Information**:
|
||||
- WeChat: `relakkes`
|
||||
- Email: `relakkes@gmail.com`
|
||||
---
|
||||
|
||||
### 📚 Other
|
||||
- **FAQ**: [MediaCrawler Complete Documentation](https://nanmicoder.github.io/MediaCrawler/)
|
||||
- **Crawler Beginner Tutorial**: [CrawlerTutorial Free Tutorial](https://github.com/NanmiCoder/CrawlerTutorial)
|
||||
- **News Crawler Open Source Project**: [NewsCrawlerCollection](https://github.com/NanmiCoder/NewsCrawlerCollection)
|
||||
|
||||
|
||||
## ⭐ Star Trend Chart
|
||||
|
||||
If this project helps you, please give a ⭐ Star to support and let more people see MediaCrawler!
|
||||
|
||||
[](https://star-history.com/#NanmiCoder/MediaCrawler&Date)
|
||||
|
||||
|
||||
## 📚 References
|
||||
|
||||
- **Xiaohongshu Signature Repository**: [Cloxl's xhs signature repository](https://github.com/Cloxl/xhshow)
|
||||
- **Xiaohongshu Client**: [ReaJason's xhs repository](https://github.com/ReaJason/xhs)
|
||||
- **SMS Forwarding**: [SmsForwarder reference repository](https://github.com/pppscn/SmsForwarder)
|
||||
- **Intranet Penetration Tool**: [ngrok official documentation](https://ngrok.com/docs/)
|
||||
|
||||
125
README_es.md
@@ -149,6 +149,37 @@ uv run main.py --platform xhs --lt qrcode --type detail
|
||||
uv run main.py --help
|
||||
```
|
||||
|
||||
## Soporte WebUI
|
||||
|
||||
<details>
|
||||
<summary>🖥️ <strong>Interfaz de Operación Visual WebUI</strong></summary>
|
||||
|
||||
MediaCrawler proporciona una interfaz de operación visual basada en web, permitiéndole usar fácilmente las funciones del rastreador sin línea de comandos.
|
||||
|
||||
#### Iniciar Servicio WebUI
|
||||
|
||||
```shell
|
||||
# Iniciar servidor API (puerto predeterminado 8080)
|
||||
uv run uvicorn api.main:app --port 8080 --reload
|
||||
|
||||
# O iniciar usando método de módulo
|
||||
uv run python -m api.main
|
||||
```
|
||||
|
||||
Después de iniciar exitosamente, visite `http://localhost:8080` para abrir la interfaz WebUI.
|
||||
|
||||
#### Características de WebUI
|
||||
|
||||
- Configuración visual de parámetros del rastreador (plataforma, método de login, tipo de rastreo, etc.)
|
||||
- Vista en tiempo real del estado de ejecución del rastreador y logs
|
||||
- Vista previa y exportación de datos
|
||||
|
||||
#### Vista Previa de la Interfaz
|
||||
|
||||
<img src="docs/static/images/img_8.png" alt="Vista Previa de Interfaz WebUI">
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🔗 <strong>Usando gestión de entorno venv nativo de Python (No recomendado)</strong></summary>
|
||||
|
||||
@@ -207,76 +238,46 @@ python main.py --help
|
||||
|
||||
## 💾 Almacenamiento de Datos
|
||||
|
||||
Soporta múltiples métodos de almacenamiento de datos:
|
||||
- **Archivos CSV**: Soporta guardar en CSV (bajo el directorio `data/`)
|
||||
- **Archivos JSON**: Soporta guardar en JSON (bajo el directorio `data/`)
|
||||
- **Almacenamiento en Base de Datos**
|
||||
- Use el parámetro `--init_db` para la inicialización de la base de datos (cuando use `--init_db`, no se necesitan otros argumentos opcionales)
|
||||
- **Base de Datos SQLite**: Base de datos ligera, no requiere servidor, adecuada para uso personal (recomendado)
|
||||
1. Inicialización: `--init_db sqlite`
|
||||
2. Almacenamiento de Datos: `--save_data_option sqlite`
|
||||
- **Base de Datos MySQL**: Soporta guardar en la base de datos relacional MySQL (la base de datos debe crearse con anticipación)
|
||||
1. Inicialización: `--init_db mysql`
|
||||
2. Almacenamiento de Datos: `--save_data_option db` (el parámetro db se mantiene por compatibilidad con actualizaciones históricas)
|
||||
MediaCrawler soporta múltiples métodos de almacenamiento de datos, incluyendo CSV, JSON, Excel, SQLite y bases de datos MySQL.
|
||||
|
||||
📖 **Para instrucciones de uso detalladas, por favor vea: [Guía de Almacenamiento de Datos](docs/data_storage_guide.md)**
|
||||
|
||||
### Ejemplos de Uso:
|
||||
```shell
|
||||
# Inicializar la base de datos SQLite (cuando use '--init_db', no se necesitan otros argumentos opcionales)
|
||||
uv run main.py --init_db sqlite
|
||||
# Usar SQLite para almacenar datos (recomendado para usuarios personales)
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
|
||||
```
|
||||
```shell
|
||||
# Inicializar la base de datos MySQL
|
||||
uv run main.py --init_db mysql
|
||||
# Usar MySQL para almacenar datos (el parámetro db se mantiene por compatibilidad con actualizaciones históricas)
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
[🚀 ¡Lanzamiento Mayor de MediaCrawlerPro 🚀! ¡Más características, mejor diseño arquitectónico!](https://github.com/MediaCrawlerPro)
|
||||
|
||||
## 🤝 Comunidad y Soporte
|
||||
|
||||
### 💬 Grupos de Discusión
|
||||
- **Grupo de Discusión WeChat**: [Haga clic para unirse](https://nanmicoder.github.io/MediaCrawler/%E5%BE%AE%E4%BF%A1%E4%BA%A4%E6%B5%81%E7%BE%A4.html)
|
||||
- **Cuenta de Bilibili**: [Sígueme](https://space.bilibili.com/434377496), compartiendo conocimientos de tecnología de IA y rastreo
|
||||
|
||||
### 📚 Documentación y Tutoriales
|
||||
- **Documentación en Línea**: [Documentación Completa de MediaCrawler](https://nanmicoder.github.io/MediaCrawler/)
|
||||
- **Tutorial de Rastreador**: [Tutorial Gratuito CrawlerTutorial](https://github.com/NanmiCoder/CrawlerTutorial)
|
||||
|
||||
|
||||
# Otras preguntas comunes pueden verse en la documentación en línea
|
||||
>
|
||||
> La documentación en línea incluye métodos de uso, preguntas comunes, unirse a grupos de discusión del proyecto, etc.
|
||||
> [Documentación en Línea de MediaCrawler](https://nanmicoder.github.io/MediaCrawler/)
|
||||
>
|
||||
|
||||
# Servicios de Conocimiento del Autor
|
||||
> Si quiere comenzar rápidamente y aprender el uso de este proyecto, diseño arquitectónico del código fuente, aprender tecnología de programación, o quiere entender el diseño del código fuente de MediaCrawlerPro, puede revisar mi columna de conocimiento pagado.
|
||||
|
||||
[Introducción de la Columna de Conocimiento Pagado del Autor](https://nanmicoder.github.io/MediaCrawler/%E7%9F%A5%E8%AF%86%E4%BB%98%E8%B4%B9%E4%BB%8B%E7%BB%8D.html)
|
||||
|
||||
|
||||
---
|
||||
|
||||
## ⭐ Gráfico de Tendencia de Estrellas
|
||||
|
||||
¡Si este proyecto te ayuda, por favor da una ⭐ Estrella para apoyar y que más personas vean MediaCrawler!
|
||||
|
||||
[](https://star-history.com/#NanmiCoder/MediaCrawler&Date)
|
||||
|
||||
### 💰 Exhibición de Patrocinadores
|
||||
|
||||
<a href="https://www.swiftproxy.net/?ref=nanmi">
|
||||
<img src="docs/static/images/img_5.png">
|
||||
<a href="https://h.wandouip.com">
|
||||
<img src="docs/static/images/img_8.jpg">
|
||||
<br>
|
||||
**Swiftproxy** - ¡90M+ IPs residenciales puras de alta calidad globales, regístrese para obtener 500MB de tráfico de prueba gratuito, el tráfico dinámico nunca expira!
|
||||
> Código de descuento exclusivo: **GHB5** ¡Obtenga 10% de descuento instantáneamente!
|
||||
WandouHTTP - Pool de recursos IP auto-operado de decenas de millones, pureza de IP ≥99.8%, actualizaciones de IP de alta frecuencia diarias, respuesta rápida, conexión estable, soporta múltiples escenarios de negocio, personalizable según demanda, regístrese para obtener 10000 IPs gratis.
|
||||
</a>
|
||||
|
||||
---
|
||||
|
||||
<a href="https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad">
|
||||
<img width="500" src="docs/static/images/tikhub_banner_zh.png">
|
||||
<br>
|
||||
TikHub.io proporciona 900+ interfaces de datos altamente estables, cubriendo 14+ plataformas principales nacionales e internacionales incluyendo TK, DY, XHS, Y2B, Ins, X, etc. Soporta APIs de datos públicos multidimensionales para usuarios, contenido, productos, comentarios, etc., con 40M+ conjuntos de datos estructurados limpios. Use el código de invitación <code>cfzyejV9</code> para registrarse y recargar, y obtenga $2 adicionales de bonificación.
|
||||
</a>
|
||||
|
||||
---
|
||||
|
||||
<a href="https://www.thordata.com/?ls=github&lk=mediacrawler">
|
||||
<img width="500" src="docs/static/images/Thordata.png">
|
||||
<br>
|
||||
Thordata: Proveedor de servicios de proxy confiable y rentable. Proporciona servicios de IP proxy global estables, eficientes y conformes para empresas y desarrolladores. Regístrese ahora para obtener 1GB de prueba gratuita de proxy residencial y 2000 llamadas serp-api.
|
||||
</a>
|
||||
<br>
|
||||
<a href="https://www.thordata.com/products/residential-proxies/?ls=github&lk=mediacrawler">【Proxies Residenciales】</a> | <a href="https://www.thordata.com/products/web-scraper/?ls=github&lk=mediacrawler">【serp-api】</a>
|
||||
|
||||
|
||||
### 🤝 Conviértase en Patrocinador
|
||||
|
||||
¡Conviértase en patrocinador y muestre su producto aquí, obteniendo exposición masiva diariamente!
|
||||
@@ -284,10 +285,24 @@ uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
|
||||
**Información de Contacto**:
|
||||
- WeChat: `relakkes`
|
||||
- Email: `relakkes@gmail.com`
|
||||
---
|
||||
|
||||
### 📚 Otros
|
||||
- **Preguntas Frecuentes**: [Documentación Completa de MediaCrawler](https://nanmicoder.github.io/MediaCrawler/)
|
||||
- **Tutorial de Rastreador para Principiantes**: [Tutorial Gratuito CrawlerTutorial](https://github.com/NanmiCoder/CrawlerTutorial)
|
||||
- **Proyecto de Código Abierto de Rastreador de Noticias**: [NewsCrawlerCollection](https://github.com/NanmiCoder/NewsCrawlerCollection)
|
||||
|
||||
|
||||
## ⭐ Gráfico de Tendencia de Estrellas
|
||||
|
||||
¡Si este proyecto te ayuda, por favor da una ⭐ Estrella para apoyar y que más personas vean MediaCrawler!
|
||||
|
||||
[](https://star-history.com/#NanmiCoder/MediaCrawler&Date)
|
||||
|
||||
|
||||
## 📚 Referencias
|
||||
|
||||
- **Repositorio de Firma Xiaohongshu**: [Repositorio de firma xhs de Cloxl](https://github.com/Cloxl/xhshow)
|
||||
- **Cliente Xiaohongshu**: [Repositorio xhs de ReaJason](https://github.com/ReaJason/xhs)
|
||||
- **Reenvío de SMS**: [Repositorio de referencia SmsForwarder](https://github.com/pppscn/SmsForwarder)
|
||||
- **Herramienta de Penetración de Intranet**: [Documentación oficial de ngrok](https://ngrok.com/docs/)
|
||||
|
||||
19
api/__init__.py
Normal file
@@ -0,0 +1,19 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/__init__.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
# WebUI API Module for MediaCrawler
|
||||
186
api/main.py
Normal file
@@ -0,0 +1,186 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/main.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
"""
|
||||
MediaCrawler WebUI API Server
|
||||
Start command: uvicorn api.main:app --port 8080 --reload
|
||||
Or: python -m api.main
|
||||
"""
|
||||
import asyncio
|
||||
import os
|
||||
import subprocess
|
||||
import uvicorn
|
||||
from fastapi import FastAPI
|
||||
from fastapi.middleware.cors import CORSMiddleware
|
||||
from fastapi.staticfiles import StaticFiles
|
||||
from fastapi.responses import FileResponse
|
||||
|
||||
from .routers import crawler_router, data_router, websocket_router
|
||||
|
||||
app = FastAPI(
|
||||
title="MediaCrawler WebUI API",
|
||||
description="API for controlling MediaCrawler from WebUI",
|
||||
version="1.0.0"
|
||||
)
|
||||
|
||||
# Get webui static files directory
|
||||
WEBUI_DIR = os.path.join(os.path.dirname(__file__), "webui")
|
||||
|
||||
# CORS configuration - allow frontend dev server access
|
||||
app.add_middleware(
|
||||
CORSMiddleware,
|
||||
allow_origins=[
|
||||
"http://localhost:5173", # Vite dev server
|
||||
"http://localhost:3000", # Backup port
|
||||
"http://127.0.0.1:5173",
|
||||
"http://127.0.0.1:3000",
|
||||
],
|
||||
allow_credentials=True,
|
||||
allow_methods=["*"],
|
||||
allow_headers=["*"],
|
||||
)
|
||||
|
||||
# Register routers
|
||||
app.include_router(crawler_router, prefix="/api")
|
||||
app.include_router(data_router, prefix="/api")
|
||||
app.include_router(websocket_router, prefix="/api")
|
||||
|
||||
|
||||
@app.get("/")
|
||||
async def serve_frontend():
|
||||
"""Return frontend page"""
|
||||
index_path = os.path.join(WEBUI_DIR, "index.html")
|
||||
if os.path.exists(index_path):
|
||||
return FileResponse(index_path)
|
||||
return {
|
||||
"message": "MediaCrawler WebUI API",
|
||||
"version": "1.0.0",
|
||||
"docs": "/docs",
|
||||
"note": "WebUI not found, please build it first: cd webui && npm run build"
|
||||
}
|
||||
|
||||
|
||||
@app.get("/api/health")
|
||||
async def health_check():
|
||||
return {"status": "ok"}
|
||||
|
||||
|
||||
@app.get("/api/env/check")
|
||||
async def check_environment():
|
||||
"""Check if MediaCrawler environment is configured correctly"""
|
||||
try:
|
||||
# Run uv run main.py --help command to check environment
|
||||
process = await asyncio.create_subprocess_exec(
|
||||
"uv", "run", "main.py", "--help",
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
cwd="." # Project root directory
|
||||
)
|
||||
stdout, stderr = await asyncio.wait_for(
|
||||
process.communicate(),
|
||||
timeout=30.0 # 30 seconds timeout
|
||||
)
|
||||
|
||||
if process.returncode == 0:
|
||||
return {
|
||||
"success": True,
|
||||
"message": "MediaCrawler environment configured correctly",
|
||||
"output": stdout.decode("utf-8", errors="ignore")[:500] # Truncate to first 500 characters
|
||||
}
|
||||
else:
|
||||
error_msg = stderr.decode("utf-8", errors="ignore") or stdout.decode("utf-8", errors="ignore")
|
||||
return {
|
||||
"success": False,
|
||||
"message": "Environment check failed",
|
||||
"error": error_msg[:500]
|
||||
}
|
||||
except asyncio.TimeoutError:
|
||||
return {
|
||||
"success": False,
|
||||
"message": "Environment check timeout",
|
||||
"error": "Command execution exceeded 30 seconds"
|
||||
}
|
||||
except FileNotFoundError:
|
||||
return {
|
||||
"success": False,
|
||||
"message": "uv command not found",
|
||||
"error": "Please ensure uv is installed and configured in system PATH"
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
"success": False,
|
||||
"message": "Environment check error",
|
||||
"error": str(e)
|
||||
}
|
||||
|
||||
|
||||
@app.get("/api/config/platforms")
|
||||
async def get_platforms():
|
||||
"""Get list of supported platforms"""
|
||||
return {
|
||||
"platforms": [
|
||||
{"value": "xhs", "label": "Xiaohongshu", "icon": "book-open"},
|
||||
{"value": "dy", "label": "Douyin", "icon": "music"},
|
||||
{"value": "ks", "label": "Kuaishou", "icon": "video"},
|
||||
{"value": "bili", "label": "Bilibili", "icon": "tv"},
|
||||
{"value": "wb", "label": "Weibo", "icon": "message-circle"},
|
||||
{"value": "tieba", "label": "Baidu Tieba", "icon": "messages-square"},
|
||||
{"value": "zhihu", "label": "Zhihu", "icon": "help-circle"},
|
||||
]
|
||||
}
|
||||
|
||||
|
||||
@app.get("/api/config/options")
|
||||
async def get_config_options():
|
||||
"""Get all configuration options"""
|
||||
return {
|
||||
"login_types": [
|
||||
{"value": "qrcode", "label": "QR Code Login"},
|
||||
{"value": "cookie", "label": "Cookie Login"},
|
||||
],
|
||||
"crawler_types": [
|
||||
{"value": "search", "label": "Search Mode"},
|
||||
{"value": "detail", "label": "Detail Mode"},
|
||||
{"value": "creator", "label": "Creator Mode"},
|
||||
],
|
||||
"save_options": [
|
||||
{"value": "json", "label": "JSON File"},
|
||||
{"value": "csv", "label": "CSV File"},
|
||||
{"value": "excel", "label": "Excel File"},
|
||||
{"value": "sqlite", "label": "SQLite Database"},
|
||||
{"value": "db", "label": "MySQL Database"},
|
||||
{"value": "mongodb", "label": "MongoDB Database"},
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
# Mount static resources - must be placed after all routes
|
||||
if os.path.exists(WEBUI_DIR):
|
||||
assets_dir = os.path.join(WEBUI_DIR, "assets")
|
||||
if os.path.exists(assets_dir):
|
||||
app.mount("/assets", StaticFiles(directory=assets_dir), name="assets")
|
||||
# Mount logos directory
|
||||
logos_dir = os.path.join(WEBUI_DIR, "logos")
|
||||
if os.path.exists(logos_dir):
|
||||
app.mount("/logos", StaticFiles(directory=logos_dir), name="logos")
|
||||
# Mount other static files (e.g., vite.svg)
|
||||
app.mount("/static", StaticFiles(directory=WEBUI_DIR), name="webui-static")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
uvicorn.run(app, host="0.0.0.0", port=8080)
|
||||
23
api/routers/__init__.py
Normal file
@@ -0,0 +1,23 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/routers/__init__.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
from .crawler import router as crawler_router
|
||||
from .data import router as data_router
|
||||
from .websocket import router as websocket_router
|
||||
|
||||
__all__ = ["crawler_router", "data_router", "websocket_router"]
|
||||
63
api/routers/crawler.py
Normal file
@@ -0,0 +1,63 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/routers/crawler.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
from fastapi import APIRouter, HTTPException
|
||||
|
||||
from ..schemas import CrawlerStartRequest, CrawlerStatusResponse
|
||||
from ..services import crawler_manager
|
||||
|
||||
router = APIRouter(prefix="/crawler", tags=["crawler"])
|
||||
|
||||
|
||||
@router.post("/start")
|
||||
async def start_crawler(request: CrawlerStartRequest):
|
||||
"""Start crawler task"""
|
||||
success = await crawler_manager.start(request)
|
||||
if not success:
|
||||
# Handle concurrent/duplicate requests: if process is already running, return 400 instead of 500
|
||||
if crawler_manager.process and crawler_manager.process.poll() is None:
|
||||
raise HTTPException(status_code=400, detail="Crawler is already running")
|
||||
raise HTTPException(status_code=500, detail="Failed to start crawler")
|
||||
|
||||
return {"status": "ok", "message": "Crawler started successfully"}
|
||||
|
||||
|
||||
@router.post("/stop")
|
||||
async def stop_crawler():
|
||||
"""Stop crawler task"""
|
||||
success = await crawler_manager.stop()
|
||||
if not success:
|
||||
# Handle concurrent/duplicate requests: if process already exited/doesn't exist, return 400 instead of 500
|
||||
if not crawler_manager.process or crawler_manager.process.poll() is not None:
|
||||
raise HTTPException(status_code=400, detail="No crawler is running")
|
||||
raise HTTPException(status_code=500, detail="Failed to stop crawler")
|
||||
|
||||
return {"status": "ok", "message": "Crawler stopped successfully"}
|
||||
|
||||
|
||||
@router.get("/status", response_model=CrawlerStatusResponse)
|
||||
async def get_crawler_status():
|
||||
"""Get crawler status"""
|
||||
return crawler_manager.get_status()
|
||||
|
||||
|
||||
@router.get("/logs")
|
||||
async def get_logs(limit: int = 100):
|
||||
"""Get recent logs"""
|
||||
logs = crawler_manager.logs[-limit:] if limit > 0 else crawler_manager.logs
|
||||
return {"logs": [log.model_dump() for log in logs]}
|
||||
230
api/routers/data.py
Normal file
@@ -0,0 +1,230 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/routers/data.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
import os
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from fastapi import APIRouter, HTTPException
|
||||
from fastapi.responses import FileResponse
|
||||
|
||||
router = APIRouter(prefix="/data", tags=["data"])
|
||||
|
||||
# Data directory
|
||||
DATA_DIR = Path(__file__).parent.parent.parent / "data"
|
||||
|
||||
|
||||
def get_file_info(file_path: Path) -> dict:
|
||||
"""Get file information"""
|
||||
stat = file_path.stat()
|
||||
record_count = None
|
||||
|
||||
# Try to get record count
|
||||
try:
|
||||
if file_path.suffix == ".json":
|
||||
with open(file_path, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
if isinstance(data, list):
|
||||
record_count = len(data)
|
||||
elif file_path.suffix == ".csv":
|
||||
with open(file_path, "r", encoding="utf-8") as f:
|
||||
record_count = sum(1 for _ in f) - 1 # Subtract header row
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return {
|
||||
"name": file_path.name,
|
||||
"path": str(file_path.relative_to(DATA_DIR)),
|
||||
"size": stat.st_size,
|
||||
"modified_at": stat.st_mtime,
|
||||
"record_count": record_count,
|
||||
"type": file_path.suffix[1:] if file_path.suffix else "unknown"
|
||||
}
|
||||
|
||||
|
||||
@router.get("/files")
|
||||
async def list_data_files(platform: Optional[str] = None, file_type: Optional[str] = None):
|
||||
"""Get data file list"""
|
||||
if not DATA_DIR.exists():
|
||||
return {"files": []}
|
||||
|
||||
files = []
|
||||
supported_extensions = {".json", ".csv", ".xlsx", ".xls"}
|
||||
|
||||
for root, dirs, filenames in os.walk(DATA_DIR):
|
||||
root_path = Path(root)
|
||||
for filename in filenames:
|
||||
file_path = root_path / filename
|
||||
if file_path.suffix.lower() not in supported_extensions:
|
||||
continue
|
||||
|
||||
# Platform filter
|
||||
if platform:
|
||||
rel_path = str(file_path.relative_to(DATA_DIR))
|
||||
if platform.lower() not in rel_path.lower():
|
||||
continue
|
||||
|
||||
# Type filter
|
||||
if file_type and file_path.suffix[1:].lower() != file_type.lower():
|
||||
continue
|
||||
|
||||
try:
|
||||
files.append(get_file_info(file_path))
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
# Sort by modification time (newest first)
|
||||
files.sort(key=lambda x: x["modified_at"], reverse=True)
|
||||
|
||||
return {"files": files}
|
||||
|
||||
|
||||
@router.get("/files/{file_path:path}")
|
||||
async def get_file_content(file_path: str, preview: bool = True, limit: int = 100):
|
||||
"""Get file content or preview"""
|
||||
full_path = DATA_DIR / file_path
|
||||
|
||||
if not full_path.exists():
|
||||
raise HTTPException(status_code=404, detail="File not found")
|
||||
|
||||
if not full_path.is_file():
|
||||
raise HTTPException(status_code=400, detail="Not a file")
|
||||
|
||||
# Security check: ensure within DATA_DIR
|
||||
try:
|
||||
full_path.resolve().relative_to(DATA_DIR.resolve())
|
||||
except ValueError:
|
||||
raise HTTPException(status_code=403, detail="Access denied")
|
||||
|
||||
if preview:
|
||||
# Return preview data
|
||||
try:
|
||||
if full_path.suffix == ".json":
|
||||
with open(full_path, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
if isinstance(data, list):
|
||||
return {"data": data[:limit], "total": len(data)}
|
||||
return {"data": data, "total": 1}
|
||||
elif full_path.suffix == ".csv":
|
||||
import csv
|
||||
with open(full_path, "r", encoding="utf-8") as f:
|
||||
reader = csv.DictReader(f)
|
||||
rows = []
|
||||
for i, row in enumerate(reader):
|
||||
if i >= limit:
|
||||
break
|
||||
rows.append(row)
|
||||
# Re-read to get total count
|
||||
f.seek(0)
|
||||
total = sum(1 for _ in f) - 1
|
||||
return {"data": rows, "total": total}
|
||||
elif full_path.suffix.lower() in (".xlsx", ".xls"):
|
||||
import pandas as pd
|
||||
# Read first limit rows
|
||||
df = pd.read_excel(full_path, nrows=limit)
|
||||
# Get total row count (only read first column to save memory)
|
||||
df_count = pd.read_excel(full_path, usecols=[0])
|
||||
total = len(df_count)
|
||||
# Convert to list of dictionaries, handle NaN values
|
||||
rows = df.where(pd.notnull(df), None).to_dict(orient='records')
|
||||
return {
|
||||
"data": rows,
|
||||
"total": total,
|
||||
"columns": list(df.columns)
|
||||
}
|
||||
else:
|
||||
raise HTTPException(status_code=400, detail="Unsupported file type for preview")
|
||||
except json.JSONDecodeError:
|
||||
raise HTTPException(status_code=400, detail="Invalid JSON file")
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
else:
|
||||
# Return file download
|
||||
return FileResponse(
|
||||
path=full_path,
|
||||
filename=full_path.name,
|
||||
media_type="application/octet-stream"
|
||||
)
|
||||
|
||||
|
||||
@router.get("/download/{file_path:path}")
|
||||
async def download_file(file_path: str):
|
||||
"""Download file"""
|
||||
full_path = DATA_DIR / file_path
|
||||
|
||||
if not full_path.exists():
|
||||
raise HTTPException(status_code=404, detail="File not found")
|
||||
|
||||
if not full_path.is_file():
|
||||
raise HTTPException(status_code=400, detail="Not a file")
|
||||
|
||||
# Security check
|
||||
try:
|
||||
full_path.resolve().relative_to(DATA_DIR.resolve())
|
||||
except ValueError:
|
||||
raise HTTPException(status_code=403, detail="Access denied")
|
||||
|
||||
return FileResponse(
|
||||
path=full_path,
|
||||
filename=full_path.name,
|
||||
media_type="application/octet-stream"
|
||||
)
|
||||
|
||||
|
||||
@router.get("/stats")
|
||||
async def get_data_stats():
|
||||
"""Get data statistics"""
|
||||
if not DATA_DIR.exists():
|
||||
return {"total_files": 0, "total_size": 0, "by_platform": {}, "by_type": {}}
|
||||
|
||||
stats = {
|
||||
"total_files": 0,
|
||||
"total_size": 0,
|
||||
"by_platform": {},
|
||||
"by_type": {}
|
||||
}
|
||||
|
||||
supported_extensions = {".json", ".csv", ".xlsx", ".xls"}
|
||||
|
||||
for root, dirs, filenames in os.walk(DATA_DIR):
|
||||
root_path = Path(root)
|
||||
for filename in filenames:
|
||||
file_path = root_path / filename
|
||||
if file_path.suffix.lower() not in supported_extensions:
|
||||
continue
|
||||
|
||||
try:
|
||||
stat = file_path.stat()
|
||||
stats["total_files"] += 1
|
||||
stats["total_size"] += stat.st_size
|
||||
|
||||
# Statistics by type
|
||||
file_type = file_path.suffix[1:].lower()
|
||||
stats["by_type"][file_type] = stats["by_type"].get(file_type, 0) + 1
|
||||
|
||||
# Statistics by platform (inferred from path)
|
||||
rel_path = str(file_path.relative_to(DATA_DIR))
|
||||
for platform in ["xhs", "dy", "ks", "bili", "wb", "tieba", "zhihu"]:
|
||||
if platform in rel_path.lower():
|
||||
stats["by_platform"][platform] = stats["by_platform"].get(platform, 0) + 1
|
||||
break
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
return stats
|
||||
151
api/routers/websocket.py
Normal file
@@ -0,0 +1,151 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/routers/websocket.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
import asyncio
|
||||
from typing import Set, Optional
|
||||
|
||||
from fastapi import APIRouter, WebSocket, WebSocketDisconnect
|
||||
|
||||
from ..services import crawler_manager
|
||||
|
||||
router = APIRouter(tags=["websocket"])
|
||||
|
||||
|
||||
class ConnectionManager:
|
||||
"""WebSocket connection manager"""
|
||||
|
||||
def __init__(self):
|
||||
self.active_connections: Set[WebSocket] = set()
|
||||
|
||||
async def connect(self, websocket: WebSocket):
|
||||
await websocket.accept()
|
||||
self.active_connections.add(websocket)
|
||||
|
||||
def disconnect(self, websocket: WebSocket):
|
||||
self.active_connections.discard(websocket)
|
||||
|
||||
async def broadcast(self, message: dict):
|
||||
"""Broadcast message to all connections"""
|
||||
if not self.active_connections:
|
||||
return
|
||||
|
||||
disconnected = []
|
||||
for connection in list(self.active_connections):
|
||||
try:
|
||||
await connection.send_json(message)
|
||||
except Exception:
|
||||
disconnected.append(connection)
|
||||
|
||||
# Clean up disconnected connections
|
||||
for conn in disconnected:
|
||||
self.disconnect(conn)
|
||||
|
||||
|
||||
manager = ConnectionManager()
|
||||
|
||||
|
||||
async def log_broadcaster():
|
||||
"""Background task: read logs from queue and broadcast"""
|
||||
queue = crawler_manager.get_log_queue()
|
||||
while True:
|
||||
try:
|
||||
# Get log entry from queue
|
||||
entry = await queue.get()
|
||||
# Broadcast to all WebSocket connections
|
||||
await manager.broadcast(entry.model_dump())
|
||||
except asyncio.CancelledError:
|
||||
break
|
||||
except Exception as e:
|
||||
print(f"Log broadcaster error: {e}")
|
||||
await asyncio.sleep(0.1)
|
||||
|
||||
|
||||
# Global broadcast task
|
||||
_broadcaster_task: Optional[asyncio.Task] = None
|
||||
|
||||
|
||||
def start_broadcaster():
|
||||
"""Start broadcast task"""
|
||||
global _broadcaster_task
|
||||
if _broadcaster_task is None or _broadcaster_task.done():
|
||||
_broadcaster_task = asyncio.create_task(log_broadcaster())
|
||||
|
||||
|
||||
@router.websocket("/ws/logs")
|
||||
async def websocket_logs(websocket: WebSocket):
|
||||
"""WebSocket log stream"""
|
||||
print("[WS] New connection attempt")
|
||||
|
||||
try:
|
||||
# Ensure broadcast task is running
|
||||
start_broadcaster()
|
||||
|
||||
await manager.connect(websocket)
|
||||
print(f"[WS] Connected, active connections: {len(manager.active_connections)}")
|
||||
|
||||
# Send existing logs
|
||||
for log in crawler_manager.logs:
|
||||
try:
|
||||
await websocket.send_json(log.model_dump())
|
||||
except Exception as e:
|
||||
print(f"[WS] Error sending existing log: {e}")
|
||||
break
|
||||
|
||||
print(f"[WS] Sent {len(crawler_manager.logs)} existing logs, entering main loop")
|
||||
|
||||
while True:
|
||||
# Keep connection alive, receive heartbeat or any message
|
||||
try:
|
||||
data = await asyncio.wait_for(
|
||||
websocket.receive_text(),
|
||||
timeout=30.0
|
||||
)
|
||||
if data == "ping":
|
||||
await websocket.send_text("pong")
|
||||
except asyncio.TimeoutError:
|
||||
# Send ping to keep connection alive
|
||||
try:
|
||||
await websocket.send_text("ping")
|
||||
except Exception as e:
|
||||
print(f"[WS] Error sending ping: {e}")
|
||||
break
|
||||
|
||||
except WebSocketDisconnect:
|
||||
print("[WS] Client disconnected")
|
||||
except Exception as e:
|
||||
print(f"[WS] Error: {type(e).__name__}: {e}")
|
||||
finally:
|
||||
manager.disconnect(websocket)
|
||||
print(f"[WS] Cleanup done, active connections: {len(manager.active_connections)}")
|
||||
|
||||
|
||||
@router.websocket("/ws/status")
|
||||
async def websocket_status(websocket: WebSocket):
|
||||
"""WebSocket status stream"""
|
||||
await websocket.accept()
|
||||
|
||||
try:
|
||||
while True:
|
||||
# Send status every second
|
||||
status = crawler_manager.get_status()
|
||||
await websocket.send_json(status)
|
||||
await asyncio.sleep(1)
|
||||
except WebSocketDisconnect:
|
||||
pass
|
||||
except Exception:
|
||||
pass
|
||||
37
api/schemas/__init__.py
Normal file
@@ -0,0 +1,37 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/schemas/__init__.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
from .crawler import (
|
||||
PlatformEnum,
|
||||
LoginTypeEnum,
|
||||
CrawlerTypeEnum,
|
||||
SaveDataOptionEnum,
|
||||
CrawlerStartRequest,
|
||||
CrawlerStatusResponse,
|
||||
LogEntry,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"PlatformEnum",
|
||||
"LoginTypeEnum",
|
||||
"CrawlerTypeEnum",
|
||||
"SaveDataOptionEnum",
|
||||
"CrawlerStartRequest",
|
||||
"CrawlerStatusResponse",
|
||||
"LogEntry",
|
||||
]
|
||||
98
api/schemas/crawler.py
Normal file
@@ -0,0 +1,98 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/schemas/crawler.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
from enum import Enum
|
||||
from typing import Optional, Literal
|
||||
from pydantic import BaseModel
|
||||
|
||||
|
||||
class PlatformEnum(str, Enum):
|
||||
"""Supported media platforms"""
|
||||
XHS = "xhs"
|
||||
DOUYIN = "dy"
|
||||
KUAISHOU = "ks"
|
||||
BILIBILI = "bili"
|
||||
WEIBO = "wb"
|
||||
TIEBA = "tieba"
|
||||
ZHIHU = "zhihu"
|
||||
|
||||
|
||||
class LoginTypeEnum(str, Enum):
|
||||
"""Login method"""
|
||||
QRCODE = "qrcode"
|
||||
PHONE = "phone"
|
||||
COOKIE = "cookie"
|
||||
|
||||
|
||||
class CrawlerTypeEnum(str, Enum):
|
||||
"""Crawler type"""
|
||||
SEARCH = "search"
|
||||
DETAIL = "detail"
|
||||
CREATOR = "creator"
|
||||
|
||||
|
||||
class SaveDataOptionEnum(str, Enum):
|
||||
"""Data save option"""
|
||||
CSV = "csv"
|
||||
DB = "db"
|
||||
JSON = "json"
|
||||
SQLITE = "sqlite"
|
||||
MONGODB = "mongodb"
|
||||
EXCEL = "excel"
|
||||
|
||||
|
||||
class CrawlerStartRequest(BaseModel):
|
||||
"""Crawler start request"""
|
||||
platform: PlatformEnum
|
||||
login_type: LoginTypeEnum = LoginTypeEnum.QRCODE
|
||||
crawler_type: CrawlerTypeEnum = CrawlerTypeEnum.SEARCH
|
||||
keywords: str = "" # Keywords for search mode
|
||||
specified_ids: str = "" # Post/video ID list for detail mode, comma-separated
|
||||
creator_ids: str = "" # Creator ID list for creator mode, comma-separated
|
||||
start_page: int = 1
|
||||
enable_comments: bool = True
|
||||
enable_sub_comments: bool = False
|
||||
save_option: SaveDataOptionEnum = SaveDataOptionEnum.JSON
|
||||
cookies: str = ""
|
||||
headless: bool = False
|
||||
|
||||
|
||||
class CrawlerStatusResponse(BaseModel):
|
||||
"""Crawler status response"""
|
||||
status: Literal["idle", "running", "stopping", "error"]
|
||||
platform: Optional[str] = None
|
||||
crawler_type: Optional[str] = None
|
||||
started_at: Optional[str] = None
|
||||
error_message: Optional[str] = None
|
||||
|
||||
|
||||
class LogEntry(BaseModel):
|
||||
"""Log entry"""
|
||||
id: int
|
||||
timestamp: str
|
||||
level: Literal["info", "warning", "error", "success", "debug"]
|
||||
message: str
|
||||
|
||||
|
||||
class DataFileInfo(BaseModel):
|
||||
"""Data file information"""
|
||||
name: str
|
||||
path: str
|
||||
size: int
|
||||
modified_at: str
|
||||
record_count: Optional[int] = None
|
||||
21
api/services/__init__.py
Normal file
@@ -0,0 +1,21 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/services/__init__.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
from .crawler_manager import CrawlerManager, crawler_manager
|
||||
|
||||
__all__ = ["CrawlerManager", "crawler_manager"]
|
||||
282
api/services/crawler_manager.py
Normal file
@@ -0,0 +1,282 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/services/crawler_manager.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
import asyncio
|
||||
import subprocess
|
||||
import signal
|
||||
import os
|
||||
from typing import Optional, List
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
from ..schemas import CrawlerStartRequest, LogEntry
|
||||
|
||||
|
||||
class CrawlerManager:
|
||||
"""Crawler process manager"""
|
||||
|
||||
def __init__(self):
|
||||
self._lock = asyncio.Lock()
|
||||
self.process: Optional[subprocess.Popen] = None
|
||||
self.status = "idle"
|
||||
self.started_at: Optional[datetime] = None
|
||||
self.current_config: Optional[CrawlerStartRequest] = None
|
||||
self._log_id = 0
|
||||
self._logs: List[LogEntry] = []
|
||||
self._read_task: Optional[asyncio.Task] = None
|
||||
# Project root directory
|
||||
self._project_root = Path(__file__).parent.parent.parent
|
||||
# Log queue - for pushing to WebSocket
|
||||
self._log_queue: Optional[asyncio.Queue] = None
|
||||
|
||||
@property
|
||||
def logs(self) -> List[LogEntry]:
|
||||
return self._logs
|
||||
|
||||
def get_log_queue(self) -> asyncio.Queue:
|
||||
"""Get or create log queue"""
|
||||
if self._log_queue is None:
|
||||
self._log_queue = asyncio.Queue()
|
||||
return self._log_queue
|
||||
|
||||
def _create_log_entry(self, message: str, level: str = "info") -> LogEntry:
|
||||
"""Create log entry"""
|
||||
self._log_id += 1
|
||||
entry = LogEntry(
|
||||
id=self._log_id,
|
||||
timestamp=datetime.now().strftime("%H:%M:%S"),
|
||||
level=level,
|
||||
message=message
|
||||
)
|
||||
self._logs.append(entry)
|
||||
# Keep last 500 logs
|
||||
if len(self._logs) > 500:
|
||||
self._logs = self._logs[-500:]
|
||||
return entry
|
||||
|
||||
async def _push_log(self, entry: LogEntry):
|
||||
"""Push log to queue"""
|
||||
if self._log_queue is not None:
|
||||
try:
|
||||
self._log_queue.put_nowait(entry)
|
||||
except asyncio.QueueFull:
|
||||
pass
|
||||
|
||||
def _parse_log_level(self, line: str) -> str:
|
||||
"""Parse log level"""
|
||||
line_upper = line.upper()
|
||||
if "ERROR" in line_upper or "FAILED" in line_upper:
|
||||
return "error"
|
||||
elif "WARNING" in line_upper or "WARN" in line_upper:
|
||||
return "warning"
|
||||
elif "SUCCESS" in line_upper or "完成" in line or "成功" in line:
|
||||
return "success"
|
||||
elif "DEBUG" in line_upper:
|
||||
return "debug"
|
||||
return "info"
|
||||
|
||||
async def start(self, config: CrawlerStartRequest) -> bool:
|
||||
"""Start crawler process"""
|
||||
async with self._lock:
|
||||
if self.process and self.process.poll() is None:
|
||||
return False
|
||||
|
||||
# Clear old logs
|
||||
self._logs = []
|
||||
self._log_id = 0
|
||||
|
||||
# Clear pending queue (don't replace object to avoid WebSocket broadcast coroutine holding old queue reference)
|
||||
if self._log_queue is None:
|
||||
self._log_queue = asyncio.Queue()
|
||||
else:
|
||||
try:
|
||||
while True:
|
||||
self._log_queue.get_nowait()
|
||||
except asyncio.QueueEmpty:
|
||||
pass
|
||||
|
||||
# Build command line arguments
|
||||
cmd = self._build_command(config)
|
||||
|
||||
# Log start information
|
||||
entry = self._create_log_entry(f"Starting crawler: {' '.join(cmd)}", "info")
|
||||
await self._push_log(entry)
|
||||
|
||||
try:
|
||||
# Start subprocess
|
||||
self.process = subprocess.Popen(
|
||||
cmd,
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.STDOUT,
|
||||
text=True,
|
||||
encoding='utf-8',
|
||||
bufsize=1,
|
||||
cwd=str(self._project_root),
|
||||
env={**os.environ, "PYTHONUNBUFFERED": "1"}
|
||||
)
|
||||
|
||||
self.status = "running"
|
||||
self.started_at = datetime.now()
|
||||
self.current_config = config
|
||||
|
||||
entry = self._create_log_entry(
|
||||
f"Crawler started on platform: {config.platform.value}, type: {config.crawler_type.value}",
|
||||
"success"
|
||||
)
|
||||
await self._push_log(entry)
|
||||
|
||||
# Start log reading task
|
||||
self._read_task = asyncio.create_task(self._read_output())
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
self.status = "error"
|
||||
entry = self._create_log_entry(f"Failed to start crawler: {str(e)}", "error")
|
||||
await self._push_log(entry)
|
||||
return False
|
||||
|
||||
async def stop(self) -> bool:
|
||||
"""Stop crawler process"""
|
||||
async with self._lock:
|
||||
if not self.process or self.process.poll() is not None:
|
||||
return False
|
||||
|
||||
self.status = "stopping"
|
||||
entry = self._create_log_entry("Sending SIGTERM to crawler process...", "warning")
|
||||
await self._push_log(entry)
|
||||
|
||||
try:
|
||||
self.process.send_signal(signal.SIGTERM)
|
||||
|
||||
# Wait for graceful exit (up to 15 seconds)
|
||||
for _ in range(30):
|
||||
if self.process.poll() is not None:
|
||||
break
|
||||
await asyncio.sleep(0.5)
|
||||
|
||||
# If still not exited, force kill
|
||||
if self.process.poll() is None:
|
||||
entry = self._create_log_entry("Process not responding, sending SIGKILL...", "warning")
|
||||
await self._push_log(entry)
|
||||
self.process.kill()
|
||||
|
||||
entry = self._create_log_entry("Crawler process terminated", "info")
|
||||
await self._push_log(entry)
|
||||
|
||||
except Exception as e:
|
||||
entry = self._create_log_entry(f"Error stopping crawler: {str(e)}", "error")
|
||||
await self._push_log(entry)
|
||||
|
||||
self.status = "idle"
|
||||
self.current_config = None
|
||||
|
||||
# Cancel log reading task
|
||||
if self._read_task:
|
||||
self._read_task.cancel()
|
||||
self._read_task = None
|
||||
|
||||
return True
|
||||
|
||||
def get_status(self) -> dict:
|
||||
"""Get current status"""
|
||||
return {
|
||||
"status": self.status,
|
||||
"platform": self.current_config.platform.value if self.current_config else None,
|
||||
"crawler_type": self.current_config.crawler_type.value if self.current_config else None,
|
||||
"started_at": self.started_at.isoformat() if self.started_at else None,
|
||||
"error_message": None
|
||||
}
|
||||
|
||||
def _build_command(self, config: CrawlerStartRequest) -> list:
|
||||
"""Build main.py command line arguments"""
|
||||
cmd = ["uv", "run", "python", "main.py"]
|
||||
|
||||
cmd.extend(["--platform", config.platform.value])
|
||||
cmd.extend(["--lt", config.login_type.value])
|
||||
cmd.extend(["--type", config.crawler_type.value])
|
||||
cmd.extend(["--save_data_option", config.save_option.value])
|
||||
|
||||
# Pass different arguments based on crawler type
|
||||
if config.crawler_type.value == "search" and config.keywords:
|
||||
cmd.extend(["--keywords", config.keywords])
|
||||
elif config.crawler_type.value == "detail" and config.specified_ids:
|
||||
cmd.extend(["--specified_id", config.specified_ids])
|
||||
elif config.crawler_type.value == "creator" and config.creator_ids:
|
||||
cmd.extend(["--creator_id", config.creator_ids])
|
||||
|
||||
if config.start_page != 1:
|
||||
cmd.extend(["--start", str(config.start_page)])
|
||||
|
||||
cmd.extend(["--get_comment", "true" if config.enable_comments else "false"])
|
||||
cmd.extend(["--get_sub_comment", "true" if config.enable_sub_comments else "false"])
|
||||
|
||||
if config.cookies:
|
||||
cmd.extend(["--cookies", config.cookies])
|
||||
|
||||
cmd.extend(["--headless", "true" if config.headless else "false"])
|
||||
|
||||
return cmd
|
||||
|
||||
async def _read_output(self):
|
||||
"""Asynchronously read process output"""
|
||||
loop = asyncio.get_event_loop()
|
||||
|
||||
try:
|
||||
while self.process and self.process.poll() is None:
|
||||
# Read a line in thread pool
|
||||
line = await loop.run_in_executor(
|
||||
None, self.process.stdout.readline
|
||||
)
|
||||
if line:
|
||||
line = line.strip()
|
||||
if line:
|
||||
level = self._parse_log_level(line)
|
||||
entry = self._create_log_entry(line, level)
|
||||
await self._push_log(entry)
|
||||
|
||||
# Read remaining output
|
||||
if self.process and self.process.stdout:
|
||||
remaining = await loop.run_in_executor(
|
||||
None, self.process.stdout.read
|
||||
)
|
||||
if remaining:
|
||||
for line in remaining.strip().split('\n'):
|
||||
if line.strip():
|
||||
level = self._parse_log_level(line)
|
||||
entry = self._create_log_entry(line.strip(), level)
|
||||
await self._push_log(entry)
|
||||
|
||||
# Process ended
|
||||
if self.status == "running":
|
||||
exit_code = self.process.returncode if self.process else -1
|
||||
if exit_code == 0:
|
||||
entry = self._create_log_entry("Crawler completed successfully", "success")
|
||||
else:
|
||||
entry = self._create_log_entry(f"Crawler exited with code: {exit_code}", "warning")
|
||||
await self._push_log(entry)
|
||||
self.status = "idle"
|
||||
|
||||
except asyncio.CancelledError:
|
||||
pass
|
||||
except Exception as e:
|
||||
entry = self._create_log_entry(f"Error reading output: {str(e)}", "error")
|
||||
await self._push_log(entry)
|
||||
|
||||
|
||||
# Global singleton
|
||||
crawler_manager = CrawlerManager()
|
||||
353
api/webui/assets/index-DvClRayq.js
Normal file
1
api/webui/assets/index-OiBmsgXF.css
Normal file
17
api/webui/index.html
Normal file
@@ -0,0 +1,17 @@
|
||||
<!doctype html>
|
||||
<html lang="zh-CN">
|
||||
<head>
|
||||
<meta charset="UTF-8" />
|
||||
<link rel="icon" type="image/svg+xml" href="/vite.svg" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>MediaCrawler - Command Center</title>
|
||||
<link rel="preconnect" href="https://fonts.googleapis.com">
|
||||
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
|
||||
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&family=JetBrains+Mono:wght@400;500&display=swap" rel="stylesheet">
|
||||
<script type="module" crossorigin src="/assets/index-DvClRayq.js"></script>
|
||||
<link rel="stylesheet" crossorigin href="/assets/index-OiBmsgXF.css">
|
||||
</head>
|
||||
<body>
|
||||
<div id="root"></div>
|
||||
</body>
|
||||
</html>
|
||||
BIN
api/webui/logos/bilibili_logo.png
Normal file
|
After Width: | Height: | Size: 42 KiB |
BIN
api/webui/logos/douyin.png
Normal file
|
After Width: | Height: | Size: 25 KiB |
BIN
api/webui/logos/github.png
Normal file
|
After Width: | Height: | Size: 7.8 KiB |
BIN
api/webui/logos/my_logo.png
Normal file
|
After Width: | Height: | Size: 312 KiB |
BIN
api/webui/logos/xiaohongshu_logo.png
Normal file
|
After Width: | Height: | Size: 6.2 KiB |
1
api/webui/vite.svg
Normal file
@@ -0,0 +1 @@
|
||||
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100"><circle cx="50" cy="50" r="40" fill="#de283b"/></svg>
|
||||
|
After Width: | Height: | Size: 116 B |
@@ -1,11 +1,18 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/base/__init__.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
@@ -1,3 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/base/base_crawler.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
@@ -44,14 +53,14 @@ class AbstractCrawler(ABC):
|
||||
|
||||
async def launch_browser_with_cdp(self, playwright: Playwright, playwright_proxy: Optional[Dict], user_agent: Optional[str], headless: bool = True) -> BrowserContext:
|
||||
"""
|
||||
使用CDP模式启动浏览器(可选实现)
|
||||
:param playwright: playwright实例
|
||||
:param playwright_proxy: playwright代理配置
|
||||
:param user_agent: 用户代理
|
||||
:param headless: 无头模式
|
||||
:return: 浏览器上下文
|
||||
Launch browser using CDP mode (optional implementation)
|
||||
:param playwright: playwright instance
|
||||
:param playwright_proxy: playwright proxy configuration
|
||||
:param user_agent: user agent
|
||||
:param headless: headless mode
|
||||
:return: browser context
|
||||
"""
|
||||
# 默认实现:回退到标准模式
|
||||
# Default implementation: fallback to standard mode
|
||||
return await self.launch_browser(playwright.chromium, playwright_proxy, user_agent, headless)
|
||||
|
||||
|
||||
|
||||
27
cache/__init__.py
vendored
@@ -1,11 +1,18 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/cache/__init__.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
49
cache/abs_cache.py
vendored
@@ -1,19 +1,28 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/cache/abs_cache.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
# -*- coding: utf-8 -*-
|
||||
# @Author : relakkes@gmail.com
|
||||
# @Name : 程序员阿江-Relakkes
|
||||
# @Name : Programmer AJiang-Relakkes
|
||||
# @Time : 2024/6/2 11:06
|
||||
# @Desc : 抽象类
|
||||
# @Desc : Abstract class
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Any, List, Optional
|
||||
@@ -24,9 +33,9 @@ class AbstractCache(ABC):
|
||||
@abstractmethod
|
||||
def get(self, key: str) -> Optional[Any]:
|
||||
"""
|
||||
从缓存中获取键的值。
|
||||
这是一个抽象方法。子类必须实现这个方法。
|
||||
:param key: 键
|
||||
Get the value of a key from the cache.
|
||||
This is an abstract method. Subclasses must implement this method.
|
||||
:param key: The key
|
||||
:return:
|
||||
"""
|
||||
raise NotImplementedError
|
||||
@@ -34,11 +43,11 @@ class AbstractCache(ABC):
|
||||
@abstractmethod
|
||||
def set(self, key: str, value: Any, expire_time: int) -> None:
|
||||
"""
|
||||
将键的值设置到缓存中。
|
||||
这是一个抽象方法。子类必须实现这个方法。
|
||||
:param key: 键
|
||||
:param value: 值
|
||||
:param expire_time: 过期时间
|
||||
Set the value of a key in the cache.
|
||||
This is an abstract method. Subclasses must implement this method.
|
||||
:param key: The key
|
||||
:param value: The value
|
||||
:param expire_time: Expiration time
|
||||
:return:
|
||||
"""
|
||||
raise NotImplementedError
|
||||
@@ -46,8 +55,8 @@ class AbstractCache(ABC):
|
||||
@abstractmethod
|
||||
def keys(self, pattern: str) -> List[str]:
|
||||
"""
|
||||
获取所有符合pattern的key
|
||||
:param pattern: 匹配模式
|
||||
Get all keys matching the pattern
|
||||
:param pattern: Matching pattern
|
||||
:return:
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
37
cache/cache_factory.py
vendored
@@ -1,33 +1,42 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/cache/cache_factory.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
# -*- coding: utf-8 -*-
|
||||
# @Author : relakkes@gmail.com
|
||||
# @Name : 程序员阿江-Relakkes
|
||||
# @Name : Programmer AJiang-Relakkes
|
||||
# @Time : 2024/6/2 11:23
|
||||
# @Desc :
|
||||
|
||||
|
||||
class CacheFactory:
|
||||
"""
|
||||
缓存工厂类
|
||||
Cache factory class
|
||||
"""
|
||||
|
||||
@staticmethod
|
||||
def create_cache(cache_type: str, *args, **kwargs):
|
||||
"""
|
||||
创建缓存对象
|
||||
:param cache_type: 缓存类型
|
||||
:param args: 参数
|
||||
:param kwargs: 关键字参数
|
||||
Create cache object
|
||||
:param cache_type: Cache type
|
||||
:param args: Arguments
|
||||
:param kwargs: Keyword arguments
|
||||
:return:
|
||||
"""
|
||||
if cache_type == 'memory':
|
||||
|
||||
57
cache/local_cache.py
vendored
@@ -1,19 +1,28 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/cache/local_cache.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
# -*- coding: utf-8 -*-
|
||||
# @Author : relakkes@gmail.com
|
||||
# @Name : 程序员阿江-Relakkes
|
||||
# @Name : Programmer AJiang-Relakkes
|
||||
# @Time : 2024/6/2 11:05
|
||||
# @Desc : 本地缓存
|
||||
# @Desc : Local cache
|
||||
|
||||
import asyncio
|
||||
import time
|
||||
@@ -26,19 +35,19 @@ class ExpiringLocalCache(AbstractCache):
|
||||
|
||||
def __init__(self, cron_interval: int = 10):
|
||||
"""
|
||||
初始化本地缓存
|
||||
:param cron_interval: 定时清楚cache的时间间隔
|
||||
Initialize local cache
|
||||
:param cron_interval: Time interval for scheduled cache cleanup
|
||||
:return:
|
||||
"""
|
||||
self._cron_interval = cron_interval
|
||||
self._cache_container: Dict[str, Tuple[Any, float]] = {}
|
||||
self._cron_task: Optional[asyncio.Task] = None
|
||||
# 开启定时清理任务
|
||||
# Start scheduled cleanup task
|
||||
self._schedule_clear()
|
||||
|
||||
def __del__(self):
|
||||
"""
|
||||
析构函数,清理定时任务
|
||||
Destructor function, cleanup scheduled task
|
||||
:return:
|
||||
"""
|
||||
if self._cron_task is not None:
|
||||
@@ -46,7 +55,7 @@ class ExpiringLocalCache(AbstractCache):
|
||||
|
||||
def get(self, key: str) -> Optional[Any]:
|
||||
"""
|
||||
从缓存中获取键的值
|
||||
Get the value of a key from the cache
|
||||
:param key:
|
||||
:return:
|
||||
"""
|
||||
@@ -54,7 +63,7 @@ class ExpiringLocalCache(AbstractCache):
|
||||
if value is None:
|
||||
return None
|
||||
|
||||
# 如果键已过期,则删除键并返回None
|
||||
# If the key has expired, delete it and return None
|
||||
if expire_time < time.time():
|
||||
del self._cache_container[key]
|
||||
return None
|
||||
@@ -63,7 +72,7 @@ class ExpiringLocalCache(AbstractCache):
|
||||
|
||||
def set(self, key: str, value: Any, expire_time: int) -> None:
|
||||
"""
|
||||
将键的值设置到缓存中
|
||||
Set the value of a key in the cache
|
||||
:param key:
|
||||
:param value:
|
||||
:param expire_time:
|
||||
@@ -73,14 +82,14 @@ class ExpiringLocalCache(AbstractCache):
|
||||
|
||||
def keys(self, pattern: str) -> List[str]:
|
||||
"""
|
||||
获取所有符合pattern的key
|
||||
:param pattern: 匹配模式
|
||||
Get all keys matching the pattern
|
||||
:param pattern: Matching pattern
|
||||
:return:
|
||||
"""
|
||||
if pattern == '*':
|
||||
return list(self._cache_container.keys())
|
||||
|
||||
# 本地缓存通配符暂时将*替换为空
|
||||
# For local cache wildcard, temporarily replace * with empty string
|
||||
if '*' in pattern:
|
||||
pattern = pattern.replace('*', '')
|
||||
|
||||
@@ -88,7 +97,7 @@ class ExpiringLocalCache(AbstractCache):
|
||||
|
||||
def _schedule_clear(self):
|
||||
"""
|
||||
开启定时清理任务,
|
||||
Start scheduled cleanup task
|
||||
:return:
|
||||
"""
|
||||
|
||||
@@ -102,7 +111,7 @@ class ExpiringLocalCache(AbstractCache):
|
||||
|
||||
def _clear(self):
|
||||
"""
|
||||
根据过期时间清理缓存
|
||||
Clean up cache based on expiration time
|
||||
:return:
|
||||
"""
|
||||
for key, (value, expire_time) in self._cache_container.items():
|
||||
@@ -111,7 +120,7 @@ class ExpiringLocalCache(AbstractCache):
|
||||
|
||||
async def _start_clear_cron(self):
|
||||
"""
|
||||
开启定时清理任务
|
||||
Start scheduled cleanup task
|
||||
:return:
|
||||
"""
|
||||
while True:
|
||||
@@ -121,7 +130,7 @@ class ExpiringLocalCache(AbstractCache):
|
||||
|
||||
if __name__ == '__main__':
|
||||
cache = ExpiringLocalCache(cron_interval=2)
|
||||
cache.set('name', '程序员阿江-Relakkes', 3)
|
||||
cache.set('name', 'Programmer AJiang-Relakkes', 3)
|
||||
print(cache.get('key'))
|
||||
print(cache.keys("*"))
|
||||
time.sleep(4)
|
||||
|
||||
41
cache/redis_cache.py
vendored
@@ -1,19 +1,28 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/cache/redis_cache.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
# -*- coding: utf-8 -*-
|
||||
# @Author : relakkes@gmail.com
|
||||
# @Name : 程序员阿江-Relakkes
|
||||
# @Name : Programmer AJiang-Relakkes
|
||||
# @Time : 2024/5/29 22:57
|
||||
# @Desc : RedisCache实现
|
||||
# @Desc : RedisCache implementation
|
||||
import pickle
|
||||
import time
|
||||
from typing import Any, List
|
||||
@@ -27,13 +36,13 @@ from config import db_config
|
||||
class RedisCache(AbstractCache):
|
||||
|
||||
def __init__(self) -> None:
|
||||
# 连接redis, 返回redis客户端
|
||||
# Connect to redis, return redis client
|
||||
self._redis_client = self._connet_redis()
|
||||
|
||||
@staticmethod
|
||||
def _connet_redis() -> Redis:
|
||||
"""
|
||||
连接redis, 返回redis客户端, 这里按需配置redis连接信息
|
||||
Connect to redis, return redis client, configure redis connection information as needed
|
||||
:return:
|
||||
"""
|
||||
return Redis(
|
||||
@@ -45,7 +54,7 @@ class RedisCache(AbstractCache):
|
||||
|
||||
def get(self, key: str) -> Any:
|
||||
"""
|
||||
从缓存中获取键的值, 并且反序列化
|
||||
Get the value of a key from the cache and deserialize it
|
||||
:param key:
|
||||
:return:
|
||||
"""
|
||||
@@ -56,7 +65,7 @@ class RedisCache(AbstractCache):
|
||||
|
||||
def set(self, key: str, value: Any, expire_time: int) -> None:
|
||||
"""
|
||||
将键的值设置到缓存中, 并且序列化
|
||||
Set the value of a key in the cache and serialize it
|
||||
:param key:
|
||||
:param value:
|
||||
:param expire_time:
|
||||
@@ -66,7 +75,7 @@ class RedisCache(AbstractCache):
|
||||
|
||||
def keys(self, pattern: str) -> List[str]:
|
||||
"""
|
||||
获取所有符合pattern的key
|
||||
Get all keys matching the pattern
|
||||
"""
|
||||
return [key.decode() for key in self._redis_client.keys(pattern)]
|
||||
|
||||
@@ -74,7 +83,7 @@ class RedisCache(AbstractCache):
|
||||
if __name__ == '__main__':
|
||||
redis_cache = RedisCache()
|
||||
# basic usage
|
||||
redis_cache.set("name", "程序员阿江-Relakkes", 1)
|
||||
redis_cache.set("name", "Programmer AJiang-Relakkes", 1)
|
||||
print(redis_cache.get("name")) # Relakkes
|
||||
print(redis_cache.keys("*")) # ['name']
|
||||
time.sleep(2)
|
||||
|
||||
@@ -1,12 +1,21 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/cmd_arg/__init__.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
from .arg import *
|
||||
|
||||
145
cmd_arg/arg.py
@@ -1,3 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/cmd_arg/arg.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
@@ -28,7 +37,7 @@ EnumT = TypeVar("EnumT", bound=Enum)
|
||||
|
||||
|
||||
class PlatformEnum(str, Enum):
|
||||
"""支持的媒体平台枚举"""
|
||||
"""Supported media platform enumeration"""
|
||||
|
||||
XHS = "xhs"
|
||||
DOUYIN = "dy"
|
||||
@@ -40,7 +49,7 @@ class PlatformEnum(str, Enum):
|
||||
|
||||
|
||||
class LoginTypeEnum(str, Enum):
|
||||
"""登录方式枚举"""
|
||||
"""Login type enumeration"""
|
||||
|
||||
QRCODE = "qrcode"
|
||||
PHONE = "phone"
|
||||
@@ -48,7 +57,7 @@ class LoginTypeEnum(str, Enum):
|
||||
|
||||
|
||||
class CrawlerTypeEnum(str, Enum):
|
||||
"""爬虫类型枚举"""
|
||||
"""Crawler type enumeration"""
|
||||
|
||||
SEARCH = "search"
|
||||
DETAIL = "detail"
|
||||
@@ -56,19 +65,23 @@ class CrawlerTypeEnum(str, Enum):
|
||||
|
||||
|
||||
class SaveDataOptionEnum(str, Enum):
|
||||
"""数据保存方式枚举"""
|
||||
"""Data save option enumeration"""
|
||||
|
||||
CSV = "csv"
|
||||
DB = "db"
|
||||
JSON = "json"
|
||||
SQLITE = "sqlite"
|
||||
MONGODB = "mongodb"
|
||||
EXCEL = "excel"
|
||||
POSTGRES = "postgres"
|
||||
|
||||
|
||||
class InitDbOptionEnum(str, Enum):
|
||||
"""数据库初始化选项"""
|
||||
"""Database initialization option"""
|
||||
|
||||
SQLITE = "sqlite"
|
||||
MYSQL = "mysql"
|
||||
POSTGRES = "postgres"
|
||||
|
||||
|
||||
def _to_bool(value: bool | str) -> bool:
|
||||
@@ -91,7 +104,7 @@ def _coerce_enum(
|
||||
return enum_cls(value)
|
||||
except ValueError:
|
||||
typer.secho(
|
||||
f"⚠️ 配置值 '{value}' 不在 {enum_cls.__name__} 支持的范围内,已回退到默认值 '{default.value}'.",
|
||||
f"⚠️ Config value '{value}' is not within the supported range of {enum_cls.__name__}, falling back to default value '{default.value}'.",
|
||||
fg=typer.colors.YELLOW,
|
||||
)
|
||||
return default
|
||||
@@ -122,7 +135,7 @@ def _inject_init_db_default(args: Sequence[str]) -> list[str]:
|
||||
|
||||
|
||||
async def parse_cmd(argv: Optional[Sequence[str]] = None):
|
||||
"""使用 Typer 解析命令行参数。"""
|
||||
"""Parse command line arguments using Typer."""
|
||||
|
||||
app = typer.Typer(add_completion=False)
|
||||
|
||||
@@ -132,48 +145,48 @@ async def parse_cmd(argv: Optional[Sequence[str]] = None):
|
||||
PlatformEnum,
|
||||
typer.Option(
|
||||
"--platform",
|
||||
help="媒体平台选择 (xhs=小红书 | dy=抖音 | ks=快手 | bili=哔哩哔哩 | wb=微博 | tieba=百度贴吧 | zhihu=知乎)",
|
||||
rich_help_panel="基础配置",
|
||||
help="Media platform selection (xhs=XiaoHongShu | dy=Douyin | ks=Kuaishou | bili=Bilibili | wb=Weibo | tieba=Baidu Tieba | zhihu=Zhihu)",
|
||||
rich_help_panel="Basic Configuration",
|
||||
),
|
||||
] = _coerce_enum(PlatformEnum, config.PLATFORM, PlatformEnum.XHS),
|
||||
lt: Annotated[
|
||||
LoginTypeEnum,
|
||||
typer.Option(
|
||||
"--lt",
|
||||
help="登录方式 (qrcode=二维码 | phone=手机号 | cookie=Cookie)",
|
||||
rich_help_panel="账号配置",
|
||||
help="Login type (qrcode=QR Code | phone=Phone | cookie=Cookie)",
|
||||
rich_help_panel="Account Configuration",
|
||||
),
|
||||
] = _coerce_enum(LoginTypeEnum, config.LOGIN_TYPE, LoginTypeEnum.QRCODE),
|
||||
crawler_type: Annotated[
|
||||
CrawlerTypeEnum,
|
||||
typer.Option(
|
||||
"--type",
|
||||
help="爬取类型 (search=搜索 | detail=详情 | creator=创作者)",
|
||||
rich_help_panel="基础配置",
|
||||
help="Crawler type (search=Search | detail=Detail | creator=Creator)",
|
||||
rich_help_panel="Basic Configuration",
|
||||
),
|
||||
] = _coerce_enum(CrawlerTypeEnum, config.CRAWLER_TYPE, CrawlerTypeEnum.SEARCH),
|
||||
start: Annotated[
|
||||
int,
|
||||
typer.Option(
|
||||
"--start",
|
||||
help="起始页码",
|
||||
rich_help_panel="基础配置",
|
||||
help="Starting page number",
|
||||
rich_help_panel="Basic Configuration",
|
||||
),
|
||||
] = config.START_PAGE,
|
||||
keywords: Annotated[
|
||||
str,
|
||||
typer.Option(
|
||||
"--keywords",
|
||||
help="请输入关键词,多个关键词用逗号分隔",
|
||||
rich_help_panel="基础配置",
|
||||
help="Enter keywords, multiple keywords separated by commas",
|
||||
rich_help_panel="Basic Configuration",
|
||||
),
|
||||
] = config.KEYWORDS,
|
||||
get_comment: Annotated[
|
||||
str,
|
||||
typer.Option(
|
||||
"--get_comment",
|
||||
help="是否爬取一级评论,支持 yes/true/t/y/1 或 no/false/f/n/0",
|
||||
rich_help_panel="评论配置",
|
||||
help="Whether to crawl first-level comments, supports yes/true/t/y/1 or no/false/f/n/0",
|
||||
rich_help_panel="Comment Configuration",
|
||||
show_default=True,
|
||||
),
|
||||
] = str(config.ENABLE_GET_COMMENTS),
|
||||
@@ -181,17 +194,26 @@ async def parse_cmd(argv: Optional[Sequence[str]] = None):
|
||||
str,
|
||||
typer.Option(
|
||||
"--get_sub_comment",
|
||||
help="是否爬取二级评论,支持 yes/true/t/y/1 或 no/false/f/n/0",
|
||||
rich_help_panel="评论配置",
|
||||
help="Whether to crawl second-level comments, supports yes/true/t/y/1 or no/false/f/n/0",
|
||||
rich_help_panel="Comment Configuration",
|
||||
show_default=True,
|
||||
),
|
||||
] = str(config.ENABLE_GET_SUB_COMMENTS),
|
||||
headless: Annotated[
|
||||
str,
|
||||
typer.Option(
|
||||
"--headless",
|
||||
help="Whether to enable headless mode (applies to both Playwright and CDP), supports yes/true/t/y/1 or no/false/f/n/0",
|
||||
rich_help_panel="Runtime Configuration",
|
||||
show_default=True,
|
||||
),
|
||||
] = str(config.HEADLESS),
|
||||
save_data_option: Annotated[
|
||||
SaveDataOptionEnum,
|
||||
typer.Option(
|
||||
"--save_data_option",
|
||||
help="数据保存方式 (csv=CSV文件 | db=MySQL数据库 | json=JSON文件 | sqlite=SQLite数据库)",
|
||||
rich_help_panel="存储配置",
|
||||
help="Data save option (csv=CSV file | db=MySQL database | json=JSON file | sqlite=SQLite database | mongodb=MongoDB database | excel=Excel file | postgres=PostgreSQL database)",
|
||||
rich_help_panel="Storage Configuration",
|
||||
),
|
||||
] = _coerce_enum(
|
||||
SaveDataOptionEnum, config.SAVE_DATA_OPTION, SaveDataOptionEnum.JSON
|
||||
@@ -200,25 +222,62 @@ async def parse_cmd(argv: Optional[Sequence[str]] = None):
|
||||
Optional[InitDbOptionEnum],
|
||||
typer.Option(
|
||||
"--init_db",
|
||||
help="初始化数据库表结构 (sqlite | mysql)",
|
||||
rich_help_panel="存储配置",
|
||||
help="Initialize database table structure (sqlite | mysql | postgres)",
|
||||
rich_help_panel="Storage Configuration",
|
||||
),
|
||||
] = None,
|
||||
cookies: Annotated[
|
||||
str,
|
||||
typer.Option(
|
||||
"--cookies",
|
||||
help="Cookie 登录方式使用的 Cookie 值",
|
||||
rich_help_panel="账号配置",
|
||||
help="Cookie value used for Cookie login method",
|
||||
rich_help_panel="Account Configuration",
|
||||
),
|
||||
] = config.COOKIES,
|
||||
specified_id: Annotated[
|
||||
str,
|
||||
typer.Option(
|
||||
"--specified_id",
|
||||
help="Post/video ID list in detail mode, multiple IDs separated by commas (supports full URL or ID)",
|
||||
rich_help_panel="Basic Configuration",
|
||||
),
|
||||
] = "",
|
||||
creator_id: Annotated[
|
||||
str,
|
||||
typer.Option(
|
||||
"--creator_id",
|
||||
help="Creator ID list in creator mode, multiple IDs separated by commas (supports full URL or ID)",
|
||||
rich_help_panel="Basic Configuration",
|
||||
),
|
||||
] = "",
|
||||
max_comments_count_singlenotes: Annotated[
|
||||
int,
|
||||
typer.Option(
|
||||
"--max_comments_count_singlenotes",
|
||||
help="Maximum number of first-level comments to crawl per post/video",
|
||||
rich_help_panel="Comment Configuration",
|
||||
),
|
||||
] = config.CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES,
|
||||
max_concurrency_num: Annotated[
|
||||
int,
|
||||
typer.Option(
|
||||
"--max_concurrency_num",
|
||||
help="Maximum number of concurrent crawlers",
|
||||
rich_help_panel="Performance Configuration",
|
||||
),
|
||||
] = config.MAX_CONCURRENCY_NUM,
|
||||
) -> SimpleNamespace:
|
||||
"""MediaCrawler 命令行入口"""
|
||||
|
||||
enable_comment = _to_bool(get_comment)
|
||||
enable_sub_comment = _to_bool(get_sub_comment)
|
||||
enable_headless = _to_bool(headless)
|
||||
init_db_value = init_db.value if init_db else None
|
||||
|
||||
# Parse specified_id and creator_id into lists
|
||||
specified_id_list = [id.strip() for id in specified_id.split(",") if id.strip()] if specified_id else []
|
||||
creator_id_list = [id.strip() for id in creator_id.split(",") if id.strip()] if creator_id else []
|
||||
|
||||
# override global config
|
||||
config.PLATFORM = platform.value
|
||||
config.LOGIN_TYPE = lt.value
|
||||
@@ -227,8 +286,37 @@ async def parse_cmd(argv: Optional[Sequence[str]] = None):
|
||||
config.KEYWORDS = keywords
|
||||
config.ENABLE_GET_COMMENTS = enable_comment
|
||||
config.ENABLE_GET_SUB_COMMENTS = enable_sub_comment
|
||||
config.HEADLESS = enable_headless
|
||||
config.CDP_HEADLESS = enable_headless
|
||||
config.SAVE_DATA_OPTION = save_data_option.value
|
||||
config.COOKIES = cookies
|
||||
config.CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES = max_comments_count_singlenotes
|
||||
config.MAX_CONCURRENCY_NUM = max_concurrency_num
|
||||
|
||||
# Set platform-specific ID lists for detail/creator mode
|
||||
if specified_id_list:
|
||||
if platform == PlatformEnum.XHS:
|
||||
config.XHS_SPECIFIED_NOTE_URL_LIST = specified_id_list
|
||||
elif platform == PlatformEnum.BILIBILI:
|
||||
config.BILI_SPECIFIED_ID_LIST = specified_id_list
|
||||
elif platform == PlatformEnum.DOUYIN:
|
||||
config.DY_SPECIFIED_ID_LIST = specified_id_list
|
||||
elif platform == PlatformEnum.WEIBO:
|
||||
config.WEIBO_SPECIFIED_ID_LIST = specified_id_list
|
||||
elif platform == PlatformEnum.KUAISHOU:
|
||||
config.KS_SPECIFIED_ID_LIST = specified_id_list
|
||||
|
||||
if creator_id_list:
|
||||
if platform == PlatformEnum.XHS:
|
||||
config.XHS_CREATOR_ID_LIST = creator_id_list
|
||||
elif platform == PlatformEnum.BILIBILI:
|
||||
config.BILI_CREATOR_ID_LIST = creator_id_list
|
||||
elif platform == PlatformEnum.DOUYIN:
|
||||
config.DY_CREATOR_ID_LIST = creator_id_list
|
||||
elif platform == PlatformEnum.WEIBO:
|
||||
config.WEIBO_CREATOR_ID_LIST = creator_id_list
|
||||
elif platform == PlatformEnum.KUAISHOU:
|
||||
config.KS_CREATOR_ID_LIST = creator_id_list
|
||||
|
||||
return SimpleNamespace(
|
||||
platform=config.PLATFORM,
|
||||
@@ -238,9 +326,12 @@ async def parse_cmd(argv: Optional[Sequence[str]] = None):
|
||||
keywords=config.KEYWORDS,
|
||||
get_comment=config.ENABLE_GET_COMMENTS,
|
||||
get_sub_comment=config.ENABLE_GET_SUB_COMMENTS,
|
||||
headless=config.HEADLESS,
|
||||
save_data_option=config.SAVE_DATA_OPTION,
|
||||
init_db=init_db_value,
|
||||
cookies=config.COOKIES,
|
||||
specified_id=specified_id,
|
||||
creator_id=creator_id,
|
||||
)
|
||||
|
||||
command = typer.main.get_command(app)
|
||||
|
||||
@@ -1,13 +1,22 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/__init__.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
from .base_config import *
|
||||
from .db_config import *
|
||||
from .db_config import *
|
||||
|
||||
@@ -1,3 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/base_config.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
@@ -55,14 +64,14 @@ CUSTOM_BROWSER_PATH = ""
|
||||
CDP_HEADLESS = False
|
||||
|
||||
# 浏览器启动超时时间(秒)
|
||||
BROWSER_LAUNCH_TIMEOUT = 30
|
||||
BROWSER_LAUNCH_TIMEOUT = 60
|
||||
|
||||
# 是否在程序结束时自动关闭浏览器
|
||||
# 设置为False可以保持浏览器运行,便于调试
|
||||
AUTO_CLOSE_BROWSER = True
|
||||
|
||||
# 数据保存类型选项配置,支持四种类型:csv、db、json、sqlite, 最好保存到DB,有排重的功能。
|
||||
SAVE_DATA_OPTION = "json" # csv or db or json or sqlite
|
||||
# 数据保存类型选项配置,支持六种类型:csv、db、json、sqlite、excel、postgres, 最好保存到DB,有排重的功能。
|
||||
SAVE_DATA_OPTION = "json" # csv or db or json or sqlite or excel or postgres
|
||||
|
||||
# 用户浏览器缓存的浏览器文件配置
|
||||
USER_DATA_DIR = "%s_user_data_dir" # %s will be replaced by platform name
|
||||
|
||||
@@ -1,4 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/bilibili_config.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
|
||||
@@ -1,12 +1,21 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/db_config.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
import os
|
||||
@@ -28,7 +37,7 @@ mysql_db_config = {
|
||||
|
||||
|
||||
# redis config
|
||||
REDIS_DB_HOST = "127.0.0.1" # your redis host
|
||||
REDIS_DB_HOST = os.getenv("REDIS_DB_HOST", "127.0.0.1") # your redis host
|
||||
REDIS_DB_PWD = os.getenv("REDIS_DB_PWD", "123456") # your redis password
|
||||
REDIS_DB_PORT = os.getenv("REDIS_DB_PORT", 6379) # your redis port
|
||||
REDIS_DB_NUM = os.getenv("REDIS_DB_NUM", 0) # your redis db num
|
||||
@@ -42,4 +51,34 @@ SQLITE_DB_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)), "datab
|
||||
|
||||
sqlite_db_config = {
|
||||
"db_path": SQLITE_DB_PATH
|
||||
}
|
||||
}
|
||||
|
||||
# mongodb config
|
||||
MONGODB_HOST = os.getenv("MONGODB_HOST", "localhost")
|
||||
MONGODB_PORT = os.getenv("MONGODB_PORT", 27017)
|
||||
MONGODB_USER = os.getenv("MONGODB_USER", "")
|
||||
MONGODB_PWD = os.getenv("MONGODB_PWD", "")
|
||||
MONGODB_DB_NAME = os.getenv("MONGODB_DB_NAME", "media_crawler")
|
||||
|
||||
mongodb_config = {
|
||||
"host": MONGODB_HOST,
|
||||
"port": int(MONGODB_PORT),
|
||||
"user": MONGODB_USER,
|
||||
"password": MONGODB_PWD,
|
||||
"db_name": MONGODB_DB_NAME,
|
||||
}
|
||||
|
||||
# postgres config
|
||||
POSTGRES_DB_PWD = os.getenv("POSTGRES_DB_PWD", "123456")
|
||||
POSTGRES_DB_USER = os.getenv("POSTGRES_DB_USER", "postgres")
|
||||
POSTGRES_DB_HOST = os.getenv("POSTGRES_DB_HOST", "localhost")
|
||||
POSTGRES_DB_PORT = os.getenv("POSTGRES_DB_PORT", 5432)
|
||||
POSTGRES_DB_NAME = os.getenv("POSTGRES_DB_NAME", "media_crawler")
|
||||
|
||||
postgres_db_config = {
|
||||
"user": POSTGRES_DB_USER,
|
||||
"password": POSTGRES_DB_PWD,
|
||||
"host": POSTGRES_DB_HOST,
|
||||
"port": POSTGRES_DB_PORT,
|
||||
"db_name": POSTGRES_DB_NAME,
|
||||
}
|
||||
|
||||
@@ -1,3 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/dy_config.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
@@ -22,7 +31,7 @@ DY_SPECIFIED_ID_LIST = [
|
||||
"https://www.douyin.com/video/7525538910311632128",
|
||||
"https://v.douyin.com/drIPtQ_WPWY/",
|
||||
"https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main&modal_id=7525538910311632128",
|
||||
"7202432992642387233",
|
||||
"7202432992642387233",
|
||||
# ........................
|
||||
]
|
||||
|
||||
|
||||
@@ -1,3 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/ks_config.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
|
||||
@@ -1,3 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/tieba_config.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
|
||||
@@ -1,3 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/weibo_config.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
@@ -12,7 +21,7 @@
|
||||
# 微博平台配置
|
||||
|
||||
# 搜索类型,具体的枚举值在media_platform/weibo/field.py中
|
||||
WEIBO_SEARCH_TYPE = "popular"
|
||||
WEIBO_SEARCH_TYPE = "default"
|
||||
|
||||
# 指定微博ID列表
|
||||
WEIBO_SPECIFIED_ID_LIST = [
|
||||
@@ -22,6 +31,10 @@ WEIBO_SPECIFIED_ID_LIST = [
|
||||
|
||||
# 指定微博用户ID列表
|
||||
WEIBO_CREATOR_ID_LIST = [
|
||||
"5533390220",
|
||||
"5756404150",
|
||||
# ........................
|
||||
]
|
||||
|
||||
# 是否开启微博爬取全文的功能,默认开启
|
||||
# 如果开启的话会增加被风控的概率,相当于一个关键词搜索请求会再遍历所有帖子的时候,再请求一次帖子详情
|
||||
ENABLE_WEIBO_FULL_TEXT = True
|
||||
|
||||
@@ -1,4 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/xhs_config.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
@@ -17,16 +25,13 @@ SORT_TYPE = "popularity_descending"
|
||||
|
||||
# 指定笔记URL列表, 必须要携带xsec_token参数
|
||||
XHS_SPECIFIED_NOTE_URL_LIST = [
|
||||
"https://www.xiaohongshu.com/explore/66fad51c000000001b0224b8?xsec_token=AB3rO-QopW5sgrJ41GwN01WCXh6yWPxjSoFI9D5JIMgKw=&xsec_source=pc_search"
|
||||
"https://www.xiaohongshu.com/explore/64b95d01000000000c034587?xsec_token=AB0EFqJvINCkj6xOCKCQgfNNh8GdnBC_6XecG4QOddo3Q=&xsec_source=pc_cfeed"
|
||||
# ........................
|
||||
]
|
||||
|
||||
# 指定创作者URL列表 (支持完整URL或纯ID)
|
||||
# 支持格式:
|
||||
# 1. 完整创作者主页URL (带xsec_token和xsec_source参数): "https://www.xiaohongshu.com/user/profile/5eb8e1d400000000010075ae?xsec_token=AB1nWBKCo1vE2HEkfoJUOi5B6BE5n7wVrbdpHoWIj5xHw=&xsec_source=pc_feed"
|
||||
# 2. 纯user_id: "63e36c9a000000002703502b"
|
||||
# 指定创作者URL列表,需要携带xsec_token和xsec_source参数
|
||||
|
||||
XHS_CREATOR_ID_LIST = [
|
||||
"https://www.xiaohongshu.com/user/profile/5eb8e1d400000000010075ae?xsec_token=AB1nWBKCo1vE2HEkfoJUOi5B6BE5n7wVrbdpHoWIj5xHw=&xsec_source=pc_feed",
|
||||
"63e36c9a000000002703502b",
|
||||
"https://www.xiaohongshu.com/user/profile/5f58bd990000000001003753?xsec_token=ABYVg1evluJZZzpMX-VWzchxQ1qSNVW3r-jOEnKqMcgZw=&xsec_source=pc_search"
|
||||
# ........................
|
||||
]
|
||||
|
||||
@@ -1,3 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/config/zhihu_config.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
|
||||
@@ -1,12 +1,21 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/constant/__init__.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
@@ -1,14 +1,23 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/constant/baidu_tieba.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
TIEBA_URL = 'https://tieba.baidu.com'
|
||||
TIEBA_URL = 'https://tieba.baidu.com'
|
||||
|
||||
@@ -1,12 +1,21 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/constant/zhihu.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
# -*- coding: utf-8 -*-
|
||||
@@ -16,4 +25,3 @@ ZHIHU_ZHUANLAN_URL = "https://zhuanlan.zhihu.com"
|
||||
ANSWER_NAME = "answer"
|
||||
ARTICLE_NAME = "article"
|
||||
VIDEO_NAME = "zvideo"
|
||||
|
||||
|
||||
@@ -0,0 +1,17 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/database/__init__.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
@@ -1,7 +1,25 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/database/db.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
# persist-1<persist1@126.com>
|
||||
# 原因:将 db.py 改造为模块,移除直接执行入口,修复相对导入问题。
|
||||
# 副作用:无
|
||||
# 回滚策略:还原此文件。
|
||||
# Reason: Refactored db.py into a module, removed direct execution entry point, fixed relative import issues.
|
||||
# Side effects: None
|
||||
# Rollback strategy: Restore this file.
|
||||
import asyncio
|
||||
import sys
|
||||
from pathlib import Path
|
||||
@@ -16,7 +34,7 @@ from database.db_session import create_tables
|
||||
|
||||
async def init_table_schema(db_type: str):
|
||||
"""
|
||||
Initializes the database table schema.
|
||||
Initializes the database table schema.
|
||||
This will create tables based on the ORM models.
|
||||
Args:
|
||||
db_type: The type of database, 'sqlite' or 'mysql'.
|
||||
|
||||
@@ -1,10 +1,28 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/database/db_session.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
from sqlalchemy import text
|
||||
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
|
||||
from sqlalchemy.orm import sessionmaker
|
||||
from contextlib import asynccontextmanager
|
||||
from .models import Base
|
||||
import config
|
||||
from config.db_config import mysql_db_config, sqlite_db_config
|
||||
from config.db_config import mysql_db_config, sqlite_db_config, postgres_db_config
|
||||
|
||||
# Keep a cache of engines
|
||||
_engines = {}
|
||||
@@ -18,6 +36,18 @@ async def create_database_if_not_exists(db_type: str):
|
||||
async with engine.connect() as conn:
|
||||
await conn.execute(text(f"CREATE DATABASE IF NOT EXISTS {mysql_db_config['db_name']}"))
|
||||
await engine.dispose()
|
||||
elif db_type == "postgres":
|
||||
# Connect to the default 'postgres' database
|
||||
server_url = f"postgresql+asyncpg://{postgres_db_config['user']}:{postgres_db_config['password']}@{postgres_db_config['host']}:{postgres_db_config['port']}/postgres"
|
||||
print(f"[init_db] Connecting to Postgres: host={postgres_db_config['host']}, port={postgres_db_config['port']}, user={postgres_db_config['user']}, dbname=postgres")
|
||||
# Isolation level AUTOCOMMIT is required for CREATE DATABASE
|
||||
engine = create_async_engine(server_url, echo=False, isolation_level="AUTOCOMMIT")
|
||||
async with engine.connect() as conn:
|
||||
# Check if database exists
|
||||
result = await conn.execute(text(f"SELECT 1 FROM pg_database WHERE datname = '{postgres_db_config['db_name']}'"))
|
||||
if not result.scalar():
|
||||
await conn.execute(text(f"CREATE DATABASE {postgres_db_config['db_name']}"))
|
||||
await engine.dispose()
|
||||
|
||||
|
||||
def get_async_engine(db_type: str = None):
|
||||
@@ -34,6 +64,8 @@ def get_async_engine(db_type: str = None):
|
||||
db_url = f"sqlite+aiosqlite:///{sqlite_db_config['db_path']}"
|
||||
elif db_type == "mysql" or db_type == "db":
|
||||
db_url = f"mysql+asyncmy://{mysql_db_config['user']}:{mysql_db_config['password']}@{mysql_db_config['host']}:{mysql_db_config['port']}/{mysql_db_config['db_name']}"
|
||||
elif db_type == "postgres":
|
||||
db_url = f"postgresql+asyncpg://{postgres_db_config['user']}:{postgres_db_config['password']}@{postgres_db_config['host']}:{postgres_db_config['port']}/{postgres_db_config['db_name']}"
|
||||
else:
|
||||
raise ValueError(f"Unsupported database type: {db_type}")
|
||||
|
||||
@@ -67,4 +99,4 @@ async def get_session() -> AsyncSession:
|
||||
await session.rollback()
|
||||
raise e
|
||||
finally:
|
||||
await session.close()
|
||||
await session.close()
|
||||
|
||||
@@ -1,3 +1,21 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/database/models.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
from sqlalchemy import create_engine, Column, Integer, Text, String, BigInteger
|
||||
from sqlalchemy.ext.declarative import declarative_base
|
||||
from sqlalchemy.orm import sessionmaker
|
||||
@@ -388,9 +406,9 @@ class ZhihuContent(Base):
|
||||
last_modify_ts = Column(BigInteger)
|
||||
|
||||
# persist-1<persist1@126.com>
|
||||
# 原因:修复 ORM 模型定义错误,确保与数据库表结构一致。
|
||||
# 副作用:无
|
||||
# 回滚策略:还原此行
|
||||
# Reason: Fixed ORM model definition error, ensuring consistency with database table structure.
|
||||
# Side effects: None
|
||||
# Rollback strategy: Restore this line
|
||||
|
||||
class ZhihuComment(Base):
|
||||
__tablename__ = 'zhihu_comment'
|
||||
@@ -431,4 +449,4 @@ class ZhihuCreator(Base):
|
||||
column_count = Column(Integer, default=0)
|
||||
get_voteup_count = Column(Integer, default=0)
|
||||
add_ts = Column(BigInteger)
|
||||
last_modify_ts = Column(BigInteger)
|
||||
last_modify_ts = Column(BigInteger)
|
||||
|
||||
143
database/mongodb_store_base.py
Normal file
@@ -0,0 +1,143 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/database/mongodb_store_base.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
"""MongoDB storage base class: Provides connection management and common storage methods"""
|
||||
import asyncio
|
||||
from typing import Dict, List, Optional
|
||||
from motor.motor_asyncio import AsyncIOMotorClient, AsyncIOMotorDatabase, AsyncIOMotorCollection
|
||||
from config import db_config
|
||||
from tools import utils
|
||||
|
||||
|
||||
class MongoDBConnection:
|
||||
"""MongoDB connection management (singleton pattern)"""
|
||||
_instance = None
|
||||
_client: Optional[AsyncIOMotorClient] = None
|
||||
_db: Optional[AsyncIOMotorDatabase] = None
|
||||
_lock = asyncio.Lock()
|
||||
|
||||
def __new__(cls):
|
||||
if cls._instance is None:
|
||||
cls._instance = super(MongoDBConnection, cls).__new__(cls)
|
||||
return cls._instance
|
||||
|
||||
async def get_client(self) -> AsyncIOMotorClient:
|
||||
"""Get client"""
|
||||
if self._client is None:
|
||||
async with self._lock:
|
||||
if self._client is None:
|
||||
await self._connect()
|
||||
return self._client
|
||||
|
||||
async def get_db(self) -> AsyncIOMotorDatabase:
|
||||
"""Get database"""
|
||||
if self._db is None:
|
||||
async with self._lock:
|
||||
if self._db is None:
|
||||
await self._connect()
|
||||
return self._db
|
||||
|
||||
async def _connect(self):
|
||||
"""Establish connection"""
|
||||
try:
|
||||
mongo_config = db_config.mongodb_config
|
||||
host = mongo_config["host"]
|
||||
port = mongo_config["port"]
|
||||
user = mongo_config["user"]
|
||||
password = mongo_config["password"]
|
||||
db_name = mongo_config["db_name"]
|
||||
|
||||
# Build connection URL (with/without authentication)
|
||||
if user and password:
|
||||
connection_url = f"mongodb://{user}:{password}@{host}:{port}/"
|
||||
else:
|
||||
connection_url = f"mongodb://{host}:{port}/"
|
||||
|
||||
self._client = AsyncIOMotorClient(connection_url, serverSelectionTimeoutMS=5000)
|
||||
await self._client.server_info() # Test connection
|
||||
self._db = self._client[db_name]
|
||||
utils.logger.info(f"[MongoDBConnection] Connected to {host}:{port}/{db_name}")
|
||||
except Exception as e:
|
||||
utils.logger.error(f"[MongoDBConnection] Connection failed: {e}")
|
||||
raise
|
||||
|
||||
async def close(self):
|
||||
"""Close connection"""
|
||||
if self._client is not None:
|
||||
self._client.close()
|
||||
self._client = None
|
||||
self._db = None
|
||||
utils.logger.info("[MongoDBConnection] Connection closed")
|
||||
|
||||
|
||||
class MongoDBStoreBase:
|
||||
"""MongoDB storage base class: Provides common CRUD operations"""
|
||||
|
||||
def __init__(self, collection_prefix: str):
|
||||
"""Initialize storage base class
|
||||
Args:
|
||||
collection_prefix: Platform prefix (xhs/douyin/bilibili, etc.)
|
||||
"""
|
||||
self.collection_prefix = collection_prefix
|
||||
self._connection = MongoDBConnection()
|
||||
|
||||
async def get_collection(self, collection_suffix: str) -> AsyncIOMotorCollection:
|
||||
"""Get collection: {prefix}_{suffix}"""
|
||||
db = await self._connection.get_db()
|
||||
collection_name = f"{self.collection_prefix}_{collection_suffix}"
|
||||
return db[collection_name]
|
||||
|
||||
async def save_or_update(self, collection_suffix: str, query: Dict, data: Dict) -> bool:
|
||||
"""Save or update data (upsert)"""
|
||||
try:
|
||||
collection = await self.get_collection(collection_suffix)
|
||||
await collection.update_one(query, {"$set": data}, upsert=True)
|
||||
return True
|
||||
except Exception as e:
|
||||
utils.logger.error(f"[MongoDBStoreBase] Save failed ({self.collection_prefix}_{collection_suffix}): {e}")
|
||||
return False
|
||||
|
||||
async def find_one(self, collection_suffix: str, query: Dict) -> Optional[Dict]:
|
||||
"""Query a single record"""
|
||||
try:
|
||||
collection = await self.get_collection(collection_suffix)
|
||||
return await collection.find_one(query)
|
||||
except Exception as e:
|
||||
utils.logger.error(f"[MongoDBStoreBase] Find one failed ({self.collection_prefix}_{collection_suffix}): {e}")
|
||||
return None
|
||||
|
||||
async def find_many(self, collection_suffix: str, query: Dict, limit: int = 0) -> List[Dict]:
|
||||
"""Query multiple records (limit=0 means no limit)"""
|
||||
try:
|
||||
collection = await self.get_collection(collection_suffix)
|
||||
cursor = collection.find(query)
|
||||
if limit > 0:
|
||||
cursor = cursor.limit(limit)
|
||||
return await cursor.to_list(length=None)
|
||||
except Exception as e:
|
||||
utils.logger.error(f"[MongoDBStoreBase] Find many failed ({self.collection_prefix}_{collection_suffix}): {e}")
|
||||
return []
|
||||
|
||||
async def create_index(self, collection_suffix: str, keys: List[tuple], unique: bool = False):
|
||||
"""Create index: keys=[("field", 1)]"""
|
||||
try:
|
||||
collection = await self.get_collection(collection_suffix)
|
||||
await collection.create_index(keys, unique=unique)
|
||||
utils.logger.info(f"[MongoDBStoreBase] Index created on {self.collection_prefix}_{collection_suffix}")
|
||||
except Exception as e:
|
||||
utils.logger.error(f"[MongoDBStoreBase] Create index failed: {e}")
|
||||
@@ -1,7 +1,8 @@
|
||||
import {defineConfig} from 'vitepress'
|
||||
import {withMermaid} from 'vitepress-plugin-mermaid'
|
||||
|
||||
// https://vitepress.dev/reference/site-config
|
||||
export default defineConfig({
|
||||
export default withMermaid(defineConfig({
|
||||
title: "MediaCrawler自媒体爬虫",
|
||||
description: "小红书爬虫,抖音爬虫, 快手爬虫, B站爬虫, 微博爬虫,百度贴吧爬虫,知乎爬虫...。 ",
|
||||
lastUpdated: true,
|
||||
@@ -43,6 +44,7 @@ export default defineConfig({
|
||||
text: 'MediaCrawler使用文档',
|
||||
items: [
|
||||
{text: '基本使用', link: '/'},
|
||||
{text: '项目架构文档', link: '/项目架构文档'},
|
||||
{text: '常见问题汇总', link: '/常见问题'},
|
||||
{text: 'IP代理使用', link: '/代理使用'},
|
||||
{text: '词云图使用', link: '/词云图使用配置'},
|
||||
@@ -59,7 +61,6 @@ export default defineConfig({
|
||||
text: 'MediaCrawler源码剖析课',
|
||||
link: 'https://relakkes.feishu.cn/wiki/JUgBwdhIeiSbAwkFCLkciHdAnhh'
|
||||
},
|
||||
{text: '知识星球文章专栏', link: '/知识星球介绍'},
|
||||
{text: '开发者咨询服务', link: '/开发者咨询'},
|
||||
]
|
||||
},
|
||||
@@ -86,4 +87,4 @@ export default defineConfig({
|
||||
{icon: 'github', link: 'https://github.com/NanmiCoder/MediaCrawler'}
|
||||
]
|
||||
}
|
||||
})
|
||||
}))
|
||||
|
||||
@@ -11,9 +11,9 @@ const fetchAds = async () => {
|
||||
return [
|
||||
{
|
||||
id: 1,
|
||||
imageUrl: 'https://github.com/NanmiCoder/MediaCrawler/raw/main/docs/static/images/auto_test.png',
|
||||
landingUrl: 'https://item.jd.com/10124939676219.html',
|
||||
text: '给好朋友虫师新书站台推荐 - 基于Python的自动化测试框架设计'
|
||||
imageUrl: 'https://github.com/NanmiCoder/MediaCrawler/raw/main/docs/static/images/MediaCrawlerPro.jpg',
|
||||
landingUrl: 'https://github.com/MediaCrawlerPro',
|
||||
text: '👏欢迎大家来订阅MediaCrawlerPro源代码'
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -63,7 +63,8 @@ onUnmounted(() => {
|
||||
}
|
||||
|
||||
.ad-image {
|
||||
max-width: 130px;
|
||||
max-width: 100%;
|
||||
width: 280px;
|
||||
height: auto;
|
||||
margin-bottom: 0.5rem;
|
||||
}
|
||||
|
||||
@@ -6,4 +6,5 @@
|
||||
:root {
|
||||
--vp-sidebar-width: 285px;
|
||||
--vp-sidebar-bg-color: var(--vp-c-bg-alt);
|
||||
}
|
||||
--vp-aside-width: 300px;
|
||||
}
|
||||
|
||||
67
docs/data_storage_guide.md
Normal file
@@ -0,0 +1,67 @@
|
||||
# 数据保存指南 / Data Storage Guide
|
||||
|
||||
|
||||
### 💾 数据保存
|
||||
|
||||
MediaCrawler 支持多种数据存储方式,您可以根据需求选择最适合的方案:
|
||||
|
||||
#### 存储方式
|
||||
|
||||
- **CSV 文件**:支持保存到 CSV 中(`data/` 目录下)
|
||||
- **JSON 文件**:支持保存到 JSON 中(`data/` 目录下)
|
||||
- **Excel 文件**:支持保存到格式化的 Excel 文件(`data/` 目录下)✨ 新功能
|
||||
- 多工作表支持(内容、评论、创作者)
|
||||
- 专业格式化(标题样式、自动列宽、边框)
|
||||
- 易于分析和分享
|
||||
- **数据库存储**
|
||||
- 使用参数 `--init_db` 进行数据库初始化(使用`--init_db`时不需要携带其他optional)
|
||||
- **SQLite 数据库**:轻量级数据库,无需服务器,适合个人使用(推荐)
|
||||
1. 初始化:`--init_db sqlite`
|
||||
2. 数据存储:`--save_data_option sqlite`
|
||||
- **MySQL 数据库**:支持关系型数据库 MySQL 中保存(需要提前创建数据库)
|
||||
1. 初始化:`--init_db mysql`
|
||||
2. 数据存储:`--save_data_option db`(db 参数为兼容历史更新保留)
|
||||
- **PostgreSQL 数据库**:支持高级关系型数据库 PostgreSQL 中保存(推荐生产环境使用)
|
||||
1. 初始化:`--init_db postgres`
|
||||
2. 数据存储:`--save_data_option postgres`
|
||||
|
||||
#### 使用示例
|
||||
|
||||
```shell
|
||||
# 使用 Excel 存储数据(推荐用于数据分析)✨ 新功能
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel
|
||||
|
||||
# 初始化 SQLite 数据库
|
||||
uv run main.py --init_db sqlite
|
||||
# 使用 SQLite 存储数据
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
|
||||
```
|
||||
|
||||
```shell
|
||||
# 初始化 MySQL 数据库
|
||||
uv run main.py --init_db mysql
|
||||
# 使用 MySQL 存储数据(为适配历史更新,db参数进行沿用)
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
|
||||
```
|
||||
|
||||
```shell
|
||||
# 初始化 PostgreSQL 数据库
|
||||
uv run main.py --init_db postgres
|
||||
# 使用 PostgreSQL 存储数据
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option postgres
|
||||
```
|
||||
|
||||
```shell
|
||||
# 使用 CSV 存储数据
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option csv
|
||||
|
||||
# 使用 JSON 存储数据
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option json
|
||||
```
|
||||
|
||||
#### 详细文档
|
||||
|
||||
- **Excel 导出详细指南**:查看 [Excel 导出指南](excel_export_guide.md)
|
||||
- **数据库配置**:参考 [常见问题](常见问题.md)
|
||||
|
||||
---
|
||||
244
docs/excel_export_guide.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# Excel Export Guide
|
||||
|
||||
## Overview
|
||||
|
||||
MediaCrawler now supports exporting crawled data to formatted Excel files (.xlsx) with professional styling and multiple sheets for contents, comments, and creators.
|
||||
|
||||
## Features
|
||||
|
||||
- **Multi-sheet workbooks**: Separate sheets for Contents, Comments, and Creators
|
||||
- **Professional formatting**:
|
||||
- Styled headers with blue background and white text
|
||||
- Auto-adjusted column widths
|
||||
- Cell borders and text wrapping
|
||||
- Clean, readable layout
|
||||
- **Smart export**: Empty sheets are automatically removed
|
||||
- **Organized storage**: Files saved to `data/{platform}/` directory with timestamps
|
||||
|
||||
## Installation
|
||||
|
||||
Excel export requires the `openpyxl` library:
|
||||
|
||||
```bash
|
||||
# Using uv (recommended)
|
||||
uv sync
|
||||
|
||||
# Or using pip
|
||||
pip install openpyxl
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
|
||||
1. **Configure Excel export** in `config/base_config.py`:
|
||||
|
||||
```python
|
||||
SAVE_DATA_OPTION = "excel" # Change from json/csv/db to excel
|
||||
```
|
||||
|
||||
2. **Run the crawler**:
|
||||
|
||||
```bash
|
||||
# Xiaohongshu example
|
||||
uv run main.py --platform xhs --lt qrcode --type search
|
||||
|
||||
# Douyin example
|
||||
uv run main.py --platform dy --lt qrcode --type search
|
||||
|
||||
# Bilibili example
|
||||
uv run main.py --platform bili --lt qrcode --type search
|
||||
```
|
||||
|
||||
3. **Find your Excel file** in `data/{platform}/` directory:
|
||||
- Filename format: `{platform}_{crawler_type}_{timestamp}.xlsx`
|
||||
- Example: `xhs_search_20250128_143025.xlsx`
|
||||
|
||||
### Command Line Examples
|
||||
|
||||
```bash
|
||||
# Search by keywords and export to Excel
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel
|
||||
|
||||
# Crawl specific posts and export to Excel
|
||||
uv run main.py --platform xhs --lt qrcode --type detail --save_data_option excel
|
||||
|
||||
# Crawl creator profile and export to Excel
|
||||
uv run main.py --platform xhs --lt qrcode --type creator --save_data_option excel
|
||||
```
|
||||
|
||||
## Excel File Structure
|
||||
|
||||
### Contents Sheet
|
||||
Contains post/video information:
|
||||
- `note_id`: Unique post identifier
|
||||
- `title`: Post title
|
||||
- `desc`: Post description
|
||||
- `user_id`: Author user ID
|
||||
- `nickname`: Author nickname
|
||||
- `liked_count`: Number of likes
|
||||
- `comment_count`: Number of comments
|
||||
- `share_count`: Number of shares
|
||||
- `ip_location`: IP location
|
||||
- `image_list`: Comma-separated image URLs
|
||||
- `tag_list`: Comma-separated tags
|
||||
- `note_url`: Direct link to post
|
||||
- And more platform-specific fields...
|
||||
|
||||
### Comments Sheet
|
||||
Contains comment information:
|
||||
- `comment_id`: Unique comment identifier
|
||||
- `note_id`: Associated post ID
|
||||
- `content`: Comment text
|
||||
- `user_id`: Commenter user ID
|
||||
- `nickname`: Commenter nickname
|
||||
- `like_count`: Comment likes
|
||||
- `create_time`: Comment timestamp
|
||||
- `ip_location`: Commenter location
|
||||
- `sub_comment_count`: Number of replies
|
||||
- And more...
|
||||
|
||||
### Creators Sheet
|
||||
Contains creator/author information:
|
||||
- `user_id`: Unique user identifier
|
||||
- `nickname`: Display name
|
||||
- `gender`: Gender
|
||||
- `avatar`: Profile picture URL
|
||||
- `desc`: Bio/description
|
||||
- `fans`: Follower count
|
||||
- `follows`: Following count
|
||||
- `interaction`: Total interactions
|
||||
- And more...
|
||||
|
||||
## Advantages Over Other Formats
|
||||
|
||||
### vs CSV
|
||||
- ✅ Multiple sheets in one file
|
||||
- ✅ Professional formatting
|
||||
- ✅ Better handling of special characters
|
||||
- ✅ Auto-adjusted column widths
|
||||
- ✅ No encoding issues
|
||||
|
||||
### vs JSON
|
||||
- ✅ Human-readable tabular format
|
||||
- ✅ Easy to open in Excel/Google Sheets
|
||||
- ✅ Better for data analysis
|
||||
- ✅ Easier to share with non-technical users
|
||||
|
||||
### vs Database
|
||||
- ✅ No database setup required
|
||||
- ✅ Portable single-file format
|
||||
- ✅ Easy to share and archive
|
||||
- ✅ Works offline
|
||||
|
||||
## Tips & Best Practices
|
||||
|
||||
1. **Large datasets**: For very large crawls (>10,000 rows), consider using database storage instead for better performance
|
||||
|
||||
2. **Data analysis**: Excel files work great with:
|
||||
- Microsoft Excel
|
||||
- Google Sheets
|
||||
- LibreOffice Calc
|
||||
- Python pandas: `pd.read_excel('file.xlsx')`
|
||||
|
||||
3. **Combining data**: You can merge multiple Excel files using:
|
||||
```python
|
||||
import pandas as pd
|
||||
df1 = pd.read_excel('file1.xlsx', sheet_name='Contents')
|
||||
df2 = pd.read_excel('file2.xlsx', sheet_name='Contents')
|
||||
combined = pd.concat([df1, df2])
|
||||
combined.to_excel('combined.xlsx', index=False)
|
||||
```
|
||||
|
||||
4. **File size**: Excel files are typically 2-3x larger than CSV but smaller than JSON
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "openpyxl not installed" error
|
||||
|
||||
```bash
|
||||
# Install openpyxl
|
||||
uv add openpyxl
|
||||
# or
|
||||
pip install openpyxl
|
||||
```
|
||||
|
||||
### Excel file not created
|
||||
|
||||
Check that:
|
||||
1. `SAVE_DATA_OPTION = "excel"` in config
|
||||
2. Crawler successfully collected data
|
||||
3. No errors in console output
|
||||
4. `data/{platform}/` directory exists
|
||||
|
||||
### Empty Excel file
|
||||
|
||||
This happens when:
|
||||
- No data was crawled (check keywords/IDs)
|
||||
- Login failed (check login status)
|
||||
- Platform blocked requests (check IP/rate limits)
|
||||
|
||||
## Example Output
|
||||
|
||||
After running a successful crawl, you'll see:
|
||||
|
||||
```
|
||||
[ExcelStoreBase] Initialized Excel export to: data/xhs/xhs_search_20250128_143025.xlsx
|
||||
[ExcelStoreBase] Stored content to Excel: 7123456789
|
||||
[ExcelStoreBase] Stored comment to Excel: comment_123
|
||||
...
|
||||
[Main] Excel file saved successfully
|
||||
```
|
||||
|
||||
Your Excel file will have:
|
||||
- Professional blue headers
|
||||
- Clean borders
|
||||
- Wrapped text for long content
|
||||
- Auto-sized columns
|
||||
- Separate organized sheets
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Programmatic Access
|
||||
|
||||
```python
|
||||
from store.excel_store_base import ExcelStoreBase
|
||||
|
||||
# Create store
|
||||
store = ExcelStoreBase(platform="xhs", crawler_type="search")
|
||||
|
||||
# Store data
|
||||
await store.store_content({
|
||||
"note_id": "123",
|
||||
"title": "Test Post",
|
||||
"liked_count": 100
|
||||
})
|
||||
|
||||
# Save to file
|
||||
store.flush()
|
||||
```
|
||||
|
||||
### Custom Formatting
|
||||
|
||||
You can extend `ExcelStoreBase` to customize formatting:
|
||||
|
||||
```python
|
||||
from store.excel_store_base import ExcelStoreBase
|
||||
|
||||
class CustomExcelStore(ExcelStoreBase):
|
||||
def _apply_header_style(self, sheet, row_num=1):
|
||||
# Custom header styling
|
||||
super()._apply_header_style(sheet, row_num)
|
||||
# Add your customizations here
|
||||
```
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions:
|
||||
- Check [常见问题](常见问题.md)
|
||||
- Open an issue on GitHub
|
||||
- Join the WeChat discussion group
|
||||
|
||||
---
|
||||
|
||||
**Note**: Excel export is designed for learning and research purposes. Please respect platform terms of service and rate limits.
|
||||
113
docs/index.md
@@ -1,58 +1,80 @@
|
||||
# MediaCrawler使用方法
|
||||
|
||||
## 创建并激活 python 虚拟环境
|
||||
> 如果是爬取抖音和知乎,需要提前安装nodejs环境,版本大于等于:`16`即可 <br>
|
||||
```shell
|
||||
# 进入项目根目录
|
||||
cd MediaCrawler
|
||||
|
||||
# 创建虚拟环境
|
||||
# 我的python版本是:3.9.6,requirements.txt中的库是基于这个版本的,如果是其他python版本,可能requirements.txt中的库不兼容,自行解决一下。
|
||||
python -m venv venv
|
||||
|
||||
# macos & linux 激活虚拟环境
|
||||
source venv/bin/activate
|
||||
## 项目文档
|
||||
|
||||
# windows 激活虚拟环境
|
||||
venv\Scripts\activate
|
||||
- [项目架构文档](项目架构文档.md) - 系统架构、模块设计、数据流向(含 Mermaid 图表)
|
||||
|
||||
```
|
||||
## 推荐:使用 uv 管理依赖
|
||||
|
||||
## 安装依赖库
|
||||
### 1. 前置依赖
|
||||
- 安装 [uv](https://docs.astral.sh/uv/getting-started/installation),并用 `uv --version` 验证。
|
||||
- Python 版本建议使用 **3.11**(当前依赖基于该版本构建)。
|
||||
- 安装 Node.js(抖音、知乎等平台需要),版本需 `>= 16.0.0`。
|
||||
|
||||
```shell
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
### 2. 同步 Python 依赖
|
||||
```shell
|
||||
# 进入项目根目录
|
||||
cd MediaCrawler
|
||||
|
||||
## 安装 playwright浏览器驱动
|
||||
# 使用 uv 保证 Python 版本和依赖一致性
|
||||
uv sync
|
||||
```
|
||||
|
||||
```shell
|
||||
playwright install
|
||||
```
|
||||
### 3. 安装 Playwright 浏览器驱动
|
||||
```shell
|
||||
uv run playwright install
|
||||
```
|
||||
> 项目已支持使用 Playwright 连接本地 Chrome。如需使用 CDP 方式,可在 `config/base_config.py` 中调整 `xhs` 和 `dy` 的相关配置。
|
||||
|
||||
## 运行爬虫程序
|
||||
### 4. 运行爬虫程序
|
||||
```shell
|
||||
# 项目默认未开启评论爬取,如需评论请在 config/base_config.py 中修改 ENABLE_GET_COMMENTS
|
||||
# 其他功能开关也可在 config/base_config.py 查看,均有中文注释
|
||||
|
||||
```shell
|
||||
### 项目默认是没有开启评论爬取模式,如需评论请在config/base_config.py中的 ENABLE_GET_COMMENTS 变量修改
|
||||
### 一些其他支持项,也可以在config/base_config.py查看功能,写的有中文注释
|
||||
|
||||
# 从配置文件中读取关键词搜索相关的帖子并爬取帖子信息与评论
|
||||
python main.py --platform xhs --lt qrcode --type search
|
||||
|
||||
# 从配置文件中读取指定的帖子ID列表获取指定帖子的信息与评论信息
|
||||
python main.py --platform xhs --lt qrcode --type detail
|
||||
|
||||
# 使用SQLite数据库存储数据(推荐个人用户使用)
|
||||
python main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
|
||||
|
||||
# 使用MySQL数据库存储数据
|
||||
python main.py --platform xhs --lt qrcode --type search --save_data_option db
|
||||
|
||||
# 打开对应APP扫二维码登录
|
||||
|
||||
# 其他平台爬虫使用示例,执行下面的命令查看
|
||||
python main.py --help
|
||||
```
|
||||
# 从配置中读取关键词搜索并爬取帖子与评论
|
||||
uv run main.py --platform xhs --lt qrcode --type search
|
||||
|
||||
# 从配置中读取指定帖子ID列表并爬取帖子与评论
|
||||
uv run main.py --platform xhs --lt qrcode --type detail
|
||||
|
||||
# 使用 SQLite 数据库存储数据(推荐个人用户使用)
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
|
||||
|
||||
# 使用 MySQL 数据库存储数据
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
|
||||
|
||||
# 其他平台示例
|
||||
uv run main.py --help
|
||||
```
|
||||
|
||||
## 备选:Python 原生 venv(不推荐)
|
||||
> 如果爬取抖音或知乎,需要提前安装 Node.js,版本 `>= 16`。
|
||||
```shell
|
||||
# 进入项目根目录
|
||||
cd MediaCrawler
|
||||
|
||||
# 创建虚拟环境(示例 Python 版本:3.11,requirements 基于该版本)
|
||||
python -m venv venv
|
||||
|
||||
# macOS & Linux 激活虚拟环境
|
||||
source venv/bin/activate
|
||||
|
||||
# Windows 激活虚拟环境
|
||||
venv\Scripts\activate
|
||||
```
|
||||
```shell
|
||||
# 安装依赖与驱动
|
||||
pip install -r requirements.txt
|
||||
playwright install
|
||||
```
|
||||
```shell
|
||||
# 运行爬虫程序(venv 环境)
|
||||
python main.py --platform xhs --lt qrcode --type search
|
||||
python main.py --platform xhs --lt qrcode --type detail
|
||||
python main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
|
||||
python main.py --platform xhs --lt qrcode --type search --save_data_option db
|
||||
python main.py --help
|
||||
```
|
||||
|
||||
## 💾 数据存储
|
||||
|
||||
@@ -74,4 +96,3 @@
|
||||
> 大家请以学习为目的使用本仓库,爬虫违法违规的案件:https://github.com/HiddenStrawberry/Crawler_Illegal_Cases_In_China <br>
|
||||
>
|
||||
>本项目的所有内容仅供学习和参考之用,禁止用于商业用途。任何人或组织不得将本仓库的内容用于非法用途或侵犯他人合法权益。本仓库所涉及的爬虫技术仅用于学习和研究,不得用于对其他平台进行大规模爬虫或其他非法行为。对于因使用本仓库内容而引起的任何法律责任,本仓库不承担任何责任。使用本仓库的内容即表示您同意本免责声明的所有条款和条件。
|
||||
|
||||
|
||||
BIN
docs/static/images/11群二维码.JPG
vendored
|
Before Width: | Height: | Size: 171 KiB |
BIN
docs/static/images/12群二维码.JPG
vendored
|
Before Width: | Height: | Size: 170 KiB |
BIN
docs/static/images/13群二维码.JPG
vendored
|
Before Width: | Height: | Size: 168 KiB |
BIN
docs/static/images/14群二维码.jpeg
vendored
|
Before Width: | Height: | Size: 161 KiB |
BIN
docs/static/images/MediaCrawlerPro.jpg
vendored
Normal file
|
After Width: | Height: | Size: 158 KiB |
BIN
docs/static/images/QIWEI.png
vendored
Normal file
|
After Width: | Height: | Size: 22 KiB |
BIN
docs/static/images/Thordata.png
vendored
Normal file
|
After Width: | Height: | Size: 486 KiB |
BIN
docs/static/images/img_8.png
vendored
Normal file
|
After Width: | Height: | Size: 944 KiB |
BIN
docs/static/images/xingqiu.jpg
vendored
|
Before Width: | Height: | Size: 241 KiB |
BIN
docs/static/images/星球qrcode.jpg
vendored
|
Before Width: | Height: | Size: 229 KiB |
@@ -1,12 +1,12 @@
|
||||
# 关于作者
|
||||
> 大家都叫我阿江,网名:程序员阿江-Relakkes,目前裸辞正探索自由职业,希望能靠自己的技术能力和努力,实现自己理想的生活方式。
|
||||
>
|
||||
> 我身边有大量的技术人脉资源,如果大家有一些爬虫咨询或者编程单子可以向我丢过来
|
||||
> 大家都叫我阿江,网名:程序员阿江-Relakkes,目前是一名独立开发者,专注于 AI Agent 和爬虫相关的开发工作,All in AI。
|
||||
|
||||
- [Github万星开源自媒体爬虫仓库MediaCrawler作者](https://github.com/NanmiCoder/MediaCrawler)
|
||||
- 全栈程序员,熟悉Python、Golang、JavaScript,工作中主要用Golang。
|
||||
- 曾经主导并参与过百万级爬虫采集系统架构设计与编码
|
||||
- 爬虫是一种技术兴趣爱好,参与爬虫有一种对抗的感觉,越难越兴奋。
|
||||
- 目前专注于 AI Agent 领域,积极探索 AI 技术的应用与创新
|
||||
- 如果你有 AI Agent 相关的项目需要合作,欢迎联系我,我有很多时间可以投入
|
||||
|
||||
## 微信联系方式
|
||||

|
||||
|
||||
106
docs/原生环境管理文档.md
@@ -1,52 +1,74 @@
|
||||
## 使用python原生venv管理依赖(不推荐了)
|
||||
# 本地原生环境管理
|
||||
|
||||
## 创建并激活 python 虚拟环境
|
||||
> 如果是爬取抖音和知乎,需要提前安装nodejs环境,版本大于等于:`16`即可 <br>
|
||||
> 新增 [uv](https://github.com/astral-sh/uv) 来管理项目依赖,使用uv来替代python版本管理、pip进行依赖安装,更加方便快捷
|
||||
```shell
|
||||
# 进入项目根目录
|
||||
cd MediaCrawler
|
||||
|
||||
# 创建虚拟环境
|
||||
# 我的python版本是:3.9.6,requirements.txt中的库是基于这个版本的,如果是其他python版本,可能requirements.txt中的库不兼容,自行解决一下。
|
||||
python -m venv venv
|
||||
|
||||
# macos & linux 激活虚拟环境
|
||||
source venv/bin/activate
|
||||
## 推荐方案:使用 uv 管理依赖
|
||||
|
||||
# windows 激活虚拟环境
|
||||
venv\Scripts\activate
|
||||
### 1. 前置依赖
|
||||
- 安装 [uv](https://docs.astral.sh/uv/getting-started/installation),并使用 `uv --version` 验证。
|
||||
- Python 版本建议使用 **3.11**(当前依赖基于该版本构建)。
|
||||
- 安装 Node.js(抖音、知乎等平台需要),版本需 `>= 16.0.0`。
|
||||
|
||||
```
|
||||
### 2. 同步 Python 依赖
|
||||
```shell
|
||||
# 进入项目根目录
|
||||
cd MediaCrawler
|
||||
|
||||
## 安装依赖库
|
||||
# 使用 uv 保证 Python 版本和依赖一致性
|
||||
uv sync
|
||||
```
|
||||
|
||||
```shell
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
### 3. 安装 Playwright 浏览器驱动
|
||||
```shell
|
||||
uv run playwright install
|
||||
```
|
||||
> 项目已支持使用 Playwright 连接本地 Chrome。如需使用 CDP 方式,可在 `config/base_config.py` 中调整 `xhs` 和 `dy` 的相关配置。
|
||||
|
||||
## 查看配置文件
|
||||
### 4. 运行爬虫程序
|
||||
```shell
|
||||
# 项目默认未开启评论爬取,如需评论请在 config/base_config.py 中修改 ENABLE_GET_COMMENTS
|
||||
# 其他功能开关也可在 config/base_config.py 查看,均有中文注释
|
||||
|
||||
## 安装 playwright浏览器驱动 (非必需)
|
||||
# 从配置中读取关键词搜索并爬取帖子与评论
|
||||
uv run main.py --platform xhs --lt qrcode --type search
|
||||
|
||||
```shell
|
||||
playwright install
|
||||
```
|
||||
# 从配置中读取指定帖子ID列表并爬取帖子与评论
|
||||
uv run main.py --platform xhs --lt qrcode --type detail
|
||||
|
||||
## 运行爬虫程序
|
||||
# 其他平台示例
|
||||
uv run main.py --help
|
||||
```
|
||||
|
||||
```shell
|
||||
### 项目默认是没有开启评论爬取模式,如需评论请在config/base_config.py中的 ENABLE_GET_COMMENTS 变量修改
|
||||
### 一些其他支持项,也可以在config/base_config.py查看功能,写的有中文注释
|
||||
|
||||
# 从配置文件中读取关键词搜索相关的帖子并爬取帖子信息与评论
|
||||
python main.py --platform xhs --lt qrcode --type search
|
||||
|
||||
# 从配置文件中读取指定的帖子ID列表获取指定帖子的信息与评论信息
|
||||
python main.py --platform xhs --lt qrcode --type detail
|
||||
|
||||
# 打开对应APP扫二维码登录
|
||||
|
||||
# 其他平台爬虫使用示例,执行下面的命令查看
|
||||
python main.py --help
|
||||
```
|
||||
## 备选方案:Python 原生 venv(不推荐)
|
||||
|
||||
### 创建并激活虚拟环境
|
||||
> 如果爬取抖音或知乎,需要提前安装 Node.js,版本 `>= 16`。
|
||||
```shell
|
||||
# 进入项目根目录
|
||||
cd MediaCrawler
|
||||
|
||||
# 创建虚拟环境(示例 Python 版本:3.11,requirements 基于该版本)
|
||||
python -m venv venv
|
||||
|
||||
# macOS & Linux 激活虚拟环境
|
||||
source venv/bin/activate
|
||||
|
||||
# Windows 激活虚拟环境
|
||||
venv\Scripts\activate
|
||||
```
|
||||
|
||||
### 安装依赖与驱动
|
||||
```shell
|
||||
pip install -r requirements.txt
|
||||
playwright install
|
||||
```
|
||||
|
||||
### 运行爬虫程序(venv 环境)
|
||||
```shell
|
||||
# 从配置中读取关键词搜索并爬取帖子与评论
|
||||
python main.py --platform xhs --lt qrcode --type search
|
||||
|
||||
# 从配置中读取指定帖子ID列表并爬取帖子与评论
|
||||
python main.py --platform xhs --lt qrcode --type detail
|
||||
|
||||
# 更多示例
|
||||
python main.py --help
|
||||
```
|
||||
|
||||
@@ -9,4 +9,4 @@
|
||||
>
|
||||
> 如果图片展示不出来或过期,可以直接添加我的微信号:relakkes,并备注github,会有拉群小助手自动拉你进群
|
||||
|
||||

|
||||

|
||||
@@ -15,5 +15,3 @@
|
||||
## MediaCrawler源码剖析视频课程
|
||||
[mediacrawler源码课程介绍](https://relakkes.feishu.cn/wiki/JUgBwdhIeiSbAwkFCLkciHdAnhh)
|
||||
|
||||
## 知识星球爬虫逆向、编程专栏
|
||||
[知识星球专栏介绍](知识星球介绍.md)
|
||||
|
||||
@@ -1,31 +0,0 @@
|
||||
# 知识星球专栏
|
||||
|
||||
## 基本介绍
|
||||
|
||||
文章:
|
||||
- 1.爬虫JS逆向案例分享
|
||||
- 2.MediaCrawler技术实现分享。
|
||||
- 3.沉淀python开发经验和技巧
|
||||
- ......................
|
||||
|
||||
提问:
|
||||
- 4.在星球内向我提问关于MediaCrawler、爬虫、编程任何问题
|
||||
|
||||
## 章节内容
|
||||
- [逆向案例 - 某16x8平台商品列表接口逆向参数分析](https://articles.zsxq.com/id_x1qmtg8pzld9.html)
|
||||
- [逆向案例 - Product Hunt月度最佳产品榜单接口加密参数分析](https://articles.zsxq.com/id_au4eich3x2sg.html)
|
||||
- [逆向案例 - 某zhi乎x-zse-96参数分析过程](https://articles.zsxq.com/id_dui2vil0ag1l.html)
|
||||
- [逆向案例 - 某x识星球X-Signature加密参数分析过程](https://articles.zsxq.com/id_pp4madwcwcg8.html)
|
||||
- [【独创】使用Playwright获取某音a_bogus参数流程(包含加密参数分析)](https://articles.zsxq.com/id_u89al50jk9x0.html)
|
||||
- [【独创】使用Playwright低成本获取某书X-s参数流程分析(当年的回忆录)](https://articles.zsxq.com/id_u4lcrvqakuc7.html)
|
||||
- [ MediaCrawler-基于抽象类设计重构项目缓存](https://articles.zsxq.com/id_4ju73oxewt9j.html)
|
||||
- [ 手把手带你撸一个自己的IP代理池](https://articles.zsxq.com/id_38fza371ladm.html)
|
||||
- [一次Mysql数据库中混用collation排序规则带来的bug](https://articles.zsxq.com/id_pibwr1wnst2p.html)
|
||||
- [错误使用 Python 可变类型带来的隐藏 Bug](https://articles.zsxq.com/id_f7vn89l1d303.html)
|
||||
- [【MediaCrawler】微博帖子评论爬虫教程](https://articles.zsxq.com/id_vrmuhw0ovj3t.html)
|
||||
- [Python协程在并发场景下的幂等性问题](https://articles.zsxq.com/id_wocdwsfmfcmp.html)
|
||||
- ........................................
|
||||
|
||||
## 加入星球
|
||||

|
||||
|
||||
883
docs/项目架构文档.md
Normal file
@@ -0,0 +1,883 @@
|
||||
# MediaCrawler 项目架构文档
|
||||
|
||||
## 1. 项目概述
|
||||
|
||||
### 1.1 项目简介
|
||||
|
||||
MediaCrawler 是一个多平台自媒体爬虫框架,采用 Python 异步编程实现,支持爬取主流社交媒体平台的内容、评论和创作者信息。
|
||||
|
||||
### 1.2 支持的平台
|
||||
|
||||
| 平台 | 代号 | 主要功能 |
|
||||
|------|------|---------|
|
||||
| 小红书 | `xhs` | 笔记搜索、详情、创作者 |
|
||||
| 抖音 | `dy` | 视频搜索、详情、创作者 |
|
||||
| 快手 | `ks` | 视频搜索、详情、创作者 |
|
||||
| B站 | `bili` | 视频搜索、详情、UP主 |
|
||||
| 微博 | `wb` | 微博搜索、详情、博主 |
|
||||
| 百度贴吧 | `tieba` | 帖子搜索、详情 |
|
||||
| 知乎 | `zhihu` | 问答搜索、详情、答主 |
|
||||
|
||||
### 1.3 核心功能特性
|
||||
|
||||
- **多平台支持**:统一的爬虫接口,支持 7 大主流平台
|
||||
- **多种登录方式**:二维码、手机号、Cookie 三种登录方式
|
||||
- **多种存储方式**:CSV、JSON、SQLite、MySQL、MongoDB、Excel
|
||||
- **反爬虫对策**:CDP 模式、代理 IP 池、请求签名
|
||||
- **异步高并发**:基于 asyncio 的异步架构,高效并发爬取
|
||||
- **词云生成**:自动生成评论词云图
|
||||
|
||||
---
|
||||
|
||||
## 2. 系统架构总览
|
||||
|
||||
### 2.1 高层架构图
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph Entry["入口层"]
|
||||
main["main.py<br/>程序入口"]
|
||||
cmdarg["cmd_arg<br/>命令行参数"]
|
||||
config["config<br/>配置管理"]
|
||||
end
|
||||
|
||||
subgraph Core["核心爬虫层"]
|
||||
factory["CrawlerFactory<br/>爬虫工厂"]
|
||||
base["AbstractCrawler<br/>爬虫基类"]
|
||||
|
||||
subgraph Platforms["平台实现"]
|
||||
xhs["XiaoHongShuCrawler"]
|
||||
dy["DouYinCrawler"]
|
||||
ks["KuaishouCrawler"]
|
||||
bili["BilibiliCrawler"]
|
||||
wb["WeiboCrawler"]
|
||||
tieba["TieBaCrawler"]
|
||||
zhihu["ZhihuCrawler"]
|
||||
end
|
||||
end
|
||||
|
||||
subgraph Client["API客户端层"]
|
||||
absClient["AbstractApiClient<br/>客户端基类"]
|
||||
xhsClient["XiaoHongShuClient"]
|
||||
dyClient["DouYinClient"]
|
||||
ksClient["KuaiShouClient"]
|
||||
biliClient["BilibiliClient"]
|
||||
wbClient["WeiboClient"]
|
||||
tiebaClient["BaiduTieBaClient"]
|
||||
zhihuClient["ZhiHuClient"]
|
||||
end
|
||||
|
||||
subgraph Storage["数据存储层"]
|
||||
storeFactory["StoreFactory<br/>存储工厂"]
|
||||
csv["CSV存储"]
|
||||
json["JSON存储"]
|
||||
sqlite["SQLite存储"]
|
||||
mysql["MySQL存储"]
|
||||
mongodb["MongoDB存储"]
|
||||
excel["Excel存储"]
|
||||
end
|
||||
|
||||
subgraph Infra["基础设施层"]
|
||||
browser["浏览器管理<br/>Playwright/CDP"]
|
||||
proxy["代理IP池"]
|
||||
cache["缓存系统"]
|
||||
login["登录管理"]
|
||||
end
|
||||
|
||||
main --> factory
|
||||
cmdarg --> main
|
||||
config --> main
|
||||
factory --> base
|
||||
base --> Platforms
|
||||
Platforms --> Client
|
||||
Client --> Storage
|
||||
Client --> Infra
|
||||
Storage --> storeFactory
|
||||
storeFactory --> csv & json & sqlite & mysql & mongodb & excel
|
||||
```
|
||||
|
||||
### 2.2 数据流向图
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph Input["输入"]
|
||||
keywords["关键词/ID"]
|
||||
config["配置参数"]
|
||||
end
|
||||
|
||||
subgraph Process["处理流程"]
|
||||
browser["启动浏览器"]
|
||||
login["登录认证"]
|
||||
search["搜索/爬取"]
|
||||
parse["数据解析"]
|
||||
comment["获取评论"]
|
||||
end
|
||||
|
||||
subgraph Output["输出"]
|
||||
content["内容数据"]
|
||||
comments["评论数据"]
|
||||
creator["创作者数据"]
|
||||
media["媒体文件"]
|
||||
end
|
||||
|
||||
subgraph Storage["存储"]
|
||||
file["文件存储<br/>CSV/JSON/Excel"]
|
||||
db["数据库<br/>SQLite/MySQL"]
|
||||
nosql["NoSQL<br/>MongoDB"]
|
||||
end
|
||||
|
||||
keywords --> browser
|
||||
config --> browser
|
||||
browser --> login
|
||||
login --> search
|
||||
search --> parse
|
||||
parse --> comment
|
||||
parse --> content
|
||||
comment --> comments
|
||||
parse --> creator
|
||||
parse --> media
|
||||
content & comments & creator --> file & db & nosql
|
||||
media --> file
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. 目录结构
|
||||
|
||||
```
|
||||
MediaCrawler/
|
||||
├── main.py # 程序入口
|
||||
├── var.py # 全局上下文变量
|
||||
├── pyproject.toml # 项目配置
|
||||
│
|
||||
├── base/ # 基础抽象类
|
||||
│ └── base_crawler.py # 爬虫、登录、存储、客户端基类
|
||||
│
|
||||
├── config/ # 配置管理
|
||||
│ ├── base_config.py # 核心配置
|
||||
│ ├── db_config.py # 数据库配置
|
||||
│ └── {platform}_config.py # 平台特定配置
|
||||
│
|
||||
├── media_platform/ # 平台爬虫实现
|
||||
│ ├── xhs/ # 小红书
|
||||
│ ├── douyin/ # 抖音
|
||||
│ ├── kuaishou/ # 快手
|
||||
│ ├── bilibili/ # B站
|
||||
│ ├── weibo/ # 微博
|
||||
│ ├── tieba/ # 百度贴吧
|
||||
│ └── zhihu/ # 知乎
|
||||
│
|
||||
├── store/ # 数据存储
|
||||
│ ├── excel_store_base.py # Excel存储基类
|
||||
│ └── {platform}/ # 各平台存储实现
|
||||
│
|
||||
├── database/ # 数据库层
|
||||
│ ├── models.py # ORM模型定义
|
||||
│ ├── db_session.py # 数据库会话管理
|
||||
│ └── mongodb_store_base.py # MongoDB基类
|
||||
│
|
||||
├── proxy/ # 代理管理
|
||||
│ ├── proxy_ip_pool.py # IP池管理
|
||||
│ ├── proxy_mixin.py # 代理刷新混入
|
||||
│ └── providers/ # 代理提供商
|
||||
│
|
||||
├── cache/ # 缓存系统
|
||||
│ ├── abs_cache.py # 缓存抽象类
|
||||
│ ├── local_cache.py # 本地缓存
|
||||
│ └── redis_cache.py # Redis缓存
|
||||
│
|
||||
├── tools/ # 工具模块
|
||||
│ ├── app_runner.py # 应用运行管理
|
||||
│ ├── browser_launcher.py # 浏览器启动
|
||||
│ ├── cdp_browser.py # CDP浏览器管理
|
||||
│ ├── crawler_util.py # 爬虫工具
|
||||
│ └── async_file_writer.py # 异步文件写入
|
||||
│
|
||||
├── model/ # 数据模型
|
||||
│ └── m_{platform}.py # Pydantic模型
|
||||
│
|
||||
├── libs/ # JS脚本库
|
||||
│ └── stealth.min.js # 反检测脚本
|
||||
│
|
||||
└── cmd_arg/ # 命令行参数
|
||||
└── arg.py # 参数定义
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. 核心模块详解
|
||||
|
||||
### 4.1 爬虫基类体系
|
||||
|
||||
```mermaid
|
||||
classDiagram
|
||||
class AbstractCrawler {
|
||||
<<abstract>>
|
||||
+start()* 启动爬虫
|
||||
+search()* 搜索功能
|
||||
+launch_browser() 启动浏览器
|
||||
+launch_browser_with_cdp() CDP模式启动
|
||||
}
|
||||
|
||||
class AbstractLogin {
|
||||
<<abstract>>
|
||||
+begin()* 开始登录
|
||||
+login_by_qrcode()* 二维码登录
|
||||
+login_by_mobile()* 手机号登录
|
||||
+login_by_cookies()* Cookie登录
|
||||
}
|
||||
|
||||
class AbstractStore {
|
||||
<<abstract>>
|
||||
+store_content()* 存储内容
|
||||
+store_comment()* 存储评论
|
||||
+store_creator()* 存储创作者
|
||||
+store_image()* 存储图片
|
||||
+store_video()* 存储视频
|
||||
}
|
||||
|
||||
class AbstractApiClient {
|
||||
<<abstract>>
|
||||
+request()* HTTP请求
|
||||
+update_cookies()* 更新Cookies
|
||||
}
|
||||
|
||||
class ProxyRefreshMixin {
|
||||
+init_proxy_pool() 初始化代理池
|
||||
+_refresh_proxy_if_expired() 刷新过期代理
|
||||
}
|
||||
|
||||
class XiaoHongShuCrawler {
|
||||
+xhs_client: XiaoHongShuClient
|
||||
+start()
|
||||
+search()
|
||||
+get_specified_notes()
|
||||
+get_creators_and_notes()
|
||||
}
|
||||
|
||||
class XiaoHongShuClient {
|
||||
+playwright_page: Page
|
||||
+cookie_dict: Dict
|
||||
+request()
|
||||
+pong() 检查登录状态
|
||||
+get_note_by_keyword()
|
||||
+get_note_by_id()
|
||||
}
|
||||
|
||||
AbstractCrawler <|-- XiaoHongShuCrawler
|
||||
AbstractApiClient <|-- XiaoHongShuClient
|
||||
ProxyRefreshMixin <|-- XiaoHongShuClient
|
||||
```
|
||||
|
||||
### 4.2 爬虫生命周期
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Main as main.py
|
||||
participant Factory as CrawlerFactory
|
||||
participant Crawler as XiaoHongShuCrawler
|
||||
participant Browser as Playwright/CDP
|
||||
participant Login as XiaoHongShuLogin
|
||||
participant Client as XiaoHongShuClient
|
||||
participant Store as StoreFactory
|
||||
|
||||
Main->>Factory: create_crawler("xhs")
|
||||
Factory-->>Main: crawler实例
|
||||
|
||||
Main->>Crawler: start()
|
||||
|
||||
alt 启用IP代理
|
||||
Crawler->>Crawler: create_ip_pool()
|
||||
end
|
||||
|
||||
alt CDP模式
|
||||
Crawler->>Browser: launch_browser_with_cdp()
|
||||
else 标准模式
|
||||
Crawler->>Browser: launch_browser()
|
||||
end
|
||||
Browser-->>Crawler: browser_context
|
||||
|
||||
Crawler->>Crawler: create_xhs_client()
|
||||
Crawler->>Client: pong() 检查登录状态
|
||||
|
||||
alt 未登录
|
||||
Crawler->>Login: begin()
|
||||
Login->>Login: login_by_qrcode/mobile/cookie
|
||||
Login-->>Crawler: 登录成功
|
||||
end
|
||||
|
||||
alt search模式
|
||||
Crawler->>Client: get_note_by_keyword()
|
||||
Client-->>Crawler: 搜索结果
|
||||
loop 获取详情
|
||||
Crawler->>Client: get_note_by_id()
|
||||
Client-->>Crawler: 笔记详情
|
||||
end
|
||||
else detail模式
|
||||
Crawler->>Client: get_note_by_id()
|
||||
else creator模式
|
||||
Crawler->>Client: get_creator_info()
|
||||
end
|
||||
|
||||
Crawler->>Store: store_content/comment/creator
|
||||
Store-->>Crawler: 存储完成
|
||||
|
||||
Main->>Crawler: cleanup()
|
||||
Crawler->>Browser: close()
|
||||
```
|
||||
|
||||
### 4.3 平台爬虫实现结构
|
||||
|
||||
每个平台目录包含以下核心文件:
|
||||
|
||||
```
|
||||
media_platform/{platform}/
|
||||
├── __init__.py # 模块导出
|
||||
├── core.py # 爬虫主实现类
|
||||
├── client.py # API客户端
|
||||
├── login.py # 登录实现
|
||||
├── field.py # 字段/枚举定义
|
||||
├── exception.py # 异常定义
|
||||
├── help.py # 辅助函数
|
||||
└── {特殊实现}.py # 平台特定逻辑
|
||||
```
|
||||
|
||||
### 4.4 三种爬虫模式
|
||||
|
||||
| 模式 | 配置值 | 功能描述 | 适用场景 |
|
||||
|------|--------|---------|---------|
|
||||
| 搜索模式 | `search` | 根据关键词搜索内容 | 批量获取特定主题内容 |
|
||||
| 详情模式 | `detail` | 获取指定ID的详情 | 精确获取已知内容 |
|
||||
| 创作者模式 | `creator` | 获取创作者所有内容 | 追踪特定博主/UP主 |
|
||||
|
||||
---
|
||||
|
||||
## 5. 数据存储层
|
||||
|
||||
### 5.1 存储架构图
|
||||
|
||||
```mermaid
|
||||
classDiagram
|
||||
class AbstractStore {
|
||||
<<abstract>>
|
||||
+store_content()*
|
||||
+store_comment()*
|
||||
+store_creator()*
|
||||
}
|
||||
|
||||
class StoreFactory {
|
||||
+STORES: Dict
|
||||
+create_store() AbstractStore
|
||||
}
|
||||
|
||||
class CsvStoreImplement {
|
||||
+async_file_writer: AsyncFileWriter
|
||||
+store_content()
|
||||
+store_comment()
|
||||
}
|
||||
|
||||
class JsonStoreImplement {
|
||||
+async_file_writer: AsyncFileWriter
|
||||
+store_content()
|
||||
+store_comment()
|
||||
}
|
||||
|
||||
class DbStoreImplement {
|
||||
+session: AsyncSession
|
||||
+store_content()
|
||||
+store_comment()
|
||||
}
|
||||
|
||||
class SqliteStoreImplement {
|
||||
+session: AsyncSession
|
||||
+store_content()
|
||||
+store_comment()
|
||||
}
|
||||
|
||||
class MongoStoreImplement {
|
||||
+mongo_base: MongoDBStoreBase
|
||||
+store_content()
|
||||
+store_comment()
|
||||
}
|
||||
|
||||
class ExcelStoreImplement {
|
||||
+excel_base: ExcelStoreBase
|
||||
+store_content()
|
||||
+store_comment()
|
||||
}
|
||||
|
||||
AbstractStore <|-- CsvStoreImplement
|
||||
AbstractStore <|-- JsonStoreImplement
|
||||
AbstractStore <|-- DbStoreImplement
|
||||
AbstractStore <|-- SqliteStoreImplement
|
||||
AbstractStore <|-- MongoStoreImplement
|
||||
AbstractStore <|-- ExcelStoreImplement
|
||||
StoreFactory --> AbstractStore
|
||||
```
|
||||
|
||||
### 5.2 存储工厂模式
|
||||
|
||||
```python
|
||||
# 以抖音为例
|
||||
class DouyinStoreFactory:
|
||||
STORES = {
|
||||
"csv": DouyinCsvStoreImplement,
|
||||
"db": DouyinDbStoreImplement,
|
||||
"json": DouyinJsonStoreImplement,
|
||||
"sqlite": DouyinSqliteStoreImplement,
|
||||
"mongodb": DouyinMongoStoreImplement,
|
||||
"excel": DouyinExcelStoreImplement,
|
||||
}
|
||||
|
||||
@staticmethod
|
||||
def create_store() -> AbstractStore:
|
||||
store_class = DouyinStoreFactory.STORES.get(config.SAVE_DATA_OPTION)
|
||||
return store_class()
|
||||
```
|
||||
|
||||
### 5.3 存储方式对比
|
||||
|
||||
| 存储方式 | 配置值 | 优点 | 适用场景 |
|
||||
|---------|--------|-----|---------|
|
||||
| CSV | `csv` | 简单、通用 | 小规模数据、快速查看 |
|
||||
| JSON | `json` | 结构完整、易解析 | API对接、数据交换 |
|
||||
| SQLite | `sqlite` | 轻量、无需服务 | 本地开发、小型项目 |
|
||||
| MySQL | `db` | 性能好、支持并发 | 生产环境、大规模数据 |
|
||||
| MongoDB | `mongodb` | 灵活、易扩展 | 非结构化数据、快速迭代 |
|
||||
| Excel | `excel` | 可视化、易分享 | 报告、数据分析 |
|
||||
|
||||
---
|
||||
|
||||
## 6. 基础设施层
|
||||
|
||||
### 6.1 代理系统架构
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph Config["配置"]
|
||||
enable["ENABLE_IP_PROXY"]
|
||||
provider["IP_PROXY_PROVIDER"]
|
||||
count["IP_PROXY_POOL_COUNT"]
|
||||
end
|
||||
|
||||
subgraph Pool["代理池管理"]
|
||||
pool["ProxyIpPool"]
|
||||
load["load_proxies()"]
|
||||
validate["_is_valid_proxy()"]
|
||||
get["get_proxy()"]
|
||||
refresh["get_or_refresh_proxy()"]
|
||||
end
|
||||
|
||||
subgraph Providers["代理提供商"]
|
||||
kuaidl["快代理<br/>KuaiDaiLiProxy"]
|
||||
wandou["万代理<br/>WanDouHttpProxy"]
|
||||
jishu["技术IP<br/>JiShuHttpProxy"]
|
||||
end
|
||||
|
||||
subgraph Client["API客户端"]
|
||||
mixin["ProxyRefreshMixin"]
|
||||
request["request()"]
|
||||
end
|
||||
|
||||
enable --> pool
|
||||
provider --> Providers
|
||||
count --> load
|
||||
pool --> load
|
||||
load --> validate
|
||||
validate --> Providers
|
||||
pool --> get
|
||||
pool --> refresh
|
||||
mixin --> refresh
|
||||
mixin --> Client
|
||||
request --> mixin
|
||||
```
|
||||
|
||||
### 6.2 登录流程
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
Start([开始登录]) --> CheckType{登录类型?}
|
||||
|
||||
CheckType -->|qrcode| QR[显示二维码]
|
||||
QR --> WaitScan[等待扫描]
|
||||
WaitScan --> CheckQR{扫描成功?}
|
||||
CheckQR -->|是| SaveCookie[保存Cookie]
|
||||
CheckQR -->|否| WaitScan
|
||||
|
||||
CheckType -->|phone| Phone[输入手机号]
|
||||
Phone --> SendCode[发送验证码]
|
||||
SendCode --> Slider{需要滑块?}
|
||||
Slider -->|是| DoSlider[滑动验证]
|
||||
DoSlider --> InputCode[输入验证码]
|
||||
Slider -->|否| InputCode
|
||||
InputCode --> Verify[验证登录]
|
||||
Verify --> SaveCookie
|
||||
|
||||
CheckType -->|cookie| LoadCookie[加载已保存Cookie]
|
||||
LoadCookie --> VerifyCookie{Cookie有效?}
|
||||
VerifyCookie -->|是| SaveCookie
|
||||
VerifyCookie -->|否| Fail[登录失败]
|
||||
|
||||
SaveCookie --> UpdateContext[更新浏览器上下文]
|
||||
UpdateContext --> End([登录完成])
|
||||
```
|
||||
|
||||
### 6.3 浏览器管理
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph Mode["启动模式"]
|
||||
standard["标准模式<br/>Playwright"]
|
||||
cdp["CDP模式<br/>Chrome DevTools"]
|
||||
end
|
||||
|
||||
subgraph Standard["标准模式流程"]
|
||||
launch["chromium.launch()"]
|
||||
context["new_context()"]
|
||||
stealth["注入stealth.js"]
|
||||
end
|
||||
|
||||
subgraph CDP["CDP模式流程"]
|
||||
detect["检测浏览器路径"]
|
||||
start["启动浏览器进程"]
|
||||
connect["connect_over_cdp()"]
|
||||
cdpContext["获取已有上下文"]
|
||||
end
|
||||
|
||||
subgraph Features["特性"]
|
||||
f1["用户数据持久化"]
|
||||
f2["扩展和设置继承"]
|
||||
f3["反检测能力增强"]
|
||||
end
|
||||
|
||||
standard --> Standard
|
||||
cdp --> CDP
|
||||
CDP --> Features
|
||||
```
|
||||
|
||||
### 6.4 缓存系统
|
||||
|
||||
```mermaid
|
||||
classDiagram
|
||||
class AbstractCache {
|
||||
<<abstract>>
|
||||
+get(key)* 获取缓存
|
||||
+set(key, value, expire)* 设置缓存
|
||||
+keys(pattern)* 获取所有键
|
||||
}
|
||||
|
||||
class ExpiringLocalCache {
|
||||
-_cache: Dict
|
||||
-_expire_times: Dict
|
||||
+get(key)
|
||||
+set(key, value, expire_time)
|
||||
+keys(pattern)
|
||||
-_is_expired(key)
|
||||
}
|
||||
|
||||
class RedisCache {
|
||||
-_client: Redis
|
||||
+get(key)
|
||||
+set(key, value, expire_time)
|
||||
+keys(pattern)
|
||||
}
|
||||
|
||||
class CacheFactory {
|
||||
+create_cache(type) AbstractCache
|
||||
}
|
||||
|
||||
AbstractCache <|-- ExpiringLocalCache
|
||||
AbstractCache <|-- RedisCache
|
||||
CacheFactory --> AbstractCache
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. 数据模型
|
||||
|
||||
### 7.1 ORM模型关系
|
||||
|
||||
```mermaid
|
||||
erDiagram
|
||||
DouyinAweme {
|
||||
int id PK
|
||||
string aweme_id UK
|
||||
string aweme_type
|
||||
string title
|
||||
string desc
|
||||
int create_time
|
||||
int liked_count
|
||||
int collected_count
|
||||
int comment_count
|
||||
int share_count
|
||||
string user_id FK
|
||||
datetime add_ts
|
||||
datetime last_modify_ts
|
||||
}
|
||||
|
||||
DouyinAwemeComment {
|
||||
int id PK
|
||||
string comment_id UK
|
||||
string aweme_id FK
|
||||
string content
|
||||
int create_time
|
||||
int sub_comment_count
|
||||
string user_id
|
||||
datetime add_ts
|
||||
datetime last_modify_ts
|
||||
}
|
||||
|
||||
DyCreator {
|
||||
int id PK
|
||||
string user_id UK
|
||||
string nickname
|
||||
string avatar
|
||||
string desc
|
||||
int follower_count
|
||||
int total_favorited
|
||||
datetime add_ts
|
||||
datetime last_modify_ts
|
||||
}
|
||||
|
||||
DouyinAweme ||--o{ DouyinAwemeComment : "has"
|
||||
DyCreator ||--o{ DouyinAweme : "creates"
|
||||
```
|
||||
|
||||
### 7.2 各平台数据表
|
||||
|
||||
| 平台 | 内容表 | 评论表 | 创作者表 |
|
||||
|------|--------|--------|---------|
|
||||
| 抖音 | DouyinAweme | DouyinAwemeComment | DyCreator |
|
||||
| 小红书 | XHSNote | XHSNoteComment | XHSCreator |
|
||||
| 快手 | KuaishouVideo | KuaishouVideoComment | KsCreator |
|
||||
| B站 | BilibiliVideo | BilibiliVideoComment | BilibiliUpInfo |
|
||||
| 微博 | WeiboNote | WeiboNoteComment | WeiboCreator |
|
||||
| 贴吧 | TiebaNote | TiebaNoteComment | - |
|
||||
| 知乎 | ZhihuContent | ZhihuContentComment | ZhihuCreator |
|
||||
|
||||
---
|
||||
|
||||
## 8. 配置系统
|
||||
|
||||
### 8.1 核心配置项
|
||||
|
||||
```python
|
||||
# config/base_config.py
|
||||
|
||||
# 平台选择
|
||||
PLATFORM = "xhs" # xhs, dy, ks, bili, wb, tieba, zhihu
|
||||
|
||||
# 登录配置
|
||||
LOGIN_TYPE = "qrcode" # qrcode, phone, cookie
|
||||
SAVE_LOGIN_STATE = True
|
||||
|
||||
# 爬虫配置
|
||||
CRAWLER_TYPE = "search" # search, detail, creator
|
||||
KEYWORDS = "编程副业,编程兼职"
|
||||
CRAWLER_MAX_NOTES_COUNT = 15
|
||||
MAX_CONCURRENCY_NUM = 1
|
||||
|
||||
# 评论配置
|
||||
ENABLE_GET_COMMENTS = True
|
||||
ENABLE_GET_SUB_COMMENTS = False
|
||||
CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES = 10
|
||||
|
||||
# 浏览器配置
|
||||
HEADLESS = False
|
||||
ENABLE_CDP_MODE = True
|
||||
CDP_DEBUG_PORT = 9222
|
||||
|
||||
# 代理配置
|
||||
ENABLE_IP_PROXY = False
|
||||
IP_PROXY_PROVIDER = "kuaidaili"
|
||||
IP_PROXY_POOL_COUNT = 2
|
||||
|
||||
# 存储配置
|
||||
SAVE_DATA_OPTION = "json" # csv, db, json, sqlite, mongodb, excel
|
||||
```
|
||||
|
||||
### 8.2 数据库配置
|
||||
|
||||
```python
|
||||
# config/db_config.py
|
||||
|
||||
# MySQL
|
||||
MYSQL_DB_HOST = "localhost"
|
||||
MYSQL_DB_PORT = 3306
|
||||
MYSQL_DB_NAME = "media_crawler"
|
||||
|
||||
# Redis
|
||||
REDIS_DB_HOST = "127.0.0.1"
|
||||
REDIS_DB_PORT = 6379
|
||||
|
||||
# MongoDB
|
||||
MONGODB_HOST = "localhost"
|
||||
MONGODB_PORT = 27017
|
||||
|
||||
# SQLite
|
||||
SQLITE_DB_PATH = "database/sqlite_tables.db"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. 工具模块
|
||||
|
||||
### 9.1 工具函数概览
|
||||
|
||||
| 模块 | 文件 | 主要功能 |
|
||||
|------|------|---------|
|
||||
| 应用运行器 | `app_runner.py` | 信号处理、优雅退出、清理管理 |
|
||||
| 浏览器启动 | `browser_launcher.py` | 检测浏览器路径、启动浏览器进程 |
|
||||
| CDP管理 | `cdp_browser.py` | CDP连接、浏览器上下文管理 |
|
||||
| 爬虫工具 | `crawler_util.py` | 二维码识别、验证码处理、User-Agent |
|
||||
| 文件写入 | `async_file_writer.py` | 异步CSV/JSON写入、词云生成 |
|
||||
| 滑块验证 | `slider_util.py` | 滑动验证码破解 |
|
||||
| 时间工具 | `time_util.py` | 时间戳转换、日期处理 |
|
||||
|
||||
### 9.2 应用运行管理
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
Start([程序启动]) --> Run["run(app_main, app_cleanup)"]
|
||||
Run --> Main["执行 app_main()"]
|
||||
Main --> Running{运行中}
|
||||
|
||||
Running -->|正常完成| Cleanup1["执行 app_cleanup()"]
|
||||
Running -->|SIGINT/SIGTERM| Signal["捕获信号"]
|
||||
|
||||
Signal --> First{第一次信号?}
|
||||
First -->|是| Cleanup2["启动清理流程"]
|
||||
First -->|否| Force["强制退出"]
|
||||
|
||||
Cleanup1 & Cleanup2 --> Cancel["取消其他任务"]
|
||||
Cancel --> Wait["等待任务完成<br/>(超时15秒)"]
|
||||
Wait --> End([程序退出])
|
||||
Force --> End
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 10. 模块依赖关系
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph Entry["入口层"]
|
||||
main["main.py"]
|
||||
config["config/"]
|
||||
cmdarg["cmd_arg/"]
|
||||
end
|
||||
|
||||
subgraph Core["核心层"]
|
||||
base["base/base_crawler.py"]
|
||||
platforms["media_platform/*/"]
|
||||
end
|
||||
|
||||
subgraph Client["客户端层"]
|
||||
client["*/client.py"]
|
||||
login["*/login.py"]
|
||||
end
|
||||
|
||||
subgraph Storage["存储层"]
|
||||
store["store/"]
|
||||
database["database/"]
|
||||
end
|
||||
|
||||
subgraph Infra["基础设施"]
|
||||
proxy["proxy/"]
|
||||
cache["cache/"]
|
||||
tools["tools/"]
|
||||
end
|
||||
|
||||
subgraph External["外部依赖"]
|
||||
playwright["Playwright"]
|
||||
httpx["httpx"]
|
||||
sqlalchemy["SQLAlchemy"]
|
||||
motor["Motor/MongoDB"]
|
||||
end
|
||||
|
||||
main --> config
|
||||
main --> cmdarg
|
||||
main --> Core
|
||||
|
||||
Core --> base
|
||||
platforms --> base
|
||||
platforms --> Client
|
||||
|
||||
client --> proxy
|
||||
client --> httpx
|
||||
login --> tools
|
||||
|
||||
platforms --> Storage
|
||||
Storage --> sqlalchemy
|
||||
Storage --> motor
|
||||
|
||||
client --> playwright
|
||||
tools --> playwright
|
||||
|
||||
proxy --> cache
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 11. 扩展指南
|
||||
|
||||
### 11.1 添加新平台
|
||||
|
||||
1. 在 `media_platform/` 下创建新目录
|
||||
2. 实现以下核心文件:
|
||||
- `core.py` - 继承 `AbstractCrawler`
|
||||
- `client.py` - 继承 `AbstractApiClient` 和 `ProxyRefreshMixin`
|
||||
- `login.py` - 继承 `AbstractLogin`
|
||||
- `field.py` - 定义平台枚举
|
||||
3. 在 `store/` 下创建对应存储目录
|
||||
4. 在 `main.py` 的 `CrawlerFactory.CRAWLERS` 中注册
|
||||
|
||||
### 11.2 添加新存储方式
|
||||
|
||||
1. 在 `store/` 下创建新的存储实现类
|
||||
2. 继承 `AbstractStore` 基类
|
||||
3. 实现 `store_content`、`store_comment`、`store_creator` 方法
|
||||
4. 在各平台的 `StoreFactory.STORES` 中注册
|
||||
|
||||
### 11.3 添加新代理提供商
|
||||
|
||||
1. 在 `proxy/providers/` 下创建新的代理类
|
||||
2. 继承 `BaseProxy` 基类
|
||||
3. 实现 `get_proxy()` 方法
|
||||
4. 在配置中注册
|
||||
|
||||
---
|
||||
|
||||
## 12. 快速参考
|
||||
|
||||
### 12.1 常用命令
|
||||
|
||||
```bash
|
||||
# 启动爬虫
|
||||
python main.py
|
||||
|
||||
# 指定平台
|
||||
python main.py --platform xhs
|
||||
|
||||
# 指定登录方式
|
||||
python main.py --lt qrcode
|
||||
|
||||
# 指定爬虫类型
|
||||
python main.py --type search
|
||||
```
|
||||
|
||||
### 12.2 关键文件路径
|
||||
|
||||
| 用途 | 文件路径 |
|
||||
|------|---------|
|
||||
| 程序入口 | `main.py` |
|
||||
| 核心配置 | `config/base_config.py` |
|
||||
| 数据库配置 | `config/db_config.py` |
|
||||
| 爬虫基类 | `base/base_crawler.py` |
|
||||
| ORM模型 | `database/models.py` |
|
||||
| 代理池 | `proxy/proxy_ip_pool.py` |
|
||||
| CDP浏览器 | `tools/cdp_browser.py` |
|
||||
|
||||
---
|
||||
|
||||
*文档生成时间: 2025-12-18*
|
||||
123
main.py
@@ -1,3 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/main.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
@@ -8,10 +17,20 @@
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
import sys
|
||||
import io
|
||||
|
||||
# Force UTF-8 encoding for stdout/stderr to prevent encoding errors
|
||||
# when outputting Chinese characters in non-UTF-8 terminals
|
||||
if sys.stdout and hasattr(sys.stdout, 'buffer'):
|
||||
if sys.stdout.encoding and sys.stdout.encoding.lower() != 'utf-8':
|
||||
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8', errors='replace')
|
||||
if sys.stderr and hasattr(sys.stderr, 'buffer'):
|
||||
if sys.stderr.encoding and sys.stderr.encoding.lower() != 'utf-8':
|
||||
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8', errors='replace')
|
||||
|
||||
import asyncio
|
||||
import sys
|
||||
from typing import Optional
|
||||
from typing import Optional, Type
|
||||
|
||||
import cmd_arg
|
||||
import config
|
||||
@@ -24,10 +43,12 @@ from media_platform.tieba import TieBaCrawler
|
||||
from media_platform.weibo import WeiboCrawler
|
||||
from media_platform.xhs import XiaoHongShuCrawler
|
||||
from media_platform.zhihu import ZhihuCrawler
|
||||
from tools.async_file_writer import AsyncFileWriter
|
||||
from var import crawler_type_var
|
||||
|
||||
|
||||
class CrawlerFactory:
|
||||
CRAWLERS = {
|
||||
CRAWLERS: dict[str, Type[AbstractCrawler]] = {
|
||||
"xhs": XiaoHongShuCrawler,
|
||||
"dy": DouYinCrawler,
|
||||
"ks": KuaishouCrawler,
|
||||
@@ -41,48 +62,96 @@ class CrawlerFactory:
|
||||
def create_crawler(platform: str) -> AbstractCrawler:
|
||||
crawler_class = CrawlerFactory.CRAWLERS.get(platform)
|
||||
if not crawler_class:
|
||||
raise ValueError(
|
||||
"Invalid Media Platform Currently only supported xhs or dy or ks or bili ..."
|
||||
)
|
||||
supported = ", ".join(sorted(CrawlerFactory.CRAWLERS))
|
||||
raise ValueError(f"Invalid media platform: {platform!r}. Supported: {supported}")
|
||||
return crawler_class()
|
||||
|
||||
|
||||
crawler: Optional[AbstractCrawler] = None
|
||||
|
||||
|
||||
# persist-1<persist1@126.com>
|
||||
# 原因:增加 --init_db 功能,用于数据库初始化。
|
||||
# 副作用:无
|
||||
# 回滚策略:还原此文件。
|
||||
async def main():
|
||||
# Init crawler
|
||||
def _flush_excel_if_needed() -> None:
|
||||
if config.SAVE_DATA_OPTION != "excel":
|
||||
return
|
||||
|
||||
try:
|
||||
from store.excel_store_base import ExcelStoreBase
|
||||
|
||||
ExcelStoreBase.flush_all()
|
||||
print("[Main] Excel files saved successfully")
|
||||
except Exception as e:
|
||||
print(f"[Main] Error flushing Excel data: {e}")
|
||||
|
||||
|
||||
async def _generate_wordcloud_if_needed() -> None:
|
||||
if config.SAVE_DATA_OPTION != "json" or not config.ENABLE_GET_WORDCLOUD:
|
||||
return
|
||||
|
||||
try:
|
||||
file_writer = AsyncFileWriter(
|
||||
platform=config.PLATFORM,
|
||||
crawler_type=crawler_type_var.get(),
|
||||
)
|
||||
await file_writer.generate_wordcloud_from_comments()
|
||||
except Exception as e:
|
||||
print(f"[Main] Error generating wordcloud: {e}")
|
||||
|
||||
|
||||
async def main() -> None:
|
||||
global crawler
|
||||
|
||||
# parse cmd
|
||||
args = await cmd_arg.parse_cmd()
|
||||
|
||||
# init db
|
||||
if args.init_db:
|
||||
await db.init_db(args.init_db)
|
||||
print(f"Database {args.init_db} initialized successfully.")
|
||||
return # Exit the main function cleanly
|
||||
|
||||
|
||||
return
|
||||
|
||||
crawler = CrawlerFactory.create_crawler(platform=config.PLATFORM)
|
||||
await crawler.start()
|
||||
|
||||
_flush_excel_if_needed()
|
||||
|
||||
def cleanup():
|
||||
# Generate wordcloud after crawling is complete
|
||||
# Only for JSON save mode
|
||||
await _generate_wordcloud_if_needed()
|
||||
|
||||
|
||||
async def async_cleanup() -> None:
|
||||
global crawler
|
||||
if crawler:
|
||||
# asyncio.run(crawler.close())
|
||||
pass
|
||||
if config.SAVE_DATA_OPTION in ["db", "sqlite"]:
|
||||
asyncio.run(db.close())
|
||||
if getattr(crawler, "cdp_manager", None):
|
||||
try:
|
||||
await crawler.cdp_manager.cleanup(force=True)
|
||||
except Exception as e:
|
||||
error_msg = str(e).lower()
|
||||
if "closed" not in error_msg and "disconnected" not in error_msg:
|
||||
print(f"[Main] Error cleaning up CDP browser: {e}")
|
||||
|
||||
elif getattr(crawler, "browser_context", None):
|
||||
try:
|
||||
await crawler.browser_context.close()
|
||||
except Exception as e:
|
||||
error_msg = str(e).lower()
|
||||
if "closed" not in error_msg and "disconnected" not in error_msg:
|
||||
print(f"[Main] Error closing browser context: {e}")
|
||||
|
||||
if config.SAVE_DATA_OPTION in ("db", "sqlite"):
|
||||
await db.close()
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
asyncio.get_event_loop().run_until_complete(main())
|
||||
finally:
|
||||
cleanup()
|
||||
from tools.app_runner import run
|
||||
|
||||
def _force_stop() -> None:
|
||||
c = crawler
|
||||
if not c:
|
||||
return
|
||||
cdp_manager = getattr(c, "cdp_manager", None)
|
||||
launcher = getattr(cdp_manager, "launcher", None)
|
||||
if not launcher:
|
||||
return
|
||||
try:
|
||||
launcher.cleanup()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
run(main, async_cleanup, cleanup_timeout_seconds=15.0, on_first_interrupt=_force_stop)
|
||||
|
||||
@@ -1,11 +1,18 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/__init__.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
@@ -1,12 +1,21 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/bilibili/__init__.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
# -*- coding: utf-8 -*-
|
||||
@@ -14,4 +23,4 @@
|
||||
# @Time : 2023/12/2 18:36
|
||||
# @Desc :
|
||||
|
||||
from .core import *
|
||||
from .core import *
|
||||
|
||||
@@ -1,3 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/bilibili/client.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
@@ -11,11 +20,11 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# @Author : relakkes@gmail.com
|
||||
# @Time : 2023/12/2 18:44
|
||||
# @Desc : bilibili 请求客户端
|
||||
# @Desc : bilibili request client
|
||||
import asyncio
|
||||
import json
|
||||
import random
|
||||
from typing import Any, Callable, Dict, List, Optional, Tuple, Union
|
||||
from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple, Union
|
||||
from urllib.parse import urlencode
|
||||
|
||||
import httpx
|
||||
@@ -23,23 +32,28 @@ from playwright.async_api import BrowserContext, Page
|
||||
|
||||
import config
|
||||
from base.base_crawler import AbstractApiClient
|
||||
from proxy.proxy_mixin import ProxyRefreshMixin
|
||||
from tools import utils
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from proxy.proxy_ip_pool import ProxyIpPool
|
||||
|
||||
from .exception import DataFetchError
|
||||
from .field import CommentOrderType, SearchOrderType
|
||||
from .help import BilibiliSign
|
||||
|
||||
|
||||
class BilibiliClient(AbstractApiClient):
|
||||
class BilibiliClient(AbstractApiClient, ProxyRefreshMixin):
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
timeout=60, # 若开启爬取媒体选项,b 站的长视频需要更久的超时时间
|
||||
timeout=60, # For media crawling, Bilibili long videos need a longer timeout
|
||||
proxy=None,
|
||||
*,
|
||||
headers: Dict[str, str],
|
||||
playwright_page: Page,
|
||||
cookie_dict: Dict[str, str],
|
||||
proxy_ip_pool: Optional["ProxyIpPool"] = None,
|
||||
):
|
||||
self.proxy = proxy
|
||||
self.timeout = timeout
|
||||
@@ -47,8 +61,13 @@ class BilibiliClient(AbstractApiClient):
|
||||
self._host = "https://api.bilibili.com"
|
||||
self.playwright_page = playwright_page
|
||||
self.cookie_dict = cookie_dict
|
||||
# Initialize proxy pool (from ProxyRefreshMixin)
|
||||
self.init_proxy_pool(proxy_ip_pool)
|
||||
|
||||
async def request(self, method, url, **kwargs) -> Any:
|
||||
# Check if proxy has expired before each request
|
||||
await self._refresh_proxy_if_expired()
|
||||
|
||||
async with httpx.AsyncClient(proxy=self.proxy) as client:
|
||||
response = await client.request(method, url, timeout=self.timeout, **kwargs)
|
||||
try:
|
||||
@@ -63,8 +82,8 @@ class BilibiliClient(AbstractApiClient):
|
||||
|
||||
async def pre_request_data(self, req_data: Dict) -> Dict:
|
||||
"""
|
||||
发送请求进行请求参数签名
|
||||
需要从 localStorage 拿 wbi_img_urls 这参数,值如下:
|
||||
Send request to sign request parameters
|
||||
Need to get wbi_img_urls parameter from localStorage, value as follows:
|
||||
https://i0.hdslb.com/bfs/wbi/7cd084941338484aae1ad9425b84077c.png-https://i0.hdslb.com/bfs/wbi/4932caff0ff746eab6f01bf08b70ac45.png
|
||||
:param req_data:
|
||||
:return:
|
||||
@@ -76,7 +95,7 @@ class BilibiliClient(AbstractApiClient):
|
||||
|
||||
async def get_wbi_keys(self) -> Tuple[str, str]:
|
||||
"""
|
||||
获取最新的 img_key 和 sub_key
|
||||
Get the latest img_key and sub_key
|
||||
:return:
|
||||
"""
|
||||
local_storage = await self.playwright_page.evaluate("() => window.localStorage")
|
||||
@@ -141,12 +160,12 @@ class BilibiliClient(AbstractApiClient):
|
||||
) -> Dict:
|
||||
"""
|
||||
KuaiShou web search api
|
||||
:param keyword: 搜索关键词
|
||||
:param page: 分页参数具体第几页
|
||||
:param page_size: 每一页参数的数量
|
||||
:param order: 搜索结果排序,默认位综合排序
|
||||
:param pubtime_begin_s: 发布时间开始时间戳
|
||||
:param pubtime_end_s: 发布时间结束时间戳
|
||||
:param keyword: Search keyword
|
||||
:param page: Page number for pagination
|
||||
:param page_size: Number of items per page
|
||||
:param order: Sort order for search results, default is comprehensive sorting
|
||||
:param pubtime_begin_s: Publish time start timestamp
|
||||
:param pubtime_end_s: Publish time end timestamp
|
||||
:return:
|
||||
"""
|
||||
uri = "/x/web-interface/wbi/search/type"
|
||||
@@ -163,13 +182,13 @@ class BilibiliClient(AbstractApiClient):
|
||||
|
||||
async def get_video_info(self, aid: Union[int, None] = None, bvid: Union[str, None] = None) -> Dict:
|
||||
"""
|
||||
Bilibli web video detail api, aid 和 bvid任选一个参数
|
||||
:param aid: 稿件avid
|
||||
:param bvid: 稿件bvid
|
||||
Bilibli web video detail api, choose one parameter between aid and bvid
|
||||
:param aid: Video aid
|
||||
:param bvid: Video bvid
|
||||
:return:
|
||||
"""
|
||||
if not aid and not bvid:
|
||||
raise ValueError("请提供 aid 或 bvid 中的至少一个参数")
|
||||
raise ValueError("Please provide at least one parameter: aid or bvid")
|
||||
|
||||
uri = "/x/web-interface/view/detail"
|
||||
params = dict()
|
||||
@@ -182,12 +201,12 @@ class BilibiliClient(AbstractApiClient):
|
||||
async def get_video_play_url(self, aid: int, cid: int) -> Dict:
|
||||
"""
|
||||
Bilibli web video play url api
|
||||
:param aid: 稿件avid
|
||||
:param aid: Video aid
|
||||
:param cid: cid
|
||||
:return:
|
||||
"""
|
||||
if not aid or not cid or aid <= 0 or cid <= 0:
|
||||
raise ValueError("aid 和 cid 必须存在")
|
||||
raise ValueError("aid and cid must exist")
|
||||
uri = "/x/player/wbi/playurl"
|
||||
qn_value = getattr(config, "BILI_QN", 80)
|
||||
params = {
|
||||
@@ -214,7 +233,7 @@ class BilibiliClient(AbstractApiClient):
|
||||
)
|
||||
return None
|
||||
except httpx.HTTPError as exc: # some wrong when call httpx.request method, such as connection error, client error, server error or response status code is not 2xx
|
||||
utils.logger.error(f"[BilibiliClient.get_video_media] {exc.__class__.__name__} for {exc.request.url} - {exc}") # 保留原始异常类型名称,以便开发者调试
|
||||
utils.logger.error(f"[BilibiliClient.get_video_media] {exc.__class__.__name__} for {exc.request.url} - {exc}") # Keep original exception type name for developer debugging
|
||||
return None
|
||||
|
||||
async def get_video_comments(
|
||||
@@ -224,9 +243,9 @@ class BilibiliClient(AbstractApiClient):
|
||||
next: int = 0,
|
||||
) -> Dict:
|
||||
"""get video comments
|
||||
:param video_id: 视频 ID
|
||||
:param order_mode: 排序方式
|
||||
:param next: 评论页选择
|
||||
:param video_id: Video ID
|
||||
:param order_mode: Sort order
|
||||
:param next: Comment page selection
|
||||
:return:
|
||||
"""
|
||||
uri = "/x/v2/reply/wbi/main"
|
||||
@@ -247,7 +266,7 @@ class BilibiliClient(AbstractApiClient):
|
||||
:param crawl_interval:
|
||||
:param is_fetch_sub_comments:
|
||||
:param callback:
|
||||
max_count: 一次笔记爬取的最大评论数量
|
||||
max_count: Maximum number of comments to crawl per note
|
||||
|
||||
:return:
|
||||
"""
|
||||
@@ -280,7 +299,7 @@ class BilibiliClient(AbstractApiClient):
|
||||
|
||||
comment_list: List[Dict] = comments_res.get("replies", [])
|
||||
|
||||
# 检查 is_end 和 next 是否存在
|
||||
# Check if is_end and next exist
|
||||
if "is_end" not in cursor_info or "next" not in cursor_info:
|
||||
utils.logger.warning(f"[BilibiliClient.get_video_all_comments] 'is_end' or 'next' not in cursor for video_id: {video_id}. Assuming end of comments.")
|
||||
is_end = True
|
||||
@@ -298,7 +317,7 @@ class BilibiliClient(AbstractApiClient):
|
||||
{await self.get_video_all_level_two_comments(video_id, comment_id, CommentOrderType.DEFAULT, 10, crawl_interval, callback)}
|
||||
if len(result) + len(comment_list) > max_count:
|
||||
comment_list = comment_list[:max_count - len(result)]
|
||||
if callback: # 如果有回调函数,就执行回调函数
|
||||
if callback: # If there is a callback function, execute it
|
||||
await callback(video_id, comment_list)
|
||||
await asyncio.sleep(crawl_interval)
|
||||
if not is_fetch_sub_comments:
|
||||
@@ -317,10 +336,10 @@ class BilibiliClient(AbstractApiClient):
|
||||
) -> Dict:
|
||||
"""
|
||||
get video all level two comments for a level one comment
|
||||
:param video_id: 视频 ID
|
||||
:param level_one_comment_id: 一级评论 ID
|
||||
:param video_id: Video ID
|
||||
:param level_one_comment_id: Level one comment ID
|
||||
:param order_mode:
|
||||
:param ps: 一页评论数
|
||||
:param ps: Number of comments per page
|
||||
:param crawl_interval:
|
||||
:param callback:
|
||||
:return:
|
||||
@@ -330,7 +349,7 @@ class BilibiliClient(AbstractApiClient):
|
||||
while True:
|
||||
result = await self.get_video_level_two_comments(video_id, level_one_comment_id, pn, ps, order_mode)
|
||||
comment_list: List[Dict] = result.get("replies", [])
|
||||
if callback: # 如果有回调函数,就执行回调函数
|
||||
if callback: # If there is a callback function, execute it
|
||||
await callback(video_id, comment_list)
|
||||
await asyncio.sleep(crawl_interval)
|
||||
if (int(result["page"]["count"]) <= pn * ps):
|
||||
@@ -347,9 +366,9 @@ class BilibiliClient(AbstractApiClient):
|
||||
order_mode: CommentOrderType,
|
||||
) -> Dict:
|
||||
"""get video level two comments
|
||||
:param video_id: 视频 ID
|
||||
:param level_one_comment_id: 一级评论 ID
|
||||
:param order_mode: 排序方式
|
||||
:param video_id: Video ID
|
||||
:param level_one_comment_id: Level one comment ID
|
||||
:param order_mode: Sort order
|
||||
|
||||
:return:
|
||||
"""
|
||||
@@ -367,10 +386,10 @@ class BilibiliClient(AbstractApiClient):
|
||||
|
||||
async def get_creator_videos(self, creator_id: str, pn: int, ps: int = 30, order_mode: SearchOrderType = SearchOrderType.LAST_PUBLISH) -> Dict:
|
||||
"""get all videos for a creator
|
||||
:param creator_id: 创作者 ID
|
||||
:param pn: 页数
|
||||
:param ps: 一页视频数
|
||||
:param order_mode: 排序方式
|
||||
:param creator_id: Creator ID
|
||||
:param pn: Page number
|
||||
:param ps: Number of videos per page
|
||||
:param order_mode: Sort order
|
||||
|
||||
:return:
|
||||
"""
|
||||
@@ -386,7 +405,7 @@ class BilibiliClient(AbstractApiClient):
|
||||
async def get_creator_info(self, creator_id: int) -> Dict:
|
||||
"""
|
||||
get creator info
|
||||
:param creator_id: 作者 ID
|
||||
:param creator_id: Creator ID
|
||||
"""
|
||||
uri = "/x/space/wbi/acc/info"
|
||||
post_data = {
|
||||
@@ -402,9 +421,9 @@ class BilibiliClient(AbstractApiClient):
|
||||
) -> Dict:
|
||||
"""
|
||||
get creator fans
|
||||
:param creator_id: 创作者 ID
|
||||
:param pn: 开始页数
|
||||
:param ps: 每页数量
|
||||
:param creator_id: Creator ID
|
||||
:param pn: Start page number
|
||||
:param ps: Number of items per page
|
||||
:return:
|
||||
"""
|
||||
uri = "/x/relation/fans"
|
||||
@@ -424,9 +443,9 @@ class BilibiliClient(AbstractApiClient):
|
||||
) -> Dict:
|
||||
"""
|
||||
get creator followings
|
||||
:param creator_id: 创作者 ID
|
||||
:param pn: 开始页数
|
||||
:param ps: 每页数量
|
||||
:param creator_id: Creator ID
|
||||
:param pn: Start page number
|
||||
:param ps: Number of items per page
|
||||
:return:
|
||||
"""
|
||||
uri = "/x/relation/followings"
|
||||
@@ -441,8 +460,8 @@ class BilibiliClient(AbstractApiClient):
|
||||
async def get_creator_dynamics(self, creator_id: int, offset: str = ""):
|
||||
"""
|
||||
get creator comments
|
||||
:param creator_id: 创作者 ID
|
||||
:param offset: 发送请求所需参数
|
||||
:param creator_id: Creator ID
|
||||
:param offset: Parameter required for sending request
|
||||
:return:
|
||||
"""
|
||||
uri = "/x/polymer/web-dynamic/v1/feed/space"
|
||||
@@ -466,9 +485,9 @@ class BilibiliClient(AbstractApiClient):
|
||||
:param creator_info:
|
||||
:param crawl_interval:
|
||||
:param callback:
|
||||
:param max_count: 一个up主爬取的最大粉丝数量
|
||||
:param max_count: Maximum number of fans to crawl for a creator
|
||||
|
||||
:return: up主粉丝数列表
|
||||
:return: List of creator fans
|
||||
"""
|
||||
creator_id = creator_info["id"]
|
||||
result = []
|
||||
@@ -480,7 +499,7 @@ class BilibiliClient(AbstractApiClient):
|
||||
pn += 1
|
||||
if len(result) + len(fans_list) > max_count:
|
||||
fans_list = fans_list[:max_count - len(result)]
|
||||
if callback: # 如果有回调函数,就执行回调函数
|
||||
if callback: # If there is a callback function, execute it
|
||||
await callback(creator_info, fans_list)
|
||||
await asyncio.sleep(crawl_interval)
|
||||
if not fans_list:
|
||||
@@ -500,9 +519,9 @@ class BilibiliClient(AbstractApiClient):
|
||||
:param creator_info:
|
||||
:param crawl_interval:
|
||||
:param callback:
|
||||
:param max_count: 一个up主爬取的最大关注者数量
|
||||
:param max_count: Maximum number of followings to crawl for a creator
|
||||
|
||||
:return: up主关注者列表
|
||||
:return: List of creator followings
|
||||
"""
|
||||
creator_id = creator_info["id"]
|
||||
result = []
|
||||
@@ -514,7 +533,7 @@ class BilibiliClient(AbstractApiClient):
|
||||
pn += 1
|
||||
if len(result) + len(followings_list) > max_count:
|
||||
followings_list = followings_list[:max_count - len(result)]
|
||||
if callback: # 如果有回调函数,就执行回调函数
|
||||
if callback: # If there is a callback function, execute it
|
||||
await callback(creator_info, followings_list)
|
||||
await asyncio.sleep(crawl_interval)
|
||||
if not followings_list:
|
||||
@@ -534,9 +553,9 @@ class BilibiliClient(AbstractApiClient):
|
||||
:param creator_info:
|
||||
:param crawl_interval:
|
||||
:param callback:
|
||||
:param max_count: 一个up主爬取的最大动态数量
|
||||
:param max_count: Maximum number of dynamics to crawl for a creator
|
||||
|
||||
:return: up主关注者列表
|
||||
:return: List of creator dynamics
|
||||
"""
|
||||
creator_id = creator_info["id"]
|
||||
result = []
|
||||
|
||||
@@ -1,3 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/bilibili/core.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
@@ -11,7 +20,7 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# @Author : relakkes@gmail.com
|
||||
# @Time : 2023/12/2 18:44
|
||||
# @Desc : B站爬虫
|
||||
# @Desc : Bilibili Crawler
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
@@ -55,18 +64,19 @@ class BilibiliCrawler(AbstractCrawler):
|
||||
self.index_url = "https://www.bilibili.com"
|
||||
self.user_agent = utils.get_user_agent()
|
||||
self.cdp_manager = None
|
||||
self.ip_proxy_pool = None # Proxy IP pool for automatic proxy refresh
|
||||
|
||||
async def start(self):
|
||||
playwright_proxy_format, httpx_proxy_format = None, None
|
||||
if config.ENABLE_IP_PROXY:
|
||||
ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
|
||||
ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy()
|
||||
self.ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
|
||||
ip_proxy_info: IpInfoModel = await self.ip_proxy_pool.get_proxy()
|
||||
playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(ip_proxy_info)
|
||||
|
||||
async with async_playwright() as playwright:
|
||||
# 根据配置选择启动模式
|
||||
# Choose launch mode based on configuration
|
||||
if config.ENABLE_CDP_MODE:
|
||||
utils.logger.info("[BilibiliCrawler] 使用CDP模式启动浏览器")
|
||||
utils.logger.info("[BilibiliCrawler] Launching browser using CDP mode")
|
||||
self.browser_context = await self.launch_browser_with_cdp(
|
||||
playwright,
|
||||
playwright_proxy_format,
|
||||
@@ -74,12 +84,13 @@ class BilibiliCrawler(AbstractCrawler):
|
||||
headless=config.CDP_HEADLESS,
|
||||
)
|
||||
else:
|
||||
utils.logger.info("[BilibiliCrawler] 使用标准模式启动浏览器")
|
||||
utils.logger.info("[BilibiliCrawler] Launching browser using standard mode")
|
||||
# Launch a browser context.
|
||||
chromium = playwright.chromium
|
||||
self.browser_context = await self.launch_browser(chromium, None, self.user_agent, headless=config.HEADLESS)
|
||||
# stealth.min.js is a js script to prevent the website from detecting the crawler.
|
||||
await self.browser_context.add_init_script(path="libs/stealth.min.js")
|
||||
# stealth.min.js is a js script to prevent the website from detecting the crawler.
|
||||
await self.browser_context.add_init_script(path="libs/stealth.min.js")
|
||||
|
||||
self.context_page = await self.browser_context.new_page()
|
||||
await self.context_page.goto(self.index_url)
|
||||
|
||||
@@ -138,31 +149,31 @@ class BilibiliCrawler(AbstractCrawler):
|
||||
end: str = config.END_DAY,
|
||||
) -> Tuple[str, str]:
|
||||
"""
|
||||
获取 bilibili 作品发布日期起始时间戳 pubtime_begin_s 与发布日期结束时间戳 pubtime_end_s
|
||||
Get bilibili publish start timestamp pubtime_begin_s and publish end timestamp pubtime_end_s
|
||||
---
|
||||
:param start: 发布日期起始时间,YYYY-MM-DD
|
||||
:param end: 发布日期结束时间,YYYY-MM-DD
|
||||
:param start: Publish date start time, YYYY-MM-DD
|
||||
:param end: Publish date end time, YYYY-MM-DD
|
||||
|
||||
Note
|
||||
---
|
||||
- 搜索的时间范围为 start 至 end,包含 start 和 end
|
||||
- 若要搜索同一天的内容,为了包含 start 当天的搜索内容,则 pubtime_end_s 的值应该为 pubtime_begin_s 的值加上一天再减去一秒,即 start 当天的最后一秒
|
||||
- 如仅搜索 2024-01-05 的内容,pubtime_begin_s = 1704384000,pubtime_end_s = 1704470399
|
||||
转换为可读的 datetime 对象:pubtime_begin_s = datetime.datetime(2024, 1, 5, 0, 0),pubtime_end_s = datetime.datetime(2024, 1, 5, 23, 59, 59)
|
||||
- 若要搜索 start 至 end 的内容,为了包含 end 当天的搜索内容,则 pubtime_end_s 的值应该为 pubtime_end_s 的值加上一天再减去一秒,即 end 当天的最后一秒
|
||||
- 如搜索 2024-01-05 - 2024-01-06 的内容,pubtime_begin_s = 1704384000,pubtime_end_s = 1704556799
|
||||
转换为可读的 datetime 对象:pubtime_begin_s = datetime.datetime(2024, 1, 5, 0, 0),pubtime_end_s = datetime.datetime(2024, 1, 6, 23, 59, 59)
|
||||
- Search time range is from start to end, including both start and end
|
||||
- To search content from the same day, to include search content from that day, pubtime_end_s should be pubtime_begin_s plus one day minus one second, i.e., the last second of start day
|
||||
- For example, searching only 2024-01-05 content, pubtime_begin_s = 1704384000, pubtime_end_s = 1704470399
|
||||
Converted to readable datetime objects: pubtime_begin_s = datetime.datetime(2024, 1, 5, 0, 0), pubtime_end_s = datetime.datetime(2024, 1, 5, 23, 59, 59)
|
||||
- To search content from start to end, to include search content from end day, pubtime_end_s should be pubtime_end_s plus one day minus one second, i.e., the last second of end day
|
||||
- For example, searching 2024-01-05 - 2024-01-06 content, pubtime_begin_s = 1704384000, pubtime_end_s = 1704556799
|
||||
Converted to readable datetime objects: pubtime_begin_s = datetime.datetime(2024, 1, 5, 0, 0), pubtime_end_s = datetime.datetime(2024, 1, 6, 23, 59, 59)
|
||||
"""
|
||||
# 转换 start 与 end 为 datetime 对象
|
||||
# Convert start and end to datetime objects
|
||||
start_day: datetime = datetime.strptime(start, "%Y-%m-%d")
|
||||
end_day: datetime = datetime.strptime(end, "%Y-%m-%d")
|
||||
if start_day > end_day:
|
||||
raise ValueError("Wrong time range, please check your start and end argument, to ensure that the start cannot exceed end")
|
||||
elif start_day == end_day: # 搜索同一天的内容
|
||||
end_day = (start_day + timedelta(days=1) - timedelta(seconds=1)) # 则将 end_day 设置为 start_day + 1 day - 1 second
|
||||
else: # 搜索 start 至 end
|
||||
end_day = (end_day + timedelta(days=1) - timedelta(seconds=1)) # 则将 end_day 设置为 end_day + 1 day - 1 second
|
||||
# 将其重新转换为时间戳
|
||||
elif start_day == end_day: # Searching content from the same day
|
||||
end_day = (start_day + timedelta(days=1) - timedelta(seconds=1)) # Set end_day to start_day + 1 day - 1 second
|
||||
else: # Searching from start to end
|
||||
end_day = (end_day + timedelta(days=1) - timedelta(seconds=1)) # Set end_day to end_day + 1 day - 1 second
|
||||
# Convert back to timestamps
|
||||
return str(int(start_day.timestamp())), str(int(end_day.timestamp()))
|
||||
|
||||
async def search_by_keywords(self):
|
||||
@@ -192,8 +203,8 @@ class BilibiliCrawler(AbstractCrawler):
|
||||
page=page,
|
||||
page_size=bili_limit_count,
|
||||
order=SearchOrderType.DEFAULT,
|
||||
pubtime_begin_s=0, # 作品发布日期起始时间戳
|
||||
pubtime_end_s=0, # 作品发布日期结束日期时间戳
|
||||
pubtime_begin_s=0, # Publish date start timestamp
|
||||
pubtime_end_s=0, # Publish date end timestamp
|
||||
)
|
||||
video_list: List[Dict] = videos_res.get("result")
|
||||
|
||||
@@ -215,11 +226,11 @@ class BilibiliCrawler(AbstractCrawler):
|
||||
await bilibili_store.update_up_info(video_item)
|
||||
await self.get_bilibili_video(video_item, semaphore)
|
||||
page += 1
|
||||
|
||||
|
||||
# Sleep after page navigation
|
||||
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
|
||||
utils.logger.info(f"[BilibiliCrawler.search_by_keywords] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after page {page-1}")
|
||||
|
||||
|
||||
await self.batch_get_video_comments(video_id_list)
|
||||
|
||||
async def search_by_keywords_in_time_range(self, daily_limit: bool):
|
||||
@@ -296,11 +307,11 @@ class BilibiliCrawler(AbstractCrawler):
|
||||
await self.get_bilibili_video(video_item, semaphore)
|
||||
|
||||
page += 1
|
||||
|
||||
|
||||
# Sleep after page navigation
|
||||
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
|
||||
utils.logger.info(f"[BilibiliCrawler.search_by_keywords_in_time_range] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after page {page-1}")
|
||||
|
||||
|
||||
await self.batch_get_video_comments(video_id_list)
|
||||
|
||||
except Exception as e:
|
||||
@@ -412,11 +423,11 @@ class BilibiliCrawler(AbstractCrawler):
|
||||
async with semaphore:
|
||||
try:
|
||||
result = await self.bili_client.get_video_info(aid=aid, bvid=bvid)
|
||||
|
||||
|
||||
# Sleep after fetching video details
|
||||
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
|
||||
utils.logger.info(f"[BilibiliCrawler.get_video_info_task] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after fetching video details {bvid or aid}")
|
||||
|
||||
|
||||
return result
|
||||
except DataFetchError as ex:
|
||||
utils.logger.error(f"[BilibiliCrawler.get_video_info_task] Get video detail error: {ex}")
|
||||
@@ -463,6 +474,7 @@ class BilibiliCrawler(AbstractCrawler):
|
||||
},
|
||||
playwright_page=self.context_page,
|
||||
cookie_dict=cookie_dict,
|
||||
proxy_ip_pool=self.ip_proxy_pool, # 传递代理池用于自动刷新
|
||||
)
|
||||
return bilibili_client_obj
|
||||
|
||||
@@ -496,11 +508,12 @@ class BilibiliCrawler(AbstractCrawler):
|
||||
"height": 1080
|
||||
},
|
||||
user_agent=user_agent,
|
||||
channel="chrome", # Use system's stable Chrome version
|
||||
)
|
||||
return browser_context
|
||||
else:
|
||||
# type: ignore
|
||||
browser = await chromium.launch(headless=headless, proxy=playwright_proxy)
|
||||
browser = await chromium.launch(headless=headless, proxy=playwright_proxy, channel="chrome")
|
||||
browser_context = await browser.new_context(viewport={"width": 1920, "height": 1080}, user_agent=user_agent)
|
||||
return browser_context
|
||||
|
||||
@@ -512,7 +525,7 @@ class BilibiliCrawler(AbstractCrawler):
|
||||
headless: bool = True,
|
||||
) -> BrowserContext:
|
||||
"""
|
||||
使用CDP模式启动浏览器
|
||||
Launch browser using CDP mode
|
||||
"""
|
||||
try:
|
||||
self.cdp_manager = CDPBrowserManager()
|
||||
@@ -523,22 +536,22 @@ class BilibiliCrawler(AbstractCrawler):
|
||||
headless=headless,
|
||||
)
|
||||
|
||||
# 显示浏览器信息
|
||||
# Display browser information
|
||||
browser_info = await self.cdp_manager.get_browser_info()
|
||||
utils.logger.info(f"[BilibiliCrawler] CDP浏览器信息: {browser_info}")
|
||||
utils.logger.info(f"[BilibiliCrawler] CDP browser info: {browser_info}")
|
||||
|
||||
return browser_context
|
||||
|
||||
except Exception as e:
|
||||
utils.logger.error(f"[BilibiliCrawler] CDP模式启动失败,回退到标准模式: {e}")
|
||||
# 回退到标准模式
|
||||
utils.logger.error(f"[BilibiliCrawler] CDP mode launch failed, fallback to standard mode: {e}")
|
||||
# Fallback to standard mode
|
||||
chromium = playwright.chromium
|
||||
return await self.launch_browser(chromium, playwright_proxy, user_agent, headless)
|
||||
|
||||
async def close(self):
|
||||
"""Close browser context"""
|
||||
try:
|
||||
# 如果使用CDP模式,需要特殊处理
|
||||
# If using CDP mode, special handling is required
|
||||
if self.cdp_manager:
|
||||
await self.cdp_manager.cleanup()
|
||||
self.cdp_manager = None
|
||||
|
||||
@@ -1,12 +1,21 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/bilibili/exception.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
@@ -1,12 +1,21 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/bilibili/field.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
# -*- coding: utf-8 -*-
|
||||
@@ -18,28 +27,28 @@ from enum import Enum
|
||||
|
||||
|
||||
class SearchOrderType(Enum):
|
||||
# 综合排序
|
||||
# Comprehensive sorting
|
||||
DEFAULT = ""
|
||||
|
||||
# 最多点击
|
||||
# Most clicks
|
||||
MOST_CLICK = "click"
|
||||
|
||||
# 最新发布
|
||||
# Latest published
|
||||
LAST_PUBLISH = "pubdate"
|
||||
|
||||
# 最多弹幕
|
||||
# Most danmu (comments)
|
||||
MOST_DANMU = "dm"
|
||||
|
||||
# 最多收藏
|
||||
# Most bookmarks
|
||||
MOST_MARK = "stow"
|
||||
|
||||
|
||||
class CommentOrderType(Enum):
|
||||
# 仅按热度
|
||||
# By popularity only
|
||||
DEFAULT = 0
|
||||
|
||||
# 按热度+按时间
|
||||
# By popularity + time
|
||||
MIXED = 1
|
||||
|
||||
# 按时间
|
||||
# By time
|
||||
TIME = 2
|
||||
|
||||
@@ -1,19 +1,28 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/bilibili/help.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
# -*- coding: utf-8 -*-
|
||||
# @Author : relakkes@gmail.com
|
||||
# @Time : 2023/12/2 23:26
|
||||
# @Desc : bilibili 请求参数签名
|
||||
# 逆向实现参考:https://socialsisteryi.github.io/bilibili-API-collect/docs/misc/sign/wbi.html#wbi%E7%AD%BE%E5%90%8D%E7%AE%97%E6%B3%95
|
||||
# @Desc : bilibili request parameter signing
|
||||
# Reverse engineering implementation reference: https://socialsisteryi.github.io/bilibili-API-collect/docs/misc/sign/wbi.html#wbi%E7%AD%BE%E5%90%8D%E7%AE%97%E6%B3%95
|
||||
import re
|
||||
import urllib.parse
|
||||
from hashlib import md5
|
||||
@@ -36,7 +45,7 @@ class BilibiliSign:
|
||||
|
||||
def get_salt(self) -> str:
|
||||
"""
|
||||
获取加盐的 key
|
||||
Get the salted key
|
||||
:return:
|
||||
"""
|
||||
salt = ""
|
||||
@@ -47,8 +56,8 @@ class BilibiliSign:
|
||||
|
||||
def sign(self, req_data: Dict) -> Dict:
|
||||
"""
|
||||
请求参数中加上当前时间戳对请求参数中的key进行字典序排序
|
||||
再将请求参数进行 url 编码集合 salt 进行 md5 就可以生成w_rid参数了
|
||||
Add current timestamp to request parameters, sort keys in dictionary order,
|
||||
then URL encode the parameters and combine with salt to generate md5 for w_rid parameter
|
||||
:param req_data:
|
||||
:return:
|
||||
"""
|
||||
@@ -56,35 +65,35 @@ class BilibiliSign:
|
||||
req_data.update({"wts": current_ts})
|
||||
req_data = dict(sorted(req_data.items()))
|
||||
req_data = {
|
||||
# 过滤 value 中的 "!'()*" 字符
|
||||
# Filter "!'()*" characters from values
|
||||
k: ''.join(filter(lambda ch: ch not in "!'()*", str(v)))
|
||||
for k, v
|
||||
in req_data.items()
|
||||
}
|
||||
query = urllib.parse.urlencode(req_data)
|
||||
salt = self.get_salt()
|
||||
wbi_sign = md5((query + salt).encode()).hexdigest() # 计算 w_rid
|
||||
wbi_sign = md5((query + salt).encode()).hexdigest() # Calculate w_rid
|
||||
req_data['w_rid'] = wbi_sign
|
||||
return req_data
|
||||
|
||||
|
||||
def parse_video_info_from_url(url: str) -> VideoUrlInfo:
|
||||
"""
|
||||
从B站视频URL中解析出视频ID
|
||||
Parse video ID from Bilibili video URL
|
||||
Args:
|
||||
url: B站视频链接
|
||||
url: Bilibili video link
|
||||
- https://www.bilibili.com/video/BV1dwuKzmE26/?spm_id_from=333.1387.homepage.video_card.click
|
||||
- https://www.bilibili.com/video/BV1d54y1g7db
|
||||
- BV1d54y1g7db (直接传入BV号)
|
||||
- BV1d54y1g7db (directly pass BV number)
|
||||
Returns:
|
||||
VideoUrlInfo: 包含视频ID的对象
|
||||
VideoUrlInfo: Object containing video ID
|
||||
"""
|
||||
# 如果传入的已经是BV号,直接返回
|
||||
# If the input is already a BV number, return directly
|
||||
if url.startswith("BV"):
|
||||
return VideoUrlInfo(video_id=url)
|
||||
|
||||
# 使用正则表达式提取BV号
|
||||
# 匹配 /video/BV... 或 /video/av... 格式
|
||||
# Use regex to extract BV number
|
||||
# Match /video/BV... or /video/av... format
|
||||
bv_pattern = r'/video/(BV[a-zA-Z0-9]+)'
|
||||
match = re.search(bv_pattern, url)
|
||||
|
||||
@@ -92,26 +101,26 @@ def parse_video_info_from_url(url: str) -> VideoUrlInfo:
|
||||
video_id = match.group(1)
|
||||
return VideoUrlInfo(video_id=video_id)
|
||||
|
||||
raise ValueError(f"无法从URL中解析出视频ID: {url}")
|
||||
raise ValueError(f"Unable to parse video ID from URL: {url}")
|
||||
|
||||
|
||||
def parse_creator_info_from_url(url: str) -> CreatorUrlInfo:
|
||||
"""
|
||||
从B站创作者空间URL中解析出创作者ID
|
||||
Parse creator ID from Bilibili creator space URL
|
||||
Args:
|
||||
url: B站创作者空间链接
|
||||
url: Bilibili creator space link
|
||||
- https://space.bilibili.com/434377496?spm_id_from=333.1007.0.0
|
||||
- https://space.bilibili.com/20813884
|
||||
- 434377496 (直接传入UID)
|
||||
- 434377496 (directly pass UID)
|
||||
Returns:
|
||||
CreatorUrlInfo: 包含创作者ID的对象
|
||||
CreatorUrlInfo: Object containing creator ID
|
||||
"""
|
||||
# 如果传入的已经是纯数字ID,直接返回
|
||||
# If the input is already a numeric ID, return directly
|
||||
if url.isdigit():
|
||||
return CreatorUrlInfo(creator_id=url)
|
||||
|
||||
# 使用正则表达式提取UID
|
||||
# 匹配 /space.bilibili.com/数字 格式
|
||||
# Use regex to extract UID
|
||||
# Match /space.bilibili.com/number format
|
||||
uid_pattern = r'space\.bilibili\.com/(\d+)'
|
||||
match = re.search(uid_pattern, url)
|
||||
|
||||
@@ -119,20 +128,20 @@ def parse_creator_info_from_url(url: str) -> CreatorUrlInfo:
|
||||
creator_id = match.group(1)
|
||||
return CreatorUrlInfo(creator_id=creator_id)
|
||||
|
||||
raise ValueError(f"无法从URL中解析出创作者ID: {url}")
|
||||
raise ValueError(f"Unable to parse creator ID from URL: {url}")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
# 测试视频URL解析
|
||||
# Test video URL parsing
|
||||
video_url1 = "https://www.bilibili.com/video/BV1dwuKzmE26/?spm_id_from=333.1387.homepage.video_card.click"
|
||||
video_url2 = "BV1d54y1g7db"
|
||||
print("视频URL解析测试:")
|
||||
print("Video URL parsing test:")
|
||||
print(f"URL1: {video_url1} -> {parse_video_info_from_url(video_url1)}")
|
||||
print(f"URL2: {video_url2} -> {parse_video_info_from_url(video_url2)}")
|
||||
|
||||
# 测试创作者URL解析
|
||||
# Test creator URL parsing
|
||||
creator_url1 = "https://space.bilibili.com/434377496?spm_id_from=333.1007.0.0"
|
||||
creator_url2 = "20813884"
|
||||
print("\n创作者URL解析测试:")
|
||||
print("\nCreator URL parsing test:")
|
||||
print(f"URL1: {creator_url1} -> {parse_creator_info_from_url(creator_url1)}")
|
||||
print(f"URL2: {creator_url2} -> {parse_creator_info_from_url(creator_url2)}")
|
||||
|
||||
@@ -1,18 +1,27 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/bilibili/login.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
# -*- coding: utf-8 -*-
|
||||
# @Author : relakkes@gmail.com
|
||||
# @Time : 2023/12/2 18:44
|
||||
# @Desc : bilibli登录实现类
|
||||
# @Desc : bilibili login implementation class
|
||||
|
||||
import asyncio
|
||||
import functools
|
||||
|
||||
@@ -1,12 +1,21 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/douyin/__init__.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
from .core import DouYinCrawler
|
||||
|
||||
@@ -1,3 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/douyin/client.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
@@ -12,21 +21,25 @@ import asyncio
|
||||
import copy
|
||||
import json
|
||||
import urllib.parse
|
||||
from typing import Any, Callable, Dict, Union, Optional
|
||||
from typing import TYPE_CHECKING, Any, Callable, Dict, Union, Optional
|
||||
|
||||
import httpx
|
||||
from playwright.async_api import BrowserContext
|
||||
|
||||
from base.base_crawler import AbstractApiClient
|
||||
from proxy.proxy_mixin import ProxyRefreshMixin
|
||||
from tools import utils
|
||||
from var import request_keyword_var
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from proxy.proxy_ip_pool import ProxyIpPool
|
||||
|
||||
from .exception import *
|
||||
from .field import *
|
||||
from .help import *
|
||||
|
||||
|
||||
class DouYinClient(AbstractApiClient):
|
||||
class DouYinClient(AbstractApiClient, ProxyRefreshMixin):
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
@@ -36,6 +49,7 @@ class DouYinClient(AbstractApiClient):
|
||||
headers: Dict,
|
||||
playwright_page: Optional[Page],
|
||||
cookie_dict: Dict,
|
||||
proxy_ip_pool: Optional["ProxyIpPool"] = None,
|
||||
):
|
||||
self.proxy = proxy
|
||||
self.timeout = timeout
|
||||
@@ -43,6 +57,8 @@ class DouYinClient(AbstractApiClient):
|
||||
self._host = "https://www.douyin.com"
|
||||
self.playwright_page = playwright_page
|
||||
self.cookie_dict = cookie_dict
|
||||
# 初始化代理池(来自 ProxyRefreshMixin)
|
||||
self.init_proxy_pool(proxy_ip_pool)
|
||||
|
||||
async def __process_req_params(
|
||||
self,
|
||||
@@ -91,10 +107,15 @@ class DouYinClient(AbstractApiClient):
|
||||
post_data = {}
|
||||
if request_method == "POST":
|
||||
post_data = params
|
||||
a_bogus = await get_a_bogus(uri, query_string, post_data, headers["User-Agent"], self.playwright_page)
|
||||
params["a_bogus"] = a_bogus
|
||||
|
||||
if "/v1/web/general/search" not in uri:
|
||||
a_bogus = await get_a_bogus(uri, query_string, post_data, headers["User-Agent"], self.playwright_page)
|
||||
params["a_bogus"] = a_bogus
|
||||
|
||||
async def request(self, method, url, **kwargs):
|
||||
# 每次请求前检测代理是否过期
|
||||
await self._refresh_proxy_if_expired()
|
||||
|
||||
async with httpx.AsyncClient(proxy=self.proxy) as client:
|
||||
response = await client.request(method, url, timeout=self.timeout, **kwargs)
|
||||
try:
|
||||
|
||||
@@ -1,3 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/douyin/core.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
@@ -46,12 +55,13 @@ class DouYinCrawler(AbstractCrawler):
|
||||
def __init__(self) -> None:
|
||||
self.index_url = "https://www.douyin.com"
|
||||
self.cdp_manager = None
|
||||
self.ip_proxy_pool = None # 代理IP池,用于代理自动刷新
|
||||
|
||||
async def start(self) -> None:
|
||||
playwright_proxy_format, httpx_proxy_format = None, None
|
||||
if config.ENABLE_IP_PROXY:
|
||||
ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
|
||||
ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy()
|
||||
self.ip_proxy_pool = await create_ip_pool(config.IP_PROXY_POOL_COUNT, enable_validate_ip=True)
|
||||
ip_proxy_info: IpInfoModel = await self.ip_proxy_pool.get_proxy()
|
||||
playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(ip_proxy_info)
|
||||
|
||||
async with async_playwright() as playwright:
|
||||
@@ -74,8 +84,9 @@ class DouYinCrawler(AbstractCrawler):
|
||||
user_agent=None,
|
||||
headless=config.HEADLESS,
|
||||
)
|
||||
# stealth.min.js is a js script to prevent the website from detecting the crawler.
|
||||
await self.browser_context.add_init_script(path="libs/stealth.min.js")
|
||||
# stealth.min.js is a js script to prevent the website from detecting the crawler.
|
||||
await self.browser_context.add_init_script(path="libs/stealth.min.js")
|
||||
|
||||
self.context_page = await self.browser_context.new_page()
|
||||
await self.context_page.goto(self.index_url)
|
||||
|
||||
@@ -140,19 +151,24 @@ class DouYinCrawler(AbstractCrawler):
|
||||
utils.logger.error(f"[DouYinCrawler.search] search douyin keyword: {keyword} failed,账号也许被风控了。")
|
||||
break
|
||||
dy_search_id = posts_res.get("extra", {}).get("logid", "")
|
||||
page_aweme_list = []
|
||||
for post_item in posts_res.get("data"):
|
||||
try:
|
||||
aweme_info: Dict = (post_item.get("aweme_info") or post_item.get("aweme_mix_info", {}).get("mix_items")[0])
|
||||
except TypeError:
|
||||
continue
|
||||
aweme_list.append(aweme_info.get("aweme_id", ""))
|
||||
page_aweme_list.append(aweme_info.get("aweme_id", ""))
|
||||
await douyin_store.update_douyin_aweme(aweme_item=aweme_info)
|
||||
await self.get_aweme_media(aweme_item=aweme_info)
|
||||
|
||||
# Batch get note comments for the current page
|
||||
await self.batch_get_note_comments(page_aweme_list)
|
||||
|
||||
# Sleep after each page navigation
|
||||
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
|
||||
utils.logger.info(f"[DouYinCrawler.search] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after page {page-1}")
|
||||
utils.logger.info(f"[DouYinCrawler.search] keyword:{keyword}, aweme_list:{aweme_list}")
|
||||
await self.batch_get_note_comments(aweme_list)
|
||||
|
||||
async def get_specified_awemes(self):
|
||||
"""Get the information and comments of the specified post from URLs or IDs"""
|
||||
@@ -295,6 +311,7 @@ class DouYinCrawler(AbstractCrawler):
|
||||
},
|
||||
playwright_page=self.context_page,
|
||||
cookie_dict=cookie_dict,
|
||||
proxy_ip_pool=self.ip_proxy_pool, # 传递代理池用于自动刷新
|
||||
)
|
||||
return douyin_client
|
||||
|
||||
@@ -392,7 +409,7 @@ class DouYinCrawler(AbstractCrawler):
|
||||
async def get_aweme_images(self, aweme_item: Dict):
|
||||
"""
|
||||
get aweme images. please use get_aweme_media
|
||||
|
||||
|
||||
Args:
|
||||
aweme_item (Dict): 抖音作品详情
|
||||
"""
|
||||
|
||||
@@ -1,12 +1,21 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/douyin/exception.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
from httpx import RequestError
|
||||
|
||||
@@ -1,12 +1,21 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/douyin/field.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
from enum import Enum
|
||||
@@ -14,21 +23,21 @@ from enum import Enum
|
||||
|
||||
class SearchChannelType(Enum):
|
||||
"""search channel type"""
|
||||
GENERAL = "aweme_general" # 综合
|
||||
VIDEO = "aweme_video_web" # 视频
|
||||
USER = "aweme_user_web" # 用户
|
||||
LIVE = "aweme_live" # 直播
|
||||
GENERAL = "aweme_general" # General
|
||||
VIDEO = "aweme_video_web" # Video
|
||||
USER = "aweme_user_web" # User
|
||||
LIVE = "aweme_live" # Live
|
||||
|
||||
|
||||
class SearchSortType(Enum):
|
||||
"""search sort type"""
|
||||
GENERAL = 0 # 综合排序
|
||||
MOST_LIKE = 1 # 最多点赞
|
||||
LATEST = 2 # 最新发布
|
||||
GENERAL = 0 # Comprehensive sorting
|
||||
MOST_LIKE = 1 # Most likes
|
||||
LATEST = 2 # Latest published
|
||||
|
||||
class PublishTimeType(Enum):
|
||||
"""publish time type"""
|
||||
UNLIMITED = 0 # 不限
|
||||
ONE_DAY = 1 # 一天内
|
||||
ONE_WEEK = 7 # 一周内
|
||||
SIX_MONTH = 180 # 半年内
|
||||
UNLIMITED = 0 # Unlimited
|
||||
ONE_DAY = 1 # Within one day
|
||||
ONE_WEEK = 7 # Within one week
|
||||
SIX_MONTH = 180 # Within six months
|
||||
|
||||
@@ -1,19 +1,28 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/douyin/help.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
# -*- coding: utf-8 -*-
|
||||
# @Author : relakkes@gmail.com
|
||||
# @Name : 程序员阿江-Relakkes
|
||||
# @Time : 2024/6/10 02:24
|
||||
# @Desc : 获取 a_bogus 参数, 学习交流使用,请勿用作商业用途,侵权联系作者删除
|
||||
# @Desc : Get a_bogus parameter, for learning and communication only, do not use for commercial purposes, contact author to delete if infringement
|
||||
|
||||
import random
|
||||
import re
|
||||
@@ -29,7 +38,7 @@ douyin_sign_obj = execjs.compile(open('libs/douyin.js', encoding='utf-8-sig').re
|
||||
|
||||
def get_web_id():
|
||||
"""
|
||||
生成随机的webid
|
||||
Generate random webid
|
||||
Returns:
|
||||
|
||||
"""
|
||||
@@ -51,13 +60,13 @@ def get_web_id():
|
||||
|
||||
async def get_a_bogus(url: str, params: str, post_data: dict, user_agent: str, page: Page = None):
|
||||
"""
|
||||
获取 a_bogus 参数, 目前不支持post请求类型的签名
|
||||
Get a_bogus parameter, currently does not support POST request type signature
|
||||
"""
|
||||
return get_a_bogus_from_js(url, params, user_agent)
|
||||
|
||||
def get_a_bogus_from_js(url: str, params: str, user_agent: str):
|
||||
"""
|
||||
通过js获取 a_bogus 参数
|
||||
Get a_bogus parameter through js
|
||||
Args:
|
||||
url:
|
||||
params:
|
||||
@@ -73,10 +82,10 @@ def get_a_bogus_from_js(url: str, params: str, user_agent: str):
|
||||
|
||||
|
||||
|
||||
async def get_a_bogus_from_playright(params: str, post_data: dict, user_agent: str, page: Page):
|
||||
async def get_a_bogus_from_playwright(params: str, post_data: dict, user_agent: str, page: Page):
|
||||
"""
|
||||
通过playright获取 a_bogus 参数
|
||||
playwright版本已失效
|
||||
Get a_bogus parameter through playwright
|
||||
playwright version is deprecated
|
||||
Returns:
|
||||
|
||||
"""
|
||||
@@ -91,73 +100,73 @@ async def get_a_bogus_from_playright(params: str, post_data: dict, user_agent: s
|
||||
|
||||
def parse_video_info_from_url(url: str) -> VideoUrlInfo:
|
||||
"""
|
||||
从抖音视频URL中解析出视频ID
|
||||
支持以下格式:
|
||||
1. 普通视频链接: https://www.douyin.com/video/7525082444551310602
|
||||
2. 带modal_id参数的链接:
|
||||
Parse video ID from Douyin video URL
|
||||
Supports the following formats:
|
||||
1. Normal video link: https://www.douyin.com/video/7525082444551310602
|
||||
2. Link with modal_id parameter:
|
||||
- https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?modal_id=7525082444551310602
|
||||
- https://www.douyin.com/root/search/python?modal_id=7471165520058862848
|
||||
3. 短链接: https://v.douyin.com/iF12345ABC/ (需要client解析)
|
||||
4. 纯ID: 7525082444551310602
|
||||
3. Short link: https://v.douyin.com/iF12345ABC/ (requires client parsing)
|
||||
4. Pure ID: 7525082444551310602
|
||||
|
||||
Args:
|
||||
url: 抖音视频链接或ID
|
||||
url: Douyin video link or ID
|
||||
Returns:
|
||||
VideoUrlInfo: 包含视频ID的对象
|
||||
VideoUrlInfo: Object containing video ID
|
||||
"""
|
||||
# 如果是纯数字ID,直接返回
|
||||
# If it's a pure numeric ID, return directly
|
||||
if url.isdigit():
|
||||
return VideoUrlInfo(aweme_id=url, url_type="normal")
|
||||
|
||||
# 检查是否是短链接 (v.douyin.com)
|
||||
# Check if it's a short link (v.douyin.com)
|
||||
if "v.douyin.com" in url or url.startswith("http") and len(url) < 50 and "video" not in url:
|
||||
return VideoUrlInfo(aweme_id="", url_type="short") # 需要通过client解析
|
||||
return VideoUrlInfo(aweme_id="", url_type="short") # Requires client parsing
|
||||
|
||||
# 尝试从URL参数中提取modal_id
|
||||
# Try to extract modal_id from URL parameters
|
||||
params = extract_url_params_to_dict(url)
|
||||
modal_id = params.get("modal_id")
|
||||
if modal_id:
|
||||
return VideoUrlInfo(aweme_id=modal_id, url_type="modal")
|
||||
|
||||
# 从标准视频URL中提取ID: /video/数字
|
||||
# Extract ID from standard video URL: /video/number
|
||||
video_pattern = r'/video/(\d+)'
|
||||
match = re.search(video_pattern, url)
|
||||
if match:
|
||||
aweme_id = match.group(1)
|
||||
return VideoUrlInfo(aweme_id=aweme_id, url_type="normal")
|
||||
|
||||
raise ValueError(f"无法从URL中解析出视频ID: {url}")
|
||||
raise ValueError(f"Unable to parse video ID from URL: {url}")
|
||||
|
||||
|
||||
def parse_creator_info_from_url(url: str) -> CreatorUrlInfo:
|
||||
"""
|
||||
从抖音创作者主页URL中解析出创作者ID (sec_user_id)
|
||||
支持以下格式:
|
||||
1. 创作者主页: https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main
|
||||
2. 纯ID: MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE
|
||||
Parse creator ID (sec_user_id) from Douyin creator homepage URL
|
||||
Supports the following formats:
|
||||
1. Creator homepage: https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main
|
||||
2. Pure ID: MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE
|
||||
|
||||
Args:
|
||||
url: 抖音创作者主页链接或sec_user_id
|
||||
url: Douyin creator homepage link or sec_user_id
|
||||
Returns:
|
||||
CreatorUrlInfo: 包含创作者ID的对象
|
||||
CreatorUrlInfo: Object containing creator ID
|
||||
"""
|
||||
# 如果是纯ID格式(通常以MS4wLjABAAAA开头),直接返回
|
||||
# If it's a pure ID format (usually starts with MS4wLjABAAAA), return directly
|
||||
if url.startswith("MS4wLjABAAAA") or (not url.startswith("http") and "douyin.com" not in url):
|
||||
return CreatorUrlInfo(sec_user_id=url)
|
||||
|
||||
# 从创作者主页URL中提取sec_user_id: /user/xxx
|
||||
# Extract sec_user_id from creator homepage URL: /user/xxx
|
||||
user_pattern = r'/user/([^/?]+)'
|
||||
match = re.search(user_pattern, url)
|
||||
if match:
|
||||
sec_user_id = match.group(1)
|
||||
return CreatorUrlInfo(sec_user_id=sec_user_id)
|
||||
|
||||
raise ValueError(f"无法从URL中解析出创作者ID: {url}")
|
||||
raise ValueError(f"Unable to parse creator ID from URL: {url}")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
# 测试视频URL解析
|
||||
print("=== 视频URL解析测试 ===")
|
||||
# Test video URL parsing
|
||||
print("=== Video URL Parsing Test ===")
|
||||
test_urls = [
|
||||
"https://www.douyin.com/video/7525082444551310602",
|
||||
"https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main&modal_id=7525082444551310602",
|
||||
@@ -168,13 +177,13 @@ if __name__ == '__main__':
|
||||
try:
|
||||
result = parse_video_info_from_url(url)
|
||||
print(f"✓ URL: {url[:80]}...")
|
||||
print(f" 结果: {result}\n")
|
||||
print(f" Result: {result}\n")
|
||||
except Exception as e:
|
||||
print(f"✗ URL: {url}")
|
||||
print(f" 错误: {e}\n")
|
||||
print(f" Error: {e}\n")
|
||||
|
||||
# 测试创作者URL解析
|
||||
print("=== 创作者URL解析测试 ===")
|
||||
# Test creator URL parsing
|
||||
print("=== Creator URL Parsing Test ===")
|
||||
test_creator_urls = [
|
||||
"https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main",
|
||||
"MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE",
|
||||
@@ -183,8 +192,7 @@ if __name__ == '__main__':
|
||||
try:
|
||||
result = parse_creator_info_from_url(url)
|
||||
print(f"✓ URL: {url[:80]}...")
|
||||
print(f" 结果: {result}\n")
|
||||
print(f" Result: {result}\n")
|
||||
except Exception as e:
|
||||
print(f"✗ URL: {url}")
|
||||
print(f" 错误: {e}\n")
|
||||
|
||||
print(f" Error: {e}\n")
|
||||
|
||||
@@ -1,12 +1,21 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/douyin/login.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
import asyncio
|
||||
@@ -44,7 +53,7 @@ class DouYinLogin(AbstractLogin):
|
||||
async def begin(self):
|
||||
"""
|
||||
Start login douyin website
|
||||
滑块中间页面的验证准确率不太OK... 如果没有特俗要求,建议不开抖音登录,或者使用cookies登录
|
||||
The verification accuracy of the slider verification is not very good... If there are no special requirements, it is recommended not to use Douyin login, or use cookie login
|
||||
"""
|
||||
|
||||
# popup login dialog
|
||||
@@ -60,7 +69,7 @@ class DouYinLogin(AbstractLogin):
|
||||
else:
|
||||
raise ValueError("[DouYinLogin.begin] Invalid Login Type Currently only supported qrcode or phone or cookie ...")
|
||||
|
||||
# 如果页面重定向到滑动验证码页面,需要再次滑动滑块
|
||||
# If the page redirects to the slider verification page, need to slide again
|
||||
await asyncio.sleep(6)
|
||||
current_page_title = await self.context_page.title()
|
||||
if "验证码中间页" in current_page_title:
|
||||
@@ -138,10 +147,10 @@ class DouYinLogin(AbstractLogin):
|
||||
send_sms_code_btn = self.context_page.locator("xpath=//span[text() = '获取验证码']")
|
||||
await send_sms_code_btn.click()
|
||||
|
||||
# 检查是否有滑动验证码
|
||||
# Check if there is slider verification
|
||||
await self.check_page_display_slider(move_step=10, slider_level="easy")
|
||||
cache_client = CacheFactory.create_cache(config.CACHE_TYPE_MEMORY)
|
||||
max_get_sms_code_time = 60 * 2 # 最长获取验证码的时间为2分钟
|
||||
max_get_sms_code_time = 60 * 2 # Maximum time to get verification code is 2 minutes
|
||||
while max_get_sms_code_time > 0:
|
||||
utils.logger.info(f"[DouYinLogin.login_by_mobile] get douyin sms code from redis remaining time {max_get_sms_code_time}s ...")
|
||||
await asyncio.sleep(1)
|
||||
@@ -155,20 +164,20 @@ class DouYinLogin(AbstractLogin):
|
||||
await sms_code_input_ele.fill(value=sms_code_value.decode())
|
||||
await asyncio.sleep(0.5)
|
||||
submit_btn_ele = self.context_page.locator("xpath=//button[@class='web-login-button']")
|
||||
await submit_btn_ele.click() # 点击登录
|
||||
# todo ... 应该还需要检查验证码的正确性有可能输入的验证码不正确
|
||||
await submit_btn_ele.click() # Click login
|
||||
# todo ... should also check the correctness of the verification code, it may be incorrect
|
||||
break
|
||||
|
||||
async def check_page_display_slider(self, move_step: int = 10, slider_level: str = "easy"):
|
||||
"""
|
||||
检查页面是否出现滑动验证码
|
||||
Check if slider verification appears on the page
|
||||
:return:
|
||||
"""
|
||||
# 等待滑动验证码的出现
|
||||
# Wait for slider verification to appear
|
||||
back_selector = "#captcha-verify-image"
|
||||
try:
|
||||
await self.context_page.wait_for_selector(selector=back_selector, state="visible", timeout=30 * 1000)
|
||||
except PlaywrightTimeoutError: # 没有滑动验证码,直接返回
|
||||
except PlaywrightTimeoutError: # No slider verification, return directly
|
||||
return
|
||||
|
||||
gap_selector = 'xpath=//*[@id="captcha_container"]/div/div[2]/img[2]'
|
||||
@@ -182,16 +191,16 @@ class DouYinLogin(AbstractLogin):
|
||||
await self.move_slider(back_selector, gap_selector, move_step, slider_level)
|
||||
await asyncio.sleep(1)
|
||||
|
||||
# 如果滑块滑动慢了,或者验证失败了,会提示操作过慢,这里点一下刷新按钮
|
||||
# If the slider is too slow or verification failed, it will prompt "操作过慢", click the refresh button here
|
||||
page_content = await self.context_page.content()
|
||||
if "操作过慢" in page_content or "提示重新操作" in page_content:
|
||||
utils.logger.info("[DouYinLogin.check_page_display_slider] slider verify failed, retry ...")
|
||||
await self.context_page.click(selector="//a[contains(@class, 'secsdk_captcha_refresh')]")
|
||||
continue
|
||||
|
||||
# 滑动成功后,等待滑块消失
|
||||
# After successful sliding, wait for the slider to disappear
|
||||
await self.context_page.wait_for_selector(selector=back_selector, state="hidden", timeout=1000)
|
||||
# 如果滑块消失了,说明验证成功了,跳出循环,如果没有消失,说明验证失败了,上面这一行代码会抛出异常被捕获后继续循环滑动验证码
|
||||
# If the slider disappears, it means the verification is successful, break the loop. If not, it means the verification failed, the above line will throw an exception and be caught to continue the loop
|
||||
utils.logger.info("[DouYinLogin.check_page_display_slider] slider verify success ...")
|
||||
slider_verify_success = True
|
||||
except Exception as e:
|
||||
@@ -204,10 +213,10 @@ class DouYinLogin(AbstractLogin):
|
||||
async def move_slider(self, back_selector: str, gap_selector: str, move_step: int = 10, slider_level="easy"):
|
||||
"""
|
||||
Move the slider to the right to complete the verification
|
||||
:param back_selector: 滑动验证码背景图片的选择器
|
||||
:param gap_selector: 滑动验证码的滑块选择器
|
||||
:param move_step: 是控制单次移动速度的比例是1/10 默认是1 相当于 传入的这个距离不管多远0.1秒钟移动完 越大越慢
|
||||
:param slider_level: 滑块难度 easy hard,分别对应手机验证码的滑块和验证码中间的滑块
|
||||
:param back_selector: Selector for the slider verification background image
|
||||
:param gap_selector: Selector for the slider verification slider
|
||||
:param move_step: Controls the ratio of single movement speed, default is 1, meaning the distance moves in 0.1 seconds no matter how far, larger value means slower
|
||||
:param slider_level: Slider difficulty easy hard, corresponding to the slider for mobile verification code and the slider in the middle of verification code
|
||||
:return:
|
||||
"""
|
||||
|
||||
@@ -225,31 +234,31 @@ class DouYinLogin(AbstractLogin):
|
||||
)
|
||||
gap_src = str(await gap_elements.get_property("src")) # type: ignore
|
||||
|
||||
# 识别滑块位置
|
||||
# Identify slider position
|
||||
slide_app = utils.Slide(gap=gap_src, bg=slide_back)
|
||||
distance = slide_app.discern()
|
||||
|
||||
# 获取移动轨迹
|
||||
# Get movement trajectory
|
||||
tracks = utils.get_tracks(distance, slider_level)
|
||||
new_1 = tracks[-1] - (sum(tracks) - distance)
|
||||
tracks.pop()
|
||||
tracks.append(new_1)
|
||||
|
||||
# 根据轨迹拖拽滑块到指定位置
|
||||
# Drag slider to specified position according to trajectory
|
||||
element = await self.context_page.query_selector(gap_selector)
|
||||
bounding_box = await element.bounding_box() # type: ignore
|
||||
|
||||
await self.context_page.mouse.move(bounding_box["x"] + bounding_box["width"] / 2, # type: ignore
|
||||
bounding_box["y"] + bounding_box["height"] / 2) # type: ignore
|
||||
# 这里获取到x坐标中心点位置
|
||||
# Get x coordinate center position
|
||||
x = bounding_box["x"] + bounding_box["width"] / 2 # type: ignore
|
||||
# 模拟滑动操作
|
||||
# Simulate sliding operation
|
||||
await element.hover() # type: ignore
|
||||
await self.context_page.mouse.down()
|
||||
|
||||
for track in tracks:
|
||||
# 循环鼠标按照轨迹移动
|
||||
# steps 是控制单次移动速度的比例是1/10 默认是1 相当于 传入的这个距离不管多远0.1秒钟移动完 越大越慢
|
||||
# Loop mouse movement according to trajectory
|
||||
# steps controls the ratio of single movement speed, default is 1, meaning the distance moves in 0.1 seconds no matter how far, larger value means slower
|
||||
await self.context_page.mouse.move(x + track, 0, steps=move_step)
|
||||
x += track
|
||||
await self.context_page.mouse.up()
|
||||
|
||||
@@ -1,13 +1,22 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/kuaishou/__init__.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
# -*- coding: utf-8 -*-
|
||||
from .core import KuaishouCrawler
|
||||
from .core import KuaishouCrawler
|
||||
|
||||
@@ -1,3 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/kuaishou/client.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
@@ -12,7 +21,7 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
import asyncio
|
||||
import json
|
||||
from typing import Any, Callable, Dict, List, Optional
|
||||
from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional
|
||||
from urllib.parse import urlencode
|
||||
|
||||
import httpx
|
||||
@@ -20,13 +29,17 @@ from playwright.async_api import BrowserContext, Page
|
||||
|
||||
import config
|
||||
from base.base_crawler import AbstractApiClient
|
||||
from proxy.proxy_mixin import ProxyRefreshMixin
|
||||
from tools import utils
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from proxy.proxy_ip_pool import ProxyIpPool
|
||||
|
||||
from .exception import DataFetchError
|
||||
from .graphql import KuaiShouGraphQL
|
||||
|
||||
|
||||
class KuaiShouClient(AbstractApiClient):
|
||||
class KuaiShouClient(AbstractApiClient, ProxyRefreshMixin):
|
||||
def __init__(
|
||||
self,
|
||||
timeout=10,
|
||||
@@ -35,16 +48,23 @@ class KuaiShouClient(AbstractApiClient):
|
||||
headers: Dict[str, str],
|
||||
playwright_page: Page,
|
||||
cookie_dict: Dict[str, str],
|
||||
proxy_ip_pool: Optional["ProxyIpPool"] = None,
|
||||
):
|
||||
self.proxy = proxy
|
||||
self.timeout = timeout
|
||||
self.headers = headers
|
||||
self._host = "https://www.kuaishou.com/graphql"
|
||||
self._rest_host = "https://www.kuaishou.com"
|
||||
self.playwright_page = playwright_page
|
||||
self.cookie_dict = cookie_dict
|
||||
self.graphql = KuaiShouGraphQL()
|
||||
# Initialize proxy pool (from ProxyRefreshMixin)
|
||||
self.init_proxy_pool(proxy_ip_pool)
|
||||
|
||||
async def request(self, method, url, **kwargs) -> Any:
|
||||
# Check if proxy is expired before each request
|
||||
await self._refresh_proxy_if_expired()
|
||||
|
||||
async with httpx.AsyncClient(proxy=self.proxy) as client:
|
||||
response = await client.request(method, url, timeout=self.timeout, **kwargs)
|
||||
data: Dict = response.json()
|
||||
@@ -67,6 +87,29 @@ class KuaiShouClient(AbstractApiClient):
|
||||
method="POST", url=f"{self._host}{uri}", data=json_str, headers=self.headers
|
||||
)
|
||||
|
||||
async def request_rest_v2(self, uri: str, data: dict) -> Dict:
|
||||
"""
|
||||
Make REST API V2 request (for comment endpoints)
|
||||
:param uri: API endpoint path
|
||||
:param data: request body
|
||||
:return: response data
|
||||
"""
|
||||
await self._refresh_proxy_if_expired()
|
||||
|
||||
json_str = json.dumps(data, separators=(",", ":"), ensure_ascii=False)
|
||||
async with httpx.AsyncClient(proxy=self.proxy) as client:
|
||||
response = await client.request(
|
||||
method="POST",
|
||||
url=f"{self._rest_host}{uri}",
|
||||
data=json_str,
|
||||
timeout=self.timeout,
|
||||
headers=self.headers,
|
||||
)
|
||||
result: Dict = response.json()
|
||||
if result.get("result") != 1:
|
||||
raise DataFetchError(f"REST API V2 error: {result}")
|
||||
return result
|
||||
|
||||
async def pong(self) -> bool:
|
||||
"""get a note to check if login state is ok"""
|
||||
utils.logger.info("[KuaiShouClient.pong] Begin pong kuaishou...")
|
||||
@@ -130,36 +173,32 @@ class KuaiShouClient(AbstractApiClient):
|
||||
return await self.post("", post_data)
|
||||
|
||||
async def get_video_comments(self, photo_id: str, pcursor: str = "") -> Dict:
|
||||
"""get video comments
|
||||
:param photo_id: photo id you want to fetch
|
||||
:param pcursor: last you get pcursor, defaults to ""
|
||||
:return:
|
||||
"""Get video first-level comments using REST API V2
|
||||
:param photo_id: video id you want to fetch
|
||||
:param pcursor: pagination cursor, defaults to ""
|
||||
:return: dict with rootCommentsV2, pcursorV2, commentCountV2
|
||||
"""
|
||||
post_data = {
|
||||
"operationName": "commentListQuery",
|
||||
"variables": {"photoId": photo_id, "pcursor": pcursor},
|
||||
"query": self.graphql.get("comment_list"),
|
||||
"photoId": photo_id,
|
||||
"pcursor": pcursor,
|
||||
}
|
||||
return await self.post("", post_data)
|
||||
return await self.request_rest_v2("/rest/v/photo/comment/list", post_data)
|
||||
|
||||
async def get_video_sub_comments(
|
||||
self, photo_id: str, rootCommentId: str, pcursor: str = ""
|
||||
self, photo_id: str, root_comment_id: int, pcursor: str = ""
|
||||
) -> Dict:
|
||||
"""get video sub comments
|
||||
:param photo_id: photo id you want to fetch
|
||||
:param pcursor: last you get pcursor, defaults to ""
|
||||
:return:
|
||||
"""Get video second-level comments using REST API V2
|
||||
:param photo_id: video id you want to fetch
|
||||
:param root_comment_id: parent comment id (must be int type)
|
||||
:param pcursor: pagination cursor, defaults to ""
|
||||
:return: dict with subCommentsV2, pcursorV2
|
||||
"""
|
||||
post_data = {
|
||||
"operationName": "visionSubCommentList",
|
||||
"variables": {
|
||||
"photoId": photo_id,
|
||||
"pcursor": pcursor,
|
||||
"rootCommentId": rootCommentId,
|
||||
},
|
||||
"query": self.graphql.get("vision_sub_comment_list"),
|
||||
"photoId": photo_id,
|
||||
"pcursor": pcursor,
|
||||
"rootCommentId": root_comment_id, # Must be int type for V2 API
|
||||
}
|
||||
return await self.post("", post_data)
|
||||
return await self.request_rest_v2("/rest/v/photo/comment/sublist", post_data)
|
||||
|
||||
async def get_creator_profile(self, userId: str) -> Dict:
|
||||
post_data = {
|
||||
@@ -185,12 +224,12 @@ class KuaiShouClient(AbstractApiClient):
|
||||
max_count: int = 10,
|
||||
):
|
||||
"""
|
||||
get video all comments include sub comments
|
||||
:param photo_id:
|
||||
:param crawl_interval:
|
||||
:param callback:
|
||||
:param max_count:
|
||||
:return:
|
||||
Get video all comments including sub comments (V2 REST API)
|
||||
:param photo_id: video id
|
||||
:param crawl_interval: delay between requests (seconds)
|
||||
:param callback: callback function for processing comments
|
||||
:param max_count: max number of comments to fetch
|
||||
:return: list of all comments
|
||||
"""
|
||||
|
||||
result = []
|
||||
@@ -198,12 +237,12 @@ class KuaiShouClient(AbstractApiClient):
|
||||
|
||||
while pcursor != "no_more" and len(result) < max_count:
|
||||
comments_res = await self.get_video_comments(photo_id, pcursor)
|
||||
vision_commen_list = comments_res.get("visionCommentList", {})
|
||||
pcursor = vision_commen_list.get("pcursor", "")
|
||||
comments = vision_commen_list.get("rootComments", [])
|
||||
# V2 API returns data at top level, not nested in visionCommentList
|
||||
pcursor = comments_res.get("pcursorV2", "no_more")
|
||||
comments = comments_res.get("rootCommentsV2", [])
|
||||
if len(result) + len(comments) > max_count:
|
||||
comments = comments[: max_count - len(result)]
|
||||
if callback: # 如果有回调函数,就执行回调函数
|
||||
if callback: # If there is a callback function, execute the callback function
|
||||
await callback(photo_id, comments)
|
||||
result.extend(comments)
|
||||
await asyncio.sleep(crawl_interval)
|
||||
@@ -221,14 +260,14 @@ class KuaiShouClient(AbstractApiClient):
|
||||
callback: Optional[Callable] = None,
|
||||
) -> List[Dict]:
|
||||
"""
|
||||
获取指定一级评论下的所有二级评论, 该方法会一直查找一级评论下的所有二级评论信息
|
||||
Get all second-level comments under specified first-level comments (V2 REST API)
|
||||
Args:
|
||||
comments: 评论列表
|
||||
photo_id: 视频id
|
||||
crawl_interval: 爬取一次评论的延迟单位(秒)
|
||||
callback: 一次评论爬取结束后
|
||||
comments: Comment list
|
||||
photo_id: Video ID
|
||||
crawl_interval: Delay unit for crawling comments once (seconds)
|
||||
callback: Callback after one comment crawl ends
|
||||
Returns:
|
||||
|
||||
List of sub comments
|
||||
"""
|
||||
if not config.ENABLE_GET_SUB_COMMENTS:
|
||||
utils.logger.info(
|
||||
@@ -238,35 +277,36 @@ class KuaiShouClient(AbstractApiClient):
|
||||
|
||||
result = []
|
||||
for comment in comments:
|
||||
sub_comments = comment.get("subComments")
|
||||
if sub_comments and callback:
|
||||
await callback(photo_id, sub_comments)
|
||||
|
||||
sub_comment_pcursor = comment.get("subCommentsPcursor")
|
||||
if sub_comment_pcursor == "no_more":
|
||||
# V2 API uses hasSubComments (boolean) instead of subCommentsPcursor (string)
|
||||
has_sub_comments = comment.get("hasSubComments", False)
|
||||
if not has_sub_comments:
|
||||
continue
|
||||
|
||||
# V2 API uses comment_id (int) instead of commentId (string)
|
||||
root_comment_id = comment.get("comment_id")
|
||||
if not root_comment_id:
|
||||
continue
|
||||
|
||||
root_comment_id = comment.get("commentId")
|
||||
sub_comment_pcursor = ""
|
||||
|
||||
while sub_comment_pcursor != "no_more":
|
||||
comments_res = await self.get_video_sub_comments(
|
||||
photo_id, root_comment_id, sub_comment_pcursor
|
||||
)
|
||||
vision_sub_comment_list = comments_res.get("visionSubCommentList", {})
|
||||
sub_comment_pcursor = vision_sub_comment_list.get("pcursor", "no_more")
|
||||
# V2 API returns data at top level
|
||||
sub_comment_pcursor = comments_res.get("pcursorV2", "no_more")
|
||||
sub_comments = comments_res.get("subCommentsV2", [])
|
||||
|
||||
comments = vision_sub_comment_list.get("subComments", {})
|
||||
if callback:
|
||||
await callback(photo_id, comments)
|
||||
if callback and sub_comments:
|
||||
await callback(photo_id, sub_comments)
|
||||
await asyncio.sleep(crawl_interval)
|
||||
result.extend(comments)
|
||||
result.extend(sub_comments)
|
||||
return result
|
||||
|
||||
async def get_creator_info(self, user_id: str) -> Dict:
|
||||
"""
|
||||
eg: https://www.kuaishou.com/profile/3x4jtnbfter525a
|
||||
快手用户主页
|
||||
Kuaishou user homepage
|
||||
"""
|
||||
|
||||
visionProfile = await self.get_creator_profile(user_id)
|
||||
@@ -279,11 +319,11 @@ class KuaiShouClient(AbstractApiClient):
|
||||
callback: Optional[Callable] = None,
|
||||
) -> List[Dict]:
|
||||
"""
|
||||
获取指定用户下的所有发过的帖子,该方法会一直查找一个用户下的所有帖子信息
|
||||
Get all posts published by the specified user, this method will continue to find all post information under a user
|
||||
Args:
|
||||
user_id: 用户ID
|
||||
crawl_interval: 爬取一次的延迟单位(秒)
|
||||
callback: 一次分页爬取结束后的更新回调函数
|
||||
user_id: User ID
|
||||
crawl_interval: Delay unit for crawling once (seconds)
|
||||
callback: Update callback function after one page crawl ends
|
||||
Returns:
|
||||
|
||||
"""
|
||||
|
||||
@@ -1,3 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/kuaishou/core.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
@@ -49,22 +58,23 @@ class KuaishouCrawler(AbstractCrawler):
|
||||
self.index_url = "https://www.kuaishou.com"
|
||||
self.user_agent = utils.get_user_agent()
|
||||
self.cdp_manager = None
|
||||
self.ip_proxy_pool = None # Proxy IP pool, used for automatic proxy refresh
|
||||
|
||||
async def start(self):
|
||||
playwright_proxy_format, httpx_proxy_format = None, None
|
||||
if config.ENABLE_IP_PROXY:
|
||||
ip_proxy_pool = await create_ip_pool(
|
||||
self.ip_proxy_pool = await create_ip_pool(
|
||||
config.IP_PROXY_POOL_COUNT, enable_validate_ip=True
|
||||
)
|
||||
ip_proxy_info: IpInfoModel = await ip_proxy_pool.get_proxy()
|
||||
ip_proxy_info: IpInfoModel = await self.ip_proxy_pool.get_proxy()
|
||||
playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(
|
||||
ip_proxy_info
|
||||
)
|
||||
|
||||
async with async_playwright() as playwright:
|
||||
# 根据配置选择启动模式
|
||||
# Select startup mode based on configuration
|
||||
if config.ENABLE_CDP_MODE:
|
||||
utils.logger.info("[KuaishouCrawler] 使用CDP模式启动浏览器")
|
||||
utils.logger.info("[KuaishouCrawler] Launching browser using CDP mode")
|
||||
self.browser_context = await self.launch_browser_with_cdp(
|
||||
playwright,
|
||||
playwright_proxy_format,
|
||||
@@ -72,14 +82,16 @@ class KuaishouCrawler(AbstractCrawler):
|
||||
headless=config.CDP_HEADLESS,
|
||||
)
|
||||
else:
|
||||
utils.logger.info("[KuaishouCrawler] 使用标准模式启动浏览器")
|
||||
utils.logger.info("[KuaishouCrawler] Launching browser using standard mode")
|
||||
# Launch a browser context.
|
||||
chromium = playwright.chromium
|
||||
self.browser_context = await self.launch_browser(
|
||||
chromium, None, self.user_agent, headless=config.HEADLESS
|
||||
)
|
||||
# stealth.min.js is a js script to prevent the website from detecting the crawler.
|
||||
await self.browser_context.add_init_script(path="libs/stealth.min.js")
|
||||
# stealth.min.js is a js script to prevent the website from detecting the crawler.
|
||||
await self.browser_context.add_init_script(path="libs/stealth.min.js")
|
||||
|
||||
|
||||
self.context_page = await self.browser_context.new_page()
|
||||
await self.context_page.goto(f"{self.index_url}?isHome=1")
|
||||
|
||||
@@ -161,11 +173,11 @@ class KuaishouCrawler(AbstractCrawler):
|
||||
|
||||
# batch fetch video comments
|
||||
page += 1
|
||||
|
||||
|
||||
# Sleep after page navigation
|
||||
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
|
||||
utils.logger.info(f"[KuaishouCrawler.search] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after page {page-1}")
|
||||
|
||||
|
||||
await self.batch_get_video_comments(video_id_list)
|
||||
|
||||
async def get_specified_videos(self):
|
||||
@@ -199,11 +211,11 @@ class KuaishouCrawler(AbstractCrawler):
|
||||
async with semaphore:
|
||||
try:
|
||||
result = await self.ks_client.get_video_info(video_id)
|
||||
|
||||
|
||||
# Sleep after fetching video details
|
||||
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
|
||||
utils.logger.info(f"[KuaishouCrawler.get_video_info_task] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after fetching video details {video_id}")
|
||||
|
||||
|
||||
utils.logger.info(
|
||||
f"[KuaishouCrawler.get_video_info_task] Get video_id:{video_id} info result: {result} ..."
|
||||
)
|
||||
@@ -257,11 +269,11 @@ class KuaishouCrawler(AbstractCrawler):
|
||||
utils.logger.info(
|
||||
f"[KuaishouCrawler.get_comments] begin get video_id: {video_id} comments ..."
|
||||
)
|
||||
|
||||
|
||||
# Sleep before fetching comments
|
||||
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
|
||||
utils.logger.info(f"[KuaishouCrawler.get_comments] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds before fetching comments for video {video_id}")
|
||||
|
||||
|
||||
await self.ks_client.get_video_all_comments(
|
||||
photo_id=video_id,
|
||||
crawl_interval=config.CRAWLER_MAX_SLEEP_SEC,
|
||||
@@ -306,6 +318,7 @@ class KuaishouCrawler(AbstractCrawler):
|
||||
},
|
||||
playwright_page=self.context_page,
|
||||
cookie_dict=cookie_dict,
|
||||
proxy_ip_pool=self.ip_proxy_pool, # Pass proxy pool for automatic refresh
|
||||
)
|
||||
return ks_client_obj
|
||||
|
||||
@@ -331,10 +344,11 @@ class KuaishouCrawler(AbstractCrawler):
|
||||
proxy=playwright_proxy, # type: ignore
|
||||
viewport={"width": 1920, "height": 1080},
|
||||
user_agent=user_agent,
|
||||
channel="chrome", # Use system's stable Chrome version
|
||||
)
|
||||
return browser_context
|
||||
else:
|
||||
browser = await chromium.launch(headless=headless, proxy=playwright_proxy) # type: ignore
|
||||
browser = await chromium.launch(headless=headless, proxy=playwright_proxy, channel="chrome") # type: ignore
|
||||
browser_context = await browser.new_context(
|
||||
viewport={"width": 1920, "height": 1080}, user_agent=user_agent
|
||||
)
|
||||
@@ -348,7 +362,7 @@ class KuaishouCrawler(AbstractCrawler):
|
||||
headless: bool = True,
|
||||
) -> BrowserContext:
|
||||
"""
|
||||
使用CDP模式启动浏览器
|
||||
Launch browser using CDP mode
|
||||
"""
|
||||
try:
|
||||
self.cdp_manager = CDPBrowserManager()
|
||||
@@ -359,17 +373,17 @@ class KuaishouCrawler(AbstractCrawler):
|
||||
headless=headless,
|
||||
)
|
||||
|
||||
# 显示浏览器信息
|
||||
# Display browser information
|
||||
browser_info = await self.cdp_manager.get_browser_info()
|
||||
utils.logger.info(f"[KuaishouCrawler] CDP浏览器信息: {browser_info}")
|
||||
utils.logger.info(f"[KuaishouCrawler] CDP browser info: {browser_info}")
|
||||
|
||||
return browser_context
|
||||
|
||||
except Exception as e:
|
||||
utils.logger.error(
|
||||
f"[KuaishouCrawler] CDP模式启动失败,回退到标准模式: {e}"
|
||||
f"[KuaishouCrawler] CDP mode launch failed, fallback to standard mode: {e}"
|
||||
)
|
||||
# 回退到标准模式
|
||||
# Fallback to standard mode
|
||||
chromium = playwright.chromium
|
||||
return await self.launch_browser(
|
||||
chromium, playwright_proxy, user_agent, headless
|
||||
@@ -424,7 +438,7 @@ class KuaishouCrawler(AbstractCrawler):
|
||||
|
||||
async def close(self):
|
||||
"""Close browser context"""
|
||||
# 如果使用CDP模式,需要特殊处理
|
||||
# If using CDP mode, need special handling
|
||||
if self.cdp_manager:
|
||||
await self.cdp_manager.cleanup()
|
||||
self.cdp_manager = None
|
||||
|
||||
@@ -1,12 +1,21 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/kuaishou/exception.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
from httpx import RequestError
|
||||
|
||||
@@ -1,12 +1,21 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/kuaishou/field.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
# -*- coding: utf-8 -*-
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
@@ -1,16 +1,25 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/kuaishou/graphql.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
# 快手的数据传输是基于GraphQL实现的
|
||||
# 这个类负责获取一些GraphQL的schema
|
||||
# Kuaishou's data transmission is based on GraphQL
|
||||
# This class is responsible for obtaining some GraphQL schemas
|
||||
from typing import Dict
|
||||
|
||||
|
||||
|
||||
@@ -1,3 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/kuaishou/help.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
@@ -17,59 +26,59 @@ from model.m_kuaishou import VideoUrlInfo, CreatorUrlInfo
|
||||
|
||||
def parse_video_info_from_url(url: str) -> VideoUrlInfo:
|
||||
"""
|
||||
从快手视频URL中解析出视频ID
|
||||
支持以下格式:
|
||||
1. 完整视频URL: "https://www.kuaishou.com/short-video/3x3zxz4mjrsc8ke?authorId=3x84qugg4ch9zhs&streamSource=search"
|
||||
2. 纯视频ID: "3x3zxz4mjrsc8ke"
|
||||
Parse video ID from Kuaishou video URL
|
||||
Supports the following formats:
|
||||
1. Full video URL: "https://www.kuaishou.com/short-video/3x3zxz4mjrsc8ke?authorId=3x84qugg4ch9zhs&streamSource=search"
|
||||
2. Pure video ID: "3x3zxz4mjrsc8ke"
|
||||
|
||||
Args:
|
||||
url: 快手视频链接或视频ID
|
||||
url: Kuaishou video link or video ID
|
||||
Returns:
|
||||
VideoUrlInfo: 包含视频ID的对象
|
||||
VideoUrlInfo: Object containing video ID
|
||||
"""
|
||||
# 如果不包含http且不包含kuaishou.com,认为是纯ID
|
||||
# If it doesn't contain http and doesn't contain kuaishou.com, consider it as pure ID
|
||||
if not url.startswith("http") and "kuaishou.com" not in url:
|
||||
return VideoUrlInfo(video_id=url, url_type="normal")
|
||||
|
||||
# 从标准视频URL中提取ID: /short-video/视频ID
|
||||
# Extract ID from standard video URL: /short-video/video_ID
|
||||
video_pattern = r'/short-video/([a-zA-Z0-9_-]+)'
|
||||
match = re.search(video_pattern, url)
|
||||
if match:
|
||||
video_id = match.group(1)
|
||||
return VideoUrlInfo(video_id=video_id, url_type="normal")
|
||||
|
||||
raise ValueError(f"无法从URL中解析出视频ID: {url}")
|
||||
raise ValueError(f"Unable to parse video ID from URL: {url}")
|
||||
|
||||
|
||||
def parse_creator_info_from_url(url: str) -> CreatorUrlInfo:
|
||||
"""
|
||||
从快手创作者主页URL中解析出创作者ID
|
||||
支持以下格式:
|
||||
1. 创作者主页: "https://www.kuaishou.com/profile/3x84qugg4ch9zhs"
|
||||
2. 纯ID: "3x4sm73aye7jq7i"
|
||||
Parse creator ID from Kuaishou creator homepage URL
|
||||
Supports the following formats:
|
||||
1. Creator homepage: "https://www.kuaishou.com/profile/3x84qugg4ch9zhs"
|
||||
2. Pure ID: "3x4sm73aye7jq7i"
|
||||
|
||||
Args:
|
||||
url: 快手创作者主页链接或user_id
|
||||
url: Kuaishou creator homepage link or user_id
|
||||
Returns:
|
||||
CreatorUrlInfo: 包含创作者ID的对象
|
||||
CreatorUrlInfo: Object containing creator ID
|
||||
"""
|
||||
# 如果不包含http且不包含kuaishou.com,认为是纯ID
|
||||
# If it doesn't contain http and doesn't contain kuaishou.com, consider it as pure ID
|
||||
if not url.startswith("http") and "kuaishou.com" not in url:
|
||||
return CreatorUrlInfo(user_id=url)
|
||||
|
||||
# 从创作者主页URL中提取user_id: /profile/xxx
|
||||
# Extract user_id from creator homepage URL: /profile/xxx
|
||||
user_pattern = r'/profile/([a-zA-Z0-9_-]+)'
|
||||
match = re.search(user_pattern, url)
|
||||
if match:
|
||||
user_id = match.group(1)
|
||||
return CreatorUrlInfo(user_id=user_id)
|
||||
|
||||
raise ValueError(f"无法从URL中解析出创作者ID: {url}")
|
||||
raise ValueError(f"Unable to parse creator ID from URL: {url}")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
# 测试视频URL解析
|
||||
print("=== 视频URL解析测试 ===")
|
||||
# Test video URL parsing
|
||||
print("=== Video URL Parsing Test ===")
|
||||
test_video_urls = [
|
||||
"https://www.kuaishou.com/short-video/3x3zxz4mjrsc8ke?authorId=3x84qugg4ch9zhs&streamSource=search&area=searchxxnull&searchKey=python",
|
||||
"3xf8enb8dbj6uig",
|
||||
@@ -78,13 +87,13 @@ if __name__ == '__main__':
|
||||
try:
|
||||
result = parse_video_info_from_url(url)
|
||||
print(f"✓ URL: {url[:80]}...")
|
||||
print(f" 结果: {result}\n")
|
||||
print(f" Result: {result}\n")
|
||||
except Exception as e:
|
||||
print(f"✗ URL: {url}")
|
||||
print(f" 错误: {e}\n")
|
||||
print(f" Error: {e}\n")
|
||||
|
||||
# 测试创作者URL解析
|
||||
print("=== 创作者URL解析测试 ===")
|
||||
# Test creator URL parsing
|
||||
print("=== Creator URL Parsing Test ===")
|
||||
test_creator_urls = [
|
||||
"https://www.kuaishou.com/profile/3x84qugg4ch9zhs",
|
||||
"3x4sm73aye7jq7i",
|
||||
@@ -93,7 +102,7 @@ if __name__ == '__main__':
|
||||
try:
|
||||
result = parse_creator_info_from_url(url)
|
||||
print(f"✓ URL: {url[:80]}...")
|
||||
print(f" 结果: {result}\n")
|
||||
print(f" Result: {result}\n")
|
||||
except Exception as e:
|
||||
print(f"✗ URL: {url}")
|
||||
print(f" 错误: {e}\n")
|
||||
print(f" Error: {e}\n")
|
||||
|
||||
@@ -1,12 +1,21 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/kuaishou/login.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
import asyncio
|
||||
|
||||
@@ -1,13 +1,22 @@
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/media_platform/tieba/__init__.py
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
|
||||
# -*- coding: utf-8 -*-
|
||||
from .core import TieBaCrawler
|
||||
from .core import TieBaCrawler
|
||||
|
||||