52 Commits

Author SHA1 Message Date
程序员阿江(Relakkes)
ae7955787c feat: kuaishou support url link 2025-10-18 07:40:10 +08:00
程序员阿江(Relakkes)
a9dd08680f feat: xhs support creator url link 2025-10-18 07:20:09 +08:00
程序员阿江(Relakkes)
cae707cb2a feat: douyin support url link 2025-10-18 07:00:21 +08:00
程序员阿江(Relakkes)
906c259cc7 feat: bilibili support url link 2025-10-18 06:30:20 +08:00
程序员阿江(Relakkes)
3b6fae8a62 docs: update README.md 2025-10-17 15:30:44 +08:00
程序员阿江-Relakkes
a72504a33d Merge pull request #739 from callmeiks/add-tikhub-sponsor
docs: resize TikHub banner to smaller size
2025-10-16 16:54:18 +08:00
Callmeiks
e177f799df docs: resize TikHub banner to smaller size 2025-10-16 01:51:55 -07:00
程序员阿江-Relakkes
1a5dcb6db7 Merge pull request #738 from callmeiks/add-tikhub-sponsor
docs: add TikHub as sponsor
2025-10-16 16:41:19 +08:00
Callmeiks
2c9eec544d docs: add TikHub as sponsor 2025-10-16 01:22:40 -07:00
程序员阿江(Relakkes)
d1f73e811c docs: update README.md 2025-10-12 21:19:11 +08:00
程序员阿江(Relakkes)
2d3e7555c6 docs: update README.md 2025-10-11 16:16:11 +08:00
程序员阿江(Relakkes)
3c5b9e8035 docs: update wechat qrcode 2025-10-02 14:27:10 +08:00
程序员阿江(Relakkes)
e6f3182ed7 Merge branch 'codex/replace-argparse-with-typer-for-cli' 2025-09-26 18:11:02 +08:00
程序员阿江(Relakkes)
2cf143cc7c fix: #730 2025-09-26 18:10:30 +08:00
程序员阿江-Relakkes
eb625b0b48 Merge pull request #729 from NanmiCoder/codex/replace-argparse-with-typer-for-cli
feat(cli): migrate CLI argument parsing to Typer
2025-09-26 18:08:21 +08:00
程序员阿江(Relakkes)
84f6f650f8 fix: typer args bugfix 2025-09-26 18:07:57 +08:00
程序员阿江-Relakkes
9d6cf065e9 fix(cli): support runtime without peps604 2025-09-26 17:38:50 +08:00
程序员阿江-Relakkes
95c740dee2 refine: harden typer cli defaults 2025-09-26 17:38:44 +08:00
程序员阿江-Relakkes
f97e0c18cd feat(cli): migrate argument parsing to typer 2025-09-26 17:21:47 +08:00
程序员阿江-Relakkes
879a72ea30 fix: 修复cdp启动的浏览器无法关闭的bug
Improve BrowserLauncher shutdown reliability
2025-09-26 16:57:48 +08:00
程序员阿江-Relakkes
3237073a0e Improve BrowserLauncher cleanup handling 2025-09-26 16:52:38 +08:00
程序员阿江-Relakkes
7b9db2f748 Merge pull request #726 from LePao1/main
feat(bilibili):增加视频清晰度参数,可以通过`BILI_QN`更改下载的视频清晰度;
2025-09-25 01:58:24 +08:00
LePao1
3954c40e69 feat(bilibili):增加视频清晰度参数,可以通过BILI_QN更改下载的视频清晰度;
在 BilibiliClient 中添加视频质量配置并改进错误处理,修复下载请求被 302 重定向到 CDN,旧代码未跟随重定向且只接受 “OK” ,导致失败,现在即便是低清晰度/CDN 跳转的链接也能正常下载。
2025-09-24 12:27:16 +08:00
程序员阿江(Relakkes)
e2554288e0 docs: update README 2025-09-23 09:46:44 +08:00
程序员阿江-Relakkes
1342797486 Merge pull request #718 from persist-1/refactor
fix(store): 修复'crawler_type_var'的不当使用导致csv/json保存文件名异常的bug
2025-09-11 06:45:26 +08:00
persist-1
926ea9dc42 fix: 修复路径分隔符连接方式不当导致的路径格式问题
- 修改'async_file_writer.py'中'_get_file_path'返回值由字符串连接改为直接用正斜杠拼接路径,以确保路径分隔符的统一
- 修改获取文件保存时间后缀方式为'get_current_date',以'天'为文件内容划分点
2025-09-11 00:35:02 +08:00
persist-1
a6d85b4194 sync #717 2025-09-11 00:00:06 +08:00
persist-1
0d0af57a01 fix(store): 修复'crawler_type_var'的不当使用导致csv/json保存文件名异常的bug 2025-09-10 23:47:05 +08:00
程序员阿江-Relakkes
4b346cfb61 Merge pull request #716 from persist-1/refactor
Refactor: 使用 SQLAlchemy ORM 全面重构数据库层
2025-09-10 14:38:39 +08:00
程序员阿江-Relakkes
bf7a0098bd Merge pull request #717 from wisty/patch-1
log client modify
2025-09-09 17:14:37 +08:00
刘小龙
c87df59996 log client modify 2025-09-09 15:27:46 +08:00
persist-1
d3bebd039e refactor(database): 调整数据库模块位置、调整初始化arg名称,并更新文档
- 将 db.py 数据库模块移入‘database/‘中,并修正所有调用代码
- 将初始化参数 `--init-db` 改为 `--init_db`
- 更新相关文档说明
2025-09-08 01:14:31 +08:00
persist-1
99756612b4 chore: 移除先前被同步的sqlite数据库,让用户自行进行初始化 2025-09-08 00:40:55 +08:00
persist-1
95a3dc8ce1 chore: 删除不必要的注释 2025-09-08 00:37:57 +08:00
persist-1
40de0e47e5 fix(store): 将async for循环替换为async with语句来修复zhihu数据库会话管理 2025-09-08 00:29:04 +08:00
persist-1
a38058856f test: 添加数据库同步测试脚本用于ORM与数据库结构对比与同步
fix(database): 修复大量不适当的字段类型
2025-09-08 00:13:00 +08:00
persist-1
684a16ed9a fix(数据库): 修复模型字段类型以支持更广泛的数据格式;
修复xhs评论存储方法,从批量处理改为单条处理
2025-09-07 04:10:49 +08:00
persist-1
b04f5bcd6f feat(database): 优化数据库模型和连接管理
- 添加cryptography依赖用于加密功能
- 重构数据库模型字段类型和约束
- 增加数据库自动创建功能
- 改进数据库连接管理和错误处理
- 更新相关依赖文件(pyproject.toml, requirements.txt)
2025-09-06 06:08:28 +08:00
persist-1
0965bd6c96 fix: 使用 get_current_time() 替代 get_current_date() 以避免文件名因同日期而冲突 2025-09-06 04:43:56 +08:00
persist-1
e92c6130e1 fix(store): 修复存储实现的AsyncFileWriter导入
重构小红书存储实现,将store_comments方法改为处理单个评论的store_comment
为多个平台添加AsyncFileWriter工具类导入
2025-09-06 04:41:37 +08:00
persist-1
be306c6f54 refactor(database): 重构数据库存储实现,使用SQLAlchemy ORM替代原始SQL操作
- 删除旧的async_db.py和async_sqlite_db.py实现
- 新增SQLAlchemy ORM模型和数据库会话管理
- 统一各平台存储实现到_store_impl.py文件
- 添加数据库初始化功能支持
- 更新.gitignore和pyproject.toml依赖配置
- 优化文件存储路径和命名规范
2025-09-06 04:10:20 +08:00
程序员阿江(Relakkes)
fa5f07e9ee docs: update README.md 2025-09-05 17:51:36 +08:00
程序员阿江(Relakkes)
6b6fedd031 fix: #711 2025-09-02 18:57:18 +08:00
程序员阿江(Relakkes)
2bce3593f7 feat: support time deplay for all platform 2025-09-02 16:43:09 +08:00
程序员阿江(Relakkes)
eb799e1fa7 refactor: xhs extractor 2025-09-02 14:50:32 +08:00
程序员阿江-Relakkes
ce52c58b98 Merge pull request #707 from CzsGit/fix-douyin-json-format
fix: 为抖音JSON存储添加格式化输出
2025-08-18 19:15:50 +08:00
Czs-HF
48da268bc5 fix: 为抖音JSON存储添加格式化输出
- 在DouyinJsonStoreImplement.save_data_to_json方法中添加indent=4参数
- 使抖音JSON输出格式与小红书保持一致,提高可读性
- 解决JSON文件所有内容都在一行的问题
2025-08-16 12:52:37 +08:00
程序员阿江(Relakkes)
9e8c979164 fix: note_download_url field length error 2025-08-14 14:57:24 +08:00
程序员阿江(Relakkes)
4a68e79ed0 docs: update README.md 2025-08-12 22:25:21 +08:00
程序员阿江-Relakkes
526c37822b Merge pull request #700 from 2513502304/main
将捕获异常的类型由HTTPStatusError换成基类HTTPError,以便正确处理爬取媒体资源出现任何错误时,都不会导致爬取评论的中断,详情参见提交记录
2025-08-06 17:26:29 +08:00
翟持江
2c11e64dc9 Merge branch 'NanmiCoder:main' into main 2025-08-06 11:39:42 +08:00
未来可欺
6a10d0d11c 原始的HTTPStatusError不能捕获像ConnectError、ReadError这些异常类型,本次提交修改了捕获异常的类型为httpx模块请求异常的基类:HTTPError,以便捕获在httpx.request方法中引发的任何异常(例如ip被封,服务器拒接连接),正确处理爬取媒体被中断时并不会导致爬取文本的中断逻辑 2025-08-06 11:24:51 +08:00
88 changed files with 4179 additions and 6235 deletions

7
.gitignore vendored
View File

@@ -173,4 +173,9 @@ docs/.vitepress/cache
# other gitignore
.venv
.refer
.refer
agent_zone
debug_tools
database/*.db

View File

@@ -1 +1 @@
3.9
3.11

171
README.md
View File

@@ -1,5 +1,19 @@
# 🔥 MediaCrawler - 自媒体平台爬虫 🕷️
<div align="center" markdown="1">
<sup>Special thanks to:</sup>
<br>
<br>
<a href="https://go.warp.dev/MediaCrawler">
<img alt="Warp sponsorship" width="400" src="https://github.com/warpdotdev/brand-assets/blob/main/Github/Sponsor/Warp-Github-LG-02.png?raw=true">
</a>
### [Warp is built for coding with multiple AI agents](https://go.warp.dev/MediaCrawler)
</div>
<hr>
<div align="center">
<a href="https://trendshift.io/repositories/8291" target="_blank">
@@ -51,8 +65,6 @@
| 知乎 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
<details id="pro-version">
<summary>🔗 <strong>🚀 MediaCrawlerPro 重磅发布!更多的功能,更好的架构设计!</strong></summary>
### 🚀 MediaCrawlerPro 重磅发布!
@@ -77,7 +89,7 @@
- [ ] **基于自媒体平台的AI Agent正在开发中 🚀🚀**
点击查看:[MediaCrawlerPro 项目主页](https://github.com/MediaCrawlerPro) 更多介绍
</details>
## 🚀 快速开始
@@ -198,49 +210,97 @@ python main.py --help
## 💾 数据保存
支持多种数据存储方式:
- **SQLite 数据库**:轻量级数据库,无需服务器,适合个人使用(推荐)
- 参数:`--save_data_option sqlite`
- 自动创建数据库文件
- **MySQL 数据库**:支持关系型数据库 MySQL 中保存(需要提前创建数据库)
- 执行 `python db.py` 初始化数据库表结构(只在首次执行)
- **CSV 文件**:支持保存到 CSV 中(`data/` 目录下)
- **JSON 文件**:支持保存到 JSON 中(`data/` 目录下)
- **数据库存储**
- 使用参数 `--init_db` 进行数据库初始化(使用`--init_db`时不需要携带其他optional
- **SQLite 数据库**:轻量级数据库,无需服务器,适合个人使用(推荐)
1. 初始化:`--init_db sqlite`
2. 数据存储:`--save_data_option sqlite`
- **MySQL 数据库**:支持关系型数据库 MySQL 中保存(需要提前创建数据库)
1. 初始化:`--init_db mysql`
2. 数据存储:`--save_data_option db`db 参数为兼容历史更新保留)
### 使用示例:
```shell
# 使用 SQLite(推荐个人用户使用
# 初始化 SQLite 数据库(使用'--init_db'时不需要携带其他optional
uv run main.py --init_db sqlite
# 使用 SQLite 存储数据(推荐个人用户使用)
uv run main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
# 使用 MySQL
```
```shell
# 初始化 MySQL 数据库
uv run main.py --init_db mysql
# 使用 MySQL 存储数据为适配历史更新db参数进行沿用
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
```
---
[🚀 MediaCrawlerPro 重磅发布 🚀!更多的功能,更好的架构设计!](https://github.com/MediaCrawlerPro)
## 🤝 社区与支持
### 💬 交流群组
- **微信交流群**[点击加入](https://nanmicoder.github.io/MediaCrawler/%E5%BE%AE%E4%BF%A1%E4%BA%A4%E6%B5%81%E7%BE%A4.html)
### 📚 文档与教程
- **在线文档**[MediaCrawler 完整文档](https://nanmicoder.github.io/MediaCrawler/)
- **爬虫教程**[CrawlerTutorial 免费教程](https://github.com/NanmiCoder/CrawlerTutorial)
### 📚 其他
- **常见问题**[MediaCrawler 完整文档](https://nanmicoder.github.io/MediaCrawler/)
- **爬虫入门教程**[CrawlerTutorial 免费教程](https://github.com/NanmiCoder/CrawlerTutorial)
- **新闻爬虫开源项目**[NewsCrawlerCollection](https://github.com/NanmiCoder/NewsCrawlerCollection)
---
# 其他常见问题可以查看在线文档
>
> 在线文档包含使用方法、常见问题、加入项目交流群等。
> [MediaCrawler在线文档](https://nanmicoder.github.io/MediaCrawler/)
>
### 💰 赞助商展示
# 作者提供的知识服务
> 如果想快速入门和学习该项目的使用、源码架构设计等、学习编程技术、亦或者想了解MediaCrawlerPro的源代码设计可以看下我的知识付费栏目。
<a href="https://h.wandouip.com">
<img src="docs/static/images/img_8.jpg">
<br>
豌豆HTTP自营千万级IP资源池IP纯净度≥99.8%每日保持IP高频更新快速响应稳定连接,满足多种业务场景支持按需定制注册免费提取10000ip。
</a>
[作者的知识付费栏目介绍](https://nanmicoder.github.io/MediaCrawler/%E7%9F%A5%E8%AF%86%E4%BB%98%E8%B4%B9%E4%BB%8B%E7%BB%8D.html)
---
<p align="center">
<a href="https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad">
<img style="border-radius:20px" width="500" alt="TikHub IO_Banner zh" src="docs/static/images/tikhub_banner_zh.png">
</a>
</p>
[TikHub](https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad) 提供超过 **700 个端点**,可用于从 **14+ 个社交媒体平台** 获取与分析数据 —— 包括视频、用户、评论、商店、商品与趋势等,一站式完成所有数据访问与分析。
通过每日签到,可以获取免费额度。可以使用我的注册链接:[https://user.tikhub.io/users/signup?referral_code=cfzyejV9](https://user.tikhub.io/users/signup?referral_code=cfzyejV9&utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad) 或使用邀请码:`cfzyejV9`,注册并充值即可获得 **$2 免费额度**。
[TikHub](https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad) 提供以下服务:
- 🚀 丰富的社交媒体数据接口TikTok、Douyin、XHS、YouTube、Instagram等
- 💎 每日签到免费领取额度
- ⚡ 高成功率与高并发支持
- 🌐 官网:[https://tikhub.io/](https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad)
- 💻 GitHub地址[https://github.com/TikHubIO/](https://github.com/TikHubIO/)
---
<p align="center">
<a href="https://app.nstbrowser.io/account/register?utm_source=official&utm_term=mediacrawler">
<img style="border-radius:20px" alt="NstBrowser Banner " src="docs/static/images/nstbrowser.jpg">
</a>
</p>
Nstbrowser 指纹浏览器 — 多账号运营&自动化管理的最佳解决方案
<br>
多账号安全管理与会话隔离指纹定制结合反检测浏览器环境兼顾真实度与稳定性覆盖店铺管理、电商监控、社媒营销、广告验证、Web3、投放监控与联盟营销等业务线提供生产级并发与定制化企业服务提供可一键部署的云端浏览器方案配套全球高质量 IP 池,为您构建长期行业竞争力
<br>
[点击此处即刻开始免费使用](https://app.nstbrowser.io/account/register?utm_source=official&utm_term=mediacrawler)
<br>
使用 NSTBROWSER 可获得 10% 充值赠礼
### 🤝 成为赞助者
成为赞助者,可以将您的产品展示在这里,每天获得大量曝光!
**联系方式**
- 微信:`relakkes`
- 邮箱:`relakkes@gmail.com`
---
@@ -250,54 +310,6 @@ uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
[![Star History Chart](https://api.star-history.com/svg?repos=NanmiCoder/MediaCrawler&type=Date)](https://star-history.com/#NanmiCoder/MediaCrawler&Date)
### 💰 赞助商展示
<a href="https://www.swiftproxy.net/?ref=nanmi">
<img src="docs/static/images/img_5.png">
<br>
Swiftproxy - 90M+ 全球高质量纯净住宅IP注册可领免费 500MB 测试流量,动态流量不过期!
> 专属折扣码:**GHB5** 立享九折优惠!
</a>
<br>
<br>
<a href="https://www.tkyds.com/?=MediaCrawler">
<img src="docs/static/images/img_6.png">
<br>
TK云大师,专业的TikTok矩阵系统,AI赋能自动化,单人轻松管理上万账号!
</a>
<br>
<br>
<a href="https://www.thordata.com/?ls=github&lk=Crawler">
<img src="docs/static/images/img_7.png">
<br>
Thordata是全球代理IP解决方案提供商支持大规模采集公共网络数据提供 195+ 国家城市、6000 万住宅IP价格低至 $0.65/GB支持不限流量、不限IP、不限并发还包括本土独享ISP静态代理和高性能数据中心代理均为 $0.75/IP弹性定价。点击图片注册后联系中文客服即可免费试用现在首充还有赠送同额金额活动。可与EasySpider工具配合使用高效采集网络数据。
</a>
<br>
<br>
<a href="https://h.wandouip.com">
<img src="docs/static/images/img_8.jpg">
<br>
豌豆HTTP自营千万级IP资源池IP纯净度≥99.8%每日保持IP高频更新快速响应稳定连接满足多种业务场景支持按需定制注册免费提取10000ip。
</a>
<br>
<br>
<a href="https://sider.ai/ad-land-redirect?source=github&p1=mi&p2=kk">**Sider** - 全网最火的 ChatGPT 插件,体验拉满!</a>
### 🤝 成为赞助者
成为赞助者,可以将您的产品展示在这里,每天获得大量曝光!
**联系方式**
- 微信:`yzglan`
- 邮箱:`relakkes@gmail.com`
## 📚 参考
@@ -328,14 +340,3 @@ Thordata是全球代理IP解决方案提供商支持大规模采集公共网
## 6. 最终解释权
关于本项目的最终解释权归开发者所有。开发者保留随时更改或更新本免责声明的权利,恕不另行通知。
</div>
## 🙏 致谢
### JetBrains 开源许可证支持
感谢 JetBrains 为本项目提供免费的开源许可证支持!
<a href="https://www.jetbrains.com/?from=MediaCrawler">
<img src="https://www.jetbrains.com/company/brand/img/jetbrains_logo.png" width="100" alt="JetBrains" />
</a>

View File

@@ -1,3 +1,16 @@
<div align="center" markdown="1">
<sup>Special thanks to:</sup>
<br>
<br>
<a href="https://go.warp.dev/MediaCrawler">
<img alt="Warp sponsorship" width="400" src="https://github.com/warpdotdev/brand-assets/blob/main/Github/Sponsor/Warp-Github-LG-02.png?raw=true">
</a>
### [Warp is built for coding with multiple AI agents](https://go.warp.dev/MediaCrawler)
</div>
<hr>
# 🔥 MediaCrawler - Social Media Platform Crawler 🕷️
<div align="center">
@@ -194,21 +207,29 @@ python main.py --help
## 💾 Data Storage
Supports multiple data storage methods:
- **SQLite Database**: Lightweight database without server, ideal for personal use (recommended)
- Parameter: `--save_data_option sqlite`
- Database file created automatically
- **MySQL Database**: Supports saving to relational database MySQL (need to create database in advance)
- Execute `python db.py` to initialize database table structure (only execute on first run)
- **CSV Files**: Supports saving to CSV (under `data/` directory)
- **JSON Files**: Supports saving to JSON (under `data/` directory)
- **Database Storage**
- Use the `--init_db` parameter for database initialization (when using `--init_db`, no other optional arguments are needed)
- **SQLite Database**: Lightweight database, no server required, suitable for personal use (recommended)
1. Initialization: `--init_db sqlite`
2. Data Storage: `--save_data_option sqlite`
- **MySQL Database**: Supports saving to relational database MySQL (database needs to be created in advance)
1. Initialization: `--init_db mysql`
2. Data Storage: `--save_data_option db` (the db parameter is retained for compatibility with historical updates)
### Usage Examples:
```shell
# Use SQLite (recommended for personal users)
# Initialize SQLite database (when using '--init_db', no other optional arguments are needed)
uv run main.py --init_db sqlite
# Use SQLite to store data (recommended for personal users)
uv run main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
# Use MySQL
```
```shell
# Initialize MySQL database
uv run main.py --init_db mysql
# Use MySQL to store data (the db parameter is retained for compatibility with historical updates)
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
```
@@ -255,16 +276,13 @@ If this project helps you, please give a ⭐ Star to support and let more people
> Exclusive discount code: **GHB5** Get 10% off instantly!
</a>
<br><br>
<a href="https://sider.ai/ad-land-redirect?source=github&p1=mi&p2=kk">**Sider** - The hottest ChatGPT plugin on the web, amazing experience!</a>
### 🤝 Become a Sponsor
Become a sponsor and showcase your product here, getting massive exposure daily!
**Contact Information**:
- WeChat: `yzglan`
- WeChat: `relakkes`
- Email: `relakkes@gmail.com`

View File

@@ -1,3 +1,17 @@
<div align="center" markdown="1">
<sup>Special thanks to:</sup>
<br>
<br>
<a href="https://go.warp.dev/MediaCrawler">
<img alt="Warp sponsorship" width="400" src="https://github.com/warpdotdev/brand-assets/blob/main/Github/Sponsor/Warp-Github-LG-02.png?raw=true">
</a>
### [Warp is built for coding with multiple AI agents](https://go.warp.dev/MediaCrawler)
</div>
<hr>
# 🔥 MediaCrawler - Rastreador de Plataformas de Redes Sociales 🕷️
<div align="center">
@@ -194,21 +208,29 @@ python main.py --help
## 💾 Almacenamiento de Datos
Soporta múltiples métodos de almacenamiento de datos:
- **Base de Datos SQLite**: Base de datos ligera sin servidor, ideal para uso personal (recomendado)
- Parámetro: `--save_data_option sqlite`
- Se crea automáticamente el archivo de base de datos
- **Base de Datos MySQL**: Soporta guardar en base de datos relacional MySQL (necesita crear base de datos con anticipación)
- Ejecute `python db.py` para inicializar la estructura de tablas de la base de datos (solo ejecutar en la primera ejecución)
- **Archivos CSV**: Soporta guardar en CSV (bajo el directorio `data/`)
- **Archivos JSON**: Soporta guardar en JSON (bajo el directorio `data/`)
- **Almacenamiento en Base de Datos**
- Use el parámetro `--init_db` para la inicialización de la base de datos (cuando use `--init_db`, no se necesitan otros argumentos opcionales)
- **Base de Datos SQLite**: Base de datos ligera, no requiere servidor, adecuada para uso personal (recomendado)
1. Inicialización: `--init_db sqlite`
2. Almacenamiento de Datos: `--save_data_option sqlite`
- **Base de Datos MySQL**: Soporta guardar en la base de datos relacional MySQL (la base de datos debe crearse con anticipación)
1. Inicialización: `--init_db mysql`
2. Almacenamiento de Datos: `--save_data_option db` (el parámetro db se mantiene por compatibilidad con actualizaciones históricas)
### Ejemplos de Uso:
```shell
# Usar SQLite (recomendado para usuarios personales)
# Inicializar la base de datos SQLite (cuando use '--init_db', no se necesitan otros argumentos opcionales)
uv run main.py --init_db sqlite
# Usar SQLite para almacenar datos (recomendado para usuarios personales)
uv run main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
# Usar MySQL
```
```shell
# Inicializar la base de datos MySQL
uv run main.py --init_db mysql
# Usar MySQL para almacenar datos (el parámetro db se mantiene por compatibilidad con actualizaciones históricas)
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
```
@@ -255,16 +277,12 @@ uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
> Código de descuento exclusivo: **GHB5** ¡Obtenga 10% de descuento instantáneamente!
</a>
<br><br>
<a href="https://sider.ai/ad-land-redirect?source=github&p1=mi&p2=kk">**Sider** - ¡El plugin de ChatGPT más popular en la web, experiencia increíble!</a>
### 🤝 Conviértase en Patrocinador
¡Conviértase en patrocinador y muestre su producto aquí, obteniendo exposición masiva diariamente!
**Información de Contacto**:
- WeChat: `yzglan`
- WeChat: `relakkes`
- Email: `relakkes@gmail.com`

View File

@@ -1,107 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 14:21
# @Desc : 异步Aiomysql的增删改查封装
from typing import Any, Dict, List, Union
import aiomysql
class AsyncMysqlDB:
def __init__(self, pool: aiomysql.Pool) -> None:
self.__pool = pool
async def query(self, sql: str, *args: Union[str, int]) -> List[Dict[str, Any]]:
"""
从给定的 SQL 中查询记录,返回的是一个列表
:param sql: 查询的sql
:param args: sql中传递动态参数列表
:return:
"""
async with self.__pool.acquire() as conn:
async with conn.cursor(aiomysql.DictCursor) as cur:
await cur.execute(sql, args)
data = await cur.fetchall()
return data or []
async def get_first(self, sql: str, *args: Union[str, int]) -> Union[Dict[str, Any], None]:
"""
从给定的 SQL 中查询记录,返回的是符合条件的第一个结果
:param sql: 查询的sql
:param args:sql中传递动态参数列表
:return:
"""
async with self.__pool.acquire() as conn:
async with conn.cursor(aiomysql.DictCursor) as cur:
await cur.execute(sql, args)
data = await cur.fetchone()
return data
async def item_to_table(self, table_name: str, item: Dict[str, Any]) -> int:
"""
表中插入数据
:param table_name: 表名
:param item: 一条记录的字典信息
:return:
"""
fields = list(item.keys())
values = list(item.values())
fields = [f'`{field}`' for field in fields]
fieldstr = ','.join(fields)
valstr = ','.join(['%s'] * len(item))
sql = "INSERT INTO %s (%s) VALUES(%s)" % (table_name, fieldstr, valstr)
async with self.__pool.acquire() as conn:
async with conn.cursor(aiomysql.DictCursor) as cur:
await cur.execute(sql, values)
lastrowid = cur.lastrowid
return lastrowid
async def update_table(self, table_name: str, updates: Dict[str, Any], field_where: str,
value_where: Union[str, int, float]) -> int:
"""
更新指定表的记录
:param table_name: 表名
:param updates: 需要更新的字段和值的 key - value 映射
:param field_where: update 语句 where 条件中的字段名
:param value_where: update 语句 where 条件中的字段值
:return:
"""
upsets = []
values = []
for k, v in updates.items():
s = '`%s`=%%s' % k
upsets.append(s)
values.append(v)
upsets = ','.join(upsets)
sql = 'UPDATE %s SET %s WHERE %s="%s"' % (
table_name,
upsets,
field_where, value_where,
)
async with self.__pool.acquire() as conn:
async with conn.cursor() as cur:
rows = await cur.execute(sql, values)
return rows
async def execute(self, sql: str, *args: Union[str, int]) -> int:
"""
需要更新、写入等操作的 excute 执行语句
:param sql:
:param args:
:return:
"""
async with self.__pool.acquire() as conn:
async with conn.cursor() as cur:
rows = await cur.execute(sql, args)
return rows

View File

@@ -1,111 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 14:21
# @Desc : 异步SQLite的增删改查封装
from typing import Any, Dict, List, Union
import aiosqlite
class AsyncSqliteDB:
def __init__(self, db_path: str) -> None:
self.__db_path = db_path
async def query(self, sql: str, *args: Union[str, int]) -> List[Dict[str, Any]]:
"""
从给定的 SQL 中查询记录,返回的是一个列表
:param sql: 查询的sql
:param args: sql中传递动态参数列表
:return:
"""
async with aiosqlite.connect(self.__db_path) as conn:
conn.row_factory = aiosqlite.Row
async with conn.execute(sql, args) as cursor:
rows = await cursor.fetchall()
return [dict(row) for row in rows] if rows else []
async def get_first(self, sql: str, *args: Union[str, int]) -> Union[Dict[str, Any], None]:
"""
从给定的 SQL 中查询记录,返回的是符合条件的第一个结果
:param sql: 查询的sql
:param args:sql中传递动态参数列表
:return:
"""
async with aiosqlite.connect(self.__db_path) as conn:
conn.row_factory = aiosqlite.Row
async with conn.execute(sql, args) as cursor:
row = await cursor.fetchone()
return dict(row) if row else None
async def item_to_table(self, table_name: str, item: Dict[str, Any]) -> int:
"""
表中插入数据
:param table_name: 表名
:param item: 一条记录的字典信息
:return:
"""
fields = list(item.keys())
values = list(item.values())
fieldstr = ','.join(fields)
valstr = ','.join(['?'] * len(item))
sql = f"INSERT INTO {table_name} ({fieldstr}) VALUES({valstr})"
async with aiosqlite.connect(self.__db_path) as conn:
async with conn.execute(sql, values) as cursor:
await conn.commit()
return cursor.lastrowid
async def update_table(self, table_name: str, updates: Dict[str, Any], field_where: str,
value_where: Union[str, int, float]) -> int:
"""
更新指定表的记录
:param table_name: 表名
:param updates: 需要更新的字段和值的 key - value 映射
:param field_where: update 语句 where 条件中的字段名
:param value_where: update 语句 where 条件中的字段值
:return:
"""
upsets = []
values = []
for k, v in updates.items():
upsets.append(f'{k}=?')
values.append(v)
upsets_str = ','.join(upsets)
values.append(value_where)
sql = f'UPDATE {table_name} SET {upsets_str} WHERE {field_where}=?'
async with aiosqlite.connect(self.__db_path) as conn:
async with conn.execute(sql, values) as cursor:
await conn.commit()
return cursor.rowcount
async def execute(self, sql: str, *args: Union[str, int]) -> int:
"""
需要更新、写入等操作的 excute 执行语句
:param sql:
:param args:
:return:
"""
async with aiosqlite.connect(self.__db_path) as conn:
async with conn.execute(sql, args) as cursor:
await conn.commit()
return cursor.rowcount
async def executescript(self, sql_script: str) -> None:
"""
执行SQL脚本用于初始化数据库表结构
:param sql_script: SQL脚本内容
:return:
"""
async with aiosqlite.connect(self.__db_path) as conn:
await conn.executescript(sql_script)
await conn.commit()

View File

@@ -1,55 +1,257 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
import argparse
from __future__ import annotations
import sys
from enum import Enum
from types import SimpleNamespace
from typing import Iterable, Optional, Sequence, Type, TypeVar
import typer
from typing_extensions import Annotated
import config
from tools.utils import str2bool
async def parse_cmd():
# 读取command arg
parser = argparse.ArgumentParser(description='Media crawler program. / 媒体爬虫程序')
parser.add_argument('--platform', type=str,
help='Media platform select / 选择媒体平台 (xhs=小红书 | dy=抖音 | ks=快手 | bili=哔哩哔哩 | wb=微博 | tieba=百度贴吧 | zhihu=知乎)',
choices=["xhs", "dy", "ks", "bili", "wb", "tieba", "zhihu"], default=config.PLATFORM)
parser.add_argument('--lt', type=str,
help='Login type / 登录方式 (qrcode=二维码 | phone=手机号 | cookie=Cookie)',
choices=["qrcode", "phone", "cookie"], default=config.LOGIN_TYPE)
parser.add_argument('--type', type=str,
help='Crawler type / 爬取类型 (search=搜索 | detail=详情 | creator=创作者)',
choices=["search", "detail", "creator"], default=config.CRAWLER_TYPE)
parser.add_argument('--start', type=int,
help='Number of start page / 起始页码', default=config.START_PAGE)
parser.add_argument('--keywords', type=str,
help='Please input keywords / 请输入关键词', default=config.KEYWORDS)
parser.add_argument('--get_comment', type=str2bool,
help='''Whether to crawl level one comment / 是否爬取一级评论, supported values case insensitive / 支持的值(不区分大小写) ('yes', 'true', 't', 'y', '1', 'no', 'false', 'f', 'n', '0')''', default=config.ENABLE_GET_COMMENTS)
parser.add_argument('--get_sub_comment', type=str2bool,
help=''''Whether to crawl level two comment / 是否爬取二级评论, supported values case insensitive / 支持的值(不区分大小写) ('yes', 'true', 't', 'y', '1', 'no', 'false', 'f', 'n', '0')''', default=config.ENABLE_GET_SUB_COMMENTS)
parser.add_argument('--save_data_option', type=str,
help='Where to save the data / 数据保存方式 (csv=CSV文件 | db=MySQL数据库 | json=JSON文件 | sqlite=SQLite数据库)',
choices=['csv', 'db', 'json', 'sqlite'], default=config.SAVE_DATA_OPTION)
parser.add_argument('--cookies', type=str,
help='Cookies used for cookie login type / Cookie登录方式使用的Cookie值', default=config.COOKIES)
EnumT = TypeVar("EnumT", bound=Enum)
args = parser.parse_args()
# override config
config.PLATFORM = args.platform
config.LOGIN_TYPE = args.lt
config.CRAWLER_TYPE = args.type
config.START_PAGE = args.start
config.KEYWORDS = args.keywords
config.ENABLE_GET_COMMENTS = args.get_comment
config.ENABLE_GET_SUB_COMMENTS = args.get_sub_comment
config.SAVE_DATA_OPTION = args.save_data_option
config.COOKIES = args.cookies
class PlatformEnum(str, Enum):
"""支持的媒体平台枚举"""
XHS = "xhs"
DOUYIN = "dy"
KUAISHOU = "ks"
BILIBILI = "bili"
WEIBO = "wb"
TIEBA = "tieba"
ZHIHU = "zhihu"
class LoginTypeEnum(str, Enum):
"""登录方式枚举"""
QRCODE = "qrcode"
PHONE = "phone"
COOKIE = "cookie"
class CrawlerTypeEnum(str, Enum):
"""爬虫类型枚举"""
SEARCH = "search"
DETAIL = "detail"
CREATOR = "creator"
class SaveDataOptionEnum(str, Enum):
"""数据保存方式枚举"""
CSV = "csv"
DB = "db"
JSON = "json"
SQLITE = "sqlite"
class InitDbOptionEnum(str, Enum):
"""数据库初始化选项"""
SQLITE = "sqlite"
MYSQL = "mysql"
def _to_bool(value: bool | str) -> bool:
if isinstance(value, bool):
return value
return str2bool(value)
def _coerce_enum(
enum_cls: Type[EnumT],
value: EnumT | str,
default: EnumT,
) -> EnumT:
"""Safely convert a raw config value to an enum member."""
if isinstance(value, enum_cls):
return value
try:
return enum_cls(value)
except ValueError:
typer.secho(
f"⚠️ 配置值 '{value}' 不在 {enum_cls.__name__} 支持的范围内,已回退到默认值 '{default.value}'.",
fg=typer.colors.YELLOW,
)
return default
def _normalize_argv(argv: Optional[Sequence[str]]) -> Iterable[str]:
if argv is None:
return list(sys.argv[1:])
return list(argv)
def _inject_init_db_default(args: Sequence[str]) -> list[str]:
"""Ensure bare --init_db defaults to sqlite for backward compatibility."""
normalized: list[str] = []
i = 0
while i < len(args):
arg = args[i]
normalized.append(arg)
if arg == "--init_db":
next_arg = args[i + 1] if i + 1 < len(args) else None
if not next_arg or next_arg.startswith("-"):
normalized.append(InitDbOptionEnum.SQLITE.value)
i += 1
return normalized
async def parse_cmd(argv: Optional[Sequence[str]] = None):
"""使用 Typer 解析命令行参数。"""
app = typer.Typer(add_completion=False)
@app.callback(invoke_without_command=True)
def main(
platform: Annotated[
PlatformEnum,
typer.Option(
"--platform",
help="媒体平台选择 (xhs=小红书 | dy=抖音 | ks=快手 | bili=哔哩哔哩 | wb=微博 | tieba=百度贴吧 | zhihu=知乎)",
rich_help_panel="基础配置",
),
] = _coerce_enum(PlatformEnum, config.PLATFORM, PlatformEnum.XHS),
lt: Annotated[
LoginTypeEnum,
typer.Option(
"--lt",
help="登录方式 (qrcode=二维码 | phone=手机号 | cookie=Cookie)",
rich_help_panel="账号配置",
),
] = _coerce_enum(LoginTypeEnum, config.LOGIN_TYPE, LoginTypeEnum.QRCODE),
crawler_type: Annotated[
CrawlerTypeEnum,
typer.Option(
"--type",
help="爬取类型 (search=搜索 | detail=详情 | creator=创作者)",
rich_help_panel="基础配置",
),
] = _coerce_enum(CrawlerTypeEnum, config.CRAWLER_TYPE, CrawlerTypeEnum.SEARCH),
start: Annotated[
int,
typer.Option(
"--start",
help="起始页码",
rich_help_panel="基础配置",
),
] = config.START_PAGE,
keywords: Annotated[
str,
typer.Option(
"--keywords",
help="请输入关键词,多个关键词用逗号分隔",
rich_help_panel="基础配置",
),
] = config.KEYWORDS,
get_comment: Annotated[
str,
typer.Option(
"--get_comment",
help="是否爬取一级评论,支持 yes/true/t/y/1 或 no/false/f/n/0",
rich_help_panel="评论配置",
show_default=True,
),
] = str(config.ENABLE_GET_COMMENTS),
get_sub_comment: Annotated[
str,
typer.Option(
"--get_sub_comment",
help="是否爬取二级评论,支持 yes/true/t/y/1 或 no/false/f/n/0",
rich_help_panel="评论配置",
show_default=True,
),
] = str(config.ENABLE_GET_SUB_COMMENTS),
save_data_option: Annotated[
SaveDataOptionEnum,
typer.Option(
"--save_data_option",
help="数据保存方式 (csv=CSV文件 | db=MySQL数据库 | json=JSON文件 | sqlite=SQLite数据库)",
rich_help_panel="存储配置",
),
] = _coerce_enum(
SaveDataOptionEnum, config.SAVE_DATA_OPTION, SaveDataOptionEnum.JSON
),
init_db: Annotated[
Optional[InitDbOptionEnum],
typer.Option(
"--init_db",
help="初始化数据库表结构 (sqlite | mysql)",
rich_help_panel="存储配置",
),
] = None,
cookies: Annotated[
str,
typer.Option(
"--cookies",
help="Cookie 登录方式使用的 Cookie 值",
rich_help_panel="账号配置",
),
] = config.COOKIES,
) -> SimpleNamespace:
"""MediaCrawler 命令行入口"""
enable_comment = _to_bool(get_comment)
enable_sub_comment = _to_bool(get_sub_comment)
init_db_value = init_db.value if init_db else None
# override global config
config.PLATFORM = platform.value
config.LOGIN_TYPE = lt.value
config.CRAWLER_TYPE = crawler_type.value
config.START_PAGE = start
config.KEYWORDS = keywords
config.ENABLE_GET_COMMENTS = enable_comment
config.ENABLE_GET_SUB_COMMENTS = enable_sub_comment
config.SAVE_DATA_OPTION = save_data_option.value
config.COOKIES = cookies
return SimpleNamespace(
platform=config.PLATFORM,
lt=config.LOGIN_TYPE,
type=config.CRAWLER_TYPE,
start=config.START_PAGE,
keywords=config.KEYWORDS,
get_comment=config.ENABLE_GET_COMMENTS,
get_sub_comment=config.ENABLE_GET_SUB_COMMENTS,
save_data_option=config.SAVE_DATA_OPTION,
init_db=init_db_value,
cookies=config.COOKIES,
)
command = typer.main.get_command(app)
cli_args = _normalize_argv(argv)
cli_args = _inject_init_db_default(cli_args)
try:
result = command.main(args=cli_args, standalone_mode=False)
if isinstance(result, int): # help/options handled by Typer; propagate exit code
raise SystemExit(result)
return result
except typer.Exit as exc: # pragma: no cover - CLI exit paths
raise SystemExit(exc.exit_code) from exc

View File

@@ -38,7 +38,7 @@ SAVE_LOGIN_STATE = True
# 是否启用CDP模式 - 使用用户现有的Chrome/Edge浏览器进行爬取提供更好的反检测能力
# 启用后将自动检测并启动用户的Chrome/Edge浏览器通过CDP协议进行控制
# 这种方式使用真实的浏览器环境包括用户的扩展、Cookie和设置大大降低被检测的风险
ENABLE_CDP_MODE = False
ENABLE_CDP_MODE = True
# CDP调试端口用于与浏览器通信
# 如果端口被占用,系统会自动尝试下一个可用端口
@@ -71,7 +71,7 @@ USER_DATA_DIR = "%s_user_data_dir" # %s will be replaced by platform name
START_PAGE = 1
# 爬取视频/帖子的数量控制
CRAWLER_MAX_NOTES_COUNT = 200
CRAWLER_MAX_NOTES_COUNT = 15
# 并发爬虫数量控制
MAX_CONCURRENCY_NUM = 1

View File

@@ -13,16 +13,23 @@
# 每天爬取视频/帖子的数量控制
MAX_NOTES_PER_DAY = 1
# 指定B站视频ID列表
# 指定B站视频URL列表 (支持完整URL或BV号)
# 示例:
# - 完整URL: "https://www.bilibili.com/video/BV1dwuKzmE26/?spm_id_from=333.1387.homepage.video_card.click"
# - BV号: "BV1d54y1g7db"
BILI_SPECIFIED_ID_LIST = [
"BV1d54y1g7db",
"https://www.bilibili.com/video/BV1dwuKzmE26/?spm_id_from=333.1387.homepage.video_card.click",
"BV1Sz4y1U77N",
"BV14Q4y1n7jz",
# ........................
]
# 指定B站用户ID列表
# 指定B站创作者URL列表 (支持完整URL或UID)
# 示例:
# - 完整URL: "https://space.bilibili.com/434377496?spm_id_from=333.1007.0.0"
# - UID: "20813884"
BILI_CREATOR_ID_LIST = [
"https://space.bilibili.com/434377496?spm_id_from=333.1007.0.0",
"20813884",
# ........................
]
@@ -34,6 +41,11 @@ END_DAY = "2024-01-01"
# 搜索模式
BILI_SEARCH_MODE = "normal"
# 视频清晰度qn配置常见取值
# 16=360p, 32=480p, 64=720p, 80=1080p, 112=1080p高码率, 116=1080p60, 120=4K
# 注意:更高清晰度需要账号/视频本身支持
BILI_QN = 80
# 是否爬取用户信息
CREATOR_MODE = True

View File

@@ -18,6 +18,14 @@ MYSQL_DB_HOST = os.getenv("MYSQL_DB_HOST", "localhost")
MYSQL_DB_PORT = os.getenv("MYSQL_DB_PORT", 3306)
MYSQL_DB_NAME = os.getenv("MYSQL_DB_NAME", "media_crawler")
mysql_db_config = {
"user": MYSQL_DB_USER,
"password": MYSQL_DB_PWD,
"host": MYSQL_DB_HOST,
"port": MYSQL_DB_PORT,
"db_name": MYSQL_DB_NAME,
}
# redis config
REDIS_DB_HOST = "127.0.0.1" # your redis host
@@ -30,4 +38,8 @@ CACHE_TYPE_REDIS = "redis"
CACHE_TYPE_MEMORY = "memory"
# sqlite config
SQLITE_DB_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)), "schema", "sqlite_tables.db")
SQLITE_DB_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)), "database", "sqlite_tables.db")
sqlite_db_config = {
"db_path": SQLITE_DB_PATH
}

View File

@@ -11,15 +11,27 @@
# 抖音平台配置
PUBLISH_TIME_TYPE = 0
# 指定DY视频ID列表
# 指定DY视频URL列表 (支持多种格式)
# 支持格式:
# 1. 完整视频URL: "https://www.douyin.com/video/7525538910311632128"
# 2. 带modal_id的URL: "https://www.douyin.com/user/xxx?modal_id=7525538910311632128"
# 3. 搜索页带modal_id: "https://www.douyin.com/root/search/python?modal_id=7525538910311632128"
# 4. 短链接: "https://v.douyin.com/drIPtQ_WPWY/"
# 5. 纯视频ID: "7280854932641664319"
DY_SPECIFIED_ID_LIST = [
"7280854932641664319",
"7202432992642387233",
"https://www.douyin.com/video/7525538910311632128",
"https://v.douyin.com/drIPtQ_WPWY/",
"https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main&modal_id=7525538910311632128",
"7202432992642387233",
# ........................
]
# 指定DY用户ID列表
# 指定DY创作者URL列表 (支持完整URL或sec_user_id)
# 支持格式:
# 1. 完整创作者主页URL: "https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main"
# 2. sec_user_id: "MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE"
DY_CREATOR_ID_LIST = [
"MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE",
"https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main",
"MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE"
# ........................
]

View File

@@ -10,11 +10,22 @@
# 快手平台配置
# 指定快手视频ID列表
KS_SPECIFIED_ID_LIST = ["3xf8enb8dbj6uig", "3x6zz972bchmvqe"]
# 指定快手视频URL列表 (支持完整URL或纯ID)
# 支持格式:
# 1. 完整视频URL: "https://www.kuaishou.com/short-video/3x3zxz4mjrsc8ke?authorId=3x84qugg4ch9zhs&streamSource=search"
# 2. 纯视频ID: "3xf8enb8dbj6uig"
KS_SPECIFIED_ID_LIST = [
"https://www.kuaishou.com/short-video/3x3zxz4mjrsc8ke?authorId=3x84qugg4ch9zhs&streamSource=search&area=searchxxnull&searchKey=python",
"3xf8enb8dbj6uig",
# ........................
]
# 指定快手用户ID列表
# 指定快手创作者URL列表 (支持完整URL或纯ID)
# 支持格式:
# 1. 创作者主页URL: "https://www.kuaishou.com/profile/3x84qugg4ch9zhs"
# 2. 纯user_id: "3x4sm73aye7jq7i"
KS_CREATOR_ID_LIST = [
"https://www.kuaishou.com/profile/3x84qugg4ch9zhs",
"3x4sm73aye7jq7i",
# ........................
]

View File

@@ -21,8 +21,12 @@ XHS_SPECIFIED_NOTE_URL_LIST = [
# ........................
]
# 指定用户ID列表
# 指定创作者URL列表 (支持完整URL或纯ID)
# 支持格式:
# 1. 完整创作者主页URL (带xsec_token和xsec_source参数): "https://www.xiaohongshu.com/user/profile/5eb8e1d400000000010075ae?xsec_token=AB1nWBKCo1vE2HEkfoJUOi5B6BE5n7wVrbdpHoWIj5xHw=&xsec_source=pc_feed"
# 2. 纯user_id: "63e36c9a000000002703502b"
XHS_CREATOR_ID_LIST = [
"63e36c9a000000002703502b",
"https://www.xiaohongshu.com/user/profile/5eb8e1d400000000010075ae?xsec_token=AB1nWBKCo1vE2HEkfoJUOi5B6BE5n7wVrbdpHoWIj5xHw=&xsec_source=pc_feed",
"63e36c9a000000002703502b",
# ........................
]

0
database/__init__.py Normal file
View File

35
database/db.py Normal file
View File

@@ -0,0 +1,35 @@
# persist-1<persist1@126.com>
# 原因:将 db.py 改造为模块,移除直接执行入口,修复相对导入问题。
# 副作用:无
# 回滚策略:还原此文件。
import asyncio
import sys
from pathlib import Path
# Add project root to sys.path
project_root = Path(__file__).resolve().parents[1]
if str(project_root) not in sys.path:
sys.path.append(str(project_root))
from tools import utils
from database.db_session import create_tables
async def init_table_schema(db_type: str):
"""
Initializes the database table schema.
This will create tables based on the ORM models.
Args:
db_type: The type of database, 'sqlite' or 'mysql'.
"""
utils.logger.info(f"[init_table_schema] begin init {db_type} table schema ...")
await create_tables(db_type)
utils.logger.info(f"[init_table_schema] {db_type} table schema init successful")
async def init_db(db_type: str = None):
await init_table_schema(db_type)
async def close():
"""
Placeholder for closing database connections if needed in the future.
"""
pass

70
database/db_session.py Normal file
View File

@@ -0,0 +1,70 @@
from sqlalchemy import text
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from sqlalchemy.orm import sessionmaker
from contextlib import asynccontextmanager
from .models import Base
import config
from config.db_config import mysql_db_config, sqlite_db_config
# Keep a cache of engines
_engines = {}
async def create_database_if_not_exists(db_type: str):
if db_type == "mysql" or db_type == "db":
# Connect to the server without a database
server_url = f"mysql+asyncmy://{mysql_db_config['user']}:{mysql_db_config['password']}@{mysql_db_config['host']}:{mysql_db_config['port']}"
engine = create_async_engine(server_url, echo=False)
async with engine.connect() as conn:
await conn.execute(text(f"CREATE DATABASE IF NOT EXISTS {mysql_db_config['db_name']}"))
await engine.dispose()
def get_async_engine(db_type: str = None):
if db_type is None:
db_type = config.SAVE_DATA_OPTION
if db_type in _engines:
return _engines[db_type]
if db_type in ["json", "csv"]:
return None
if db_type == "sqlite":
db_url = f"sqlite+aiosqlite:///{sqlite_db_config['db_path']}"
elif db_type == "mysql" or db_type == "db":
db_url = f"mysql+asyncmy://{mysql_db_config['user']}:{mysql_db_config['password']}@{mysql_db_config['host']}:{mysql_db_config['port']}/{mysql_db_config['db_name']}"
else:
raise ValueError(f"Unsupported database type: {db_type}")
engine = create_async_engine(db_url, echo=False)
_engines[db_type] = engine
return engine
async def create_tables(db_type: str = None):
if db_type is None:
db_type = config.SAVE_DATA_OPTION
await create_database_if_not_exists(db_type)
engine = get_async_engine(db_type)
if engine:
async with engine.begin() as conn:
await conn.run_sync(Base.metadata.create_all)
@asynccontextmanager
async def get_session() -> AsyncSession:
engine = get_async_engine(config.SAVE_DATA_OPTION)
if not engine:
yield None
return
AsyncSessionFactory = sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)
session = AsyncSessionFactory()
try:
yield session
await session.commit()
except Exception as e:
await session.rollback()
raise e
finally:
await session.close()

434
database/models.py Normal file
View File

@@ -0,0 +1,434 @@
from sqlalchemy import create_engine, Column, Integer, Text, String, BigInteger
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
Base = declarative_base()
class BilibiliVideo(Base):
__tablename__ = 'bilibili_video'
id = Column(Integer, primary_key=True)
video_id = Column(BigInteger, nullable=False, index=True, unique=True)
video_url = Column(Text, nullable=False)
user_id = Column(BigInteger, index=True)
nickname = Column(Text)
avatar = Column(Text)
liked_count = Column(Integer)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
video_type = Column(Text)
title = Column(Text)
desc = Column(Text)
create_time = Column(BigInteger, index=True)
disliked_count = Column(Text)
video_play_count = Column(Text)
video_favorite_count = Column(Text)
video_share_count = Column(Text)
video_coin_count = Column(Text)
video_danmaku = Column(Text)
video_comment = Column(Text)
video_cover_url = Column(Text)
source_keyword = Column(Text, default='')
class BilibiliVideoComment(Base):
__tablename__ = 'bilibili_video_comment'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
nickname = Column(Text)
sex = Column(Text)
sign = Column(Text)
avatar = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
comment_id = Column(BigInteger, index=True)
video_id = Column(BigInteger, index=True)
content = Column(Text)
create_time = Column(BigInteger)
sub_comment_count = Column(Text)
parent_comment_id = Column(String(255))
like_count = Column(Text, default='0')
class BilibiliUpInfo(Base):
__tablename__ = 'bilibili_up_info'
id = Column(Integer, primary_key=True)
user_id = Column(BigInteger, index=True)
nickname = Column(Text)
sex = Column(Text)
sign = Column(Text)
avatar = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
total_fans = Column(Integer)
total_liked = Column(Integer)
user_rank = Column(Integer)
is_official = Column(Integer)
class BilibiliContactInfo(Base):
__tablename__ = 'bilibili_contact_info'
id = Column(Integer, primary_key=True)
up_id = Column(BigInteger, index=True)
fan_id = Column(BigInteger, index=True)
up_name = Column(Text)
fan_name = Column(Text)
up_sign = Column(Text)
fan_sign = Column(Text)
up_avatar = Column(Text)
fan_avatar = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
class BilibiliUpDynamic(Base):
__tablename__ = 'bilibili_up_dynamic'
id = Column(Integer, primary_key=True)
dynamic_id = Column(BigInteger, index=True)
user_id = Column(String(255))
user_name = Column(Text)
text = Column(Text)
type = Column(Text)
pub_ts = Column(BigInteger)
total_comments = Column(Integer)
total_forwards = Column(Integer)
total_liked = Column(Integer)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
class DouyinAweme(Base):
__tablename__ = 'douyin_aweme'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
sec_uid = Column(String(255))
short_user_id = Column(String(255))
user_unique_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
user_signature = Column(Text)
ip_location = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
aweme_id = Column(BigInteger, index=True)
aweme_type = Column(Text)
title = Column(Text)
desc = Column(Text)
create_time = Column(BigInteger, index=True)
liked_count = Column(Text)
comment_count = Column(Text)
share_count = Column(Text)
collected_count = Column(Text)
aweme_url = Column(Text)
cover_url = Column(Text)
video_download_url = Column(Text)
music_download_url = Column(Text)
note_download_url = Column(Text)
source_keyword = Column(Text, default='')
class DouyinAwemeComment(Base):
__tablename__ = 'douyin_aweme_comment'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
sec_uid = Column(String(255))
short_user_id = Column(String(255))
user_unique_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
user_signature = Column(Text)
ip_location = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
comment_id = Column(BigInteger, index=True)
aweme_id = Column(BigInteger, index=True)
content = Column(Text)
create_time = Column(BigInteger)
sub_comment_count = Column(Text)
parent_comment_id = Column(String(255))
like_count = Column(Text, default='0')
pictures = Column(Text, default='')
class DyCreator(Base):
__tablename__ = 'dy_creator'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
ip_location = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
desc = Column(Text)
gender = Column(Text)
follows = Column(Text)
fans = Column(Text)
interaction = Column(Text)
videos_count = Column(String(255))
class KuaishouVideo(Base):
__tablename__ = 'kuaishou_video'
id = Column(Integer, primary_key=True)
user_id = Column(String(64))
nickname = Column(Text)
avatar = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
video_id = Column(String(255), index=True)
video_type = Column(Text)
title = Column(Text)
desc = Column(Text)
create_time = Column(BigInteger, index=True)
liked_count = Column(Text)
viewd_count = Column(Text)
video_url = Column(Text)
video_cover_url = Column(Text)
video_play_url = Column(Text)
source_keyword = Column(Text, default='')
class KuaishouVideoComment(Base):
__tablename__ = 'kuaishou_video_comment'
id = Column(Integer, primary_key=True)
user_id = Column(Text)
nickname = Column(Text)
avatar = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
comment_id = Column(BigInteger, index=True)
video_id = Column(String(255), index=True)
content = Column(Text)
create_time = Column(BigInteger)
sub_comment_count = Column(Text)
class WeiboNote(Base):
__tablename__ = 'weibo_note'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
gender = Column(Text)
profile_url = Column(Text)
ip_location = Column(Text, default='')
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
note_id = Column(BigInteger, index=True)
content = Column(Text)
create_time = Column(BigInteger, index=True)
create_date_time = Column(String(255), index=True)
liked_count = Column(Text)
comments_count = Column(Text)
shared_count = Column(Text)
note_url = Column(Text)
source_keyword = Column(Text, default='')
class WeiboNoteComment(Base):
__tablename__ = 'weibo_note_comment'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
gender = Column(Text)
profile_url = Column(Text)
ip_location = Column(Text, default='')
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
comment_id = Column(BigInteger, index=True)
note_id = Column(BigInteger, index=True)
content = Column(Text)
create_time = Column(BigInteger)
create_date_time = Column(String(255), index=True)
comment_like_count = Column(Text)
sub_comment_count = Column(Text)
parent_comment_id = Column(String(255))
class WeiboCreator(Base):
__tablename__ = 'weibo_creator'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
ip_location = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
desc = Column(Text)
gender = Column(Text)
follows = Column(Text)
fans = Column(Text)
tag_list = Column(Text)
class XhsCreator(Base):
__tablename__ = 'xhs_creator'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
ip_location = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
desc = Column(Text)
gender = Column(Text)
follows = Column(Text)
fans = Column(Text)
interaction = Column(Text)
tag_list = Column(Text)
class XhsNote(Base):
__tablename__ = 'xhs_note'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
ip_location = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
note_id = Column(String(255), index=True)
type = Column(Text)
title = Column(Text)
desc = Column(Text)
video_url = Column(Text)
time = Column(BigInteger, index=True)
last_update_time = Column(BigInteger)
liked_count = Column(Text)
collected_count = Column(Text)
comment_count = Column(Text)
share_count = Column(Text)
image_list = Column(Text)
tag_list = Column(Text)
note_url = Column(Text)
source_keyword = Column(Text, default='')
xsec_token = Column(Text)
class XhsNoteComment(Base):
__tablename__ = 'xhs_note_comment'
id = Column(Integer, primary_key=True)
user_id = Column(String(255))
nickname = Column(Text)
avatar = Column(Text)
ip_location = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
comment_id = Column(String(255), index=True)
create_time = Column(BigInteger, index=True)
note_id = Column(String(255))
content = Column(Text)
sub_comment_count = Column(Integer)
pictures = Column(Text)
parent_comment_id = Column(String(255))
like_count = Column(Text)
class TiebaNote(Base):
__tablename__ = 'tieba_note'
id = Column(Integer, primary_key=True)
note_id = Column(String(644), index=True)
title = Column(Text)
desc = Column(Text)
note_url = Column(Text)
publish_time = Column(String(255), index=True)
user_link = Column(Text, default='')
user_nickname = Column(Text, default='')
user_avatar = Column(Text, default='')
tieba_id = Column(String(255), default='')
tieba_name = Column(Text)
tieba_link = Column(Text)
total_replay_num = Column(Integer, default=0)
total_replay_page = Column(Integer, default=0)
ip_location = Column(Text, default='')
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
source_keyword = Column(Text, default='')
class TiebaComment(Base):
__tablename__ = 'tieba_comment'
id = Column(Integer, primary_key=True)
comment_id = Column(String(255), index=True)
parent_comment_id = Column(String(255), default='')
content = Column(Text)
user_link = Column(Text, default='')
user_nickname = Column(Text, default='')
user_avatar = Column(Text, default='')
tieba_id = Column(String(255), default='')
tieba_name = Column(Text)
tieba_link = Column(Text)
publish_time = Column(String(255), index=True)
ip_location = Column(Text, default='')
sub_comment_count = Column(Integer, default=0)
note_id = Column(String(255), index=True)
note_url = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
class TiebaCreator(Base):
__tablename__ = 'tieba_creator'
id = Column(Integer, primary_key=True)
user_id = Column(String(64))
user_name = Column(Text)
nickname = Column(Text)
avatar = Column(Text)
ip_location = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
gender = Column(Text)
follows = Column(Text)
fans = Column(Text)
registration_duration = Column(Text)
class ZhihuContent(Base):
__tablename__ = 'zhihu_content'
id = Column(Integer, primary_key=True)
content_id = Column(String(64), index=True)
content_type = Column(Text)
content_text = Column(Text)
content_url = Column(Text)
question_id = Column(String(255))
title = Column(Text)
desc = Column(Text)
created_time = Column(String(32), index=True)
updated_time = Column(Text)
voteup_count = Column(Integer, default=0)
comment_count = Column(Integer, default=0)
source_keyword = Column(Text)
user_id = Column(String(255))
user_link = Column(Text)
user_nickname = Column(Text)
user_avatar = Column(Text)
user_url_token = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
# persist-1<persist1@126.com>
# 原因:修复 ORM 模型定义错误,确保与数据库表结构一致。
# 副作用:无
# 回滚策略:还原此行
class ZhihuComment(Base):
__tablename__ = 'zhihu_comment'
id = Column(Integer, primary_key=True)
comment_id = Column(String(64), index=True)
parent_comment_id = Column(String(64))
content = Column(Text)
publish_time = Column(String(32), index=True)
ip_location = Column(Text)
sub_comment_count = Column(Integer, default=0)
like_count = Column(Integer, default=0)
dislike_count = Column(Integer, default=0)
content_id = Column(String(64), index=True)
content_type = Column(Text)
user_id = Column(String(64))
user_link = Column(Text)
user_nickname = Column(Text)
user_avatar = Column(Text)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)
class ZhihuCreator(Base):
__tablename__ = 'zhihu_creator'
id = Column(Integer, primary_key=True)
user_id = Column(String(64), unique=True, index=True)
user_link = Column(Text)
user_nickname = Column(Text)
user_avatar = Column(Text)
url_token = Column(Text)
gender = Column(Text)
ip_location = Column(Text)
follows = Column(Integer, default=0)
fans = Column(Integer, default=0)
anwser_count = Column(Integer, default=0)
video_count = Column(Integer, default=0)
question_count = Column(Integer, default=0)
article_count = Column(Integer, default=0)
column_count = Column(Integer, default=0)
get_voteup_count = Column(Integer, default=0)
add_ts = Column(BigInteger)
last_modify_ts = Column(BigInteger)

209
db.py
View File

@@ -1,209 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 14:54
# @Desc : mediacrawler db 管理
import asyncio
from typing import Dict
from urllib.parse import urlparse
import aiofiles
import aiomysql
import config
from async_db import AsyncMysqlDB
from async_sqlite_db import AsyncSqliteDB
from tools import utils
from var import db_conn_pool_var, media_crawler_db_var
async def init_mediacrawler_db():
"""
初始化数据库链接池对象并将该对象塞给media_crawler_db_var上下文变量
Returns:
"""
pool = await aiomysql.create_pool(
host=config.MYSQL_DB_HOST,
port=config.MYSQL_DB_PORT,
user=config.MYSQL_DB_USER,
password=config.MYSQL_DB_PWD,
db=config.MYSQL_DB_NAME,
autocommit=True,
)
async_db_obj = AsyncMysqlDB(pool)
# 将连接池对象和封装的CRUD sql接口对象放到上下文变量中
db_conn_pool_var.set(pool)
media_crawler_db_var.set(async_db_obj)
async def init_sqlite_db():
"""
初始化SQLite数据库对象并将该对象塞给media_crawler_db_var上下文变量
Returns:
"""
async_db_obj = AsyncSqliteDB(config.SQLITE_DB_PATH)
# 将SQLite数据库对象放到上下文变量中
media_crawler_db_var.set(async_db_obj)
async def init_db():
"""
初始化db连接池
Returns:
"""
utils.logger.info("[init_db] start init mediacrawler db connect object")
if config.SAVE_DATA_OPTION == "sqlite":
await init_sqlite_db()
utils.logger.info("[init_db] end init sqlite db connect object")
else:
await init_mediacrawler_db()
utils.logger.info("[init_db] end init mysql db connect object")
async def close():
"""
关闭数据库连接
Returns:
"""
utils.logger.info("[close] close mediacrawler db connection")
if config.SAVE_DATA_OPTION == "sqlite":
# SQLite数据库连接会在AsyncSqliteDB对象销毁时自动关闭
utils.logger.info("[close] sqlite db connection will be closed automatically")
else:
# MySQL连接池关闭
db_pool: aiomysql.Pool = db_conn_pool_var.get()
if db_pool is not None:
db_pool.close()
utils.logger.info("[close] mysql db pool closed")
async def init_table_schema(db_type: str = None):
"""
用来初始化数据库表结构,请在第一次需要创建表结构的时候使用,多次执行该函数会将已有的表以及数据全部删除
Args:
db_type: 数据库类型,可选值为 'sqlite''mysql',如果不指定则使用配置文件中的设置
Returns:
"""
# 如果没有指定数据库类型,则使用配置文件中的设置
if db_type is None:
db_type = config.SAVE_DATA_OPTION
if db_type == "sqlite":
utils.logger.info("[init_table_schema] begin init sqlite table schema ...")
# 检查并删除可能存在的损坏数据库文件
import os
if os.path.exists(config.SQLITE_DB_PATH):
try:
# 尝试删除现有的数据库文件
os.remove(config.SQLITE_DB_PATH)
utils.logger.info(f"[init_table_schema] removed existing sqlite db file: {config.SQLITE_DB_PATH}")
except Exception as e:
utils.logger.warning(f"[init_table_schema] failed to remove existing sqlite db file: {e}")
# 如果删除失败,尝试重命名文件
try:
backup_path = f"{config.SQLITE_DB_PATH}.backup_{utils.get_current_timestamp()}"
os.rename(config.SQLITE_DB_PATH, backup_path)
utils.logger.info(f"[init_table_schema] renamed existing sqlite db file to: {backup_path}")
except Exception as rename_e:
utils.logger.error(f"[init_table_schema] failed to rename existing sqlite db file: {rename_e}")
raise rename_e
await init_sqlite_db()
async_db_obj: AsyncSqliteDB = media_crawler_db_var.get()
async with aiofiles.open("schema/sqlite_tables.sql", mode="r", encoding="utf-8") as f:
schema_sql = await f.read()
await async_db_obj.executescript(schema_sql)
utils.logger.info("[init_table_schema] sqlite table schema init successful")
elif db_type == "mysql":
utils.logger.info("[init_table_schema] begin init mysql table schema ...")
await init_mediacrawler_db()
async_db_obj: AsyncMysqlDB = media_crawler_db_var.get()
async with aiofiles.open("schema/tables.sql", mode="r", encoding="utf-8") as f:
schema_sql = await f.read()
await async_db_obj.execute(schema_sql)
utils.logger.info("[init_table_schema] mysql table schema init successful")
await close()
else:
utils.logger.error(f"[init_table_schema] 不支持的数据库类型: {db_type}")
raise ValueError(f"不支持的数据库类型: {db_type},支持的类型: sqlite, mysql")
def show_database_options():
"""
显示支持的数据库选项
"""
print("\n=== MediaCrawler 数据库初始化工具 ===")
print("支持的数据库类型:")
print("1. sqlite - SQLite 数据库 (轻量级,无需额外配置)")
print("2. mysql - MySQL 数据库 (需要配置数据库连接信息)")
print("3. config - 使用配置文件中的设置")
print("4. exit - 退出程序")
print("="*50)
def get_user_choice():
"""
获取用户选择的数据库类型
Returns:
str: 用户选择的数据库类型
"""
while True:
choice = input("请输入数据库类型 (sqlite/mysql/config/exit): ").strip().lower()
if choice in ['sqlite', 'mysql', 'config', 'exit']:
return choice
else:
print("❌ 无效的选择,请输入: sqlite, mysql, config 或 exit")
async def main():
"""
主函数,处理用户交互和数据库初始化
"""
try:
show_database_options()
while True:
choice = get_user_choice()
if choice == 'exit':
print("👋 程序已退出")
break
elif choice == 'config':
print(f"📋 使用配置文件中的设置: {config.SAVE_DATA_OPTION}")
await init_table_schema()
print("✅ 数据库表结构初始化完成!")
break
else:
print(f"🚀 开始初始化 {choice.upper()} 数据库...")
await init_table_schema(choice)
print("✅ 数据库表结构初始化完成!")
break
except KeyboardInterrupt:
print("\n\n⚠️ 用户中断操作")
except Exception as e:
print(f"\n❌ 初始化失败: {str(e)}")
utils.logger.error(f"[main] 数据库初始化失败: {str(e)}")
if __name__ == '__main__':
asyncio.get_event_loop().run_until_complete(main())

View File

@@ -54,15 +54,19 @@
python main.py --help
```
## 数据保存
- 支持关系型数据库Mysql中保存需要提前创建数据库
- 执行 `python db.py` 初始化数据库数据库表结构(只在首次执行)
- 支持轻量级SQLite数据库保存无需额外安装数据库服务器
- 本地文件数据库,适合个人使用和小规模数据存储
- 使用参数 `--save_data_option sqlite` 启用SQLite存储
- 数据库文件自动创建在项目目录下schema/sqlite_tables.db
- 支持保存到csv中data/目录下)
- 支持保存到json中data/目录下)
## 💾 数据存储
支持多种数据存储方式:
- **CSV 文件**: 支持保存至 CSV (位于 `data/` 目录下)
- **JSON 文件**: 支持保存至 JSON (位于 `data/` 目录下)
- **数据库存储**
- 使用 `--init_db` 参数进行数据库初始化 (使用 `--init_db` 时,无需其他可选参数)
- **SQLite 数据库**: 轻量级数据库,无需服务器,适合个人使用 (推荐)
1. 初始化: `--init_db sqlite`
2. 数据存储: `--save_data_option sqlite`
- **MySQL 数据库**: 支持保存至关系型数据库 MySQL (需提前创建数据库)
1. 初始化: `--init_db mysql`
2. 数据存储: `--save_data_option db` (db 参数为兼容历史更新保留)
## 免责声明
> **免责声明:**

View File

@@ -17,7 +17,7 @@
扫描下方我的个人微信备注pro版本如果图片展示不出来可以直接添加我的微信号yzglan
扫描下方我的个人微信备注pro版本如果图片展示不出来可以直接添加我的微信号relakkes
![relakkes_weichat.JPG](static/images/relakkes_weichat.jpg)

BIN
docs/static/images/nstbrowser.jpg vendored Normal file
View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 580 KiB

View File

Binary file not shown.

Before

Width:  |  Height:  |  Size: 223 KiB

After

Width:  |  Height:  |  Size: 230 KiB

BIN
docs/static/images/tikhub_banner.png vendored Normal file
View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 750 KiB

BIN
docs/static/images/tikhub_banner_zh.png vendored Normal file
View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 758 KiB

View File

@@ -7,6 +7,6 @@
## 加群方式
> 备注github会有拉群小助手自动拉你进群。
>
> 如果图片展示不出来或过期,可以直接添加我的微信号:yzglan并备注github会有拉群小助手自动拉你进群
> 如果图片展示不出来或过期,可以直接添加我的微信号:relakkes并备注github会有拉群小助手自动拉你进群
![relakkes_wechat](static/images/relakkes_weichat.jpg)

View File

@@ -2,36 +2,70 @@
```
MediaCrawler
├── base
│ └── base_crawler.py # 项目的抽象类
├── browser_data # 换成用户的浏览器数据目录
├── config
│ ├── account_config.py # 账号代理池配置
├── base
│ └── base_crawler.py # 项目的抽象
├── cache
│ ├── abs_cache.py # 缓存抽象基类
│ ├── cache_factory.py # 缓存工厂
│ ├── local_cache.py # 本地缓存实现
│ └── redis_cache.py # Redis缓存实现
├── cmd_arg
│ └── arg.py # 命令行参数定义
├── config
│ ├── base_config.py # 基础配置
── db_config.py # 数据库配置
├── data # 数据保存目录
├── libs
── db_config.py # 数据库配置
│ └── ... # 各平台配置文件
├── constant
│ └── ... # 各平台常量定义
├── database
│ ├── db.py # 数据库ORM封装增删改查
│ ├── db_session.py # 数据库会话管理
│ └── models.py # 数据库模型定义
├── docs
│ └── ... # 项目文档
├── libs
│ ├── douyin.js # 抖音Sign函数
── stealth.min.js # 去除浏览器自动化特征的JS
── stealth.min.js # 去除浏览器自动化特征的JS
│ └── zhihu.js # 知乎Sign函数
├── media_platform
│ ├── douyin # 抖音crawler实现
│ ├── xhs # 小红书crawler实现
│ ├── bilibili # B站crawler实现
── kuaishou # 快手crawler实现
├── modles
│ ├── douyin.py # 抖音数据模型
── xiaohongshu.py # 小红书数据模型
│ ├── kuaishou.py # 快手数据模型
── bilibili.py # B站数据模型
│ ├── bilibili # B站采集实现
│ ├── douyin # 抖音采集实现
│ ├── kuaishou # 快手采集实现
── tieba # 百度贴吧采集实现
│ ├── weibo # 微博采集实现
│ ├── xhs # 小红书采集实现
── zhihu # 知乎采集实现
├── model
── m_baidu_tieba.py # 百度贴吧数据模型
│ ├── m_douyin.py # 抖音数据模型
│ ├── m_kuaishou.py # 快手数据模型
│ ├── m_weibo.py # 微博数据模型
│ ├── m_xiaohongshu.py # 小红书数据模型
│ └── m_zhihu.py # 知乎数据模型
├── proxy
│ ├── base_proxy.py # 代理基类
│ ├── providers # 代理提供商实现
│ ├── proxy_ip_pool.py # 代理IP池
│ └── types.py # 代理类型定义
├── store
│ ├── bilibili # B站数据存储实现
│ ├── douyin # 抖音数据存储实现
│ ├── kuaishou # 快手数据存储实现
│ ├── tieba # 贴吧数据存储实现
│ ├── weibo # 微博数据存储实现
│ ├── xhs # 小红书数据存储实现
│ └── zhihu # 知乎数据存储实现
├── test
│ ├── test_db_sync.py # 数据库同步测试
│ ├── test_proxy_ip_pool.py # 代理IP池测试
│ └── ... # 其他测试用例
├── tools
│ ├── utils.py # 暴露给外部的工具函数
│ ├── crawler_util.py # 爬虫相关的工具函数
│ ├── slider_util.py # 滑块相关的工具函数
│ ├── time_util.py # 时间相关的工具函数
── easing.py # 模拟滑动轨迹相关的函数
| └── words.py # 生成词云图相关的函数
├── db.py # DB ORM
── main.py # 程序入口
├── var.py # 上下文变量定义
└── recv_sms_notification.py # 短信转发器的HTTP SERVER接口
│ ├── browser_launcher.py # 浏览器启动器
│ ├── cdp_browser.py # CDP浏览器控制
│ ├── crawler_util.py # 爬虫工具函数
│ ├── utils.py # 通用工具函数
── ...
├── main.py # 程序入口, 支持 --init_db 参数来初始化数据库
├── recv_sms.py # 短信转发HTTP SERVER接口
── var.py # 全局上下文变量定义
```

16
main.py
View File

@@ -15,7 +15,7 @@ from typing import Optional
import cmd_arg
import config
import db
from database import db
from base.base_crawler import AbstractCrawler
from media_platform.bilibili import BilibiliCrawler
from media_platform.douyin import DouYinCrawler
@@ -50,16 +50,24 @@ class CrawlerFactory:
crawler: Optional[AbstractCrawler] = None
# persist-1<persist1@126.com>
# 原因:增加 --init_db 功能,用于数据库初始化。
# 副作用:无
# 回滚策略:还原此文件。
async def main():
# Init crawler
global crawler
# parse cmd
await cmd_arg.parse_cmd()
args = await cmd_arg.parse_cmd()
# init db
if config.SAVE_DATA_OPTION in ["db", "sqlite"]:
await db.init_db()
if args.init_db:
await db.init_db(args.init_db)
print(f"Database {args.init_db} initialized successfully.")
return # Exit the main function cleanly
crawler = CrawlerFactory.create_crawler(platform=config.PLATFORM)
await crawler.start()

View File

@@ -189,10 +189,11 @@ class BilibiliClient(AbstractApiClient):
if not aid or not cid or aid <= 0 or cid <= 0:
raise ValueError("aid 和 cid 必须存在")
uri = "/x/player/wbi/playurl"
qn_value = getattr(config, "BILI_QN", 80)
params = {
"avid": aid,
"cid": cid,
"qn": 80,
"qn": qn_value,
"fourk": 1,
"fnval": 1,
"platform": "pc",
@@ -201,17 +202,19 @@ class BilibiliClient(AbstractApiClient):
return await self.get(uri, params, enable_params_sign=True)
async def get_video_media(self, url: str) -> Union[bytes, None]:
async with httpx.AsyncClient(proxy=self.proxy) as client:
# Follow CDN 302 redirects and treat any 2xx as success (some endpoints return 206)
async with httpx.AsyncClient(proxy=self.proxy, follow_redirects=True) as client:
try:
response = await client.request("GET", url, timeout=self.timeout, headers=self.headers)
response.raise_for_status()
if not response.reason_phrase == "OK":
utils.logger.error(f"[BilibiliClient.get_video_media] request {url} err, res:{response.text}")
return None
else:
if 200 <= response.status_code < 300:
return response.content
except httpx.HTTPStatusError as exc: # some wrong when call httpx.request method, such as connection error, client error or server error
utils.logger.error(f"[BilibiliClient.get_video_media] {exc}")
utils.logger.error(
f"[BilibiliClient.get_video_media] Unexpected status {response.status_code} for {url}"
)
return None
except httpx.HTTPError as exc: # some wrong when call httpx.request method, such as connection error, client error, server error or response status code is not 2xx
utils.logger.error(f"[BilibiliClient.get_video_media] {exc.__class__.__name__} for {exc.request.url} - {exc}") # 保留原始异常类型名称,以便开发者调试
return None
async def get_video_comments(

View File

@@ -15,7 +15,7 @@
import asyncio
import os
import random
# import random # Removed as we now use fixed config.CRAWLER_MAX_SLEEP_SEC intervals
from asyncio import Task
from typing import Dict, List, Optional, Tuple, Union
from datetime import datetime, timedelta
@@ -41,6 +41,7 @@ from var import crawler_type_var, source_keyword_var
from .client import BilibiliClient
from .exception import DataFetchError
from .field import SearchOrderType
from .help import parse_video_info_from_url, parse_creator_info_from_url
from .login import BilibiliLogin
@@ -103,8 +104,14 @@ class BilibiliCrawler(AbstractCrawler):
await self.get_specified_videos(config.BILI_SPECIFIED_ID_LIST)
elif config.CRAWLER_TYPE == "creator":
if config.CREATOR_MODE:
for creator_id in config.BILI_CREATOR_ID_LIST:
await self.get_creator_videos(int(creator_id))
for creator_url in config.BILI_CREATOR_ID_LIST:
try:
creator_info = parse_creator_info_from_url(creator_url)
utils.logger.info(f"[BilibiliCrawler.start] Parsed creator ID: {creator_info.creator_id} from {creator_url}")
await self.get_creator_videos(int(creator_info.creator_id))
except ValueError as e:
utils.logger.error(f"[BilibiliCrawler.start] Failed to parse creator URL: {e}")
continue
else:
await self.get_all_creator_details(config.BILI_CREATOR_ID_LIST)
else:
@@ -208,6 +215,11 @@ class BilibiliCrawler(AbstractCrawler):
await bilibili_store.update_up_info(video_item)
await self.get_bilibili_video(video_item, semaphore)
page += 1
# Sleep after page navigation
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[BilibiliCrawler.search_by_keywords] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after page {page-1}")
await self.batch_get_video_comments(video_id_list)
async def search_by_keywords_in_time_range(self, daily_limit: bool):
@@ -284,6 +296,11 @@ class BilibiliCrawler(AbstractCrawler):
await self.get_bilibili_video(video_item, semaphore)
page += 1
# Sleep after page navigation
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[BilibiliCrawler.search_by_keywords_in_time_range] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after page {page-1}")
await self.batch_get_video_comments(video_id_list)
except Exception as e:
@@ -318,10 +335,11 @@ class BilibiliCrawler(AbstractCrawler):
async with semaphore:
try:
utils.logger.info(f"[BilibiliCrawler.get_comments] begin get video_id: {video_id} comments ...")
await asyncio.sleep(random.uniform(0.5, 1.5))
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[BilibiliCrawler.get_comments] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after fetching comments for video {video_id}")
await self.bili_client.get_video_all_comments(
video_id=video_id,
crawl_interval=random.random(),
crawl_interval=config.CRAWLER_MAX_SLEEP_SEC,
is_fetch_sub_comments=config.ENABLE_GET_SUB_COMMENTS,
callback=bilibili_store.batch_update_bilibili_video_comments,
max_count=config.CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES,
@@ -347,14 +365,27 @@ class BilibiliCrawler(AbstractCrawler):
await self.get_specified_videos(video_bvids_list)
if int(result["page"]["count"]) <= pn * ps:
break
await asyncio.sleep(random.random())
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[BilibiliCrawler.get_creator_videos] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after page {pn}")
pn += 1
async def get_specified_videos(self, bvids_list: List[str]):
async def get_specified_videos(self, video_url_list: List[str]):
"""
get specified videos info
get specified videos info from URLs or BV IDs
:param video_url_list: List of video URLs or BV IDs
:return:
"""
utils.logger.info("[BilibiliCrawler.get_specified_videos] Parsing video URLs...")
bvids_list = []
for video_url in video_url_list:
try:
video_info = parse_video_info_from_url(video_url)
bvids_list.append(video_info.video_id)
utils.logger.info(f"[BilibiliCrawler.get_specified_videos] Parsed video ID: {video_info.video_id} from {video_url}")
except ValueError as e:
utils.logger.error(f"[BilibiliCrawler.get_specified_videos] Failed to parse video URL: {e}")
continue
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [self.get_video_info_task(aid=0, bvid=video_id, semaphore=semaphore) for video_id in bvids_list]
video_details = await asyncio.gather(*task_list)
@@ -381,6 +412,11 @@ class BilibiliCrawler(AbstractCrawler):
async with semaphore:
try:
result = await self.bili_client.get_video_info(aid=aid, bvid=bvid)
# Sleep after fetching video details
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[BilibiliCrawler.get_video_info_task] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after fetching video details {bvid or aid}")
return result
except DataFetchError as ex:
utils.logger.error(f"[BilibiliCrawler.get_video_info_task] Get video detail error: {ex}")
@@ -544,24 +580,37 @@ class BilibiliCrawler(AbstractCrawler):
return
content = await self.bili_client.get_video_media(video_url)
await asyncio.sleep(random.random())
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[BilibiliCrawler.get_bilibili_video] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after fetching video {aid}")
if content is None:
return
extension_file_name = f"video.mp4"
await bilibili_store.store_video(aid, content, extension_file_name)
async def get_all_creator_details(self, creator_id_list: List[int]):
async def get_all_creator_details(self, creator_url_list: List[str]):
"""
creator_id_list: get details for creator from creator_id_list
creator_url_list: get details for creator from creator URL list
"""
utils.logger.info(f"[BilibiliCrawler.get_creator_details] Crawling the detalis of creator")
utils.logger.info(f"[BilibiliCrawler.get_creator_details] creator ids:{creator_id_list}")
utils.logger.info(f"[BilibiliCrawler.get_all_creator_details] Crawling the details of creators")
utils.logger.info(f"[BilibiliCrawler.get_all_creator_details] Parsing creator URLs...")
creator_id_list = []
for creator_url in creator_url_list:
try:
creator_info = parse_creator_info_from_url(creator_url)
creator_id_list.append(int(creator_info.creator_id))
utils.logger.info(f"[BilibiliCrawler.get_all_creator_details] Parsed creator ID: {creator_info.creator_id} from {creator_url}")
except ValueError as e:
utils.logger.error(f"[BilibiliCrawler.get_all_creator_details] Failed to parse creator URL: {e}")
continue
utils.logger.info(f"[BilibiliCrawler.get_all_creator_details] creator ids:{creator_id_list}")
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list: List[Task] = []
try:
for creator_id in creator_id_list:
task = asyncio.create_task(self.get_creator_details(creator_id, semaphore), name=creator_id)
task = asyncio.create_task(self.get_creator_details(creator_id, semaphore), name=str(creator_id))
task_list.append(task)
except Exception as e:
utils.logger.warning(f"[BilibiliCrawler.get_all_creator_details] error in the task list. The creator will not be included. {e}")
@@ -600,7 +649,7 @@ class BilibiliCrawler(AbstractCrawler):
utils.logger.info(f"[BilibiliCrawler.get_fans] begin get creator_id: {creator_id} fans ...")
await self.bili_client.get_creator_all_fans(
creator_info=creator_info,
crawl_interval=random.random(),
crawl_interval=config.CRAWLER_MAX_SLEEP_SEC,
callback=bilibili_store.batch_update_bilibili_creator_fans,
max_count=config.CRAWLER_MAX_CONTACTS_COUNT_SINGLENOTES,
)
@@ -623,7 +672,7 @@ class BilibiliCrawler(AbstractCrawler):
utils.logger.info(f"[BilibiliCrawler.get_followings] begin get creator_id: {creator_id} followings ...")
await self.bili_client.get_creator_all_followings(
creator_info=creator_info,
crawl_interval=random.random(),
crawl_interval=config.CRAWLER_MAX_SLEEP_SEC,
callback=bilibili_store.batch_update_bilibili_creator_followings,
max_count=config.CRAWLER_MAX_CONTACTS_COUNT_SINGLENOTES,
)
@@ -646,7 +695,7 @@ class BilibiliCrawler(AbstractCrawler):
utils.logger.info(f"[BilibiliCrawler.get_dynamics] begin get creator_id: {creator_id} dynamics ...")
await self.bili_client.get_creator_all_dynamics(
creator_info=creator_info,
crawl_interval=random.random(),
crawl_interval=config.CRAWLER_MAX_SLEEP_SEC,
callback=bilibili_store.batch_update_bilibili_creator_dynamics,
max_count=config.CRAWLER_MAX_DYNAMICS_COUNT_SINGLENOTES,
)

View File

@@ -9,15 +9,17 @@
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2023/12/2 23:26
# @Desc : bilibili 请求参数签名
# 逆向实现参考https://socialsisteryi.github.io/bilibili-API-collect/docs/misc/sign/wbi.html#wbi%E7%AD%BE%E5%90%8D%E7%AE%97%E6%B3%95
import re
import urllib.parse
from hashlib import md5
from typing import Dict
from model.m_bilibili import VideoUrlInfo, CreatorUrlInfo
from tools import utils
@@ -66,16 +68,71 @@ class BilibiliSign:
return req_data
def parse_video_info_from_url(url: str) -> VideoUrlInfo:
"""
从B站视频URL中解析出视频ID
Args:
url: B站视频链接
- https://www.bilibili.com/video/BV1dwuKzmE26/?spm_id_from=333.1387.homepage.video_card.click
- https://www.bilibili.com/video/BV1d54y1g7db
- BV1d54y1g7db (直接传入BV号)
Returns:
VideoUrlInfo: 包含视频ID的对象
"""
# 如果传入的已经是BV号,直接返回
if url.startswith("BV"):
return VideoUrlInfo(video_id=url)
# 使用正则表达式提取BV号
# 匹配 /video/BV... 或 /video/av... 格式
bv_pattern = r'/video/(BV[a-zA-Z0-9]+)'
match = re.search(bv_pattern, url)
if match:
video_id = match.group(1)
return VideoUrlInfo(video_id=video_id)
raise ValueError(f"无法从URL中解析出视频ID: {url}")
def parse_creator_info_from_url(url: str) -> CreatorUrlInfo:
"""
从B站创作者空间URL中解析出创作者ID
Args:
url: B站创作者空间链接
- https://space.bilibili.com/434377496?spm_id_from=333.1007.0.0
- https://space.bilibili.com/20813884
- 434377496 (直接传入UID)
Returns:
CreatorUrlInfo: 包含创作者ID的对象
"""
# 如果传入的已经是纯数字ID,直接返回
if url.isdigit():
return CreatorUrlInfo(creator_id=url)
# 使用正则表达式提取UID
# 匹配 /space.bilibili.com/数字 格式
uid_pattern = r'space\.bilibili\.com/(\d+)'
match = re.search(uid_pattern, url)
if match:
creator_id = match.group(1)
return CreatorUrlInfo(creator_id=creator_id)
raise ValueError(f"无法从URL中解析出创作者ID: {url}")
if __name__ == '__main__':
_img_key = "7cd084941338484aae1ad9425b84077c"
_sub_key = "4932caff0ff746eab6f01bf08b70ac45"
_search_url = "__refresh__=true&_extra=&ad_resource=5654&category_id=&context=&dynamic_offset=0&from_source=&from_spmid=333.337&gaia_vtoken=&highlight=1&keyword=python&order=click&page=1&page_size=20&platform=pc&qv_id=OQ8f2qtgYdBV1UoEnqXUNUl8LEDAdzsD&search_type=video&single_column=0&source_tag=3&web_location=1430654"
_req_data = dict()
for params in _search_url.split("&"):
kvalues = params.split("=")
key = kvalues[0]
value = kvalues[1]
_req_data[key] = value
print("pre req_data", _req_data)
_req_data = BilibiliSign(img_key=_img_key, sub_key=_sub_key).sign(req_data={"aid":170001})
print(_req_data)
# 测试视频URL解析
video_url1 = "https://www.bilibili.com/video/BV1dwuKzmE26/?spm_id_from=333.1387.homepage.video_card.click"
video_url2 = "BV1d54y1g7db"
print("视频URL解析测试:")
print(f"URL1: {video_url1} -> {parse_video_info_from_url(video_url1)}")
print(f"URL2: {video_url2} -> {parse_video_info_from_url(video_url2)}")
# 测试创作者URL解析
creator_url1 = "https://space.bilibili.com/434377496?spm_id_from=333.1007.0.0"
creator_url2 = "20813884"
print("\n创作者URL解析测试:")
print(f"URL1: {creator_url1} -> {parse_creator_info_from_url(creator_url1)}")
print(f"URL2: {creator_url2} -> {parse_creator_info_from_url(creator_url2)}")

View File

@@ -321,6 +321,31 @@ class DouYinClient(AbstractApiClient):
return None
else:
return response.content
except httpx.HTTPStatusError as exc: # some wrong when call httpx.request method, such as connection error, client error or server error
utils.logger.error(f"[DouYinClient.get_aweme_media] {exc}")
except httpx.HTTPError as exc: # some wrong when call httpx.request method, such as connection error, client error, server error or response status code is not 2xx
utils.logger.error(f"[DouYinClient.get_aweme_media] {exc.__class__.__name__} for {exc.request.url} - {exc}") # 保留原始异常类型名称,以便开发者调试
return None
async def resolve_short_url(self, short_url: str) -> str:
"""
解析抖音短链接,获取重定向后的真实URL
Args:
short_url: 短链接,如 https://v.douyin.com/iF12345ABC/
Returns:
重定向后的完整URL
"""
async with httpx.AsyncClient(proxy=self.proxy, follow_redirects=False) as client:
try:
utils.logger.info(f"[DouYinClient.resolve_short_url] Resolving short URL: {short_url}")
response = await client.get(short_url, timeout=10)
# 短链接通常返回302重定向
if response.status_code in [301, 302, 303, 307, 308]:
redirect_url = response.headers.get("Location", "")
utils.logger.info(f"[DouYinClient.resolve_short_url] Resolved to: {redirect_url}")
return redirect_url
else:
utils.logger.warning(f"[DouYinClient.resolve_short_url] Unexpected status code: {response.status_code}")
return ""
except Exception as e:
utils.logger.error(f"[DouYinClient.resolve_short_url] Failed to resolve short URL: {e}")
return ""

View File

@@ -33,6 +33,7 @@ from var import crawler_type_var, source_keyword_var
from .client import DouYinClient
from .exception import DataFetchError
from .field import PublishTimeType
from .help import parse_video_info_from_url, parse_creator_info_from_url
from .login import DouYinLogin
@@ -147,25 +148,56 @@ class DouYinCrawler(AbstractCrawler):
aweme_list.append(aweme_info.get("aweme_id", ""))
await douyin_store.update_douyin_aweme(aweme_item=aweme_info)
await self.get_aweme_media(aweme_item=aweme_info)
# Sleep after each page navigation
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[DouYinCrawler.search] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after page {page-1}")
utils.logger.info(f"[DouYinCrawler.search] keyword:{keyword}, aweme_list:{aweme_list}")
await self.batch_get_note_comments(aweme_list)
async def get_specified_awemes(self):
"""Get the information and comments of the specified post"""
"""Get the information and comments of the specified post from URLs or IDs"""
utils.logger.info("[DouYinCrawler.get_specified_awemes] Parsing video URLs...")
aweme_id_list = []
for video_url in config.DY_SPECIFIED_ID_LIST:
try:
video_info = parse_video_info_from_url(video_url)
# 处理短链接
if video_info.url_type == "short":
utils.logger.info(f"[DouYinCrawler.get_specified_awemes] Resolving short link: {video_url}")
resolved_url = await self.dy_client.resolve_short_url(video_url)
if resolved_url:
# 从解析后的URL中提取视频ID
video_info = parse_video_info_from_url(resolved_url)
utils.logger.info(f"[DouYinCrawler.get_specified_awemes] Short link resolved to aweme ID: {video_info.aweme_id}")
else:
utils.logger.error(f"[DouYinCrawler.get_specified_awemes] Failed to resolve short link: {video_url}")
continue
aweme_id_list.append(video_info.aweme_id)
utils.logger.info(f"[DouYinCrawler.get_specified_awemes] Parsed aweme ID: {video_info.aweme_id} from {video_url}")
except ValueError as e:
utils.logger.error(f"[DouYinCrawler.get_specified_awemes] Failed to parse video URL: {e}")
continue
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [self.get_aweme_detail(aweme_id=aweme_id, semaphore=semaphore) for aweme_id in config.DY_SPECIFIED_ID_LIST]
task_list = [self.get_aweme_detail(aweme_id=aweme_id, semaphore=semaphore) for aweme_id in aweme_id_list]
aweme_details = await asyncio.gather(*task_list)
for aweme_detail in aweme_details:
if aweme_detail is not None:
await douyin_store.update_douyin_aweme(aweme_item=aweme_detail)
await self.get_aweme_media(aweme_item=aweme_detail)
await self.batch_get_note_comments(config.DY_SPECIFIED_ID_LIST)
await self.batch_get_note_comments(aweme_id_list)
async def get_aweme_detail(self, aweme_id: str, semaphore: asyncio.Semaphore) -> Any:
"""Get note detail"""
async with semaphore:
try:
return await self.dy_client.get_video_by_id(aweme_id)
result = await self.dy_client.get_video_by_id(aweme_id)
# Sleep after fetching aweme detail
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[DouYinCrawler.get_aweme_detail] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after fetching aweme {aweme_id}")
return result
except DataFetchError as ex:
utils.logger.error(f"[DouYinCrawler.get_aweme_detail] Get aweme detail error: {ex}")
return None
@@ -193,23 +225,38 @@ class DouYinCrawler(AbstractCrawler):
async with semaphore:
try:
# 将关键词列表传递给 get_aweme_all_comments 方法
# Use fixed crawling interval
crawl_interval = config.CRAWLER_MAX_SLEEP_SEC
await self.dy_client.get_aweme_all_comments(
aweme_id=aweme_id,
crawl_interval=random.random(),
crawl_interval=crawl_interval,
is_fetch_sub_comments=config.ENABLE_GET_SUB_COMMENTS,
callback=douyin_store.batch_update_dy_aweme_comments,
max_count=config.CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES,
)
# Sleep after fetching comments
await asyncio.sleep(crawl_interval)
utils.logger.info(f"[DouYinCrawler.get_comments] Sleeping for {crawl_interval} seconds after fetching comments for aweme {aweme_id}")
utils.logger.info(f"[DouYinCrawler.get_comments] aweme_id: {aweme_id} comments have all been obtained and filtered ...")
except DataFetchError as e:
utils.logger.error(f"[DouYinCrawler.get_comments] aweme_id: {aweme_id} get comments failed, error: {e}")
async def get_creators_and_videos(self) -> None:
"""
Get the information and videos of the specified creator
Get the information and videos of the specified creator from URLs or IDs
"""
utils.logger.info("[DouYinCrawler.get_creators_and_videos] Begin get douyin creators")
for user_id in config.DY_CREATOR_ID_LIST:
utils.logger.info("[DouYinCrawler.get_creators_and_videos] Parsing creator URLs...")
for creator_url in config.DY_CREATOR_ID_LIST:
try:
creator_info_parsed = parse_creator_info_from_url(creator_url)
user_id = creator_info_parsed.sec_user_id
utils.logger.info(f"[DouYinCrawler.get_creators_and_videos] Parsed sec_user_id: {user_id} from {creator_url}")
except ValueError as e:
utils.logger.error(f"[DouYinCrawler.get_creators_and_videos] Failed to parse creator URL: {e}")
continue
creator_info: Dict = await self.dy_client.get_user_info(user_id)
if creator_info:
await douyin_store.save_creator(user_id, creator=creator_info)

View File

@@ -16,10 +16,15 @@
# @Desc : 获取 a_bogus 参数, 学习交流使用,请勿用作商业用途,侵权联系作者删除
import random
import re
from typing import Optional
import execjs
from playwright.async_api import Page
from model.m_douyin import VideoUrlInfo, CreatorUrlInfo
from tools.crawler_util import extract_url_params_to_dict
douyin_sign_obj = execjs.compile(open('libs/douyin.js', encoding='utf-8-sig').read())
def get_web_id():
@@ -83,3 +88,103 @@ async def get_a_bogus_from_playright(params: str, post_data: dict, user_agent: s
return a_bogus
def parse_video_info_from_url(url: str) -> VideoUrlInfo:
"""
从抖音视频URL中解析出视频ID
支持以下格式:
1. 普通视频链接: https://www.douyin.com/video/7525082444551310602
2. 带modal_id参数的链接:
- https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?modal_id=7525082444551310602
- https://www.douyin.com/root/search/python?modal_id=7471165520058862848
3. 短链接: https://v.douyin.com/iF12345ABC/ (需要client解析)
4. 纯ID: 7525082444551310602
Args:
url: 抖音视频链接或ID
Returns:
VideoUrlInfo: 包含视频ID的对象
"""
# 如果是纯数字ID,直接返回
if url.isdigit():
return VideoUrlInfo(aweme_id=url, url_type="normal")
# 检查是否是短链接 (v.douyin.com)
if "v.douyin.com" in url or url.startswith("http") and len(url) < 50 and "video" not in url:
return VideoUrlInfo(aweme_id="", url_type="short") # 需要通过client解析
# 尝试从URL参数中提取modal_id
params = extract_url_params_to_dict(url)
modal_id = params.get("modal_id")
if modal_id:
return VideoUrlInfo(aweme_id=modal_id, url_type="modal")
# 从标准视频URL中提取ID: /video/数字
video_pattern = r'/video/(\d+)'
match = re.search(video_pattern, url)
if match:
aweme_id = match.group(1)
return VideoUrlInfo(aweme_id=aweme_id, url_type="normal")
raise ValueError(f"无法从URL中解析出视频ID: {url}")
def parse_creator_info_from_url(url: str) -> CreatorUrlInfo:
"""
从抖音创作者主页URL中解析出创作者ID (sec_user_id)
支持以下格式:
1. 创作者主页: https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main
2. 纯ID: MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE
Args:
url: 抖音创作者主页链接或sec_user_id
Returns:
CreatorUrlInfo: 包含创作者ID的对象
"""
# 如果是纯ID格式(通常以MS4wLjABAAAA开头),直接返回
if url.startswith("MS4wLjABAAAA") or (not url.startswith("http") and "douyin.com" not in url):
return CreatorUrlInfo(sec_user_id=url)
# 从创作者主页URL中提取sec_user_id: /user/xxx
user_pattern = r'/user/([^/?]+)'
match = re.search(user_pattern, url)
if match:
sec_user_id = match.group(1)
return CreatorUrlInfo(sec_user_id=sec_user_id)
raise ValueError(f"无法从URL中解析出创作者ID: {url}")
if __name__ == '__main__':
# 测试视频URL解析
print("=== 视频URL解析测试 ===")
test_urls = [
"https://www.douyin.com/video/7525082444551310602",
"https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main&modal_id=7525082444551310602",
"https://www.douyin.com/root/search/python?aid=b733a3b0-4662-4639-9a72-c2318fba9f3f&modal_id=7471165520058862848&type=general",
"7525082444551310602",
]
for url in test_urls:
try:
result = parse_video_info_from_url(url)
print(f"✓ URL: {url[:80]}...")
print(f" 结果: {result}\n")
except Exception as e:
print(f"✗ URL: {url}")
print(f" 错误: {e}\n")
# 测试创作者URL解析
print("=== 创作者URL解析测试 ===")
test_creator_urls = [
"https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main",
"MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE",
]
for url in test_creator_urls:
try:
result = parse_creator_info_from_url(url)
print(f"✓ URL: {url[:80]}...")
print(f" 结果: {result}\n")
except Exception as e:
print(f"✗ URL: {url}")
print(f" 错误: {e}\n")

View File

@@ -11,7 +11,7 @@
import asyncio
import os
import random
# import random # Removed as we now use fixed config.CRAWLER_MAX_SLEEP_SEC intervals
import time
from asyncio import Task
from typing import Dict, List, Optional, Tuple
@@ -26,6 +26,7 @@ from playwright.async_api import (
import config
from base.base_crawler import AbstractCrawler
from model.m_kuaishou import VideoUrlInfo, CreatorUrlInfo
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
from store import kuaishou as kuaishou_store
from tools import utils
@@ -34,6 +35,7 @@ from var import comment_tasks_var, crawler_type_var, source_keyword_var
from .client import KuaiShouClient
from .exception import DataFetchError
from .help import parse_video_info_from_url, parse_creator_info_from_url
from .login import KuaishouLogin
@@ -159,20 +161,36 @@ class KuaishouCrawler(AbstractCrawler):
# batch fetch video comments
page += 1
# Sleep after page navigation
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[KuaishouCrawler.search] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after page {page-1}")
await self.batch_get_video_comments(video_id_list)
async def get_specified_videos(self):
"""Get the information and comments of the specified post"""
utils.logger.info("[KuaishouCrawler.get_specified_videos] Parsing video URLs...")
video_ids = []
for video_url in config.KS_SPECIFIED_ID_LIST:
try:
video_info = parse_video_info_from_url(video_url)
video_ids.append(video_info.video_id)
utils.logger.info(f"Parsed video ID: {video_info.video_id} from {video_url}")
except ValueError as e:
utils.logger.error(f"Failed to parse video URL: {e}")
continue
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list = [
self.get_video_info_task(video_id=video_id, semaphore=semaphore)
for video_id in config.KS_SPECIFIED_ID_LIST
for video_id in video_ids
]
video_details = await asyncio.gather(*task_list)
for video_detail in video_details:
if video_detail is not None:
await kuaishou_store.update_kuaishou_video(video_detail)
await self.batch_get_video_comments(config.KS_SPECIFIED_ID_LIST)
await self.batch_get_video_comments(video_ids)
async def get_video_info_task(
self, video_id: str, semaphore: asyncio.Semaphore
@@ -181,6 +199,11 @@ class KuaishouCrawler(AbstractCrawler):
async with semaphore:
try:
result = await self.ks_client.get_video_info(video_id)
# Sleep after fetching video details
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[KuaishouCrawler.get_video_info_task] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after fetching video details {video_id}")
utils.logger.info(
f"[KuaishouCrawler.get_video_info_task] Get video_id:{video_id} info result: {result} ..."
)
@@ -234,9 +257,14 @@ class KuaishouCrawler(AbstractCrawler):
utils.logger.info(
f"[KuaishouCrawler.get_comments] begin get video_id: {video_id} comments ..."
)
# Sleep before fetching comments
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[KuaishouCrawler.get_comments] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds before fetching comments for video {video_id}")
await self.ks_client.get_video_all_comments(
photo_id=video_id,
crawl_interval=random.random(),
crawl_interval=config.CRAWLER_MAX_SLEEP_SEC,
callback=kuaishou_store.batch_update_ks_video_comments,
max_count=config.CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES,
)
@@ -352,16 +380,25 @@ class KuaishouCrawler(AbstractCrawler):
utils.logger.info(
"[KuaiShouCrawler.get_creators_and_videos] Begin get kuaishou creators"
)
for user_id in config.KS_CREATOR_ID_LIST:
# get creator detail info from web html content
createor_info: Dict = await self.ks_client.get_creator_info(user_id=user_id)
if createor_info:
await kuaishou_store.save_creator(user_id, creator=createor_info)
for creator_url in config.KS_CREATOR_ID_LIST:
try:
# Parse creator URL to get user_id
creator_info: CreatorUrlInfo = parse_creator_info_from_url(creator_url)
utils.logger.info(f"[KuaiShouCrawler.get_creators_and_videos] Parse creator URL info: {creator_info}")
user_id = creator_info.user_id
# get creator detail info from web html content
createor_info: Dict = await self.ks_client.get_creator_info(user_id=user_id)
if createor_info:
await kuaishou_store.save_creator(user_id, creator=createor_info)
except ValueError as e:
utils.logger.error(f"[KuaiShouCrawler.get_creators_and_videos] Failed to parse creator URL: {e}")
continue
# Get all video information of the creator
all_video_list = await self.ks_client.get_all_videos_by_creator(
user_id=user_id,
crawl_interval=random.random(),
crawl_interval=config.CRAWLER_MAX_SLEEP_SEC,
callback=self.fetch_creator_video_detail,
)

View File

@@ -0,0 +1,99 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
import re
from model.m_kuaishou import VideoUrlInfo, CreatorUrlInfo
def parse_video_info_from_url(url: str) -> VideoUrlInfo:
"""
从快手视频URL中解析出视频ID
支持以下格式:
1. 完整视频URL: "https://www.kuaishou.com/short-video/3x3zxz4mjrsc8ke?authorId=3x84qugg4ch9zhs&streamSource=search"
2. 纯视频ID: "3x3zxz4mjrsc8ke"
Args:
url: 快手视频链接或视频ID
Returns:
VideoUrlInfo: 包含视频ID的对象
"""
# 如果不包含http且不包含kuaishou.com认为是纯ID
if not url.startswith("http") and "kuaishou.com" not in url:
return VideoUrlInfo(video_id=url, url_type="normal")
# 从标准视频URL中提取ID: /short-video/视频ID
video_pattern = r'/short-video/([a-zA-Z0-9_-]+)'
match = re.search(video_pattern, url)
if match:
video_id = match.group(1)
return VideoUrlInfo(video_id=video_id, url_type="normal")
raise ValueError(f"无法从URL中解析出视频ID: {url}")
def parse_creator_info_from_url(url: str) -> CreatorUrlInfo:
"""
从快手创作者主页URL中解析出创作者ID
支持以下格式:
1. 创作者主页: "https://www.kuaishou.com/profile/3x84qugg4ch9zhs"
2. 纯ID: "3x4sm73aye7jq7i"
Args:
url: 快手创作者主页链接或user_id
Returns:
CreatorUrlInfo: 包含创作者ID的对象
"""
# 如果不包含http且不包含kuaishou.com认为是纯ID
if not url.startswith("http") and "kuaishou.com" not in url:
return CreatorUrlInfo(user_id=url)
# 从创作者主页URL中提取user_id: /profile/xxx
user_pattern = r'/profile/([a-zA-Z0-9_-]+)'
match = re.search(user_pattern, url)
if match:
user_id = match.group(1)
return CreatorUrlInfo(user_id=user_id)
raise ValueError(f"无法从URL中解析出创作者ID: {url}")
if __name__ == '__main__':
# 测试视频URL解析
print("=== 视频URL解析测试 ===")
test_video_urls = [
"https://www.kuaishou.com/short-video/3x3zxz4mjrsc8ke?authorId=3x84qugg4ch9zhs&streamSource=search&area=searchxxnull&searchKey=python",
"3xf8enb8dbj6uig",
]
for url in test_video_urls:
try:
result = parse_video_info_from_url(url)
print(f"✓ URL: {url[:80]}...")
print(f" 结果: {result}\n")
except Exception as e:
print(f"✗ URL: {url}")
print(f" 错误: {e}\n")
# 测试创作者URL解析
print("=== 创作者URL解析测试 ===")
test_creator_urls = [
"https://www.kuaishou.com/profile/3x84qugg4ch9zhs",
"3x4sm73aye7jq7i",
]
for url in test_creator_urls:
try:
result = parse_creator_info_from_url(url)
print(f"✓ URL: {url[:80]}...")
print(f" 结果: {result}\n")
except Exception as e:
print(f"✗ URL: {url}")
print(f" 错误: {e}\n")

View File

@@ -11,7 +11,7 @@
import asyncio
import os
import random
# import random # Removed as we now use fixed config.CRAWLER_MAX_SLEEP_SEC intervals
from asyncio import Task
from typing import Dict, List, Optional, Tuple
@@ -141,6 +141,11 @@ class TieBaCrawler(AbstractCrawler):
await self.get_specified_notes(
note_id_list=[note_detail.note_id for note_detail in notes_list]
)
# Sleep after page navigation
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[TieBaCrawler.search] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after page {page}")
page += 1
except Exception as ex:
utils.logger.error(
@@ -178,6 +183,11 @@ class TieBaCrawler(AbstractCrawler):
f"[BaiduTieBaCrawler.get_specified_tieba_notes] tieba name: {tieba_name} note list len: {len(note_list)}"
)
await self.get_specified_notes([note.note_id for note in note_list])
# Sleep after processing notes
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[TieBaCrawler.get_specified_tieba_notes] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after processing notes from page {page_number}")
page_number += tieba_limit_count
async def get_specified_notes(
@@ -222,6 +232,11 @@ class TieBaCrawler(AbstractCrawler):
f"[BaiduTieBaCrawler.get_note_detail] Begin get note detail, note_id: {note_id}"
)
note_detail: TiebaNote = await self.tieba_client.get_note_by_id(note_id)
# Sleep after fetching note details
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[TieBaCrawler.get_note_detail_async_task] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after fetching note details {note_id}")
if not note_detail:
utils.logger.error(
f"[BaiduTieBaCrawler.get_note_detail] Get note detail error, note_id: {note_id}"
@@ -277,9 +292,14 @@ class TieBaCrawler(AbstractCrawler):
utils.logger.info(
f"[BaiduTieBaCrawler.get_comments] Begin get note id comments {note_detail.note_id}"
)
# Sleep before fetching comments
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[TieBaCrawler.get_comments_async_task] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds before fetching comments for note {note_detail.note_id}")
await self.tieba_client.get_note_all_comments(
note_detail=note_detail,
crawl_interval=random.random(),
crawl_interval=config.CRAWLER_MAX_SLEEP_SEC,
callback=tieba_store.batch_update_tieba_note_comments,
max_count=config.CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES,
)

View File

@@ -256,8 +256,8 @@ class WeiboClient:
return None
else:
return response.content
except httpx.HTTPStatusError as exc: # some wrong when call httpx.request method, such as connection error, client error or server error
utils.logger.error(f"[DouYinClient.get_aweme_media] {exc}")
except httpx.HTTPError as exc: # some wrong when call httpx.request method, such as connection error, client error, server error or response status code is not 2xx
utils.logger.error(f"[DouYinClient.get_aweme_media] {exc.__class__.__name__} for {exc.request.url} - {exc}") # 保留原始异常类型名称,以便开发者调试
return None
async def get_creator_container_info(self, creator_id: str) -> Dict:

View File

@@ -15,7 +15,7 @@
import asyncio
import os
import random
# import random # Removed as we now use fixed config.CRAWLER_MAX_SLEEP_SEC intervals
from asyncio import Task
from typing import Dict, List, Optional, Tuple
@@ -160,6 +160,11 @@ class WeiboCrawler(AbstractCrawler):
await self.get_note_images(mblog)
page += 1
# Sleep after page navigation
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[WeiboCrawler.search] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after page {page-1}")
await self.batch_get_notes_comments(note_id_list)
async def get_specified_notes(self):
@@ -185,6 +190,11 @@ class WeiboCrawler(AbstractCrawler):
async with semaphore:
try:
result = await self.wb_client.get_note_info_by_id(note_id)
# Sleep after fetching note details
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[WeiboCrawler.get_note_info_task] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after fetching note details {note_id}")
return result
except DataFetchError as ex:
utils.logger.error(f"[WeiboCrawler.get_note_info_task] Get note detail error: {ex}")
@@ -221,9 +231,14 @@ class WeiboCrawler(AbstractCrawler):
async with semaphore:
try:
utils.logger.info(f"[WeiboCrawler.get_note_comments] begin get note_id: {note_id} comments ...")
# Sleep before fetching comments
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[WeiboCrawler.get_note_comments] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds before fetching comments for note {note_id}")
await self.wb_client.get_note_all_comments(
note_id=note_id,
crawl_interval=random.randint(1, 3), # 微博对API的限流比较严重所以延时提高一些
crawl_interval=config.CRAWLER_MAX_SLEEP_SEC, # Use fixed interval instead of random
callback=weibo_store.batch_update_weibo_note_comments,
max_count=config.CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES,
)
@@ -250,7 +265,8 @@ class WeiboCrawler(AbstractCrawler):
if not url:
continue
content = await self.wb_client.get_note_image(url)
await asyncio.sleep(random.random())
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[WeiboCrawler.get_note_images] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after fetching image")
if content != None:
extension_file_name = url.split(".")[-1]
await weibo_store.update_weibo_note_image(pic["pid"], content, extension_file_name)

View File

@@ -26,6 +26,7 @@ from html import unescape
from .exception import DataFetchError, IPBlockError
from .field import SearchNoteType, SearchSortType
from .help import get_search_id, sign
from .extractor import XiaoHongShuExtractor
class XiaoHongShuClient(AbstractApiClient):
@@ -50,6 +51,7 @@ class XiaoHongShuClient(AbstractApiClient):
self.NOTE_ABNORMAL_CODE = -510001
self.playwright_page = playwright_page
self.cookie_dict = cookie_dict
self._extractor = XiaoHongShuExtractor()
async def _pre_headers(self, url: str, data=None) -> Dict:
"""
@@ -61,7 +63,9 @@ class XiaoHongShuClient(AbstractApiClient):
Returns:
"""
encrypt_params = await self.playwright_page.evaluate("([url, data]) => window._webmsxyw(url,data)", [url, data])
encrypt_params = await self.playwright_page.evaluate(
"([url, data]) => window._webmsxyw(url,data)", [url, data]
)
local_storage = await self.playwright_page.evaluate("() => window.localStorage")
signs = sign(
a1=self.cookie_dict.get("a1", ""),
@@ -128,7 +132,9 @@ class XiaoHongShuClient(AbstractApiClient):
if isinstance(params, dict):
final_uri = f"{uri}?" f"{urlencode(params)}"
headers = await self._pre_headers(final_uri)
return await self.request(method="GET", url=f"{self._host}{final_uri}", headers=headers)
return await self.request(
method="GET", url=f"{self._host}{final_uri}", headers=headers
)
async def post(self, uri: str, data: dict, **kwargs) -> Dict:
"""
@@ -156,12 +162,18 @@ class XiaoHongShuClient(AbstractApiClient):
response = await client.request("GET", url, timeout=self.timeout)
response.raise_for_status()
if not response.reason_phrase == "OK":
utils.logger.error(f"[XiaoHongShuClient.get_note_media] request {url} err, res:{response.text}")
utils.logger.error(
f"[XiaoHongShuClient.get_note_media] request {url} err, res:{response.text}"
)
return None
else:
return response.content
except httpx.HTTPStatusError as exc: # some wrong when call httpx.request method, such as connection error, client error or server error
utils.logger.error(f"[DouYinClient.get_aweme_media] {exc}")
except (
httpx.HTTPError
) as exc: # some wrong when call httpx.request method, such as connection error, client error, server error or response status code is not 2xx
utils.logger.error(
f"[XiaoHongShuClient.get_aweme_media] {exc.__class__.__name__} for {exc.request.url} - {exc}"
) # 保留原始异常类型名称,以便开发者调试
return None
async def pong(self) -> bool:
@@ -178,7 +190,9 @@ class XiaoHongShuClient(AbstractApiClient):
if note_card.get("items"):
ping_flag = True
except Exception as e:
utils.logger.error(f"[XiaoHongShuClient.pong] Ping xhs failed: {e}, and try to login again...")
utils.logger.error(
f"[XiaoHongShuClient.pong] Ping xhs failed: {e}, and try to login again..."
)
ping_flag = False
return ping_flag
@@ -249,9 +263,7 @@ class XiaoHongShuClient(AbstractApiClient):
data = {
"source_note_id": note_id,
"image_formats": ["jpg", "webp", "avif"],
"extra": {
"need_body_topic": 1
},
"extra": {"need_body_topic": 1},
"xsec_source": xsec_source,
"xsec_token": xsec_token,
}
@@ -261,7 +273,9 @@ class XiaoHongShuClient(AbstractApiClient):
res_dict: Dict = res["items"][0]["note_card"]
return res_dict
# 爬取频繁了可能会出现有的笔记能有结果有的没有
utils.logger.error(f"[XiaoHongShuClient.get_note_by_id] get note id:{note_id} empty and res:{res}")
utils.logger.error(
f"[XiaoHongShuClient.get_note_by_id] get note id:{note_id} empty and res:{res}"
)
return dict()
async def get_note_comments(
@@ -345,15 +359,19 @@ class XiaoHongShuClient(AbstractApiClient):
comments_has_more = True
comments_cursor = ""
while comments_has_more and len(result) < max_count:
comments_res = await self.get_note_comments(note_id=note_id, xsec_token=xsec_token, cursor=comments_cursor)
comments_res = await self.get_note_comments(
note_id=note_id, xsec_token=xsec_token, cursor=comments_cursor
)
comments_has_more = comments_res.get("has_more", False)
comments_cursor = comments_res.get("cursor", "")
if "comments" not in comments_res:
utils.logger.info(f"[XiaoHongShuClient.get_note_all_comments] No 'comments' key found in response: {comments_res}")
utils.logger.info(
f"[XiaoHongShuClient.get_note_all_comments] No 'comments' key found in response: {comments_res}"
)
break
comments = comments_res["comments"]
if len(result) + len(comments) > max_count:
comments = comments[:max_count - len(result)]
comments = comments[: max_count - len(result)]
if callback:
await callback(note_id, comments)
await asyncio.sleep(crawl_interval)
@@ -386,7 +404,9 @@ class XiaoHongShuClient(AbstractApiClient):
"""
if not config.ENABLE_GET_SUB_COMMENTS:
utils.logger.info(f"[XiaoHongShuCrawler.get_comments_all_sub_comments] Crawling sub_comment mode is not enabled")
utils.logger.info(
f"[XiaoHongShuCrawler.get_comments_all_sub_comments] Crawling sub_comment mode is not enabled"
)
return []
result = []
@@ -413,12 +433,16 @@ class XiaoHongShuClient(AbstractApiClient):
)
if comments_res is None:
utils.logger.info(f"[XiaoHongShuClient.get_comments_all_sub_comments] No response found for note_id: {note_id}")
utils.logger.info(
f"[XiaoHongShuClient.get_comments_all_sub_comments] No response found for note_id: {note_id}"
)
continue
sub_comment_has_more = comments_res.get("has_more", False)
sub_comment_cursor = comments_res.get("cursor", "")
if "comments" not in comments_res:
utils.logger.info(f"[XiaoHongShuClient.get_comments_all_sub_comments] No 'comments' key found in response: {comments_res}")
utils.logger.info(
f"[XiaoHongShuClient.get_comments_all_sub_comments] No 'comments' key found in response: {comments_res}"
)
break
comments = comments_res["comments"]
if callback:
@@ -427,23 +451,30 @@ class XiaoHongShuClient(AbstractApiClient):
result.extend(comments)
return result
async def get_creator_info(self, user_id: str) -> Dict:
async def get_creator_info(
self, user_id: str, xsec_token: str = "", xsec_source: str = ""
) -> Dict:
"""
通过解析网页版的用户主页HTML获取用户个人简要信息
PC端用户主页的网页存在window.__INITIAL_STATE__这个变量上的解析它即可
eg: https://www.xiaohongshu.com/user/profile/59d8cb33de5fb4696bf17217
Args:
user_id: 用户ID
xsec_token: 验证token (可选,如果URL中包含此参数则传入)
xsec_source: 渠道来源 (可选,如果URL中包含此参数则传入)
Returns:
Dict: 创作者信息
"""
# 构建URI,如果有xsec参数则添加到URL中
uri = f"/user/profile/{user_id}"
html_content = await self.request("GET", self._domain + uri, return_response=True, headers=self.headers)
match = re.search(r"<script>window.__INITIAL_STATE__=(.+)<\/script>", html_content, re.M)
if xsec_token and xsec_source:
uri = f"{uri}?xsec_token={xsec_token}&xsec_source={xsec_source}"
if match is None:
return {}
info = json.loads(match.group(1).replace(":undefined", ":null"), strict=False)
if info is None:
return {}
return info.get("user").get("userPageData")
html_content = await self.request(
"GET", self._domain + uri, return_response=True, headers=self.headers
)
return self._extractor.extract_creator_info_from_html(html_content)
async def get_notes_by_creator(
self,
@@ -492,17 +523,23 @@ class XiaoHongShuClient(AbstractApiClient):
while notes_has_more and len(result) < config.CRAWLER_MAX_NOTES_COUNT:
notes_res = await self.get_notes_by_creator(user_id, notes_cursor)
if not notes_res:
utils.logger.error(f"[XiaoHongShuClient.get_notes_by_creator] The current creator may have been banned by xhs, so they cannot access the data.")
utils.logger.error(
f"[XiaoHongShuClient.get_notes_by_creator] The current creator may have been banned by xhs, so they cannot access the data."
)
break
notes_has_more = notes_res.get("has_more", False)
notes_cursor = notes_res.get("cursor", "")
if "notes" not in notes_res:
utils.logger.info(f"[XiaoHongShuClient.get_all_notes_by_creator] No 'notes' key found in response: {notes_res}")
utils.logger.info(
f"[XiaoHongShuClient.get_all_notes_by_creator] No 'notes' key found in response: {notes_res}"
)
break
notes = notes_res["notes"]
utils.logger.info(f"[XiaoHongShuClient.get_all_notes_by_creator] got user_id:{user_id} notes len : {len(notes)}")
utils.logger.info(
f"[XiaoHongShuClient.get_all_notes_by_creator] got user_id:{user_id} notes len : {len(notes)}"
)
remaining = config.CRAWLER_MAX_NOTES_COUNT - len(result)
if remaining <= 0:
@@ -515,7 +552,9 @@ class XiaoHongShuClient(AbstractApiClient):
result.extend(notes_to_add)
await asyncio.sleep(crawl_interval)
utils.logger.info(f"[XiaoHongShuClient.get_all_notes_by_creator] Finished getting notes for user {user_id}, total: {len(result)}")
utils.logger.info(
f"[XiaoHongShuClient.get_all_notes_by_creator] Finished getting notes for user {user_id}, total: {len(result)}"
)
return result
async def get_note_short_url(self, note_id: str) -> Dict:
@@ -552,41 +591,17 @@ class XiaoHongShuClient(AbstractApiClient):
Returns:
"""
def camel_to_underscore(key):
return re.sub(r"(?<!^)(?=[A-Z])", "_", key).lower()
def transform_json_keys(json_data):
data_dict = json.loads(json_data)
dict_new = {}
for key, value in data_dict.items():
new_key = camel_to_underscore(key)
if not value:
dict_new[new_key] = value
elif isinstance(value, dict):
dict_new[new_key] = transform_json_keys(json.dumps(value))
elif isinstance(value, list):
dict_new[new_key] = [(transform_json_keys(json.dumps(item)) if (item and isinstance(item, dict)) else item) for item in value]
else:
dict_new[new_key] = value
return dict_new
url = ("https://www.xiaohongshu.com/explore/" + note_id + f"?xsec_token={xsec_token}&xsec_source={xsec_source}")
url = (
"https://www.xiaohongshu.com/explore/"
+ note_id
+ f"?xsec_token={xsec_token}&xsec_source={xsec_source}"
)
copy_headers = self.headers.copy()
if not enable_cookie:
del copy_headers["Cookie"]
html = await self.request(method="GET", url=url, return_response=True, headers=copy_headers)
html = await self.request(
method="GET", url=url, return_response=True, headers=copy_headers
)
def get_note_dict(html):
state = re.findall(r"window.__INITIAL_STATE__=({.*})</script>", html)[0].replace("undefined", '""')
if state != "{}":
note_dict = transform_json_keys(state)
return note_dict["note"]["note_detail_map"][note_id]["note"]
return {}
try:
return get_note_dict(html)
except:
return None
return self._extractor.extract_note_detail_from_html(note_id, html)

View File

@@ -11,9 +11,8 @@
import asyncio
import os
import random
import time
from asyncio import Task
from typing import Dict, List, Optional, Tuple
from typing import Dict, List, Optional
from playwright.async_api import (
BrowserContext,
@@ -27,7 +26,7 @@ from tenacity import RetryError
import config
from base.base_crawler import AbstractCrawler
from config import CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES
from model.m_xiaohongshu import NoteUrlInfo
from model.m_xiaohongshu import NoteUrlInfo, CreatorUrlInfo
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
from store import xhs as xhs_store
from tools import utils
@@ -37,7 +36,7 @@ from var import crawler_type_var, source_keyword_var
from .client import XiaoHongShuClient
from .exception import DataFetchError
from .field import SearchSortType
from .help import parse_note_info_from_note_url, get_search_id
from .help import parse_note_info_from_note_url, parse_creator_info_from_url, get_search_id
from .login import XiaoHongShuLogin
@@ -164,6 +163,10 @@ class XiaoHongShuCrawler(AbstractCrawler):
page += 1
utils.logger.info(f"[XiaoHongShuCrawler.search] Note details: {note_details}")
await self.batch_get_note_comments(note_ids, xsec_tokens)
# Sleep after each page navigation
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[XiaoHongShuCrawler.search] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after page {page-1}")
except DataFetchError:
utils.logger.error("[XiaoHongShuCrawler.search] Get note detail error")
break
@@ -171,17 +174,27 @@ class XiaoHongShuCrawler(AbstractCrawler):
async def get_creators_and_notes(self) -> None:
"""Get creator's notes and retrieve their comment information."""
utils.logger.info("[XiaoHongShuCrawler.get_creators_and_notes] Begin get xiaohongshu creators")
for user_id in config.XHS_CREATOR_ID_LIST:
# get creator detail info from web html content
createor_info: Dict = await self.xhs_client.get_creator_info(user_id=user_id)
if createor_info:
await xhs_store.save_creator(user_id, creator=createor_info)
for creator_url in config.XHS_CREATOR_ID_LIST:
try:
# Parse creator URL to get user_id and security tokens
creator_info: CreatorUrlInfo = parse_creator_info_from_url(creator_url)
utils.logger.info(f"[XiaoHongShuCrawler.get_creators_and_notes] Parse creator URL info: {creator_info}")
user_id = creator_info.user_id
# When proxy is not enabled, increase the crawling interval
if config.ENABLE_IP_PROXY:
crawl_interval = random.random()
else:
crawl_interval = random.uniform(1, config.CRAWLER_MAX_SLEEP_SEC)
# get creator detail info from web html content
createor_info: Dict = await self.xhs_client.get_creator_info(
user_id=user_id,
xsec_token=creator_info.xsec_token,
xsec_source=creator_info.xsec_source
)
if createor_info:
await xhs_store.save_creator(user_id, creator=createor_info)
except ValueError as e:
utils.logger.error(f"[XiaoHongShuCrawler.get_creators_and_notes] Failed to parse creator URL: {e}")
continue
# Use fixed crawling interval
crawl_interval = config.CRAWLER_MAX_SLEEP_SEC
# Get all note information of the creator
all_notes_list = await self.xhs_client.get_all_notes_by_creator(
user_id=user_id,
@@ -271,7 +284,7 @@ class XiaoHongShuCrawler(AbstractCrawler):
try:
note_detail = await self.xhs_client.get_note_by_id(note_id, xsec_source, xsec_token)
except RetryError as e:
except RetryError:
pass
if not note_detail:
@@ -280,6 +293,11 @@ class XiaoHongShuCrawler(AbstractCrawler):
raise Exception(f"[get_note_detail_async_task] Failed to get note detail, Id: {note_id}")
note_detail.update({"xsec_token": xsec_token, "xsec_source": xsec_source})
# Sleep after fetching note detail
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[get_note_detail_async_task] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after fetching note {note_id}")
return note_detail
except DataFetchError as ex:
@@ -310,11 +328,8 @@ class XiaoHongShuCrawler(AbstractCrawler):
"""Get note comments with keyword filtering and quantity limitation"""
async with semaphore:
utils.logger.info(f"[XiaoHongShuCrawler.get_comments] Begin get note id comments {note_id}")
# When proxy is not enabled, increase the crawling interval
if config.ENABLE_IP_PROXY:
crawl_interval = random.random()
else:
crawl_interval = random.uniform(1, config.CRAWLER_MAX_SLEEP_SEC)
# Use fixed crawling interval
crawl_interval = config.CRAWLER_MAX_SLEEP_SEC
await self.xhs_client.get_note_all_comments(
note_id=note_id,
xsec_token=xsec_token,
@@ -322,6 +337,10 @@ class XiaoHongShuCrawler(AbstractCrawler):
callback=xhs_store.batch_update_xhs_note_comments,
max_count=CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES,
)
# Sleep after fetching comments
await asyncio.sleep(crawl_interval)
utils.logger.info(f"[XiaoHongShuCrawler.get_comments] Sleeping for {crawl_interval} seconds after fetching comments for note {note_id}")
async def create_xhs_client(self, httpx_proxy: Optional[str]) -> XiaoHongShuClient:
"""Create xhs client"""

View File

@@ -0,0 +1,60 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
import json
import re
from typing import Dict, Optional
import humps
class XiaoHongShuExtractor:
def __init__(self):
pass
def extract_note_detail_from_html(self, note_id: str, html: str) -> Optional[Dict]:
"""从html中提取笔记详情
Args:
html (str): html字符串
Returns:
Dict: 笔记详情字典
"""
if "noteDetailMap" not in html:
# 这种情况要么是出了验证码了,要么是笔记不存在
return None
state = re.findall(r"window.__INITIAL_STATE__=({.*})</script>", html)[
0
].replace("undefined", '""')
if state != "{}":
note_dict = humps.decamelize(json.loads(state))
return note_dict["note"]["note_detail_map"][note_id]["note"]
return None
def extract_creator_info_from_html(self, html: str) -> Optional[Dict]:
"""从html中提取用户信息
Args:
html (str): html字符串
Returns:
Dict: 用户信息字典
"""
match = re.search(
r"<script>window.__INITIAL_STATE__=(.+)<\/script>", html, re.M
)
if match is None:
return None
info = json.loads(match.group(1).replace(":undefined", ":null"), strict=False)
if info is None:
return None
return info.get("user").get("userPageData")

View File

@@ -15,7 +15,7 @@ import random
import time
import urllib.parse
from model.m_xiaohongshu import NoteUrlInfo
from model.m_xiaohongshu import NoteUrlInfo, CreatorUrlInfo
from tools.crawler_util import extract_url_params_to_dict
@@ -306,6 +306,37 @@ def parse_note_info_from_note_url(url: str) -> NoteUrlInfo:
return NoteUrlInfo(note_id=note_id, xsec_token=xsec_token, xsec_source=xsec_source)
def parse_creator_info_from_url(url: str) -> CreatorUrlInfo:
"""
从小红书创作者主页URL中解析出创作者信息
支持以下格式:
1. 完整URL: "https://www.xiaohongshu.com/user/profile/5eb8e1d400000000010075ae?xsec_token=AB1nWBKCo1vE2HEkfoJUOi5B6BE5n7wVrbdpHoWIj5xHw=&xsec_source=pc_feed"
2. 纯ID: "5eb8e1d400000000010075ae"
Args:
url: 创作者主页URL或user_id
Returns:
CreatorUrlInfo: 包含user_id, xsec_token, xsec_source的对象
"""
# 如果是纯ID格式(24位十六进制字符),直接返回
if len(url) == 24 and all(c in "0123456789abcdef" for c in url):
return CreatorUrlInfo(user_id=url, xsec_token="", xsec_source="")
# 从URL中提取user_id: /user/profile/xxx
import re
user_pattern = r'/user/profile/([^/?]+)'
match = re.search(user_pattern, url)
if match:
user_id = match.group(1)
# 提取xsec_token和xsec_source参数
params = extract_url_params_to_dict(url)
xsec_token = params.get("xsec_token", "")
xsec_source = params.get("xsec_source", "")
return CreatorUrlInfo(user_id=user_id, xsec_token=xsec_token, xsec_source=xsec_source)
raise ValueError(f"无法从URL中解析出创作者信息: {url}")
if __name__ == '__main__':
_img_url = "https://sns-img-bd.xhscdn.com/7a3abfaf-90c1-a828-5de7-022c80b92aa3"
# 获取一个图片地址在多个cdn下的url地址
@@ -313,4 +344,19 @@ if __name__ == '__main__':
final_img_url = get_img_url_by_trace_id(get_trace_id(_img_url))
print(final_img_url)
# 测试创作者URL解析
print("\n=== 创作者URL解析测试 ===")
test_creator_urls = [
"https://www.xiaohongshu.com/user/profile/5eb8e1d400000000010075ae?xsec_token=AB1nWBKCo1vE2HEkfoJUOi5B6BE5n7wVrbdpHoWIj5xHw=&xsec_source=pc_feed",
"5eb8e1d400000000010075ae",
]
for url in test_creator_urls:
try:
result = parse_creator_info_from_url(url)
print(f"✓ URL: {url[:80]}...")
print(f" 结果: {result}\n")
except Exception as e:
print(f"✗ URL: {url}")
print(f" 错误: {e}\n")

View File

@@ -12,7 +12,7 @@
# -*- coding: utf-8 -*-
import asyncio
import os
import random
# import random # Removed as we now use fixed config.CRAWLER_MAX_SLEEP_SEC intervals
from asyncio import Task
from typing import Dict, List, Optional, Tuple, cast
@@ -170,6 +170,10 @@ class ZhihuCrawler(AbstractCrawler):
utils.logger.info("No more content!")
break
# Sleep after page navigation
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[ZhihuCrawler.search] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after page {page-1}")
page += 1
for content in content_list:
await zhihu_store.update_zhihu_content(content)
@@ -219,9 +223,14 @@ class ZhihuCrawler(AbstractCrawler):
utils.logger.info(
f"[ZhihuCrawler.get_comments] Begin get note id comments {content_item.content_id}"
)
# Sleep before fetching comments
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[ZhihuCrawler.get_comments] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds before fetching comments for content {content_item.content_id}")
await self.zhihu_client.get_note_all_comments(
content=content_item,
crawl_interval=random.random(),
crawl_interval=config.CRAWLER_MAX_SLEEP_SEC,
callback=zhihu_store.batch_update_zhihu_note_comments,
)
@@ -259,21 +268,21 @@ class ZhihuCrawler(AbstractCrawler):
# Get all anwser information of the creator
all_content_list = await self.zhihu_client.get_all_anwser_by_creator(
creator=createor_info,
crawl_interval=random.random(),
crawl_interval=config.CRAWLER_MAX_SLEEP_SEC,
callback=zhihu_store.batch_update_zhihu_contents,
)
# Get all articles of the creator's contents
# all_content_list = await self.zhihu_client.get_all_articles_by_creator(
# creator=createor_info,
# crawl_interval=random.random(),
# crawl_interval=config.CRAWLER_MAX_SLEEP_SEC,
# callback=zhihu_store.batch_update_zhihu_contents
# )
# Get all videos of the creator's contents
# all_content_list = await self.zhihu_client.get_all_videos_by_creator(
# creator=createor_info,
# crawl_interval=random.random(),
# crawl_interval=config.CRAWLER_MAX_SLEEP_SEC,
# callback=zhihu_store.batch_update_zhihu_contents
# )
@@ -304,21 +313,39 @@ class ZhihuCrawler(AbstractCrawler):
utils.logger.info(
f"[ZhihuCrawler.get_specified_notes] Get answer info, question_id: {question_id}, answer_id: {answer_id}"
)
return await self.zhihu_client.get_answer_info(question_id, answer_id)
result = await self.zhihu_client.get_answer_info(question_id, answer_id)
# Sleep after fetching answer details
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[ZhihuCrawler.get_note_detail] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after fetching answer details {answer_id}")
return result
elif note_type == constant.ARTICLE_NAME:
article_id = full_note_url.split("/")[-1]
utils.logger.info(
f"[ZhihuCrawler.get_specified_notes] Get article info, article_id: {article_id}"
)
return await self.zhihu_client.get_article_info(article_id)
result = await self.zhihu_client.get_article_info(article_id)
# Sleep after fetching article details
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[ZhihuCrawler.get_note_detail] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after fetching article details {article_id}")
return result
elif note_type == constant.VIDEO_NAME:
video_id = full_note_url.split("/")[-1]
utils.logger.info(
f"[ZhihuCrawler.get_specified_notes] Get video info, video_id: {video_id}"
)
return await self.zhihu_client.get_video_info(video_id)
result = await self.zhihu_client.get_video_info(video_id)
# Sleep after fetching video details
await asyncio.sleep(config.CRAWLER_MAX_SLEEP_SEC)
utils.logger.info(f"[ZhihuCrawler.get_note_detail] Sleeping for {config.CRAWLER_MAX_SLEEP_SEC} seconds after fetching video details {video_id}")
return result
async def get_specified_notes(self):
"""

25
model/m_bilibili.py Normal file
View File

@@ -0,0 +1,25 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
from pydantic import BaseModel, Field
class VideoUrlInfo(BaseModel):
"""B站视频URL信息"""
video_id: str = Field(title="video id (BV id)")
video_type: str = Field(default="video", title="video type")
class CreatorUrlInfo(BaseModel):
"""B站创作者URL信息"""
creator_id: str = Field(title="creator id (UID)")

View File

@@ -1,12 +1,25 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
from pydantic import BaseModel, Field
class VideoUrlInfo(BaseModel):
"""抖音视频URL信息"""
aweme_id: str = Field(title="aweme id (video id)")
url_type: str = Field(default="normal", title="url type: normal, short, modal")
class CreatorUrlInfo(BaseModel):
"""抖音创作者URL信息"""
sec_user_id: str = Field(title="sec_user_id (creator id)")

View File

@@ -1,12 +1,25 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
from pydantic import BaseModel, Field
class VideoUrlInfo(BaseModel):
"""快手视频URL信息"""
video_id: str = Field(title="video id (photo id)")
url_type: str = Field(default="normal", title="url type: normal")
class CreatorUrlInfo(BaseModel):
"""快手创作者URL信息"""
user_id: str = Field(title="user id (creator id)")

View File

@@ -18,4 +18,11 @@ from pydantic import BaseModel, Field
class NoteUrlInfo(BaseModel):
note_id: str = Field(title="note id")
xsec_token: str = Field(title="xsec token")
xsec_source: str = Field(title="xsec source")
xsec_source: str = Field(title="xsec source")
class CreatorUrlInfo(BaseModel):
"""小红书创作者URL信息"""
user_id: str = Field(title="user id (creator id)")
xsec_token: str = Field(default="", title="xsec token")
xsec_source: str = Field(default="", title="xsec source")

View File

@@ -4,11 +4,14 @@ author = "程序员阿江-Relakkes <relakkes@gmail.com>"
version = "0.1.0"
description = "A social media crawler project, support Xiaohongshu, Weibo, Zhihu, Bilibili, Douyin, BaiduTieBa etc."
readme = "README.md"
requires-python = ">=3.9"
requires-python = ">=3.11"
dependencies = [
"aiofiles~=23.2.1",
"aiomysql==0.2.0",
"aiosqlite>=0.21.0",
"alembic>=1.16.5",
"asyncmy>=0.2.10",
"cryptography>=45.0.7",
"fastapi==0.110.2",
"httpx==0.28.1",
"jieba==0.42.1",
@@ -20,10 +23,13 @@ dependencies = [
"playwright==1.45.0",
"pydantic==2.5.2",
"pyexecjs==1.5.1",
"pyhumps>=3.8.0",
"python-dotenv==1.0.1",
"redis~=4.6.0",
"requests==2.32.3",
"sqlalchemy>=2.0.43",
"tenacity==8.2.2",
"typer>=0.12.3",
"uvicorn==0.29.0",
"wordcloud==1.9.3",
]

View File

@@ -2,6 +2,7 @@ httpx==0.28.1
Pillow==9.5.0
playwright==1.45.0
tenacity==8.2.2
typer>=0.12.3
opencv-python
aiomysql==0.2.0
redis~=4.6.0
@@ -17,4 +18,9 @@ requests==2.32.3
parsel==1.9.1
pyexecjs==1.5.1
pandas==2.2.3
aiosqlite==0.21.0
aiosqlite==0.21.0
pyhumps==3.8.0
cryptography>=45.0.7
alembic>=1.16.5
asyncmy>=0.2.10
sqlalchemy>=2.0.43

View File

Binary file not shown.

View File

@@ -1,569 +0,0 @@
-- SQLite版本的MediaCrawler数据库表结构
-- 从MySQL tables.sql转换而来适配SQLite语法
-- ----------------------------
-- Table structure for bilibili_video
-- ----------------------------
DROP TABLE IF EXISTS bilibili_video;
CREATE TABLE bilibili_video (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
video_id TEXT NOT NULL,
video_type TEXT NOT NULL,
title TEXT DEFAULT NULL,
desc TEXT,
create_time INTEGER NOT NULL,
liked_count TEXT DEFAULT NULL,
disliked_count TEXT DEFAULT NULL,
video_play_count TEXT DEFAULT NULL,
video_favorite_count TEXT DEFAULT NULL,
video_share_count TEXT DEFAULT NULL,
video_coin_count TEXT DEFAULT NULL,
video_danmaku TEXT DEFAULT NULL,
video_comment TEXT DEFAULT NULL,
video_url TEXT DEFAULT NULL,
video_cover_url TEXT DEFAULT NULL,
source_keyword TEXT DEFAULT ''
);
CREATE INDEX idx_bilibili_vi_video_i_31c36e ON bilibili_video(video_id);
CREATE INDEX idx_bilibili_vi_create__73e0ec ON bilibili_video(create_time);
-- ----------------------------
-- Table structure for bilibili_video_comment
-- ----------------------------
DROP TABLE IF EXISTS bilibili_video_comment;
CREATE TABLE bilibili_video_comment (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
sex TEXT DEFAULT NULL,
sign TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
comment_id TEXT NOT NULL,
video_id TEXT NOT NULL,
content TEXT,
create_time INTEGER NOT NULL,
sub_comment_count TEXT NOT NULL,
parent_comment_id TEXT DEFAULT NULL,
like_count TEXT NOT NULL DEFAULT '0'
);
CREATE INDEX idx_bilibili_vi_comment_41c34e ON bilibili_video_comment(comment_id);
CREATE INDEX idx_bilibili_vi_video_i_f22873 ON bilibili_video_comment(video_id);
-- ----------------------------
-- Table structure for bilibili_up_info
-- ----------------------------
DROP TABLE IF EXISTS bilibili_up_info;
CREATE TABLE bilibili_up_info (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
sex TEXT DEFAULT NULL,
sign TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
total_fans INTEGER DEFAULT NULL,
total_liked INTEGER DEFAULT NULL,
user_rank INTEGER DEFAULT NULL,
is_official INTEGER DEFAULT NULL
);
CREATE INDEX idx_bilibili_vi_user_123456 ON bilibili_up_info(user_id);
-- ----------------------------
-- Table structure for bilibili_contact_info
-- ----------------------------
DROP TABLE IF EXISTS bilibili_contact_info;
CREATE TABLE bilibili_contact_info (
id INTEGER PRIMARY KEY AUTOINCREMENT,
up_id TEXT DEFAULT NULL,
fan_id TEXT DEFAULT NULL,
up_name TEXT DEFAULT NULL,
fan_name TEXT DEFAULT NULL,
up_sign TEXT DEFAULT NULL,
fan_sign TEXT DEFAULT NULL,
up_avatar TEXT DEFAULT NULL,
fan_avatar TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL
);
CREATE INDEX idx_bilibili_contact_info_up_id ON bilibili_contact_info(up_id);
CREATE INDEX idx_bilibili_contact_info_fan_id ON bilibili_contact_info(fan_id);
-- ----------------------------
-- Table structure for bilibili_up_dynamic
-- ----------------------------
DROP TABLE IF EXISTS bilibili_up_dynamic;
CREATE TABLE bilibili_up_dynamic (
id INTEGER PRIMARY KEY AUTOINCREMENT,
dynamic_id TEXT DEFAULT NULL,
user_id TEXT DEFAULT NULL,
user_name TEXT DEFAULT NULL,
text TEXT DEFAULT NULL,
type TEXT DEFAULT NULL,
pub_ts INTEGER DEFAULT NULL,
total_comments INTEGER DEFAULT NULL,
total_forwards INTEGER DEFAULT NULL,
total_liked INTEGER DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL
);
CREATE INDEX idx_bilibili_up_dynamic_dynamic_id ON bilibili_up_dynamic(dynamic_id);
-- ----------------------------
-- Table structure for douyin_aweme
-- ----------------------------
DROP TABLE IF EXISTS douyin_aweme;
CREATE TABLE douyin_aweme (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
sec_uid TEXT DEFAULT NULL,
short_user_id TEXT DEFAULT NULL,
user_unique_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
user_signature TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
aweme_id TEXT NOT NULL,
aweme_type TEXT NOT NULL,
title TEXT DEFAULT NULL,
desc TEXT,
create_time INTEGER NOT NULL,
liked_count TEXT DEFAULT NULL,
comment_count TEXT DEFAULT NULL,
share_count TEXT DEFAULT NULL,
collected_count TEXT DEFAULT NULL,
aweme_url TEXT DEFAULT NULL,
cover_url TEXT DEFAULT NULL,
video_download_url TEXT DEFAULT NULL,
music_download_url TEXT DEFAULT NULL,
note_download_url TEXT DEFAULT NULL,
source_keyword TEXT DEFAULT ''
);
CREATE INDEX idx_douyin_awem_aweme_i_6f7bc6 ON douyin_aweme(aweme_id);
CREATE INDEX idx_douyin_awem_create__299dfe ON douyin_aweme(create_time);
-- ----------------------------
-- Table structure for douyin_aweme_comment
-- ----------------------------
DROP TABLE IF EXISTS douyin_aweme_comment;
CREATE TABLE douyin_aweme_comment (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
sec_uid TEXT DEFAULT NULL,
short_user_id TEXT DEFAULT NULL,
user_unique_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
user_signature TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
comment_id TEXT NOT NULL,
aweme_id TEXT NOT NULL,
content TEXT,
create_time INTEGER NOT NULL,
sub_comment_count TEXT NOT NULL,
parent_comment_id TEXT DEFAULT NULL,
like_count TEXT NOT NULL DEFAULT '0',
pictures TEXT NOT NULL DEFAULT ''
);
CREATE INDEX idx_douyin_awem_comment_fcd7e4 ON douyin_aweme_comment(comment_id);
CREATE INDEX idx_douyin_awem_aweme_i_c50049 ON douyin_aweme_comment(aweme_id);
-- ----------------------------
-- Table structure for dy_creator
-- ----------------------------
DROP TABLE IF EXISTS dy_creator;
CREATE TABLE dy_creator (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
desc TEXT,
gender TEXT DEFAULT NULL,
follows TEXT DEFAULT NULL,
fans TEXT DEFAULT NULL,
interaction TEXT DEFAULT NULL,
videos_count TEXT DEFAULT NULL
);
-- ----------------------------
-- Table structure for kuaishou_video
-- ----------------------------
DROP TABLE IF EXISTS kuaishou_video;
CREATE TABLE kuaishou_video (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
video_id TEXT NOT NULL,
video_type TEXT NOT NULL,
title TEXT DEFAULT NULL,
desc TEXT,
create_time INTEGER NOT NULL,
liked_count TEXT DEFAULT NULL,
viewd_count TEXT DEFAULT NULL,
video_url TEXT DEFAULT NULL,
video_cover_url TEXT DEFAULT NULL,
video_play_url TEXT DEFAULT NULL,
source_keyword TEXT DEFAULT ''
);
CREATE INDEX idx_kuaishou_vi_video_i_c5c6a6 ON kuaishou_video(video_id);
CREATE INDEX idx_kuaishou_vi_create__a10dee ON kuaishou_video(create_time);
-- ----------------------------
-- Table structure for kuaishou_video_comment
-- ----------------------------
DROP TABLE IF EXISTS kuaishou_video_comment;
CREATE TABLE kuaishou_video_comment (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
comment_id TEXT NOT NULL,
video_id TEXT NOT NULL,
content TEXT,
create_time INTEGER NOT NULL,
sub_comment_count TEXT NOT NULL
);
CREATE INDEX idx_kuaishou_vi_comment_ed48fa ON kuaishou_video_comment(comment_id);
CREATE INDEX idx_kuaishou_vi_video_i_e50914 ON kuaishou_video_comment(video_id);
-- ----------------------------
-- Table structure for weibo_note
-- ----------------------------
DROP TABLE IF EXISTS weibo_note;
CREATE TABLE weibo_note (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
gender TEXT DEFAULT NULL,
profile_url TEXT DEFAULT NULL,
ip_location TEXT DEFAULT '',
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
note_id TEXT NOT NULL,
content TEXT,
create_time INTEGER NOT NULL,
create_date_time TEXT NOT NULL,
liked_count TEXT DEFAULT NULL,
comments_count TEXT DEFAULT NULL,
shared_count TEXT DEFAULT NULL,
note_url TEXT DEFAULT NULL,
source_keyword TEXT DEFAULT ''
);
CREATE INDEX idx_weibo_note_note_id_f95b1a ON weibo_note(note_id);
CREATE INDEX idx_weibo_note_create__692709 ON weibo_note(create_time);
CREATE INDEX idx_weibo_note_create__d05ed2 ON weibo_note(create_date_time);
-- ----------------------------
-- Table structure for weibo_note_comment
-- ----------------------------
DROP TABLE IF EXISTS weibo_note_comment;
CREATE TABLE weibo_note_comment (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT DEFAULT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
gender TEXT DEFAULT NULL,
profile_url TEXT DEFAULT NULL,
ip_location TEXT DEFAULT '',
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
comment_id TEXT NOT NULL,
note_id TEXT NOT NULL,
content TEXT,
create_time INTEGER NOT NULL,
create_date_time TEXT NOT NULL,
comment_like_count TEXT NOT NULL,
sub_comment_count TEXT NOT NULL,
parent_comment_id TEXT DEFAULT NULL
);
CREATE INDEX idx_weibo_note__comment_c7611c ON weibo_note_comment(comment_id);
CREATE INDEX idx_weibo_note__note_id_24f108 ON weibo_note_comment(note_id);
CREATE INDEX idx_weibo_note__create__667fe3 ON weibo_note_comment(create_date_time);
-- ----------------------------
-- Table structure for weibo_creator
-- ----------------------------
DROP TABLE IF EXISTS weibo_creator;
CREATE TABLE weibo_creator (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
desc TEXT,
gender TEXT DEFAULT NULL,
follows TEXT DEFAULT NULL,
fans TEXT DEFAULT NULL,
tag_list TEXT
);
-- ----------------------------
-- Table structure for xhs_creator
-- ----------------------------
DROP TABLE IF EXISTS xhs_creator;
CREATE TABLE xhs_creator (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
desc TEXT,
gender TEXT DEFAULT NULL,
follows TEXT DEFAULT NULL,
fans TEXT DEFAULT NULL,
interaction TEXT DEFAULT NULL,
tag_list TEXT
);
-- ----------------------------
-- Table structure for xhs_note
-- ----------------------------
DROP TABLE IF EXISTS xhs_note;
CREATE TABLE xhs_note (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
note_id TEXT NOT NULL,
type TEXT DEFAULT NULL,
title TEXT DEFAULT NULL,
desc TEXT,
video_url TEXT,
time INTEGER NOT NULL,
last_update_time INTEGER NOT NULL,
liked_count TEXT DEFAULT NULL,
collected_count TEXT DEFAULT NULL,
comment_count TEXT DEFAULT NULL,
share_count TEXT DEFAULT NULL,
image_list TEXT,
tag_list TEXT,
note_url TEXT DEFAULT NULL,
source_keyword TEXT DEFAULT '',
xsec_token TEXT DEFAULT NULL
);
CREATE INDEX idx_xhs_note_note_id_209457 ON xhs_note(note_id);
CREATE INDEX idx_xhs_note_time_eaa910 ON xhs_note(time);
-- ----------------------------
-- Table structure for xhs_note_comment
-- ----------------------------
DROP TABLE IF EXISTS xhs_note_comment;
CREATE TABLE xhs_note_comment (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
comment_id TEXT NOT NULL,
create_time INTEGER NOT NULL,
note_id TEXT NOT NULL,
content TEXT NOT NULL,
sub_comment_count INTEGER NOT NULL,
pictures TEXT DEFAULT NULL,
parent_comment_id TEXT DEFAULT NULL,
like_count TEXT DEFAULT NULL
);
CREATE INDEX idx_xhs_note_co_comment_8e8349 ON xhs_note_comment(comment_id);
CREATE INDEX idx_xhs_note_co_create__204f8d ON xhs_note_comment(create_time);
-- ----------------------------
-- Table structure for tieba_note
-- ----------------------------
DROP TABLE IF EXISTS tieba_note;
CREATE TABLE tieba_note (
id INTEGER PRIMARY KEY AUTOINCREMENT,
note_id TEXT NOT NULL,
title TEXT NOT NULL,
desc TEXT,
note_url TEXT NOT NULL,
publish_time TEXT NOT NULL,
user_link TEXT DEFAULT '',
user_nickname TEXT DEFAULT '',
user_avatar TEXT DEFAULT '',
tieba_id TEXT DEFAULT '',
tieba_name TEXT NOT NULL,
tieba_link TEXT NOT NULL,
total_replay_num INTEGER DEFAULT 0,
total_replay_page INTEGER DEFAULT 0,
ip_location TEXT DEFAULT '',
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
source_keyword TEXT DEFAULT ''
);
CREATE INDEX idx_tieba_note_note_id ON tieba_note(note_id);
CREATE INDEX idx_tieba_note_publish_time ON tieba_note(publish_time);
-- ----------------------------
-- Table structure for tieba_comment
-- ----------------------------
DROP TABLE IF EXISTS tieba_comment;
CREATE TABLE tieba_comment (
id INTEGER PRIMARY KEY AUTOINCREMENT,
comment_id TEXT NOT NULL,
parent_comment_id TEXT DEFAULT '',
content TEXT NOT NULL,
user_link TEXT DEFAULT '',
user_nickname TEXT DEFAULT '',
user_avatar TEXT DEFAULT '',
tieba_id TEXT DEFAULT '',
tieba_name TEXT NOT NULL,
tieba_link TEXT NOT NULL,
publish_time TEXT DEFAULT '',
ip_location TEXT DEFAULT '',
sub_comment_count INTEGER DEFAULT 0,
note_id TEXT NOT NULL,
note_url TEXT NOT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL
);
CREATE INDEX idx_tieba_comment_comment_id ON tieba_comment(comment_id);
CREATE INDEX idx_tieba_comment_note_id ON tieba_comment(note_id);
CREATE INDEX idx_tieba_comment_publish_time ON tieba_comment(publish_time);
-- ----------------------------
-- Table structure for tieba_creator
-- ----------------------------
DROP TABLE IF EXISTS tieba_creator;
CREATE TABLE tieba_creator (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL,
user_name TEXT NOT NULL,
nickname TEXT DEFAULT NULL,
avatar TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL,
gender TEXT DEFAULT NULL,
follows TEXT DEFAULT NULL,
fans TEXT DEFAULT NULL,
registration_duration TEXT DEFAULT NULL
);
-- ----------------------------
-- Table structure for zhihu_content
-- ----------------------------
DROP TABLE IF EXISTS zhihu_content;
CREATE TABLE zhihu_content (
id INTEGER PRIMARY KEY AUTOINCREMENT,
content_id TEXT NOT NULL,
content_type TEXT NOT NULL,
content_text TEXT,
content_url TEXT NOT NULL,
question_id TEXT DEFAULT NULL,
title TEXT NOT NULL,
desc TEXT,
created_time TEXT NOT NULL,
updated_time TEXT NOT NULL,
voteup_count INTEGER NOT NULL DEFAULT 0,
comment_count INTEGER NOT NULL DEFAULT 0,
source_keyword TEXT DEFAULT NULL,
user_id TEXT NOT NULL,
user_link TEXT NOT NULL,
user_nickname TEXT NOT NULL,
user_avatar TEXT NOT NULL,
user_url_token TEXT NOT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL
);
CREATE INDEX idx_zhihu_content_content_id ON zhihu_content(content_id);
CREATE INDEX idx_zhihu_content_created_time ON zhihu_content(created_time);
-- ----------------------------
-- Table structure for zhihu_comment
-- ----------------------------
DROP TABLE IF EXISTS zhihu_comment;
CREATE TABLE zhihu_comment (
id INTEGER PRIMARY KEY AUTOINCREMENT,
comment_id TEXT NOT NULL,
parent_comment_id TEXT DEFAULT NULL,
content TEXT NOT NULL,
publish_time TEXT NOT NULL,
ip_location TEXT DEFAULT NULL,
sub_comment_count INTEGER NOT NULL DEFAULT 0,
like_count INTEGER NOT NULL DEFAULT 0,
dislike_count INTEGER NOT NULL DEFAULT 0,
content_id TEXT NOT NULL,
content_type TEXT NOT NULL,
user_id TEXT NOT NULL,
user_link TEXT NOT NULL,
user_nickname TEXT NOT NULL,
user_avatar TEXT NOT NULL,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL
);
CREATE INDEX idx_zhihu_comment_comment_id ON zhihu_comment(comment_id);
CREATE INDEX idx_zhihu_comment_content_id ON zhihu_comment(content_id);
CREATE INDEX idx_zhihu_comment_publish_time ON zhihu_comment(publish_time);
-- ----------------------------
-- Table structure for zhihu_creator
-- ----------------------------
DROP TABLE IF EXISTS zhihu_creator;
CREATE TABLE zhihu_creator (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL UNIQUE,
user_link TEXT NOT NULL,
user_nickname TEXT NOT NULL,
user_avatar TEXT NOT NULL,
url_token TEXT NOT NULL,
gender TEXT DEFAULT NULL,
ip_location TEXT DEFAULT NULL,
follows INTEGER NOT NULL DEFAULT 0,
fans INTEGER NOT NULL DEFAULT 0,
anwser_count INTEGER NOT NULL DEFAULT 0,
video_count INTEGER NOT NULL DEFAULT 0,
question_count INTEGER NOT NULL DEFAULT 0,
article_count INTEGER NOT NULL DEFAULT 0,
column_count INTEGER NOT NULL DEFAULT 0,
get_voteup_count INTEGER NOT NULL DEFAULT 0,
add_ts INTEGER NOT NULL,
last_modify_ts INTEGER NOT NULL
);
CREATE UNIQUE INDEX idx_zhihu_creator_user_id ON zhihu_creator(user_id);

View File

@@ -1,597 +0,0 @@
-- ----------------------------
-- Table structure for bilibili_video
-- ----------------------------
DROP TABLE IF EXISTS `bilibili_video`;
CREATE TABLE `bilibili_video`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`video_id` varchar(64) NOT NULL COMMENT '视频ID',
`video_type` varchar(16) NOT NULL COMMENT '视频类型',
`title` varchar(500) DEFAULT NULL COMMENT '视频标题',
`desc` longtext COMMENT '视频描述',
`create_time` bigint NOT NULL COMMENT '视频发布时间戳',
`liked_count` varchar(16) DEFAULT NULL COMMENT '视频点赞数',
`disliked_count` varchar(16) DEFAULT NULL COMMENT '视频点踩数',
`video_play_count` varchar(16) DEFAULT NULL COMMENT '视频播放数量',
`video_favorite_count` varchar(16) DEFAULT NULL COMMENT '视频收藏数量',
`video_share_count` varchar(16) DEFAULT NULL COMMENT '视频分享数量',
`video_coin_count` varchar(16) DEFAULT NULL COMMENT '视频投币数量',
`video_danmaku` varchar(16) DEFAULT NULL COMMENT '视频弹幕数量',
`video_comment` varchar(16) DEFAULT NULL COMMENT '视频评论数量',
`video_url` varchar(512) DEFAULT NULL COMMENT '视频详情URL',
`video_cover_url` varchar(512) DEFAULT NULL COMMENT '视频封面图 URL',
PRIMARY KEY (`id`),
KEY `idx_bilibili_vi_video_i_31c36e` (`video_id`),
KEY `idx_bilibili_vi_create__73e0ec` (`create_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='B站视频';
-- ----------------------------
-- Table structure for bilibili_video_comment
-- ----------------------------
DROP TABLE IF EXISTS `bilibili_video_comment`;
CREATE TABLE `bilibili_video_comment`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`sex` varchar(64) DEFAULT NULL COMMENT '用户性别',
`sign` text DEFAULT NULL COMMENT '用户签名',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
`video_id` varchar(64) NOT NULL COMMENT '视频ID',
`content` longtext COMMENT '评论内容',
`create_time` bigint NOT NULL COMMENT '评论时间戳',
`sub_comment_count` varchar(16) NOT NULL COMMENT '评论回复数',
PRIMARY KEY (`id`),
KEY `idx_bilibili_vi_comment_41c34e` (`comment_id`),
KEY `idx_bilibili_vi_video_i_f22873` (`video_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='B 站视频评论';
-- ----------------------------
-- Table structure for bilibili_up_info
-- ----------------------------
DROP TABLE IF EXISTS `bilibili_up_info`;
CREATE TABLE `bilibili_up_info`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`sex` varchar(64) DEFAULT NULL COMMENT '用户性别',
`sign` text DEFAULT NULL COMMENT '用户签名',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`total_fans` bigint DEFAULT NULL COMMENT '粉丝数',
`total_liked` bigint DEFAULT NULL COMMENT '总获赞数',
`user_rank` int DEFAULT NULL COMMENT '用户等级',
`is_official` int DEFAULT NULL COMMENT '是否官号',
PRIMARY KEY (`id`),
KEY `idx_bilibili_vi_user_123456` (`user_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='B 站UP主信息';
-- ----------------------------
-- Table structure for bilibili_contact_info
-- ----------------------------
DROP TABLE IF EXISTS `bilibili_contact_info`;
CREATE TABLE `bilibili_contact_info`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`up_id` varchar(64) DEFAULT NULL COMMENT 'up主ID',
`fan_id` varchar(64) DEFAULT NULL COMMENT '粉丝ID',
`up_name` varchar(64) DEFAULT NULL COMMENT 'up主昵称',
`fan_name` varchar(64) DEFAULT NULL COMMENT '粉丝昵称',
`up_sign` longtext DEFAULT NULL COMMENT 'up主签名',
`fan_sign` longtext DEFAULT NULL COMMENT '粉丝签名',
`up_avatar` varchar(255) DEFAULT NULL COMMENT 'up主头像地址',
`fan_avatar` varchar(255) DEFAULT NULL COMMENT '粉丝头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
PRIMARY KEY (`id`),
KEY `idx_bilibili_contact_info_up_id` (`up_id`),
KEY `idx_bilibili_contact_info_fan_id` (`fan_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='B 站联系人信息';
-- ----------------------------
-- Table structure for bilibili_up_dynamic
-- ----------------------------
DROP TABLE IF EXISTS `bilibili_up_dynamic`;
CREATE TABLE `bilibili_up_dynamic`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`dynamic_id` varchar(64) DEFAULT NULL COMMENT '动态ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`user_name` varchar(64) DEFAULT NULL COMMENT '用户名',
`text` longtext DEFAULT NULL COMMENT '动态文本',
`type` varchar(64) DEFAULT NULL COMMENT '动态类型',
`pub_ts` bigint DEFAULT NULL COMMENT '动态发布时间',
`total_comments` bigint DEFAULT NULL COMMENT '评论数',
`total_forwards` bigint DEFAULT NULL COMMENT '转发数',
`total_liked` bigint DEFAULT NULL COMMENT '点赞数',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
PRIMARY KEY (`id`),
KEY `idx_bilibili_up_dynamic_dynamic_id` (`dynamic_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='B 站up主动态信息';
-- ----------------------------
-- Table structure for douyin_aweme
-- ----------------------------
DROP TABLE IF EXISTS `douyin_aweme`;
CREATE TABLE `douyin_aweme`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`sec_uid` varchar(128) DEFAULT NULL COMMENT '用户sec_uid',
`short_user_id` varchar(64) DEFAULT NULL COMMENT '用户短ID',
`user_unique_id` varchar(64) DEFAULT NULL COMMENT '用户唯一ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`user_signature` varchar(500) DEFAULT NULL COMMENT '用户签名',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`aweme_id` varchar(64) NOT NULL COMMENT '视频ID',
`aweme_type` varchar(16) NOT NULL COMMENT '视频类型',
`title` varchar(1024) DEFAULT NULL COMMENT '视频标题',
`desc` longtext COMMENT '视频描述',
`create_time` bigint NOT NULL COMMENT '视频发布时间戳',
`liked_count` varchar(16) DEFAULT NULL COMMENT '视频点赞数',
`comment_count` varchar(16) DEFAULT NULL COMMENT '视频评论数',
`share_count` varchar(16) DEFAULT NULL COMMENT '视频分享数',
`collected_count` varchar(16) DEFAULT NULL COMMENT '视频收藏数',
`aweme_url` varchar(255) DEFAULT NULL COMMENT '视频详情页URL',
`cover_url` varchar(500) DEFAULT NULL COMMENT '视频封面图URL',
`video_download_url` varchar(1024) DEFAULT NULL COMMENT '视频下载地址',
`music_download_url` varchar(1024) DEFAULT NULL COMMENT '音乐下载地址',
`note_download_url` varchar(5120) DEFAULT NULL COMMENT '笔记下载地址',
PRIMARY KEY (`id`),
KEY `idx_douyin_awem_aweme_i_6f7bc6` (`aweme_id`),
KEY `idx_douyin_awem_create__299dfe` (`create_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='抖音视频';
-- ----------------------------
-- Table structure for douyin_aweme_comment
-- ----------------------------
DROP TABLE IF EXISTS `douyin_aweme_comment`;
CREATE TABLE `douyin_aweme_comment`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`sec_uid` varchar(128) DEFAULT NULL COMMENT '用户sec_uid',
`short_user_id` varchar(64) DEFAULT NULL COMMENT '用户短ID',
`user_unique_id` varchar(64) DEFAULT NULL COMMENT '用户唯一ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`user_signature` varchar(500) DEFAULT NULL COMMENT '用户签名',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
`aweme_id` varchar(64) NOT NULL COMMENT '视频ID',
`content` longtext COMMENT '评论内容',
`create_time` bigint NOT NULL COMMENT '评论时间戳',
`sub_comment_count` varchar(16) NOT NULL COMMENT '评论回复数',
PRIMARY KEY (`id`),
KEY `idx_douyin_awem_comment_fcd7e4` (`comment_id`),
KEY `idx_douyin_awem_aweme_i_c50049` (`aweme_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='抖音视频评论';
-- ----------------------------
-- Table structure for dy_creator
-- ----------------------------
DROP TABLE IF EXISTS `dy_creator`;
CREATE TABLE `dy_creator`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(128) NOT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`desc` longtext COMMENT '用户描述',
`gender` varchar(1) DEFAULT NULL COMMENT '性别',
`follows` varchar(16) DEFAULT NULL COMMENT '关注数',
`fans` varchar(16) DEFAULT NULL COMMENT '粉丝数',
`interaction` varchar(16) DEFAULT NULL COMMENT '获赞数',
`videos_count` varchar(16) DEFAULT NULL COMMENT '作品数',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='抖音博主信息';
-- ----------------------------
-- Table structure for kuaishou_video
-- ----------------------------
DROP TABLE IF EXISTS `kuaishou_video`;
CREATE TABLE `kuaishou_video`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`video_id` varchar(64) NOT NULL COMMENT '视频ID',
`video_type` varchar(16) NOT NULL COMMENT '视频类型',
`title` varchar(500) DEFAULT NULL COMMENT '视频标题',
`desc` longtext COMMENT '视频描述',
`create_time` bigint NOT NULL COMMENT '视频发布时间戳',
`liked_count` varchar(16) DEFAULT NULL COMMENT '视频点赞数',
`viewd_count` varchar(16) DEFAULT NULL COMMENT '视频浏览数量',
`video_url` varchar(512) DEFAULT NULL COMMENT '视频详情URL',
`video_cover_url` varchar(512) DEFAULT NULL COMMENT '视频封面图 URL',
`video_play_url` varchar(512) DEFAULT NULL COMMENT '视频播放 URL',
PRIMARY KEY (`id`),
KEY `idx_kuaishou_vi_video_i_c5c6a6` (`video_id`),
KEY `idx_kuaishou_vi_create__a10dee` (`create_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='快手视频';
-- ----------------------------
-- Table structure for kuaishou_video_comment
-- ----------------------------
DROP TABLE IF EXISTS `kuaishou_video_comment`;
CREATE TABLE `kuaishou_video_comment`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
`video_id` varchar(64) NOT NULL COMMENT '视频ID',
`content` longtext COMMENT '评论内容',
`create_time` bigint NOT NULL COMMENT '评论时间戳',
`sub_comment_count` varchar(16) NOT NULL COMMENT '评论回复数',
PRIMARY KEY (`id`),
KEY `idx_kuaishou_vi_comment_ed48fa` (`comment_id`),
KEY `idx_kuaishou_vi_video_i_e50914` (`video_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='快手视频评论';
-- ----------------------------
-- Table structure for weibo_note
-- ----------------------------
DROP TABLE IF EXISTS `weibo_note`;
CREATE TABLE `weibo_note`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`gender` varchar(12) DEFAULT NULL COMMENT '用户性别',
`profile_url` varchar(255) DEFAULT NULL COMMENT '用户主页地址',
`ip_location` varchar(32) DEFAULT '发布微博的地理信息',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`note_id` varchar(64) NOT NULL COMMENT '帖子ID',
`content` longtext COMMENT '帖子正文内容',
`create_time` bigint NOT NULL COMMENT '帖子发布时间戳',
`create_date_time` varchar(32) NOT NULL COMMENT '帖子发布日期时间',
`liked_count` varchar(16) DEFAULT NULL COMMENT '帖子点赞数',
`comments_count` varchar(16) DEFAULT NULL COMMENT '帖子评论数量',
`shared_count` varchar(16) DEFAULT NULL COMMENT '帖子转发数量',
`note_url` varchar(512) DEFAULT NULL COMMENT '帖子详情URL',
PRIMARY KEY (`id`),
KEY `idx_weibo_note_note_id_f95b1a` (`note_id`),
KEY `idx_weibo_note_create__692709` (`create_time`),
KEY `idx_weibo_note_create__d05ed2` (`create_date_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='微博帖子';
-- ----------------------------
-- Table structure for weibo_note_comment
-- ----------------------------
DROP TABLE IF EXISTS `weibo_note_comment`;
CREATE TABLE `weibo_note_comment`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) DEFAULT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`gender` varchar(12) DEFAULT NULL COMMENT '用户性别',
`profile_url` varchar(255) DEFAULT NULL COMMENT '用户主页地址',
`ip_location` varchar(32) DEFAULT '发布微博的地理信息',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
`note_id` varchar(64) NOT NULL COMMENT '帖子ID',
`content` longtext COMMENT '评论内容',
`create_time` bigint NOT NULL COMMENT '评论时间戳',
`create_date_time` varchar(32) NOT NULL COMMENT '评论日期时间',
`comment_like_count` varchar(16) NOT NULL COMMENT '评论点赞数量',
`sub_comment_count` varchar(16) NOT NULL COMMENT '评论回复数',
PRIMARY KEY (`id`),
KEY `idx_weibo_note__comment_c7611c` (`comment_id`),
KEY `idx_weibo_note__note_id_24f108` (`note_id`),
KEY `idx_weibo_note__create__667fe3` (`create_date_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='微博帖子评论';
-- ----------------------------
-- Table structure for xhs_creator
-- ----------------------------
DROP TABLE IF EXISTS `xhs_creator`;
CREATE TABLE `xhs_creator`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`desc` longtext COMMENT '用户描述',
`gender` varchar(1) DEFAULT NULL COMMENT '性别',
`follows` varchar(16) DEFAULT NULL COMMENT '关注数',
`fans` varchar(16) DEFAULT NULL COMMENT '粉丝数',
`interaction` varchar(16) DEFAULT NULL COMMENT '获赞和收藏数',
`tag_list` longtext COMMENT '标签列表',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='小红书博主';
-- ----------------------------
-- Table structure for xhs_note
-- ----------------------------
DROP TABLE IF EXISTS `xhs_note`;
CREATE TABLE `xhs_note`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`note_id` varchar(64) NOT NULL COMMENT '笔记ID',
`type` varchar(16) DEFAULT NULL COMMENT '笔记类型(normal | video)',
`title` varchar(255) DEFAULT NULL COMMENT '笔记标题',
`desc` longtext COMMENT '笔记描述',
`video_url` longtext COMMENT '视频地址',
`time` bigint NOT NULL COMMENT '笔记发布时间戳',
`last_update_time` bigint NOT NULL COMMENT '笔记最后更新时间戳',
`liked_count` varchar(16) DEFAULT NULL COMMENT '笔记点赞数',
`collected_count` varchar(16) DEFAULT NULL COMMENT '笔记收藏数',
`comment_count` varchar(16) DEFAULT NULL COMMENT '笔记评论数',
`share_count` varchar(16) DEFAULT NULL COMMENT '笔记分享数',
`image_list` longtext COMMENT '笔记封面图片列表',
`tag_list` longtext COMMENT '标签列表',
`note_url` varchar(255) DEFAULT NULL COMMENT '笔记详情页的URL',
PRIMARY KEY (`id`),
KEY `idx_xhs_note_note_id_209457` (`note_id`),
KEY `idx_xhs_note_time_eaa910` (`time`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='小红书笔记';
-- ----------------------------
-- Table structure for xhs_note_comment
-- ----------------------------
DROP TABLE IF EXISTS `xhs_note_comment`;
CREATE TABLE `xhs_note_comment`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
`create_time` bigint NOT NULL COMMENT '评论时间戳',
`note_id` varchar(64) NOT NULL COMMENT '笔记ID',
`content` longtext NOT NULL COMMENT '评论内容',
`sub_comment_count` int NOT NULL COMMENT '子评论数量',
`pictures` varchar(512) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_xhs_note_co_comment_8e8349` (`comment_id`),
KEY `idx_xhs_note_co_create__204f8d` (`create_time`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='小红书笔记评论';
-- ----------------------------
-- alter table xhs_note_comment to support parent_comment_id
-- ----------------------------
ALTER TABLE `xhs_note_comment`
ADD COLUMN `parent_comment_id` VARCHAR(64) DEFAULT NULL COMMENT '父评论ID';
ALTER TABLE `douyin_aweme_comment`
ADD COLUMN `parent_comment_id` VARCHAR(64) DEFAULT NULL COMMENT '父评论ID';
ALTER TABLE `bilibili_video_comment`
ADD COLUMN `parent_comment_id` VARCHAR(64) DEFAULT NULL COMMENT '父评论ID';
ALTER TABLE `weibo_note_comment`
ADD COLUMN `parent_comment_id` VARCHAR(64) DEFAULT NULL COMMENT '父评论ID';
DROP TABLE IF EXISTS `tieba_note`;
CREATE TABLE tieba_note
(
id BIGINT AUTO_INCREMENT PRIMARY KEY,
note_id VARCHAR(644) NOT NULL COMMENT '帖子ID',
title VARCHAR(255) NOT NULL COMMENT '帖子标题',
`desc` TEXT COMMENT '帖子描述',
note_url VARCHAR(255) NOT NULL COMMENT '帖子链接',
publish_time VARCHAR(255) NOT NULL COMMENT '发布时间',
user_link VARCHAR(255) DEFAULT '' COMMENT '用户主页链接',
user_nickname VARCHAR(255) DEFAULT '' COMMENT '用户昵称',
user_avatar VARCHAR(255) DEFAULT '' COMMENT '用户头像地址',
tieba_id VARCHAR(255) DEFAULT '' COMMENT '贴吧ID',
tieba_name VARCHAR(255) NOT NULL COMMENT '贴吧名称',
tieba_link VARCHAR(255) NOT NULL COMMENT '贴吧链接',
total_replay_num INT DEFAULT 0 COMMENT '帖子回复总数',
total_replay_page INT DEFAULT 0 COMMENT '帖子回复总页数',
ip_location VARCHAR(255) DEFAULT '' COMMENT 'IP地理位置',
add_ts BIGINT NOT NULL COMMENT '添加时间戳',
last_modify_ts BIGINT NOT NULL COMMENT '最后修改时间戳',
KEY `idx_tieba_note_note_id` (`note_id`),
KEY `idx_tieba_note_publish_time` (`publish_time`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='贴吧帖子表';
DROP TABLE IF EXISTS `tieba_comment`;
CREATE TABLE tieba_comment
(
id BIGINT AUTO_INCREMENT PRIMARY KEY,
comment_id VARCHAR(255) NOT NULL COMMENT '评论ID',
parent_comment_id VARCHAR(255) DEFAULT '' COMMENT '父评论ID',
content TEXT NOT NULL COMMENT '评论内容',
user_link VARCHAR(255) DEFAULT '' COMMENT '用户主页链接',
user_nickname VARCHAR(255) DEFAULT '' COMMENT '用户昵称',
user_avatar VARCHAR(255) DEFAULT '' COMMENT '用户头像地址',
tieba_id VARCHAR(255) DEFAULT '' COMMENT '贴吧ID',
tieba_name VARCHAR(255) NOT NULL COMMENT '贴吧名称',
tieba_link VARCHAR(255) NOT NULL COMMENT '贴吧链接',
publish_time VARCHAR(255) DEFAULT '' COMMENT '发布时间',
ip_location VARCHAR(255) DEFAULT '' COMMENT 'IP地理位置',
sub_comment_count INT DEFAULT 0 COMMENT '子评论数',
note_id VARCHAR(255) NOT NULL COMMENT '帖子ID',
note_url VARCHAR(255) NOT NULL COMMENT '帖子链接',
add_ts BIGINT NOT NULL COMMENT '添加时间戳',
last_modify_ts BIGINT NOT NULL COMMENT '最后修改时间戳',
KEY `idx_tieba_comment_comment_id` (`note_id`),
KEY `idx_tieba_comment_note_id` (`note_id`),
KEY `idx_tieba_comment_publish_time` (`publish_time`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='贴吧评论表';
-- 增加搜索来源关键字字段
alter table bilibili_video
add column `source_keyword` varchar(255) default '' comment '搜索来源关键字';
alter table douyin_aweme
add column `source_keyword` varchar(255) default '' comment '搜索来源关键字';
alter table kuaishou_video
add column `source_keyword` varchar(255) default '' comment '搜索来源关键字';
alter table weibo_note
add column `source_keyword` varchar(255) default '' comment '搜索来源关键字';
alter table xhs_note
add column `source_keyword` varchar(255) default '' comment '搜索来源关键字';
alter table tieba_note
add column `source_keyword` varchar(255) default '' comment '搜索来源关键字';
DROP TABLE IF EXISTS `weibo_creator`;
CREATE TABLE `weibo_creator`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`desc` longtext COMMENT '用户描述',
`gender` varchar(2) DEFAULT NULL COMMENT '性别',
`follows` varchar(16) DEFAULT NULL COMMENT '关注数',
`fans` varchar(16) DEFAULT NULL COMMENT '粉丝数',
`tag_list` longtext COMMENT '标签列表',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='微博博主';
ALTER TABLE `xhs_note_comment`
ADD COLUMN `like_count` VARCHAR(64) DEFAULT NULL COMMENT '评论点赞数量';
DROP TABLE IF EXISTS `tieba_creator`;
CREATE TABLE `tieba_creator`
(
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`user_name` varchar(64) NOT NULL COMMENT '用户名',
`nickname` varchar(64) DEFAULT NULL COMMENT '用户昵称',
`avatar` varchar(255) DEFAULT NULL COMMENT '用户头像地址',
`ip_location` varchar(255) DEFAULT NULL COMMENT '评论时的IP地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
`gender` varchar(2) DEFAULT NULL COMMENT '性别',
`follows` varchar(16) DEFAULT NULL COMMENT '关注数',
`fans` varchar(16) DEFAULT NULL COMMENT '粉丝数',
`registration_duration` varchar(16) DEFAULT NULL COMMENT '吧龄',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='贴吧创作者';
DROP TABLE IF EXISTS `zhihu_content`;
CREATE TABLE `zhihu_content` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`content_id` varchar(64) NOT NULL COMMENT '内容ID',
`content_type` varchar(16) NOT NULL COMMENT '内容类型(article | answer | zvideo)',
`content_text` longtext COMMENT '内容文本, 如果是视频类型这里为空',
`content_url` varchar(255) NOT NULL COMMENT '内容落地链接',
`question_id` varchar(64) DEFAULT NULL COMMENT '问题ID, type为answer时有值',
`title` varchar(255) NOT NULL COMMENT '内容标题',
`desc` longtext COMMENT '内容描述',
`created_time` varchar(32) NOT NULL COMMENT '创建时间',
`updated_time` varchar(32) NOT NULL COMMENT '更新时间',
`voteup_count` int NOT NULL DEFAULT '0' COMMENT '赞同人数',
`comment_count` int NOT NULL DEFAULT '0' COMMENT '评论数量',
`source_keyword` varchar(64) DEFAULT NULL COMMENT '来源关键词',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`user_link` varchar(255) NOT NULL COMMENT '用户主页链接',
`user_nickname` varchar(64) NOT NULL COMMENT '用户昵称',
`user_avatar` varchar(255) NOT NULL COMMENT '用户头像地址',
`user_url_token` varchar(255) NOT NULL COMMENT '用户url_token',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
PRIMARY KEY (`id`),
KEY `idx_zhihu_content_content_id` (`content_id`),
KEY `idx_zhihu_content_created_time` (`created_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='知乎内容(回答、文章、视频)';
DROP TABLE IF EXISTS `zhihu_comment`;
CREATE TABLE `zhihu_comment` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
`parent_comment_id` varchar(64) DEFAULT NULL COMMENT '父评论ID',
`content` text NOT NULL COMMENT '评论内容',
`publish_time` varchar(32) NOT NULL COMMENT '发布时间',
`ip_location` varchar(64) DEFAULT NULL COMMENT 'IP地理位置',
`sub_comment_count` int NOT NULL DEFAULT '0' COMMENT '子评论数',
`like_count` int NOT NULL DEFAULT '0' COMMENT '点赞数',
`dislike_count` int NOT NULL DEFAULT '0' COMMENT '踩数',
`content_id` varchar(64) NOT NULL COMMENT '内容ID',
`content_type` varchar(16) NOT NULL COMMENT '内容类型(article | answer | zvideo)',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`user_link` varchar(255) NOT NULL COMMENT '用户主页链接',
`user_nickname` varchar(64) NOT NULL COMMENT '用户昵称',
`user_avatar` varchar(255) NOT NULL COMMENT '用户头像地址',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
PRIMARY KEY (`id`),
KEY `idx_zhihu_comment_comment_id` (`comment_id`),
KEY `idx_zhihu_comment_content_id` (`content_id`),
KEY `idx_zhihu_comment_publish_time` (`publish_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='知乎评论';
DROP TABLE IF EXISTS `zhihu_creator`;
CREATE TABLE `zhihu_creator` (
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`user_id` varchar(64) NOT NULL COMMENT '用户ID',
`user_link` varchar(255) NOT NULL COMMENT '用户主页链接',
`user_nickname` varchar(64) NOT NULL COMMENT '用户昵称',
`user_avatar` varchar(255) NOT NULL COMMENT '用户头像地址',
`url_token` varchar(64) NOT NULL COMMENT '用户URL Token',
`gender` varchar(16) DEFAULT NULL COMMENT '用户性别',
`ip_location` varchar(64) DEFAULT NULL COMMENT 'IP地理位置',
`follows` int NOT NULL DEFAULT 0 COMMENT '关注数',
`fans` int NOT NULL DEFAULT 0 COMMENT '粉丝数',
`anwser_count` int NOT NULL DEFAULT 0 COMMENT '回答数',
`video_count` int NOT NULL DEFAULT 0 COMMENT '视频数',
`question_count` int NOT NULL DEFAULT 0 COMMENT '问题数',
`article_count` int NOT NULL DEFAULT 0 COMMENT '文章数',
`column_count` int NOT NULL DEFAULT 0 COMMENT '专栏数',
`get_voteup_count` int NOT NULL DEFAULT 0 COMMENT '获得的赞同数',
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
PRIMARY KEY (`id`),
UNIQUE KEY `idx_zhihu_creator_user_id` (`user_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='知乎创作者';
-- add column `like_count` to douyin_aweme_comment
alter table douyin_aweme_comment add column `like_count` varchar(255) NOT NULL DEFAULT '0' COMMENT '点赞数';
alter table xhs_note add column xsec_token varchar(50) default null comment '签名算法';
alter table douyin_aweme_comment add column `pictures` varchar(500) NOT NULL DEFAULT '' COMMENT '评论图片列表';
alter table bilibili_video_comment add column `like_count` varchar(255) NOT NULL DEFAULT '0' COMMENT '点赞数';

View File

@@ -18,7 +18,7 @@ from typing import List
import config
from var import source_keyword_var
from .bilibili_store_impl import *
from ._store_impl import *
from .bilibilli_store_media import *

View File

@@ -0,0 +1,299 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : persist1@126.com
# @Time : 2025/9/5 19:34
# @Desc : B站存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
from sqlalchemy import select
from sqlalchemy.orm import sessionmaker
import config
from base.base_crawler import AbstractStore
from database.db_session import get_session
from database.models import BilibiliVideoComment, BilibiliVideo, BilibiliUpInfo, BilibiliUpDynamic, BilibiliContactInfo
from tools.async_file_writer import AsyncFileWriter
from tools import utils, words
from var import crawler_type_var
class BiliCsvStoreImplement(AbstractStore):
def __init__(self):
self.file_writer = AsyncFileWriter(
crawler_type=crawler_type_var.get(),
platform="bilibili"
)
async def store_content(self, content_item: Dict):
"""
content CSV storage implementation
Args:
content_item:
Returns:
"""
await self.file_writer.write_to_csv(
item=content_item,
item_type="videos"
)
async def store_comment(self, comment_item: Dict):
"""
comment CSV storage implementation
Args:
comment_item:
Returns:
"""
await self.file_writer.write_to_csv(
item=comment_item,
item_type="comments"
)
async def store_creator(self, creator: Dict):
"""
creator CSV storage implementation
Args:
creator:
Returns:
"""
await self.file_writer.write_to_csv(
item=creator,
item_type="creators"
)
async def store_contact(self, contact_item: Dict):
"""
creator contact CSV storage implementation
Args:
contact_item: creator's contact item dict
Returns:
"""
await self.file_writer.write_to_csv(
item=contact_item,
item_type="contacts"
)
async def store_dynamic(self, dynamic_item: Dict):
"""
creator dynamic CSV storage implementation
Args:
dynamic_item: creator's contact item dict
Returns:
"""
await self.file_writer.write_to_csv(
item=dynamic_item,
item_type="dynamics"
)
class BiliDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Bilibili content DB storage implementation
Args:
content_item: content item dict
"""
video_id = content_item.get("video_id")
async with get_session() as session:
result = await session.execute(select(BilibiliVideo).where(BilibiliVideo.video_id == video_id))
video_detail = result.scalar_one_or_none()
if not video_detail:
content_item["add_ts"] = utils.get_current_timestamp()
new_content = BilibiliVideo(**content_item)
session.add(new_content)
else:
for key, value in content_item.items():
setattr(video_detail, key, value)
await session.commit()
async def store_comment(self, comment_item: Dict):
"""
Bilibili comment DB storage implementation
Args:
comment_item: comment item dict
"""
comment_id = comment_item.get("comment_id")
async with get_session() as session:
result = await session.execute(select(BilibiliVideoComment).where(BilibiliVideoComment.comment_id == comment_id))
comment_detail = result.scalar_one_or_none()
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
new_comment = BilibiliVideoComment(**comment_item)
session.add(new_comment)
else:
for key, value in comment_item.items():
setattr(comment_detail, key, value)
await session.commit()
async def store_creator(self, creator: Dict):
"""
Bilibili creator DB storage implementation
Args:
creator: creator item dict
"""
creator_id = creator.get("user_id")
async with get_session() as session:
result = await session.execute(select(BilibiliUpInfo).where(BilibiliUpInfo.user_id == creator_id))
creator_detail = result.scalar_one_or_none()
if not creator_detail:
creator["add_ts"] = utils.get_current_timestamp()
new_creator = BilibiliUpInfo(**creator)
session.add(new_creator)
else:
for key, value in creator.items():
setattr(creator_detail, key, value)
await session.commit()
async def store_contact(self, contact_item: Dict):
"""
Bilibili contact DB storage implementation
Args:
contact_item: contact item dict
"""
up_id = contact_item.get("up_id")
fan_id = contact_item.get("fan_id")
async with get_session() as session:
result = await session.execute(
select(BilibiliContactInfo).where(BilibiliContactInfo.up_id == up_id, BilibiliContactInfo.fan_id == fan_id)
)
contact_detail = result.scalar_one_or_none()
if not contact_detail:
contact_item["add_ts"] = utils.get_current_timestamp()
new_contact = BilibiliContactInfo(**contact_item)
session.add(new_contact)
else:
for key, value in contact_item.items():
setattr(contact_detail, key, value)
await session.commit()
async def store_dynamic(self, dynamic_item):
"""
Bilibili dynamic DB storage implementation
Args:
dynamic_item: dynamic item dict
"""
dynamic_id = dynamic_item.get("dynamic_id")
async with get_session() as session:
result = await session.execute(select(BilibiliUpDynamic).where(BilibiliUpDynamic.dynamic_id == dynamic_id))
dynamic_detail = result.scalar_one_or_none()
if not dynamic_detail:
dynamic_item["add_ts"] = utils.get_current_timestamp()
new_dynamic = BilibiliUpDynamic(**dynamic_item)
session.add(new_dynamic)
else:
for key, value in dynamic_item.items():
setattr(dynamic_detail, key, value)
await session.commit()
class BiliJsonStoreImplement(AbstractStore):
def __init__(self):
self.file_writer = AsyncFileWriter(
crawler_type=crawler_type_var.get(),
platform="bilibili"
)
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.file_writer.write_single_item_to_json(
item=content_item,
item_type="contents"
)
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.file_writer.write_single_item_to_json(
item=comment_item,
item_type="comments"
)
async def store_creator(self, creator: Dict):
"""
creator JSON storage implementation
Args:
creator:
Returns:
"""
await self.file_writer.write_single_item_to_json(
item=creator,
item_type="creators"
)
async def store_contact(self, contact_item: Dict):
"""
creator contact JSON storage implementation
Args:
contact_item: creator's contact item dict
Returns:
"""
await self.file_writer.write_single_item_to_json(
item=contact_item,
item_type="contacts"
)
async def store_dynamic(self, dynamic_item: Dict):
"""
creator dynamic JSON storage implementation
Args:
dynamic_item: creator's contact item dict
Returns:
"""
await self.file_writer.write_single_item_to_json(
item=dynamic_item,
item_type="dynamics"
)
class BiliSqliteStoreImplement(BiliDbStoreImplement):
pass

View File

@@ -1,465 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/1/14 19:34
# @Desc : B站存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
import config
from base.base_crawler import AbstractStore
from tools import utils, words
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0])for file_name in os.listdir(file_store_path)])+1
except ValueError:
return 1
class BiliCsvStoreImplement(AbstractStore):
csv_store_path: str = "data/bilibili"
file_count:int=calculate_number_of_files(csv_store_path)
def make_save_file_name(self, store_type: str) -> str:
"""
make save file name by store type
Args:
store_type: contents or comments
Returns: eg: data/bilibili/search_comments_20240114.csv ...
"""
return f"{self.csv_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.csv"
async def save_data_to_csv(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in CSV format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns: no returns
"""
pathlib.Path(self.csv_store_path).mkdir(parents=True, exist_ok=True)
save_file_name = self.make_save_file_name(store_type=store_type)
async with aiofiles.open(save_file_name, mode='a+', encoding="utf-8-sig", newline="") as f:
writer = csv.writer(f)
if await f.tell() == 0:
await writer.writerow(save_item.keys())
await writer.writerow(save_item.values())
async def store_content(self, content_item: Dict):
"""
Bilibili content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.save_data_to_csv(save_item=content_item, store_type="contents")
async def store_comment(self, comment_item: Dict):
"""
Bilibili comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.save_data_to_csv(save_item=comment_item, store_type="comments")
async def store_creator(self, creator: Dict):
"""
Bilibili creator CSV storage implementation
Args:
creator: creator item dict
Returns:
"""
await self.save_data_to_csv(save_item=creator, store_type="creators")
async def store_contact(self, contact_item: Dict):
"""
Bilibili contact CSV storage implementation
Args:
contact_item: creator's contact item dict
Returns:
"""
await self.save_data_to_csv(save_item=contact_item, store_type="contacts")
async def store_dynamic(self, dynamic_item: Dict):
"""
Bilibili dynamic CSV storage implementation
Args:
dynamic_item: creator's dynamic item dict
Returns:
"""
await self.save_data_to_csv(save_item=dynamic_item, store_type="dynamics")
class BiliDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Bilibili content DB storage implementation
Args:
content_item: content item dict
Returns:
"""
from .bilibili_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
video_id = content_item.get("video_id")
video_detail: Dict = await query_content_by_content_id(content_id=video_id)
if not video_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(video_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Bilibili content DB storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .bilibili_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Bilibili creator DB storage implementation
Args:
creator: creator item dict
Returns:
"""
from .bilibili_store_sql import (add_new_creator,
query_creator_by_creator_id,
update_creator_by_creator_id)
creator_id = creator.get("user_id")
creator_detail: Dict = await query_creator_by_creator_id(creator_id=creator_id)
if not creator_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_creator_id(creator_id,creator_item=creator)
async def store_contact(self, contact_item: Dict):
"""
Bilibili contact DB storage implementation
Args:
contact_item: contact item dict
Returns:
"""
from .bilibili_store_sql import (add_new_contact,
query_contact_by_up_and_fan,
update_contact_by_id, )
up_id = contact_item.get("up_id")
fan_id = contact_item.get("fan_id")
contact_detail: Dict = await query_contact_by_up_and_fan(up_id=up_id, fan_id=fan_id)
if not contact_detail:
contact_item["add_ts"] = utils.get_current_timestamp()
await add_new_contact(contact_item)
else:
key_id = contact_detail.get("id")
await update_contact_by_id(id=key_id, contact_item=contact_item)
async def store_dynamic(self, dynamic_item):
"""
Bilibili dynamic DB storage implementation
Args:
dynamic_item: dynamic item dict
Returns:
"""
from .bilibili_store_sql import (add_new_dynamic,
query_dynamic_by_dynamic_id,
update_dynamic_by_dynamic_id)
dynamic_id = dynamic_item.get("dynamic_id")
dynamic_detail = await query_dynamic_by_dynamic_id(dynamic_id=dynamic_id)
if not dynamic_detail:
dynamic_item["add_ts"] = utils.get_current_timestamp()
await add_new_dynamic(dynamic_item)
else:
await update_dynamic_by_dynamic_id(dynamic_id, dynamic_item=dynamic_item)
class BiliJsonStoreImplement(AbstractStore):
json_store_path: str = "data/bilibili/json"
words_store_path: str = "data/bilibili/words"
lock = asyncio.Lock()
file_count:int=calculate_number_of_files(json_store_path)
WordCloud = words.AsyncWordCloudGenerator()
def make_save_file_name(self, store_type: str) -> (str,str):
"""
make save file name by store type
Args:
store_type: Save type contains content and commentscontents | comments
Returns:
"""
return (
f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
)
async def save_data_to_json(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in json format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns:
"""
pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
save_file_name,words_file_name_prefix = self.make_save_file_name(store_type=store_type)
save_data = []
async with self.lock:
if os.path.exists(save_file_name):
async with aiofiles.open(save_file_name, 'r', encoding='utf-8') as file:
save_data = json.loads(await file.read())
save_data.append(save_item)
async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
await file.write(json.dumps(save_data, ensure_ascii=False))
if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
try:
await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
except:
pass
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.save_data_to_json(content_item, "contents")
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.save_data_to_json(comment_item, "comments")
async def store_creator(self, creator: Dict):
"""
creator JSON storage implementation
Args:
creator:
Returns:
"""
await self.save_data_to_json(creator, "creators")
async def store_contact(self, contact_item: Dict):
"""
creator contact JSON storage implementation
Args:
contact_item: creator's contact item dict
Returns:
"""
await self.save_data_to_json(save_item=contact_item, store_type="contacts")
async def store_dynamic(self, dynamic_item: Dict):
"""
creator dynamic JSON storage implementation
Args:
dynamic_item: creator's contact item dict
Returns:
"""
await self.save_data_to_json(save_item=dynamic_item, store_type="dynamics")
class BiliSqliteStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Bilibili content SQLite storage implementation
Args:
content_item: content item dict
Returns:
"""
from .bilibili_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
video_id = content_item.get("video_id")
video_detail: Dict = await query_content_by_content_id(content_id=video_id)
if not video_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(video_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Bilibili comment SQLite storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .bilibili_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Bilibili creator SQLite storage implementation
Args:
creator: creator item dict
Returns:
"""
from .bilibili_store_sql import (add_new_creator,
query_creator_by_creator_id,
update_creator_by_creator_id)
creator_id = creator.get("user_id")
creator_detail: Dict = await query_creator_by_creator_id(creator_id=creator_id)
if not creator_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_creator_id(creator_id, creator_item=creator)
async def store_contact(self, contact_item: Dict):
"""
Bilibili contact SQLite storage implementation
Args:
contact_item: contact item dict
Returns:
"""
from .bilibili_store_sql import (add_new_contact,
query_contact_by_up_and_fan,
update_contact_by_id, )
up_id = contact_item.get("up_id")
fan_id = contact_item.get("fan_id")
contact_detail: Dict = await query_contact_by_up_and_fan(up_id=up_id, fan_id=fan_id)
if not contact_detail:
contact_item["add_ts"] = utils.get_current_timestamp()
await add_new_contact(contact_item)
else:
key_id = contact_detail.get("id")
await update_contact_by_id(id=key_id, contact_item=contact_item)
async def store_dynamic(self, dynamic_item):
"""
Bilibili dynamic SQLite storage implementation
Args:
dynamic_item: dynamic item dict
Returns:
"""
from .bilibili_store_sql import (add_new_dynamic,
query_dynamic_by_dynamic_id,
update_dynamic_by_dynamic_id)
dynamic_id = dynamic_item.get("dynamic_id")
dynamic_detail = await query_dynamic_by_dynamic_id(dynamic_id=dynamic_id)
if not dynamic_detail:
dynamic_item["add_ts"] = utils.get_current_timestamp()
await add_new_dynamic(dynamic_item)
else:
await update_dynamic_by_dynamic_id(dynamic_id, dynamic_item=dynamic_item)

View File

@@ -1,253 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 15:30
# @Desc : sql接口集合
from typing import Dict, List, Union
from async_db import AsyncMysqlDB
from async_sqlite_db import AsyncSqliteDB
from var import media_crawler_db_var
async def query_content_by_content_id(content_id: str) -> Dict:
"""
查询一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from bilibili_video where video_id = '{content_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_content(content_item: Dict) -> int:
"""
新增一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("bilibili_video", content_item)
return last_row_id
async def update_content_by_content_id(content_id: str, content_item: Dict) -> int:
"""
更新一条记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("bilibili_video", content_item, "video_id", content_id)
return effect_row
async def query_comment_by_comment_id(comment_id: str) -> Dict:
"""
查询一条评论内容
Args:
comment_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from bilibili_video_comment where comment_id = '{comment_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_comment(comment_item: Dict) -> int:
"""
新增一条评论记录
Args:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("bilibili_video_comment", comment_item)
return last_row_id
async def update_comment_by_comment_id(comment_id: str, comment_item: Dict) -> int:
"""
更新增一条评论记录
Args:
comment_id:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("bilibili_video_comment", comment_item, "comment_id", comment_id)
return effect_row
async def query_creator_by_creator_id(creator_id: str) -> Dict:
"""
查询up主信息
Args:
creator_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from bilibili_up_info where user_id = '{creator_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_creator(creator_item: Dict) -> int:
"""
新增up主信息
Args:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("bilibili_up_info", creator_item)
return last_row_id
async def update_creator_by_creator_id(creator_id: str, creator_item: Dict) -> int:
"""
更新up主信息
Args:
creator_id:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("bilibili_up_info", creator_item, "user_id", creator_id)
return effect_row
async def query_contact_by_up_and_fan(up_id: str, fan_id: str) -> Dict:
"""
查询一条关联关系
Args:
up_id:
fan_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from bilibili_contact_info where up_id = '{up_id}' and fan_id = '{fan_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_contact(contact_item: Dict) -> int:
"""
新增关联关系
Args:
contact_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("bilibili_contact_info", contact_item)
return last_row_id
async def update_contact_by_id(id: str, contact_item: Dict) -> int:
"""
更新关联关系
Args:
id:
contact_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("bilibili_contact_info", contact_item, "id", id)
return effect_row
async def query_dynamic_by_dynamic_id(dynamic_id: str) -> Dict:
"""
查询一条动态信息
Args:
dynamic_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from bilibili_up_dynamic where dynamic_id = '{dynamic_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_dynamic(dynamic_item: Dict) -> int:
"""
新增动态信息
Args:
dynamic_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("bilibili_up_dynamic", dynamic_item)
return last_row_id
async def update_dynamic_by_dynamic_id(dynamic_id: str, dynamic_item: Dict) -> int:
"""
更新动态信息
Args:
dynamic_id:
dynamic_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("bilibili_up_dynamic", dynamic_item, "dynamic_id", dynamic_id)
return effect_row

View File

@@ -17,7 +17,7 @@ from typing import List
import config
from var import source_keyword_var
from .douyin_store_impl import *
from ._store_impl import *
from .douyin_store_media import *

198
store/douyin/_store_impl.py Normal file
View File

@@ -0,0 +1,198 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : persist1@126.com
# @Time : 2025/9/5 19:34
# @Desc : 抖音存储实现类
import asyncio
import json
import os
import pathlib
from typing import Dict
from sqlalchemy import select
import config
from base.base_crawler import AbstractStore
from database.db_session import get_session
from database.models import DouyinAweme, DouyinAwemeComment, DyCreator
from tools import utils, words
from tools.async_file_writer import AsyncFileWriter
from var import crawler_type_var
class DouyinCsvStoreImplement(AbstractStore):
def __init__(self):
self.file_writer = AsyncFileWriter(
crawler_type=crawler_type_var.get(),
platform="douyin"
)
async def store_content(self, content_item: Dict):
"""
Douyin content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.file_writer.write_to_csv(
item=content_item,
item_type="contents"
)
async def store_comment(self, comment_item: Dict):
"""
Douyin comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.file_writer.write_to_csv(
item=comment_item,
item_type="comments"
)
async def store_creator(self, creator: Dict):
"""
Douyin creator CSV storage implementation
Args:
creator: creator item dict
Returns:
"""
await self.file_writer.write_to_csv(
item=creator,
item_type="creators"
)
class DouyinDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Douyin content DB storage implementation
Args:
content_item: content item dict
"""
aweme_id = content_item.get("aweme_id")
async with get_session() as session:
result = await session.execute(select(DouyinAweme).where(DouyinAweme.aweme_id == aweme_id))
aweme_detail = result.scalar_one_or_none()
if not aweme_detail:
content_item["add_ts"] = utils.get_current_timestamp()
if content_item.get("title"):
new_content = DouyinAweme(**content_item)
session.add(new_content)
else:
for key, value in content_item.items():
setattr(aweme_detail, key, value)
await session.commit()
async def store_comment(self, comment_item: Dict):
"""
Douyin comment DB storage implementation
Args:
comment_item: comment item dict
"""
comment_id = comment_item.get("comment_id")
async with get_session() as session:
result = await session.execute(select(DouyinAwemeComment).where(DouyinAwemeComment.comment_id == comment_id))
comment_detail = result.scalar_one_or_none()
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
new_comment = DouyinAwemeComment(**comment_item)
session.add(new_comment)
else:
for key, value in comment_item.items():
setattr(comment_detail, key, value)
await session.commit()
async def store_creator(self, creator: Dict):
"""
Douyin creator DB storage implementation
Args:
creator: creator dict
"""
user_id = creator.get("user_id")
async with get_session() as session:
result = await session.execute(select(DyCreator).where(DyCreator.user_id == user_id))
user_detail = result.scalar_one_or_none()
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
new_creator = DyCreator(**creator)
session.add(new_creator)
else:
for key, value in creator.items():
setattr(user_detail, key, value)
await session.commit()
class DouyinJsonStoreImplement(AbstractStore):
def __init__(self):
self.file_writer = AsyncFileWriter(
crawler_type=crawler_type_var.get(),
platform="douyin"
)
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.file_writer.write_single_item_to_json(
item=content_item,
item_type="contents"
)
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.file_writer.write_single_item_to_json(
item=comment_item,
item_type="comments"
)
async def store_creator(self, creator: Dict):
"""
creator JSON storage implementation
Args:
creator:
Returns:
"""
await self.file_writer.write_single_item_to_json(
item=creator,
item_type="creators"
)
class DouyinSqliteStoreImplement(DouyinDbStoreImplement):
pass

View File

@@ -1,324 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/1/14 18:46
# @Desc : 抖音存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
import config
from base.base_crawler import AbstractStore
from tools import utils, words
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0]) for file_name in os.listdir(file_store_path)]) + 1
except ValueError:
return 1
class DouyinCsvStoreImplement(AbstractStore):
csv_store_path: str = "data/douyin"
file_count: int = calculate_number_of_files(csv_store_path)
def make_save_file_name(self, store_type: str) -> str:
"""
make save file name by store type
Args:
store_type: contents or comments
Returns: eg: data/douyin/search_comments_20240114.csv ...
"""
return f"{self.csv_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.csv"
async def save_data_to_csv(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in CSV format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns: no returns
"""
pathlib.Path(self.csv_store_path).mkdir(parents=True, exist_ok=True)
save_file_name = self.make_save_file_name(store_type=store_type)
async with aiofiles.open(save_file_name, mode='a+', encoding="utf-8-sig", newline="") as f:
writer = csv.writer(f)
if await f.tell() == 0:
await writer.writerow(save_item.keys())
await writer.writerow(save_item.values())
async def store_content(self, content_item: Dict):
"""
Douyin content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.save_data_to_csv(save_item=content_item, store_type="contents")
async def store_comment(self, comment_item: Dict):
"""
Douyin comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.save_data_to_csv(save_item=comment_item, store_type="comments")
async def store_creator(self, creator: Dict):
"""
Douyin creator CSV storage implementation
Args:
creator: creator item dict
Returns:
"""
await self.save_data_to_csv(save_item=creator, store_type="creator")
class DouyinDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Douyin content DB storage implementation
Args:
content_item: content item dict
Returns:
"""
from .douyin_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
aweme_id = content_item.get("aweme_id")
aweme_detail: Dict = await query_content_by_content_id(content_id=aweme_id)
if not aweme_detail:
content_item["add_ts"] = utils.get_current_timestamp()
if content_item.get("title"):
await add_new_content(content_item)
else:
await update_content_by_content_id(aweme_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Douyin content DB storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .douyin_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Douyin content DB storage implementation
Args:
creator: creator dict
Returns:
"""
from .douyin_store_sql import (add_new_creator,
query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)
class DouyinJsonStoreImplement(AbstractStore):
json_store_path: str = "data/douyin/json"
words_store_path: str = "data/douyin/words"
lock = asyncio.Lock()
file_count: int = calculate_number_of_files(json_store_path)
WordCloud = words.AsyncWordCloudGenerator()
def make_save_file_name(self, store_type: str) -> (str,str):
"""
make save file name by store type
Args:
store_type: Save type contains content and commentscontents | comments
Returns:
"""
return (
f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
)
async def save_data_to_json(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in json format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns:
"""
pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
save_file_name,words_file_name_prefix = self.make_save_file_name(store_type=store_type)
save_data = []
async with self.lock:
if os.path.exists(save_file_name):
async with aiofiles.open(save_file_name, 'r', encoding='utf-8') as file:
save_data = json.loads(await file.read())
save_data.append(save_item)
async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
await file.write(json.dumps(save_data, ensure_ascii=False))
if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
try:
await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
except:
pass
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.save_data_to_json(content_item, "contents")
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.save_data_to_json(comment_item, "comments")
async def store_creator(self, creator: Dict):
"""
Douyin creator CSV storage implementation
Args:
creator: creator item dict
Returns:
"""
await self.save_data_to_json(save_item=creator, store_type="creator")
class DouyinSqliteStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Douyin content SQLite storage implementation
Args:
content_item: content item dict
Returns:
"""
from .douyin_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
aweme_id = content_item.get("aweme_id")
aweme_detail: Dict = await query_content_by_content_id(content_id=aweme_id)
if not aweme_detail:
content_item["add_ts"] = utils.get_current_timestamp()
if content_item.get("title"):
await add_new_content(content_item)
else:
await update_content_by_content_id(aweme_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Douyin comment SQLite storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .douyin_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Douyin creator SQLite storage implementation
Args:
creator: creator dict
Returns:
"""
from .douyin_store_sql import (add_new_creator,
query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)

View File

@@ -1,160 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 15:30
# @Desc : sql接口集合
from typing import Dict, List, Union
from async_db import AsyncMysqlDB
from async_sqlite_db import AsyncSqliteDB
from var import media_crawler_db_var
async def query_content_by_content_id(content_id: str) -> Dict:
"""
查询一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from douyin_aweme where aweme_id = '{content_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_content(content_item: Dict) -> int:
"""
新增一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("douyin_aweme", content_item)
return last_row_id
async def update_content_by_content_id(content_id: str, content_item: Dict) -> int:
"""
更新一条记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("douyin_aweme", content_item, "aweme_id", content_id)
return effect_row
async def query_comment_by_comment_id(comment_id: str) -> Dict:
"""
查询一条评论内容
Args:
comment_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from douyin_aweme_comment where comment_id = '{comment_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_comment(comment_item: Dict) -> int:
"""
新增一条评论记录
Args:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("douyin_aweme_comment", comment_item)
return last_row_id
async def update_comment_by_comment_id(comment_id: str, comment_item: Dict) -> int:
"""
更新增一条评论记录
Args:
comment_id:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("douyin_aweme_comment", comment_item, "comment_id", comment_id)
return effect_row
async def query_creator_by_user_id(user_id: str) -> Dict:
"""
查询一条创作者记录
Args:
user_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from dy_creator where user_id = '{user_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_creator(creator_item: Dict) -> int:
"""
新增一条创作者信息
Args:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("dy_creator", creator_item)
return last_row_id
async def update_creator_by_user_id(user_id: str, creator_item: Dict) -> int:
"""
更新一条创作者信息
Args:
user_id:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("dy_creator", creator_item, "user_id", user_id)
return effect_row

View File

@@ -18,7 +18,7 @@ from typing import List
import config
from var import source_keyword_var
from .kuaishou_store_impl import *
from ._store_impl import *
class KuaishouStoreFactory:

View File

@@ -0,0 +1,160 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : persist1@126.com
# @Time : 2025/9/5 19:34
# @Desc : 快手存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
from tools.async_file_writer import AsyncFileWriter
import aiofiles
from sqlalchemy import select
import config
from base.base_crawler import AbstractStore
from database.db_session import get_session
from database.models import KuaishouVideo, KuaishouVideoComment
from tools import utils, words
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0]) for file_name in os.listdir(file_store_path)]) + 1
except ValueError:
return 1
class KuaishouCsvStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="kuaishou", crawler_type=crawler_type_var.get())
async def store_content(self, content_item: Dict):
"""
Kuaishou content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.writer.write_to_csv(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Kuaishou comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.writer.write_to_csv(item_type="comments", item=comment_item)
async def store_creator(self, creator: Dict):
pass
class KuaishouDbStoreImplement(AbstractStore):
async def store_creator(self, creator: Dict):
pass
async def store_content(self, content_item: Dict):
"""
Kuaishou content DB storage implementation
Args:
content_item: content item dict
"""
video_id = content_item.get("video_id")
async with get_session() as session:
result = await session.execute(select(KuaishouVideo).where(KuaishouVideo.video_id == video_id))
video_detail = result.scalar_one_or_none()
if not video_detail:
content_item["add_ts"] = utils.get_current_timestamp()
new_content = KuaishouVideo(**content_item)
session.add(new_content)
else:
for key, value in content_item.items():
setattr(video_detail, key, value)
await session.commit()
async def store_comment(self, comment_item: Dict):
"""
Kuaishou comment DB storage implementation
Args:
comment_item: comment item dict
"""
comment_id = comment_item.get("comment_id")
async with get_session() as session:
result = await session.execute(
select(KuaishouVideoComment).where(KuaishouVideoComment.comment_id == comment_id))
comment_detail = result.scalar_one_or_none()
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
new_comment = KuaishouVideoComment(**comment_item)
session.add(new_comment)
else:
for key, value in comment_item.items():
setattr(comment_detail, key, value)
await session.commit()
class KuaishouJsonStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="kuaishou", crawler_type=crawler_type_var.get())
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.writer.write_single_item_to_json(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.writer.write_single_item_to_json(item_type="comments", item=comment_item)
async def store_creator(self, creator: Dict):
pass
class KuaishouSqliteStoreImplement(KuaishouDbStoreImplement):
async def store_creator(self, creator: Dict):
pass

View File

@@ -1,290 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/1/14 20:03
# @Desc : 快手存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
import config
from base.base_crawler import AbstractStore
from tools import utils, words
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0])for file_name in os.listdir(file_store_path)])+1
except ValueError:
return 1
class KuaishouCsvStoreImplement(AbstractStore):
async def store_creator(self, creator: Dict):
pass
csv_store_path: str = "data/kuaishou"
file_count:int=calculate_number_of_files(csv_store_path)
def make_save_file_name(self, store_type: str) -> str:
"""
make save file name by store type
Args:
store_type: contents or comments
Returns: eg: data/douyin/search_comments_20240114.csv ...
"""
return f"{self.csv_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.csv"
async def save_data_to_csv(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in CSV format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns: no returns
"""
pathlib.Path(self.csv_store_path).mkdir(parents=True, exist_ok=True)
save_file_name = self.make_save_file_name(store_type=store_type)
async with aiofiles.open(save_file_name, mode='a+', encoding="utf-8-sig", newline="") as f:
writer = csv.writer(f)
if await f.tell() == 0:
await writer.writerow(save_item.keys())
await writer.writerow(save_item.values())
async def store_content(self, content_item: Dict):
"""
Kuaishou content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.save_data_to_csv(save_item=content_item, store_type="contents")
async def store_comment(self, comment_item: Dict):
"""
Kuaishou comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.save_data_to_csv(save_item=comment_item, store_type="comments")
class KuaishouDbStoreImplement(AbstractStore):
async def store_creator(self, creator: Dict):
pass
async def store_content(self, content_item: Dict):
"""
Kuaishou content DB storage implementation
Args:
content_item: content item dict
Returns:
"""
from .kuaishou_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
video_id = content_item.get("video_id")
video_detail: Dict = await query_content_by_content_id(content_id=video_id)
if not video_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(video_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Kuaishou content DB storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .kuaishou_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
class KuaishouJsonStoreImplement(AbstractStore):
json_store_path: str = "data/kuaishou/json"
words_store_path: str = "data/kuaishou/words"
lock = asyncio.Lock()
file_count:int=calculate_number_of_files(json_store_path)
WordCloud = words.AsyncWordCloudGenerator()
def make_save_file_name(self, store_type: str) -> (str,str):
"""
make save file name by store type
Args:
store_type: Save type contains content and commentscontents | comments
Returns:
"""
return (
f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
)
async def save_data_to_json(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in json format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns:
"""
pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
save_file_name,words_file_name_prefix = self.make_save_file_name(store_type=store_type)
save_data = []
async with self.lock:
if os.path.exists(save_file_name):
async with aiofiles.open(save_file_name, 'r', encoding='utf-8') as file:
save_data = json.loads(await file.read())
save_data.append(save_item)
async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
await file.write(json.dumps(save_data, ensure_ascii=False))
if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
try:
await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
except:
pass
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.save_data_to_json(content_item, "contents")
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.save_data_to_json(comment_item, "comments")
async def store_creator(self, creator: Dict):
"""
Kuaishou content JSON storage implementation
Args:
creator: creator dict
Returns:
"""
await self.save_data_to_json(creator, "creator")
class KuaishouSqliteStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Kuaishou content SQLite storage implementation
Args:
content_item: content item dict
Returns:
"""
from .kuaishou_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
video_id = content_item.get("video_id")
video_detail: Dict = await query_content_by_content_id(content_id=video_id)
if not video_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(video_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Kuaishou comment SQLite storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .kuaishou_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Kuaishou creator SQLite storage implementation
Args:
creator: creator dict
Returns:
"""
pass

View File

@@ -1,114 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 15:30
# @Desc : sql接口集合
from typing import Dict, List, Union
from async_db import AsyncMysqlDB
from async_sqlite_db import AsyncSqliteDB
from var import media_crawler_db_var
async def query_content_by_content_id(content_id: str) -> Dict:
"""
查询一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from kuaishou_video where video_id = '{content_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_content(content_item: Dict) -> int:
"""
新增一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("kuaishou_video", content_item)
return last_row_id
async def update_content_by_content_id(content_id: str, content_item: Dict) -> int:
"""
更新一条记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("kuaishou_video", content_item, "video_id", content_id)
return effect_row
async def query_comment_by_comment_id(comment_id: str) -> Dict:
"""
查询一条评论内容
Args:
comment_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from kuaishou_video_comment where comment_id = '{comment_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_comment(comment_item: Dict) -> int:
"""
新增一条评论记录
Args:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("kuaishou_video_comment", comment_item)
return last_row_id
async def update_comment_by_comment_id(comment_id: str, comment_item: Dict) -> int:
"""
更新增一条评论记录
Args:
comment_id:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("kuaishou_video_comment", comment_item, "comment_id", comment_id)
return effect_row

View File

@@ -15,8 +15,7 @@ from typing import List
from model.m_baidu_tieba import TiebaComment, TiebaCreator, TiebaNote
from var import source_keyword_var
from . import tieba_store_impl
from .tieba_store_impl import *
from ._store_impl import *
class TieBaStoreFactory:

192
store/tieba/_store_impl.py Normal file
View File

@@ -0,0 +1,192 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : persist1@126.com
# @Time : 2025/9/5 19:34
# @Desc : 贴吧存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
from sqlalchemy import select
from sqlalchemy.ext.asyncio import AsyncSession
import config
from base.base_crawler import AbstractStore
from database.models import TiebaNote, TiebaComment, TiebaCreator
from tools import utils, words
from database.db_session import get_session
from var import crawler_type_var
from tools.async_file_writer import AsyncFileWriter
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0]) for file_name in os.listdir(file_store_path)]) + 1
except ValueError:
return 1
class TieBaCsvStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="tieba", crawler_type=crawler_type_var.get())
async def store_content(self, content_item: Dict):
"""
tieba content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.writer.write_to_csv(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
tieba comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.writer.write_to_csv(item_type="comments", item=comment_item)
async def store_creator(self, creator: Dict):
"""
tieba content CSV storage implementation
Args:
creator: creator dict
Returns:
"""
await self.writer.write_to_csv(item_type="creators", item=creator)
class TieBaDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
tieba content DB storage implementation
Args:
content_item: content item dict
"""
note_id = content_item.get("note_id")
async with get_session() as session:
stmt = select(TiebaNote).where(TiebaNote.note_id == note_id)
res = await session.execute(stmt)
db_note = res.scalar_one_or_none()
if db_note:
for key, value in content_item.items():
setattr(db_note, key, value)
else:
db_note = TiebaNote(**content_item)
session.add(db_note)
await session.commit()
async def store_comment(self, comment_item: Dict):
"""
tieba content DB storage implementation
Args:
comment_item: comment item dict
"""
comment_id = comment_item.get("comment_id")
async with get_session() as session:
stmt = select(TiebaComment).where(TiebaComment.comment_id == comment_id)
res = await session.execute(stmt)
db_comment = res.scalar_one_or_none()
if db_comment:
for key, value in comment_item.items():
setattr(db_comment, key, value)
else:
db_comment = TiebaComment(**comment_item)
session.add(db_comment)
await session.commit()
async def store_creator(self, creator: Dict):
"""
tieba content DB storage implementation
Args:
creator: creator dict
"""
user_id = creator.get("user_id")
async with get_session() as session:
stmt = select(TiebaCreator).where(TiebaCreator.user_id == user_id)
res = await session.execute(stmt)
db_creator = res.scalar_one_or_none()
if db_creator:
for key, value in creator.items():
setattr(db_creator, key, value)
else:
db_creator = TiebaCreator(**creator)
session.add(db_creator)
await session.commit()
class TieBaJsonStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="tieba", crawler_type=crawler_type_var.get())
async def store_content(self, content_item: Dict):
"""
tieba content JSON storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.writer.write_single_item_to_json(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
tieba comment JSON storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.writer.write_single_item_to_json(item_type="comments", item=comment_item)
async def store_creator(self, creator: Dict):
"""
tieba content JSON storage implementation
Args:
creator: creator dict
Returns:
"""
await self.writer.write_single_item_to_json(item_type="creators", item=creator)
class TieBaSqliteStoreImplement(TieBaDbStoreImplement):
"""
Tieba sqlite store implement
"""
pass

View File

@@ -1,318 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
import config
from base.base_crawler import AbstractStore
from tools import utils, words
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0]) for file_name in os.listdir(file_store_path)]) + 1
except ValueError:
return 1
class TieBaCsvStoreImplement(AbstractStore):
csv_store_path: str = "data/tieba"
file_count: int = calculate_number_of_files(csv_store_path)
def make_save_file_name(self, store_type: str) -> str:
"""
make save file name by store type
Args:
store_type: contents or comments
Returns: eg: data/tieba/search_comments_20240114.csv ...
"""
return f"{self.csv_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.csv"
async def save_data_to_csv(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in CSV format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns: no returns
"""
pathlib.Path(self.csv_store_path).mkdir(parents=True, exist_ok=True)
save_file_name = self.make_save_file_name(store_type=store_type)
async with aiofiles.open(save_file_name, mode='a+', encoding="utf-8-sig", newline="") as f:
f.fileno()
writer = csv.writer(f)
if await f.tell() == 0:
await writer.writerow(save_item.keys())
await writer.writerow(save_item.values())
async def store_content(self, content_item: Dict):
"""
tieba content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.save_data_to_csv(save_item=content_item, store_type="contents")
async def store_comment(self, comment_item: Dict):
"""
tieba comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.save_data_to_csv(save_item=comment_item, store_type="comments")
async def store_creator(self, creator: Dict):
"""
tieba content CSV storage implementation
Args:
creator: creator dict
Returns:
"""
await self.save_data_to_csv(save_item=creator, store_type="creator")
class TieBaDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
tieba content DB storage implementation
Args:
content_item: content item dict
Returns:
"""
from .tieba_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
note_id = content_item.get("note_id")
note_detail: Dict = await query_content_by_content_id(content_id=note_id)
if not note_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(note_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
tieba content DB storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .tieba_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
tieba content DB storage implementation
Args:
creator: creator dict
Returns:
"""
from .tieba_store_sql import (add_new_creator,
query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)
class TieBaJsonStoreImplement(AbstractStore):
json_store_path: str = "data/tieba/json"
words_store_path: str = "data/tieba/words"
lock = asyncio.Lock()
file_count: int = calculate_number_of_files(json_store_path)
WordCloud = words.AsyncWordCloudGenerator()
def make_save_file_name(self, store_type: str) -> (str, str):
"""
make save file name by store type
Args:
store_type: Save type contains content and commentscontents | comments
Returns:
"""
return (
f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
)
async def save_data_to_json(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in json format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns:
"""
pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
save_file_name, words_file_name_prefix = self.make_save_file_name(store_type=store_type)
save_data = []
async with self.lock:
if os.path.exists(save_file_name):
async with aiofiles.open(save_file_name, 'r', encoding='utf-8') as file:
save_data = json.loads(await file.read())
save_data.append(save_item)
async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
await file.write(json.dumps(save_data, ensure_ascii=False))
if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
try:
await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
except:
pass
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.save_data_to_json(content_item, "contents")
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.save_data_to_json(comment_item, "comments")
async def store_creator(self, creator: Dict):
"""
tieba content JSON storage implementation
Args:
creator: creator dict
Returns:
"""
await self.save_data_to_json(creator, "creator")
class TieBaSqliteStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
tieba content SQLite storage implementation
Args:
content_item: content item dict
Returns:
"""
from .tieba_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
note_id = content_item.get("note_id")
note_detail: Dict = await query_content_by_content_id(content_id=note_id)
if not note_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(note_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
tieba comment SQLite storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .tieba_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
tieba creator SQLite storage implementation
Args:
creator: creator dict
Returns:
"""
from .tieba_store_sql import (add_new_creator,
query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)

View File

@@ -1,156 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
from typing import Dict, List, Union
from async_db import AsyncMysqlDB
from async_sqlite_db import AsyncSqliteDB
from var import media_crawler_db_var
async def query_content_by_content_id(content_id: str) -> Dict:
"""
查询一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from tieba_note where note_id = '{content_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_content(content_item: Dict) -> int:
"""
新增一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("tieba_note", content_item)
return last_row_id
async def update_content_by_content_id(content_id: str, content_item: Dict) -> int:
"""
更新一条记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("tieba_note", content_item, "note_id", content_id)
return effect_row
async def query_comment_by_comment_id(comment_id: str) -> Dict:
"""
查询一条评论内容
Args:
comment_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from tieba_comment where comment_id = '{comment_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_comment(comment_item: Dict) -> int:
"""
新增一条评论记录
Args:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("tieba_comment", comment_item)
return last_row_id
async def update_comment_by_comment_id(comment_id: str, comment_item: Dict) -> int:
"""
更新增一条评论记录
Args:
comment_id:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("tieba_comment", comment_item, "comment_id", comment_id)
return effect_row
async def query_creator_by_user_id(user_id: str) -> Dict:
"""
查询一条创作者记录
Args:
user_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from tieba_creator where user_id = '{user_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_creator(creator_item: Dict) -> int:
"""
新增一条创作者信息
Args:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("tieba_creator", creator_item)
return last_row_id
async def update_creator_by_user_id(user_id: str, creator_item: Dict) -> int:
"""
更新一条创作者信息
Args:
user_id:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("tieba_creator", creator_item, "user_id", user_id)
return effect_row

View File

@@ -19,7 +19,7 @@ from typing import List
from var import source_keyword_var
from .weibo_store_media import *
from .weibo_store_impl import *
from ._store_impl import *
class WeibostoreFactory:

214
store/weibo/_store_impl.py Normal file
View File

@@ -0,0 +1,214 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : persist1@126.com
# @Time : 2025/9/5 19:34
# @Desc : 微博存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
from sqlalchemy import select
from sqlalchemy.ext.asyncio import AsyncSession
import config
from base.base_crawler import AbstractStore
from database.models import WeiboCreator, WeiboNote, WeiboNoteComment
from tools import utils, words
from tools.async_file_writer import AsyncFileWriter
from database.db_session import get_session
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0]) for file_name in os.listdir(file_store_path)]) + 1
except ValueError:
return 1
class WeiboCsvStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="weibo", crawler_type=crawler_type_var.get())
async def store_content(self, content_item: Dict):
"""
Weibo content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.writer.write_to_csv(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Weibo comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.writer.write_to_csv(item_type="comments", item=comment_item)
async def store_creator(self, creator: Dict):
"""
Weibo creator CSV storage implementation
Args:
creator:
Returns:
"""
await self.writer.write_to_csv(item_type="creators", item=creator)
class WeiboDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Weibo content DB storage implementation
Args:
content_item: content item dict
Returns:
"""
note_id = content_item.get("note_id")
async with get_session() as session:
stmt = select(WeiboNote).where(WeiboNote.note_id == note_id)
res = await session.execute(stmt)
db_note = res.scalar_one_or_none()
if db_note:
db_note.last_modify_ts = utils.get_current_timestamp()
for key, value in content_item.items():
if hasattr(db_note, key):
setattr(db_note, key, value)
else:
content_item["add_ts"] = utils.get_current_timestamp()
content_item["last_modify_ts"] = utils.get_current_timestamp()
db_note = WeiboNote(**content_item)
session.add(db_note)
await session.commit()
async def store_comment(self, comment_item: Dict):
"""
Weibo content DB storage implementation
Args:
comment_item: comment item dict
Returns:
"""
comment_id = comment_item.get("comment_id")
async with get_session() as session:
stmt = select(WeiboNoteComment).where(WeiboNoteComment.comment_id == comment_id)
res = await session.execute(stmt)
db_comment = res.scalar_one_or_none()
if db_comment:
db_comment.last_modify_ts = utils.get_current_timestamp()
for key, value in comment_item.items():
if hasattr(db_comment, key):
setattr(db_comment, key, value)
else:
comment_item["add_ts"] = utils.get_current_timestamp()
comment_item["last_modify_ts"] = utils.get_current_timestamp()
db_comment = WeiboNoteComment(**comment_item)
session.add(db_comment)
await session.commit()
async def store_creator(self, creator: Dict):
"""
Weibo creator DB storage implementation
Args:
creator:
Returns:
"""
user_id = creator.get("user_id")
async with get_session() as session:
stmt = select(WeiboCreator).where(WeiboCreator.user_id == user_id)
res = await session.execute(stmt)
db_creator = res.scalar_one_or_none()
if db_creator:
db_creator.last_modify_ts = utils.get_current_timestamp()
for key, value in creator.items():
if hasattr(db_creator, key):
setattr(db_creator, key, value)
else:
creator["add_ts"] = utils.get_current_timestamp()
creator["last_modify_ts"] = utils.get_current_timestamp()
db_creator = WeiboCreator(**creator)
session.add(db_creator)
await session.commit()
class WeiboJsonStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="weibo", crawler_type=crawler_type_var.get())
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.writer.write_single_item_to_json(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.writer.write_single_item_to_json(item_type="comments", item=comment_item)
async def store_creator(self, creator: Dict):
"""
creator JSON storage implementation
Args:
creator:
Returns:
"""
await self.writer.write_single_item_to_json(item_type="creators", item=creator)
class WeiboSqliteStoreImplement(WeiboDbStoreImplement):
"""
Weibo content SQLite storage implementation
"""
pass

View File

@@ -1,326 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/1/14 21:35
# @Desc : 微博存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
import config
from base.base_crawler import AbstractStore
from tools import utils, words
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0]) for file_name in os.listdir(file_store_path)]) + 1
except ValueError:
return 1
class WeiboCsvStoreImplement(AbstractStore):
csv_store_path: str = "data/weibo"
file_count: int = calculate_number_of_files(csv_store_path)
def make_save_file_name(self, store_type: str) -> str:
"""
make save file name by store type
Args:
store_type: contents or comments
Returns: eg: data/bilibili/search_comments_20240114.csv ...
"""
return f"{self.csv_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.csv"
async def save_data_to_csv(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in CSV format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns: no returns
"""
pathlib.Path(self.csv_store_path).mkdir(parents=True, exist_ok=True)
save_file_name = self.make_save_file_name(store_type=store_type)
async with aiofiles.open(save_file_name, mode='a+', encoding="utf-8-sig", newline="") as f:
writer = csv.writer(f)
if await f.tell() == 0:
await writer.writerow(save_item.keys())
await writer.writerow(save_item.values())
async def store_content(self, content_item: Dict):
"""
Weibo content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.save_data_to_csv(save_item=content_item, store_type="contents")
async def store_comment(self, comment_item: Dict):
"""
Weibo comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.save_data_to_csv(save_item=comment_item, store_type="comments")
async def store_creator(self, creator: Dict):
"""
Weibo creator CSV storage implementation
Args:
creator:
Returns:
"""
await self.save_data_to_csv(save_item=creator, store_type="creators")
class WeiboDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Weibo content DB storage implementation
Args:
content_item: content item dict
Returns:
"""
from .weibo_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
note_id = content_item.get("note_id")
note_detail: Dict = await query_content_by_content_id(content_id=note_id)
if not note_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(note_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Weibo content DB storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .weibo_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Weibo creator DB storage implementation
Args:
creator:
Returns:
"""
from .weibo_store_sql import (add_new_creator,
query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)
class WeiboJsonStoreImplement(AbstractStore):
json_store_path: str = "data/weibo/json"
words_store_path: str = "data/weibo/words"
lock = asyncio.Lock()
file_count: int = calculate_number_of_files(json_store_path)
WordCloud = words.AsyncWordCloudGenerator()
def make_save_file_name(self, store_type: str) -> (str, str):
"""
make save file name by store type
Args:
store_type: Save type contains content and commentscontents | comments
Returns:
"""
return (
f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
)
async def save_data_to_json(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in json format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns:
"""
pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
save_file_name, words_file_name_prefix = self.make_save_file_name(store_type=store_type)
save_data = []
async with self.lock:
if os.path.exists(save_file_name):
async with aiofiles.open(save_file_name, 'r', encoding='utf-8') as file:
save_data = json.loads(await file.read())
save_data.append(save_item)
async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
await file.write(json.dumps(save_data, ensure_ascii=False))
if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
try:
await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
except:
pass
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.save_data_to_json(content_item, "contents")
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.save_data_to_json(comment_item, "comments")
async def store_creator(self, creator: Dict):
"""
creator JSON storage implementation
Args:
creator:
Returns:
"""
await self.save_data_to_json(creator, "creators")
class WeiboSqliteStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Weibo content SQLite storage implementation
Args:
content_item: content item dict
Returns:
"""
from .weibo_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
note_id = content_item.get("note_id")
note_detail: Dict = await query_content_by_content_id(content_id=note_id)
if not note_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(note_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Weibo comment SQLite storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .weibo_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Weibo creator SQLite storage implementation
Args:
creator:
Returns:
"""
from .weibo_store_sql import (add_new_creator,
query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)

View File

@@ -1,160 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 15:30
# @Desc : sql接口集合
from typing import Dict, List, Union
from async_db import AsyncMysqlDB
from async_sqlite_db import AsyncSqliteDB
from var import media_crawler_db_var
async def query_content_by_content_id(content_id: str) -> Dict:
"""
查询一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from weibo_note where note_id = '{content_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_content(content_item: Dict) -> int:
"""
新增一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("weibo_note", content_item)
return last_row_id
async def update_content_by_content_id(content_id: str, content_item: Dict) -> int:
"""
更新一条记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("weibo_note", content_item, "note_id", content_id)
return effect_row
async def query_comment_by_comment_id(comment_id: str) -> Dict:
"""
查询一条评论内容
Args:
comment_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from weibo_note_comment where comment_id = '{comment_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_comment(comment_item: Dict) -> int:
"""
新增一条评论记录
Args:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("weibo_note_comment", comment_item)
return last_row_id
async def update_comment_by_comment_id(comment_id: str, comment_item: Dict) -> int:
"""
更新增一条评论记录
Args:
comment_id:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("weibo_note_comment", comment_item, "comment_id", comment_id)
return effect_row
async def query_creator_by_user_id(user_id: str) -> Dict:
"""
查询一条创作者记录
Args:
user_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from weibo_creator where user_id = '{user_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_creator(creator_item: Dict) -> int:
"""
新增一条创作者信息
Args:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("weibo_creator", creator_item)
return last_row_id
async def update_creator_by_user_id(user_id: str, creator_item: Dict) -> int:
"""
更新一条创作者信息
Args:
user_id:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("weibo_creator", creator_item, "user_id", user_id)
return effect_row

View File

@@ -17,9 +17,8 @@ from typing import List
import config
from var import source_keyword_var
from . import xhs_store_impl
from .xhs_store_media import *
from .xhs_store_impl import *
from ._store_impl import *
class XhsStoreFactory:

260
store/xhs/_store_impl.py Normal file
View File

@@ -0,0 +1,260 @@
# -*- coding: utf-8 -*-
# @Author : persist1@126.com
# @Time : 2025/9/5 19:34
# @Desc : 小红书存储实现类
import json
import os
from datetime import datetime
from typing import List, Dict, Any
from sqlalchemy import select, update, delete
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy.orm import Session
from base.base_crawler import AbstractStore
from database.db_session import get_session
from database.models import XhsNote, XhsNoteComment, XhsCreator
from tools.async_file_writer import AsyncFileWriter
from tools.time_util import get_current_timestamp
from var import crawler_type_var
class XhsCsvStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="xhs", crawler_type=crawler_type_var.get())
async def store_content(self, content_item: Dict):
"""
store content data to csv file
:param content_item:
:return:
"""
await self.writer.write_to_csv(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
store comment data to csv file
:param comment_item:
:return:
"""
await self.writer.write_to_csv(item_type="comments", item=comment_item)
async def store_creator(self, creator_item: Dict):
pass
def flush(self):
pass
class XhsJsonStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="xhs", crawler_type=crawler_type_var.get())
async def store_content(self, content_item: Dict):
"""
store content data to json file
:param content_item:
:return:
"""
await self.writer.write_single_item_to_json(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
store comment data to json file
:param comment_item:
:return:
"""
await self.writer.write_single_item_to_json(item_type="comments", item=comment_item)
async def store_creator(self, creator_item: Dict):
pass
def flush(self):
"""
flush data to json file
:return:
"""
pass
class XhsDbStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
async def store_content(self, content_item: Dict):
note_id = content_item.get("note_id")
if not note_id:
return
async with get_session() as session:
if await self.content_is_exist(session, note_id):
await self.update_content(session, content_item)
else:
await self.add_content(session, content_item)
async def add_content(self, session: AsyncSession, content_item: Dict):
add_ts = int(get_current_timestamp())
last_modify_ts = int(get_current_timestamp())
note = XhsNote(
user_id=content_item.get("user_id"),
nickname=content_item.get("nickname"),
avatar=content_item.get("avatar"),
ip_location=content_item.get("ip_location"),
add_ts=add_ts,
last_modify_ts=last_modify_ts,
note_id=content_item.get("note_id"),
type=content_item.get("type"),
title=content_item.get("title"),
desc=content_item.get("desc"),
video_url=content_item.get("video_url"),
time=content_item.get("time"),
last_update_time=content_item.get("last_update_time"),
liked_count=str(content_item.get("liked_count")),
collected_count=str(content_item.get("collected_count")),
comment_count=str(content_item.get("comment_count")),
share_count=str(content_item.get("share_count")),
image_list=json.dumps(content_item.get("image_list")),
tag_list=json.dumps(content_item.get("tag_list")),
note_url=content_item.get("note_url"),
source_keyword=content_item.get("source_keyword", ""),
xsec_token=content_item.get("xsec_token", "")
)
session.add(note)
async def update_content(self, session: AsyncSession, content_item: Dict):
note_id = content_item.get("note_id")
last_modify_ts = int(get_current_timestamp())
update_data = {
"last_modify_ts": last_modify_ts,
"liked_count": str(content_item.get("liked_count")),
"collected_count": str(content_item.get("collected_count")),
"comment_count": str(content_item.get("comment_count")),
"share_count": str(content_item.get("share_count")),
"last_update_time": content_item.get("last_update_time"),
}
stmt = update(XhsNote).where(XhsNote.note_id == note_id).values(**update_data)
await session.execute(stmt)
async def content_is_exist(self, session: AsyncSession, note_id: str) -> bool:
stmt = select(XhsNote).where(XhsNote.note_id == note_id)
result = await session.execute(stmt)
return result.first() is not None
async def store_comment(self, comment_item: Dict):
if not comment_item:
return
async with get_session() as session:
comment_id = comment_item.get("comment_id")
if not comment_id:
return
if await self.comment_is_exist(session, comment_id):
await self.update_comment(session, comment_item)
else:
await self.add_comment(session, comment_item)
async def add_comment(self, session: AsyncSession, comment_item: Dict):
add_ts = int(get_current_timestamp())
last_modify_ts = int(get_current_timestamp())
comment = XhsNoteComment(
user_id=comment_item.get("user_id"),
nickname=comment_item.get("nickname"),
avatar=comment_item.get("avatar"),
ip_location=comment_item.get("ip_location"),
add_ts=add_ts,
last_modify_ts=last_modify_ts,
comment_id=comment_item.get("comment_id"),
create_time=comment_item.get("create_time"),
note_id=comment_item.get("note_id"),
content=comment_item.get("content"),
sub_comment_count=comment_item.get("sub_comment_count"),
pictures=json.dumps(comment_item.get("pictures")),
parent_comment_id=comment_item.get("parent_comment_id"),
like_count=str(comment_item.get("like_count"))
)
session.add(comment)
async def update_comment(self, session: AsyncSession, comment_item: Dict):
comment_id = comment_item.get("comment_id")
last_modify_ts = int(get_current_timestamp())
update_data = {
"last_modify_ts": last_modify_ts,
"like_count": str(comment_item.get("like_count")),
"sub_comment_count": comment_item.get("sub_comment_count"),
}
stmt = update(XhsNoteComment).where(XhsNoteComment.comment_id == comment_id).values(**update_data)
await session.execute(stmt)
async def comment_is_exist(self, session: AsyncSession, comment_id: str) -> bool:
stmt = select(XhsNoteComment).where(XhsNoteComment.comment_id == comment_id)
result = await session.execute(stmt)
return result.first() is not None
async def store_creator(self, creator_item: Dict):
user_id = creator_item.get("user_id")
if not user_id:
return
async with get_session() as session:
if await self.creator_is_exist(session, user_id):
await self.update_creator(session, creator_item)
else:
await self.add_creator(session, creator_item)
async def add_creator(self, session: AsyncSession, creator_item: Dict):
add_ts = int(get_current_timestamp())
last_modify_ts = int(get_current_timestamp())
creator = XhsCreator(
user_id=creator_item.get("user_id"),
nickname=creator_item.get("nickname"),
avatar=creator_item.get("avatar"),
ip_location=creator_item.get("ip_location"),
add_ts=add_ts,
last_modify_ts=last_modify_ts,
desc=creator_item.get("desc"),
gender=creator_item.get("gender"),
follows=str(creator_item.get("follows")),
fans=str(creator_item.get("fans")),
interaction=str(creator_item.get("interaction")),
tag_list=json.dumps(creator_item.get("tag_list"))
)
session.add(creator)
async def update_creator(self, session: AsyncSession, creator_item: Dict):
user_id = creator_item.get("user_id")
last_modify_ts = int(get_current_timestamp())
update_data = {
"last_modify_ts": last_modify_ts,
"nickname": creator_item.get("nickname"),
"avatar": creator_item.get("avatar"),
"desc": creator_item.get("desc"),
"follows": str(creator_item.get("follows")),
"fans": str(creator_item.get("fans")),
"interaction": str(creator_item.get("interaction")),
"tag_list": json.dumps(creator_item.get("tag_list"))
}
stmt = update(XhsCreator).where(XhsCreator.user_id == user_id).values(**update_data)
await session.execute(stmt)
async def creator_is_exist(self, session: AsyncSession, user_id: str) -> bool:
stmt = select(XhsCreator).where(XhsCreator.user_id == user_id)
result = await session.execute(stmt)
return result.first() is not None
async def get_all_content(self) -> List[Dict]:
async with get_session() as session:
stmt = select(XhsNote)
result = await session.execute(stmt)
return [item.__dict__ for item in result.scalars().all()]
async def get_all_comments(self) -> List[Dict]:
async with get_session() as session:
stmt = select(XhsNoteComment)
result = await session.execute(stmt)
return [item.__dict__ for item in result.scalars().all()]
class XhsSqliteStoreImplement(XhsDbStoreImplement):
def __init__(self, **kwargs):
super().__init__(**kwargs)

View File

@@ -1,318 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/1/14 16:58
# @Desc : 小红书存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
import config
from base.base_crawler import AbstractStore
from tools import utils, words
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0])for file_name in os.listdir(file_store_path)])+1
except ValueError:
return 1
class XhsCsvStoreImplement(AbstractStore):
csv_store_path: str = "data/xhs"
file_count:int=calculate_number_of_files(csv_store_path)
def make_save_file_name(self, store_type: str) -> str:
"""
make save file name by store type
Args:
store_type: contents or comments
Returns: eg: data/xhs/search_comments_20240114.csv ...
"""
return f"{self.csv_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.csv"
async def save_data_to_csv(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in CSV format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns: no returns
"""
pathlib.Path(self.csv_store_path).mkdir(parents=True, exist_ok=True)
save_file_name = self.make_save_file_name(store_type=store_type)
async with aiofiles.open(save_file_name, mode='a+', encoding="utf-8-sig", newline="") as f:
f.fileno()
writer = csv.writer(f)
if await f.tell() == 0:
await writer.writerow(save_item.keys())
await writer.writerow(save_item.values())
async def store_content(self, content_item: Dict):
"""
Xiaohongshu content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.save_data_to_csv(save_item=content_item, store_type="contents")
async def store_comment(self, comment_item: Dict):
"""
Xiaohongshu comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.save_data_to_csv(save_item=comment_item, store_type="comments")
async def store_creator(self, creator: Dict):
"""
Xiaohongshu content CSV storage implementation
Args:
creator: creator dict
Returns:
"""
await self.save_data_to_csv(save_item=creator, store_type="creator")
class XhsDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Xiaohongshu content DB storage implementation
Args:
content_item: content item dict
Returns:
"""
from .xhs_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
note_id = content_item.get("note_id")
note_detail: Dict = await query_content_by_content_id(content_id=note_id)
if not note_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(note_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Xiaohongshu content DB storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .xhs_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Xiaohongshu content DB storage implementation
Args:
creator: creator dict
Returns:
"""
from .xhs_store_sql import (add_new_creator, query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)
class XhsJsonStoreImplement(AbstractStore):
json_store_path: str = "data/xhs/json"
words_store_path: str = "data/xhs/words"
lock = asyncio.Lock()
file_count:int=calculate_number_of_files(json_store_path)
WordCloud = words.AsyncWordCloudGenerator()
def make_save_file_name(self, store_type: str) -> (str,str):
"""
make save file name by store type
Args:
store_type: Save type contains content and commentscontents | comments
Returns:
"""
return (
f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
)
async def save_data_to_json(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in json format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns:
"""
pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
save_file_name,words_file_name_prefix = self.make_save_file_name(store_type=store_type)
save_data = []
async with self.lock:
if os.path.exists(save_file_name):
async with aiofiles.open(save_file_name, 'r', encoding='utf-8') as file:
save_data = json.loads(await file.read())
save_data.append(save_item)
async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
await file.write(json.dumps(save_data, ensure_ascii=False, indent=4))
if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
try:
await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
except:
pass
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.save_data_to_json(content_item, "contents")
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.save_data_to_json(comment_item, "comments")
async def store_creator(self, creator: Dict):
"""
Xiaohongshu content JSON storage implementation
Args:
creator: creator dict
Returns:
"""
await self.save_data_to_json(creator, "creator")
class XhsSqliteStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Xiaohongshu content SQLite storage implementation
Args:
content_item: content item dict
Returns:
"""
from .xhs_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
note_id = content_item.get("note_id")
note_detail: Dict = await query_content_by_content_id(content_id=note_id)
if not note_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(note_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Xiaohongshu comment SQLite storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .xhs_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Xiaohongshu creator SQLite storage implementation
Args:
creator: creator dict
Returns:
"""
from .xhs_store_sql import (add_new_creator, query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)

View File

@@ -1,160 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : relakkes@gmail.com
# @Time : 2024/4/6 15:30
# @Desc : sql接口集合
from typing import Dict, List, Union
from async_db import AsyncMysqlDB
from async_sqlite_db import AsyncSqliteDB
from var import media_crawler_db_var
async def query_content_by_content_id(content_id: str) -> Dict:
"""
查询一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from xhs_note where note_id = '{content_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_content(content_item: Dict) -> int:
"""
新增一条内容记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("xhs_note", content_item)
return last_row_id
async def update_content_by_content_id(content_id: str, content_item: Dict) -> int:
"""
更新一条记录xhs的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("xhs_note", content_item, "note_id", content_id)
return effect_row
async def query_comment_by_comment_id(comment_id: str) -> Dict:
"""
查询一条评论内容
Args:
comment_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from xhs_note_comment where comment_id = '{comment_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_comment(comment_item: Dict) -> int:
"""
新增一条评论记录
Args:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("xhs_note_comment", comment_item)
return last_row_id
async def update_comment_by_comment_id(comment_id: str, comment_item: Dict) -> int:
"""
更新增一条评论记录
Args:
comment_id:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("xhs_note_comment", comment_item, "comment_id", comment_id)
return effect_row
async def query_creator_by_user_id(user_id: str) -> Dict:
"""
查询一条创作者记录
Args:
user_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from xhs_creator where user_id = '{user_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_creator(creator_item: Dict) -> int:
"""
新增一条创作者信息
Args:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("xhs_creator", creator_item)
return last_row_id
async def update_creator_by_user_id(user_id: str, creator_item: Dict) -> int:
"""
更新一条创作者信息
Args:
user_id:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("xhs_creator", creator_item, "user_id", user_id)
return effect_row

View File

@@ -15,7 +15,7 @@ from typing import List
import config
from base.base_crawler import AbstractStore
from model.m_zhihu import ZhihuComment, ZhihuContent, ZhihuCreator
from store.zhihu.zhihu_store_impl import (ZhihuCsvStoreImplement,
from ._store_impl import (ZhihuCsvStoreImplement,
ZhihuDbStoreImplement,
ZhihuJsonStoreImplement,
ZhihuSqliteStoreImplement)

191
store/zhihu/_store_impl.py Normal file
View File

@@ -0,0 +1,191 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
# @Author : persist1@126.com
# @Time : 2025/9/5 19:34
# @Desc : 知乎存储实现类
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
from sqlalchemy import select
from sqlalchemy.ext.asyncio import AsyncSession
import config
from base.base_crawler import AbstractStore
from database.db_session import get_session
from database.models import ZhihuContent, ZhihuComment, ZhihuCreator
from tools import utils, words
from var import crawler_type_var
from tools.async_file_writer import AsyncFileWriter
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0]) for file_name in os.listdir(file_store_path)]) + 1
except ValueError:
return 1
class ZhihuCsvStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="zhihu", crawler_type=crawler_type_var.get())
async def store_content(self, content_item: Dict):
"""
Zhihu content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.writer.write_to_csv(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Zhihu comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.writer.write_to_csv(item_type="comments", item=comment_item)
async def store_creator(self, creator: Dict):
"""
Zhihu content CSV storage implementation
Args:
creator: creator dict
Returns:
"""
await self.writer.write_to_csv(item_type="creators", item=creator)
class ZhihuDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Zhihu content DB storage implementation
Args:
content_item: content item dict
"""
content_id = content_item.get("content_id")
async with get_session() as session:
stmt = select(ZhihuContent).where(ZhihuContent.content_id == content_id)
result = await session.execute(stmt)
existing_content = result.scalars().first()
if existing_content:
for key, value in content_item.items():
setattr(existing_content, key, value)
else:
new_content = ZhihuContent(**content_item)
session.add(new_content)
await session.commit()
async def store_comment(self, comment_item: Dict):
"""
Zhihu content DB storage implementation
Args:
comment_item: comment item dict
"""
comment_id = comment_item.get("comment_id")
async with get_session() as session:
stmt = select(ZhihuComment).where(ZhihuComment.comment_id == comment_id)
result = await session.execute(stmt)
existing_comment = result.scalars().first()
if existing_comment:
for key, value in comment_item.items():
setattr(existing_comment, key, value)
else:
new_comment = ZhihuComment(**comment_item)
session.add(new_comment)
await session.commit()
async def store_creator(self, creator: Dict):
"""
Zhihu content DB storage implementation
Args:
creator: creator dict
"""
user_id = creator.get("user_id")
async with get_session() as session:
stmt = select(ZhihuCreator).where(ZhihuCreator.user_id == user_id)
result = await session.execute(stmt)
existing_creator = result.scalars().first()
if existing_creator:
for key, value in creator.items():
setattr(existing_creator, key, value)
else:
new_creator = ZhihuCreator(**creator)
session.add(new_creator)
await session.commit()
class ZhihuJsonStoreImplement(AbstractStore):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.writer = AsyncFileWriter(platform="zhihu", crawler_type=crawler_type_var.get())
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.writer.write_single_item_to_json(item_type="contents", item=content_item)
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.writer.write_single_item_to_json(item_type="comments", item=comment_item)
async def store_creator(self, creator: Dict):
"""
Zhihu content JSON storage implementation
Args:
creator: creator dict
Returns:
"""
await self.writer.write_single_item_to_json(item_type="creators", item=creator)
class ZhihuSqliteStoreImplement(ZhihuDbStoreImplement):
"""
Zhihu content SQLite storage implementation
"""
pass

View File

@@ -1,318 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict
import aiofiles
import config
from base.base_crawler import AbstractStore
from tools import utils, words
from var import crawler_type_var
def calculate_number_of_files(file_store_path: str) -> int:
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
Args:
file_store_path;
Returns:
file nums
"""
if not os.path.exists(file_store_path):
return 1
try:
return max([int(file_name.split("_")[0]) for file_name in os.listdir(file_store_path)]) + 1
except ValueError:
return 1
class ZhihuCsvStoreImplement(AbstractStore):
csv_store_path: str = "data/zhihu"
file_count: int = calculate_number_of_files(csv_store_path)
def make_save_file_name(self, store_type: str) -> str:
"""
make save file name by store type
Args:
store_type: contents or comments
Returns: eg: data/zhihu/search_comments_20240114.csv ...
"""
return f"{self.csv_store_path}/{self.file_count}_{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.csv"
async def save_data_to_csv(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in CSV format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns: no returns
"""
pathlib.Path(self.csv_store_path).mkdir(parents=True, exist_ok=True)
save_file_name = self.make_save_file_name(store_type=store_type)
async with aiofiles.open(save_file_name, mode='a+', encoding="utf-8-sig", newline="") as f:
f.fileno()
writer = csv.writer(f)
if await f.tell() == 0:
await writer.writerow(save_item.keys())
await writer.writerow(save_item.values())
async def store_content(self, content_item: Dict):
"""
Zhihu content CSV storage implementation
Args:
content_item: note item dict
Returns:
"""
await self.save_data_to_csv(save_item=content_item, store_type="contents")
async def store_comment(self, comment_item: Dict):
"""
Zhihu comment CSV storage implementation
Args:
comment_item: comment item dict
Returns:
"""
await self.save_data_to_csv(save_item=comment_item, store_type="comments")
async def store_creator(self, creator: Dict):
"""
Zhihu content CSV storage implementation
Args:
creator: creator dict
Returns:
"""
await self.save_data_to_csv(save_item=creator, store_type="creator")
class ZhihuDbStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Zhihu content DB storage implementation
Args:
content_item: content item dict
Returns:
"""
from .zhihu_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
note_id = content_item.get("note_id")
note_detail: Dict = await query_content_by_content_id(content_id=note_id)
if not note_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(note_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Zhihu content DB storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .zhihu_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Zhihu content DB storage implementation
Args:
creator: creator dict
Returns:
"""
from .zhihu_store_sql import (add_new_creator,
query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)
class ZhihuJsonStoreImplement(AbstractStore):
json_store_path: str = "data/zhihu/json"
words_store_path: str = "data/zhihu/words"
lock = asyncio.Lock()
file_count: int = calculate_number_of_files(json_store_path)
WordCloud = words.AsyncWordCloudGenerator()
def make_save_file_name(self, store_type: str) -> (str, str):
"""
make save file name by store type
Args:
store_type: Save type contains content and commentscontents | comments
Returns:
"""
return (
f"{self.json_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}.json",
f"{self.words_store_path}/{crawler_type_var.get()}_{store_type}_{utils.get_current_date()}"
)
async def save_data_to_json(self, save_item: Dict, store_type: str):
"""
Below is a simple way to save it in json format.
Args:
save_item: save content dict info
store_type: Save type contains content and commentscontents | comments
Returns:
"""
pathlib.Path(self.json_store_path).mkdir(parents=True, exist_ok=True)
pathlib.Path(self.words_store_path).mkdir(parents=True, exist_ok=True)
save_file_name, words_file_name_prefix = self.make_save_file_name(store_type=store_type)
save_data = []
async with self.lock:
if os.path.exists(save_file_name):
async with aiofiles.open(save_file_name, 'r', encoding='utf-8') as file:
save_data = json.loads(await file.read())
save_data.append(save_item)
async with aiofiles.open(save_file_name, 'w', encoding='utf-8') as file:
await file.write(json.dumps(save_data, ensure_ascii=False, indent=4))
if config.ENABLE_GET_COMMENTS and config.ENABLE_GET_WORDCLOUD:
try:
await self.WordCloud.generate_word_frequency_and_cloud(save_data, words_file_name_prefix)
except:
pass
async def store_content(self, content_item: Dict):
"""
content JSON storage implementation
Args:
content_item:
Returns:
"""
await self.save_data_to_json(content_item, "contents")
async def store_comment(self, comment_item: Dict):
"""
comment JSON storage implementation
Args:
comment_item:
Returns:
"""
await self.save_data_to_json(comment_item, "comments")
async def store_creator(self, creator: Dict):
"""
Zhihu content JSON storage implementation
Args:
creator: creator dict
Returns:
"""
await self.save_data_to_json(creator, "creator")
class ZhihuSqliteStoreImplement(AbstractStore):
async def store_content(self, content_item: Dict):
"""
Zhihu content SQLite storage implementation
Args:
content_item: content item dict
Returns:
"""
from .zhihu_store_sql import (add_new_content,
query_content_by_content_id,
update_content_by_content_id)
note_id = content_item.get("note_id")
note_detail: Dict = await query_content_by_content_id(content_id=note_id)
if not note_detail:
content_item["add_ts"] = utils.get_current_timestamp()
await add_new_content(content_item)
else:
await update_content_by_content_id(note_id, content_item=content_item)
async def store_comment(self, comment_item: Dict):
"""
Zhihu comment SQLite storage implementation
Args:
comment_item: comment item dict
Returns:
"""
from .zhihu_store_sql import (add_new_comment,
query_comment_by_comment_id,
update_comment_by_comment_id)
comment_id = comment_item.get("comment_id")
comment_detail: Dict = await query_comment_by_comment_id(comment_id=comment_id)
if not comment_detail:
comment_item["add_ts"] = utils.get_current_timestamp()
await add_new_comment(comment_item)
else:
await update_comment_by_comment_id(comment_id, comment_item=comment_item)
async def store_creator(self, creator: Dict):
"""
Zhihu creator SQLite storage implementation
Args:
creator: creator dict
Returns:
"""
from .zhihu_store_sql import (add_new_creator,
query_creator_by_user_id,
update_creator_by_user_id)
user_id = creator.get("user_id")
user_detail: Dict = await query_creator_by_user_id(user_id)
if not user_detail:
creator["add_ts"] = utils.get_current_timestamp()
await add_new_creator(creator)
else:
await update_creator_by_user_id(user_id, creator)

View File

@@ -1,156 +0,0 @@
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
# -*- coding: utf-8 -*-
from typing import Dict, List, Union
from async_db import AsyncMysqlDB
from async_sqlite_db import AsyncSqliteDB
from var import media_crawler_db_var
async def query_content_by_content_id(content_id: str) -> Dict:
"""
查询一条内容记录zhihu的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from zhihu_content where content_id = '{content_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_content(content_item: Dict) -> int:
"""
新增一条内容记录zhihu的帖子 抖音的视频 微博 快手视频 ...
Args:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("zhihu_content", content_item)
return last_row_id
async def update_content_by_content_id(content_id: str, content_item: Dict) -> int:
"""
更新一条记录zhihu的帖子 抖音的视频 微博 快手视频 ...
Args:
content_id:
content_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("zhihu_content", content_item, "content_id", content_id)
return effect_row
async def query_comment_by_comment_id(comment_id: str) -> Dict:
"""
查询一条评论内容
Args:
comment_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from zhihu_comment where comment_id = '{comment_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_comment(comment_item: Dict) -> int:
"""
新增一条评论记录
Args:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("zhihu_comment", comment_item)
return last_row_id
async def update_comment_by_comment_id(comment_id: str, comment_item: Dict) -> int:
"""
更新增一条评论记录
Args:
comment_id:
comment_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("zhihu_comment", comment_item, "comment_id", comment_id)
return effect_row
async def query_creator_by_user_id(user_id: str) -> Dict:
"""
查询一条创作者记录
Args:
user_id:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
sql: str = f"select * from zhihu_creator where user_id = '{user_id}'"
rows: List[Dict] = await async_db_conn.query(sql)
if len(rows) > 0:
return rows[0]
return dict()
async def add_new_creator(creator_item: Dict) -> int:
"""
新增一条创作者信息
Args:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
last_row_id: int = await async_db_conn.item_to_table("zhihu_creator", creator_item)
return last_row_id
async def update_creator_by_user_id(user_id: str, creator_item: Dict) -> int:
"""
更新一条创作者信息
Args:
user_id:
creator_item:
Returns:
"""
async_db_conn: Union[AsyncMysqlDB, AsyncSqliteDB] = media_crawler_db_var.get()
effect_row: int = await async_db_conn.update_table("zhihu_creator", creator_item, "user_id", user_id)
return effect_row

233
test/test_db_sync.py Normal file
View File

@@ -0,0 +1,233 @@
# -*- coding: utf-8 -*-
# @Author : persist-1<persist1@126.com>
# @Time : 2025/9/8 00:02
# @Desc : 用于将orm映射模型database/models.py与两种数据库实际结构进行对比并进行更新操作连接数据库->结构比对->差异报告->交互式同步)
# @Tips : 该脚本需要安装依赖'pymysql==1.1.0'
import os
import sys
from sqlalchemy import create_engine, inspect as sqlalchemy_inspect
from sqlalchemy.schema import MetaData
# 将项目根目录添加到 sys.path
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from config.db_config import mysql_db_config, sqlite_db_config
from database.models import Base
def get_mysql_engine():
"""创建并返回一个MySQL数据库引擎"""
conn_str = f"mysql+pymysql://{mysql_db_config['user']}:{mysql_db_config['password']}@{mysql_db_config['host']}:{mysql_db_config['port']}/{mysql_db_config['db_name']}"
return create_engine(conn_str)
def get_sqlite_engine():
"""创建并返回一个SQLite数据库引擎"""
conn_str = f"sqlite:///{sqlite_db_config['db_path']}"
return create_engine(conn_str)
def get_db_schema(engine):
"""获取数据库的当前表结构"""
inspector = sqlalchemy_inspect(engine)
schema = {}
for table_name in inspector.get_table_names():
columns = {}
for column in inspector.get_columns(table_name):
columns[column['name']] = str(column['type'])
schema[table_name] = columns
return schema
def get_orm_schema():
"""获取ORM模型的表结构"""
schema = {}
for table_name, table in Base.metadata.tables.items():
columns = {}
for column in table.columns:
columns[column.name] = str(column.type)
schema[table_name] = columns
return schema
def compare_schemas(db_schema, orm_schema):
"""比较数据库结构和ORM模型结构返回差异"""
db_tables = set(db_schema.keys())
orm_tables = set(orm_schema.keys())
added_tables = orm_tables - db_tables
deleted_tables = db_tables - orm_tables
common_tables = db_tables.intersection(orm_tables)
changed_tables = {}
for table in common_tables:
db_cols = set(db_schema[table].keys())
orm_cols = set(orm_schema[table].keys())
added_cols = orm_cols - db_cols
deleted_cols = db_cols - orm_cols
modified_cols = {}
for col in db_cols.intersection(orm_cols):
if db_schema[table][col] != orm_schema[table][col]:
modified_cols[col] = (db_schema[table][col], orm_schema[table][col])
if added_cols or deleted_cols or modified_cols:
changed_tables[table] = {
"added": list(added_cols),
"deleted": list(deleted_cols),
"modified": modified_cols
}
return {
"added_tables": list(added_tables),
"deleted_tables": list(deleted_tables),
"changed_tables": changed_tables
}
def print_diff(db_name, diff):
"""打印差异报告"""
print(f"--- {db_name} 数据库结构差异报告 ---")
if not any(diff.values()):
print("数据库结构与ORM模型一致无需同步。")
return
if diff.get("added_tables"):
print("\n[+] 新增的表:")
for table in diff["added_tables"]:
print(f" - {table}")
if diff.get("deleted_tables"):
print("\n[-] 删除的表:")
for table in diff["deleted_tables"]:
print(f" - {table}")
if diff.get("changed_tables"):
print("\n[*] 变动的表:")
for table, changes in diff["changed_tables"].items():
print(f" - {table}:")
if changes.get("added"):
print(" [+] 新增字段:", ", ".join(changes["added"]))
if changes.get("deleted"):
print(" [-] 删除字段:", ", ".join(changes["deleted"]))
if changes.get("modified"):
print(" [*] 修改字段:")
for col, types in changes["modified"].items():
print(f" - {col}: {types[0]} -> {types[1]}")
print("--- 报告结束 ---")
def sync_database(engine, diff):
"""将ORM模型同步到数据库"""
metadata = Base.metadata
# Alembic的上下文配置
from alembic.migration import MigrationContext
from alembic.operations import Operations
conn = engine.connect()
ctx = MigrationContext.configure(conn)
op = Operations(ctx)
# 处理删除的表
for table_name in diff['deleted_tables']:
op.drop_table(table_name)
print(f"已删除表: {table_name}")
# 处理新增的表
for table_name in diff['added_tables']:
table = metadata.tables.get(table_name)
if table is not None:
table.create(engine)
print(f"已创建表: {table_name}")
# 处理字段变更
for table_name, changes in diff['changed_tables'].items():
# 删除字段
for col_name in changes['deleted']:
op.drop_column(table_name, col_name)
print(f"在表 {table_name} 中已删除字段: {col_name}")
# 新增字段
for col_name in changes['added']:
table = metadata.tables.get(table_name)
column = table.columns.get(col_name)
if column is not None:
op.add_column(table_name, column)
print(f"在表 {table_name} 中已新增字段: {col_name}")
# 修改字段
for col_name, types in changes['modified'].items():
table = metadata.tables.get(table_name)
if table is not None:
column = table.columns.get(col_name)
if column is not None:
op.alter_column(table_name, col_name, type_=column.type)
print(f"在表 {table_name} 中已修改字段: {col_name} (类型变为 {column.type})")
def main():
"""主函数"""
orm_schema = get_orm_schema()
# 处理 MySQL
try:
mysql_engine = get_mysql_engine()
mysql_schema = get_db_schema(mysql_engine)
mysql_diff = compare_schemas(mysql_schema, orm_schema)
print_diff("MySQL", mysql_diff)
if any(mysql_diff.values()):
choice = input(">>> 需要人工确认是否要将ORM模型同步到MySQL数据库? (y/N): ")
if choice.lower() == 'y':
sync_database(mysql_engine, mysql_diff)
print("MySQL数据库同步完成。")
except Exception as e:
print(f"处理MySQL时出错: {e}")
# 处理 SQLite
try:
sqlite_engine = get_sqlite_engine()
sqlite_schema = get_db_schema(sqlite_engine)
sqlite_diff = compare_schemas(sqlite_schema, orm_schema)
print_diff("SQLite", sqlite_diff)
if any(sqlite_diff.values()):
choice = input(">>> 需要人工确认是否要将ORM模型同步到SQLite数据库? (y/N): ")
if choice.lower() == 'y':
# 注意SQLite不支持ALTER COLUMN来修改字段类型这里简化处理
print("警告SQLite的字段修改支持有限此脚本不会执行修改字段类型的操作。")
sync_database(sqlite_engine, sqlite_diff)
print("SQLite数据库同步完成。")
except Exception as e:
print(f"处理SQLite时出错: {e}")
if __name__ == "__main__":
main()
######################### Feedback example #########################
# [*] 变动的表:
# - kuaishou_video:
# [*] 修改字段:
# - user_id: TEXT -> VARCHAR(64)
# - xhs_note_comment:
# [*] 修改字段:
# - comment_id: BIGINT -> VARCHAR(255)
# - zhihu_content:
# [*] 修改字段:
# - created_time: BIGINT -> VARCHAR(32)
# - content_id: BIGINT -> VARCHAR(64)
# - zhihu_creator:
# [*] 修改字段:
# - user_id: INTEGER -> VARCHAR(64)
# - tieba_note:
# [*] 修改字段:
# - publish_time: BIGINT -> VARCHAR(255)
# - tieba_id: INTEGER -> VARCHAR(255)
# - note_id: BIGINT -> VARCHAR(644)
# --- 报告结束 ---
# >>> 需要人工确认是否要将ORM模型同步到MySQL数据库? (y/N): y
# 在表 kuaishou_video 中已修改字段: user_id (类型变为 VARCHAR(64))
# 在表 xhs_note_comment 中已修改字段: comment_id (类型变为 VARCHAR(255))
# 在表 zhihu_content 中已修改字段: created_time (类型变为 VARCHAR(32))
# 在表 zhihu_content 中已修改字段: content_id (类型变为 VARCHAR(64))
# 在表 zhihu_creator 中已修改字段: user_id (类型变为 VARCHAR(64))
# 在表 tieba_note 中已修改字段: publish_time (类型变为 VARCHAR(255))
# 在表 tieba_note 中已修改字段: tieba_id (类型变为 VARCHAR(255))
# 在表 tieba_note 中已修改字段: note_id (类型变为 VARCHAR(644))
# MySQL数据库同步完成。

View File

@@ -0,0 +1,50 @@
import asyncio
import csv
import json
import os
import pathlib
from typing import Dict, List
import aiofiles
from tools.utils import utils
class AsyncFileWriter:
def __init__(self, platform: str, crawler_type: str):
self.lock = asyncio.Lock()
self.platform = platform
self.crawler_type = crawler_type
def _get_file_path(self, file_type: str, item_type: str) -> str:
base_path = f"data/{self.platform}/{file_type}"
pathlib.Path(base_path).mkdir(parents=True, exist_ok=True)
file_name = f"{self.crawler_type}_{item_type}_{utils.get_current_date()}.{file_type}"
return f"{base_path}/{file_name}"
async def write_to_csv(self, item: Dict, item_type: str):
file_path = self._get_file_path('csv', item_type)
async with self.lock:
file_exists = os.path.exists(file_path)
async with aiofiles.open(file_path, 'a', newline='', encoding='utf-8-sig') as f:
writer = csv.DictWriter(f, fieldnames=item.keys())
if not file_exists or await f.tell() == 0:
await writer.writeheader()
await writer.writerow(item)
async def write_single_item_to_json(self, item: Dict, item_type: str):
file_path = self._get_file_path('json', item_type)
async with self.lock:
existing_data = []
if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
async with aiofiles.open(file_path, 'r', encoding='utf-8') as f:
try:
content = await f.read()
if content:
existing_data = json.loads(content)
if not isinstance(existing_data, list):
existing_data = [existing_data]
except json.JSONDecodeError:
existing_data = []
existing_data.append(item)
async with aiofiles.open(file_path, 'w', encoding='utf-8') as f:
await f.write(json.dumps(existing_data, ensure_ascii=False, indent=4))

View File

@@ -14,6 +14,7 @@ import platform
import subprocess
import time
import socket
import signal
from typing import Optional, List, Tuple
import asyncio
from pathlib import Path
@@ -106,7 +107,7 @@ class BrowserLauncher:
raise RuntimeError(f"无法找到可用的端口,已尝试 {start_port}{port-1}")
def launch_browser(self, browser_path: str, debug_port: int, headless: bool = False,
def launch_browser(self, browser_path: str, debug_port: int, headless: bool = False,
user_data_dir: Optional[str] = None) -> subprocess.Popen:
"""
启动浏览器进程
@@ -169,7 +170,8 @@ class BrowserLauncher:
stderr=subprocess.DEVNULL,
preexec_fn=os.setsid # 创建新的进程组
)
self.browser_process = process
return process
except Exception as e:
@@ -230,20 +232,48 @@ class BrowserLauncher:
"""
清理资源,关闭浏览器进程
"""
if self.browser_process:
try:
utils.logger.info("[BrowserLauncher] 正在关闭浏览器进程...")
if self.system == "Windows":
# Windows下使用taskkill强制终止进程树
subprocess.run(["taskkill", "/F", "/T", "/PID", str(self.browser_process.pid)],
capture_output=True)
if not self.browser_process:
return
process = self.browser_process
if process.poll() is not None:
utils.logger.info("[BrowserLauncher] 浏览器进程已退出,无需清理")
self.browser_process = None
return
utils.logger.info("[BrowserLauncher] 正在关闭浏览器进程...")
try:
if self.system == "Windows":
# 先尝试正常终止
process.terminate()
try:
process.wait(timeout=5)
except subprocess.TimeoutExpired:
utils.logger.warning("[BrowserLauncher] 正常终止超时使用taskkill强制结束")
subprocess.run(
["taskkill", "/F", "/T", "/PID", str(process.pid)],
capture_output=True,
check=False,
)
process.wait(timeout=5)
else:
pgid = os.getpgid(process.pid)
try:
os.killpg(pgid, signal.SIGTERM)
except ProcessLookupError:
utils.logger.info("[BrowserLauncher] 浏览器进程组不存在,可能已退出")
else:
# Unix系统下终止进程组
os.killpg(os.getpgid(self.browser_process.pid), 9)
self.browser_process = None
utils.logger.info("[BrowserLauncher] 浏览器进程已关闭")
except Exception as e:
utils.logger.warning(f"[BrowserLauncher] 关闭浏览器进程时出错: {e}")
try:
process.wait(timeout=5)
except subprocess.TimeoutExpired:
utils.logger.warning("[BrowserLauncher] 优雅关闭超时发送SIGKILL")
os.killpg(pgid, signal.SIGKILL)
process.wait(timeout=5)
utils.logger.info("[BrowserLauncher] 浏览器进程已关闭")
except Exception as e:
utils.logger.warning(f"[BrowserLauncher] 关闭浏览器进程时出错: {e}")
finally:
self.browser_process = None

View File

@@ -291,16 +291,28 @@ class CDPBrowserManager:
"""
try:
# 关闭浏览器上下文
# if self.browser_context:
# await self.browser_context.close()
# self.browser_context = None
# utils.logger.info("[CDPBrowserManager] 浏览器上下文已关闭")
if self.browser_context:
try:
await self.browser_context.close()
utils.logger.info("[CDPBrowserManager] 浏览器上下文已关闭")
except Exception as context_error:
utils.logger.warning(
f"[CDPBrowserManager] 关闭浏览器上下文失败: {context_error}"
)
finally:
self.browser_context = None
# # 断开浏览器连接
# if self.browser:
# await self.browser.close()
# self.browser = None
# utils.logger.info("[CDPBrowserManager] 浏览器连接已断开")
# 断开浏览器连接
if self.browser:
try:
await self.browser.close()
utils.logger.info("[CDPBrowserManager] 浏览器连接已断开")
except Exception as browser_error:
utils.logger.warning(
f"[CDPBrowserManager] 关闭浏览器连接失败: {browser_error}"
)
finally:
self.browser = None
# 关闭浏览器进程(如果配置为自动关闭)
if config.AUTO_CLOSE_BROWSER:

View File

@@ -33,6 +33,12 @@ def get_current_time() -> str:
"""
return time.strftime('%Y-%m-%d %X', time.localtime())
def get_current_time_hour() -> str:
"""
获取当前的时间:'2023-12-02-13'
:return:
"""
return time.strftime('%Y-%m-%d-%H', time.localtime())
def get_current_date() -> str:
"""

1022
uv.lock generated
View File

File diff suppressed because it is too large Load Diff

3
var.py
View File

@@ -15,11 +15,8 @@ from typing import List
import aiomysql
from async_db import AsyncMysqlDB
request_keyword_var: ContextVar[str] = ContextVar("request_keyword", default="")
crawler_type_var: ContextVar[str] = ContextVar("crawler_type", default="")
comment_tasks_var: ContextVar[List[Task]] = ContextVar("comment_tasks", default=[])
media_crawler_db_var: ContextVar[AsyncMysqlDB] = ContextVar("media_crawler_db_var")
db_conn_pool_var: ContextVar[aiomysql.Pool] = ContextVar("db_conn_pool_var")
source_keyword_var: ContextVar[str] = ContextVar("source_keyword", default="")