108 Commits

Author SHA1 Message Date
程序员阿江(Relakkes)
0282e626c9 feat: 新增 JSONL 存储格式支持,默认存储格式改为 jsonl
JSONL(JSON Lines)每行一个 JSON 对象,采用 append 模式写入,
无需读取已有数据,大数据量下性能远优于 JSON 格式。

- 新增 AsyncFileWriter.write_to_jsonl() 核心方法
- 7 个平台新增 JsonlStoreImplement 类并注册到工厂
- 配置默认值从 json 改为 jsonl,CLI/API 枚举同步更新
- db_session.py 守卫条件加入 jsonl,避免误触 ValueError
- 词云生成支持读取 JSONL 文件,优先 jsonl 回退 json
- 原有 json 选项完全保留,向后兼容
- 更新相关文档和测试
2026-03-03 23:31:07 +08:00
程序员阿江(Relakkes)
d614ccf247 docs: translate comments and metadata to English
Update Chinese comments, variable descriptions, and metadata across
multiple configuration and core files to English. This improves
codebase accessibility for international developers. Additionally,
removed the sponsorship section from README files.
2026-02-12 05:30:11 +08:00
ouzhuowei
7484156f02 新增数据保存路径,默认不指定则保存到data文件夹下
Co-Authored-By: ouzhuowei <190020754@qq.com>
2026-02-03 11:24:22 +08:00
Doiiars
70a6ca55bb feat(database): add PostgreSQL support and fix Windows subprocess encoding 2026-01-09 00:41:59 +08:00
hsparks.codes
46ef86ddef feat: Add Excel export functionality and unit tests
Features:
- Excel export with formatted multi-sheet workbooks (Contents, Comments, Creators)
- Professional styling: blue headers, auto-width columns, borders, text wrapping
- Smart export: empty sheets automatically removed
- Support for all platforms (xhs, dy, ks, bili, wb, tieba, zhihu)

Testing:
- Added pytest framework with asyncio support
- Unit tests for Excel store functionality
- Unit tests for store factory pattern
- Shared fixtures for test data
- Test coverage for edge cases

Documentation:
- Comprehensive Excel export guide (docs/excel_export_guide.md)
- Updated README.md and README_en.md with Excel examples
- Updated config comments to include excel option

Dependencies:
- Added openpyxl>=3.1.2 for Excel support
- Added pytest>=7.4.0 and pytest-asyncio>=0.21.0 for testing

This contribution adds immediate value for users who need data analysis
capabilities and establishes a testing foundation for future development.
2025-11-28 04:44:12 +01:00
程序员阿江(Relakkes)
ff8c92daad chore: add copyright to every file 2025-11-18 12:24:02 +08:00
yt210
ef6948b305 新增存储到mongoDB 2025-11-06 10:40:30 +08:00
程序员阿江(Relakkes)
ae7955787c feat: kuaishou support url link 2025-10-18 07:40:10 +08:00
persist-1
be306c6f54 refactor(database): 重构数据库存储实现,使用SQLAlchemy ORM替代原始SQL操作
- 删除旧的async_db.py和async_sqlite_db.py实现
- 新增SQLAlchemy ORM模型和数据库会话管理
- 统一各平台存储实现到_store_impl.py文件
- 添加数据库初始化功能支持
- 更新.gitignore和pyproject.toml依赖配置
- 优化文件存储路径和命名规范
2025-09-06 04:10:20 +08:00
程序员阿江(Relakkes)
12450759d8 fix: httpx proxy format error
feat: add a ip proxy provider
2025-08-01 01:05:11 +08:00
未来可欺
e9f976117a 将配置文件恢复原状 2025-07-30 21:32:00 +08:00
未来可欺
173bc08a9d 添加了抖音存储视频以及图片的逻辑,并将config.py中ENABLE_GET_IMAGES参数更名为ENABLE_GET_MEIDAS,在此基础上略微修改存储逻辑 2025-07-30 18:24:08 +08:00
persist-1
19df1734f1 chore: 增加--help参数中文显示支持及douyin_aweme表music_download_url字段\n\n- 为命令行参数增加中文显示支持,提升用户体验\n- 在douyin_aweme表中新增music_download_url字段用于存储视频音乐下载链接\n- 更新相关数据库表结构文件(tables.sql, sqlite_tables.sql)\n- 实现音乐下载URL提取逻辑并集成到数据存储流程 2025-07-24 22:39:53 +08:00
程序员阿江(Relakkes)
a4d9aaa34a refactor: xhs update 2025-07-21 21:26:16 +08:00
程序员阿江(Relakkes)
26a43358cb chore: update config 2025-07-20 14:34:56 +08:00
程序员阿江(Relakkes)
13b00f7a36 refactor: config update 2025-07-18 23:26:52 +08:00
gaoxiaobei
b913db64bb refactor(config): move platform-specific configs to separate files
- Remove platform-specific configurations from base_config.py
- Create separate config files for each platform in their respective directories
- Update import statements in core files to use new platform-specific config modules
- Clean up unused and deprecated configuration options
2025-07-18 17:27:37 +08:00
gaoxiaobei
1dc8c1789f docs(config): update Bilibili search mode options
- Clarify the three search mode options for Bilibili
- Add note about setting MAX_NOTES_PER_DAY in bilibili config
2025-07-17 07:51:27 +08:00
gaoxiaobei
6ced357096 Merge branch 'main' into dev 2025-07-17 06:45:30 +08:00
gaoxiaobei
fb846e9060 Merge branch 'NanmiCoder:main' into main 2025-07-17 06:39:04 +08:00
gaoxiaobei
4d743f6c17 debug & resume default configuration 2025-07-14 08:00:48 +08:00
买定不离手
1673bd5c0c feat: 增强SQLite数据库配置和命令行参数支持
- 更新 cmd_arg/arg.py 文件,添加SQLite数据库选项的命令行参数解析支持
- 更新 config/base_config.py 文件,集成SQLite数据库的基础配置项和默认设置
- 更新 config/db_config.py 文件,扩展数据库配置以支持SQLite连接和参数管理
- 更新 pyproject.toml 文件,添加SQLite相关依赖包的版本管理和项目配置
2025-07-14 03:50:54 +08:00
gaoxiaobei
e91ec750bb feat: Enhance Bilibili crawler with retry logic and robustness
This commit introduces several improvements to enhance the stability and functionality of the Bilibili crawler.

- **Add Retry Logic:** Implement a retry mechanism with exponential backoff when fetching video comments. This makes the crawler more resilient to transient network issues or API errors.
- **Improve Error Handling:** Add a `try...except` block to handle potential `JSONDecodeError` in the Bilibili client, preventing crashes when the API returns an invalid response.
- **Ensure Clean Shutdown:** Refactor `main.py` to use a `try...finally` block, guaranteeing that the crawler and database connections are properly closed on exit, error, or `KeyboardInterrupt`.
- **Update Default Config:** Adjust default configuration values to increase concurrency, enable word cloud generation by default, and refine the Bilibili search mode for more practical usage.
2025-07-13 10:42:15 +08:00
gaoxiaobei
d0d7293926 feat(bilibili): Add flexible search modes and fix limit logic
Refactors the Bilibili keyword search functionality to provide more flexible crawling strategies and corrects a flaw in how crawl limits were applied.

Previously, the `ALL_DAY` boolean flag offered a rigid choice for time-based searching and contained a logical issue where `CRAWLER_MAX_NOTES_COUNT` was incorrectly applied on a per-day basis instead of as an overall total.

This commit introduces the `BILI_SEARCH_MODE` configuration option with three distinct modes:
- `normal`: The default search behavior without time constraints.
- `all_in_time_range`: Maximizes data collection within a specified date range, replicating the original intent of `ALL_DAY=True`.
- `daily_limit_in_time_range`: A new mode that strictly enforces both the daily `MAX_NOTES_PER_DAY` and the total `CRAWLER_MAX_NOTES_COUNT` limits across the entire date range.

This change resolves the limit logic bug and gives users more precise control over the crawling process.

Changes include:
- Modified `config/base_config.py` to replace `ALL_DAY` with `BILI_SEARCH_MODE`.
- Refactored `media_platform/bilibili/core.py` to implement the new search mode logic.
2025-07-13 06:07:13 +08:00
gaoxiaobei
cad9fc7af8 feat: Add daily limit for video/post crawling in Bilibili and base config 2025-07-12 14:50:59 +08:00
Lei Cao
355ed183dd 增加选择微博搜索类型的配置 2025-07-05 22:14:31 +00:00
程序员阿江(Relakkes)
848df2b491 feat: other platfrom support the cdp mode 2025-07-03 17:13:32 +08:00
程序员阿江(Relakkes)
e83b2422d9 feat: 支持playwright通过cdp协议连接本地chrome浏览器
docs: 增加uv来管理python依赖的文档
2025-06-25 23:22:39 +08:00
Bowenwin
66843f216a finish_all_for_expand_bili 2025-05-22 22:26:30 +08:00
Bowenwin
59619fff0a finish_all 2025-05-22 22:06:06 +08:00
Bowenwin
44e3d370ff fix_words 2025-05-22 20:31:48 +08:00
Bowenwin
a356358c21 get_fans_and_get_followings 2025-05-19 19:57:36 +08:00
Relakkes
b43d6b7b91 chore: update config 2025-02-12 10:58:48 +08:00
Relakkes
66a7ab1db8 refactor: bibi default to get without time data 2025-02-12 10:58:15 +08:00
翟持江
bf87821d4b 为 base_config.py 添加了是否开启按每一天进行爬取的选项
`bilibili 关键词搜索`仅返回 1000 条视频记录,共计 34 页,前 33 页视频记录数为 30,第 34 页视频记录数为 30。若要获取更多视频,需要设置更多筛选中的时间段选项,最高支持细分为 1 天的视频记录
---
此处添加了`START_DAY`与`END_DAY`以及`ALL_DAY`选项,在`ALL_DAY`为False时,使用原先关键词搜索策略,与原先版本保持不变;`ALL_DAY`为True时,从`START_DAY`与`END_DAY`中解析`client.py`中的search_video_by_keyword函数接收的`pubtime_begin_s`和`pubtime_end_s`参数,以实现最高支持细分为 1 天的视频记录,具体更改见之后提交的`client.py`和`core.py`
2025-01-15 18:06:16 +08:00
Relakkes
ea5223c708 feat: 知乎支持详情模式 2024-12-26 17:36:33 +08:00
liudongkai
33e7ef016d feat: xhs 非代理模式下增加随机等待间隔, db存储模式下增加存储xsec_token字段 2024-12-05 21:10:31 +08:00
Relakkes
43dffeb2d1 feat: xhs帖子详情获取优化 2024-11-26 13:37:53 +08:00
Relakkes
6a96c00b4b chore: config update & xinqqiu img update 2024-10-24 15:35:34 +08:00
unknown
7e53c4acfc All_platform_comments_restrict 2024-10-23 16:32:02 +08:00
unknown
19269c66fd xiaohongshu_comment_number_restrict 2024-10-22 20:33:10 +08:00
Relakkes
03e393949a fix: xhs帖子详情问题更新 2024-10-20 00:59:08 +08:00
Relakkes
9fe3e47b0f chore: 增加代码学习声明,严格禁止非法、禁止商业、不当用途 2024-10-20 00:43:25 +08:00
Relakkes
aa0f920369 feat: B站搜索接口增加发布日期筛选 2024-10-17 15:11:25 +08:00
Relakkes
da8f1c62b8 feat: 知乎支持创作者主页数据爬取(回答、文章、视频) 2024-10-16 21:02:27 +08:00
Relakkes
b7e57da0d2 feat: 知乎支持(关键词、评论) 2024-09-08 00:00:04 +08:00
Relakkes Yang
acb29add28 feat: 百度贴吧支持创作者主页帖子爬取 2024-08-24 11:03:23 +08:00
Relakkes
8adb593ba6 temp commit 2024-08-24 09:12:03 +08:00
Relakkes
ab7d8142af feat: weibo支持指定创作者主页 2024-08-24 05:52:11 +08:00
Relakkes
f371675d47 fix: xhs指定笔记ID获取方式增加解析html方式,原来的由于xsec_token导致失效 2024-08-11 22:37:10 +08:00