Commit Graph

221 Commits

Author SHA1 Message Date
LePao1
3954c40e69 feat(bilibili):增加视频清晰度参数,可以通过BILI_QN更改下载的视频清晰度;
在 BilibiliClient 中添加视频质量配置并改进错误处理,修复下载请求被 302 重定向到 CDN,旧代码未跟随重定向且只接受 “OK” ,导致失败,现在即便是低清晰度/CDN 跳转的链接也能正常下载。
2025-09-24 12:27:16 +08:00
刘小龙
c87df59996 log client modify 2025-09-09 15:27:46 +08:00
程序员阿江(Relakkes)
2bce3593f7 feat: support time deplay for all platform 2025-09-02 16:43:09 +08:00
程序员阿江(Relakkes)
eb799e1fa7 refactor: xhs extractor 2025-09-02 14:50:32 +08:00
未来可欺
6a10d0d11c 原始的HTTPStatusError不能捕获像ConnectError、ReadError这些异常类型,本次提交修改了捕获异常的类型为httpx模块请求异常的基类:HTTPError,以便捕获在httpx.request方法中引发的任何异常(例如ip被封,服务器拒接连接),正确处理爬取媒体被中断时并不会导致爬取文本的中断逻辑 2025-08-06 11:24:51 +08:00
未来可欺
81f2dbe4ab 添加了对媒体资源服务器的异常处理,参见 issue #691 2025-08-05 13:11:00 +08:00
程序员阿江(Relakkes)
b9d30bbabb fix: #693 2025-08-01 15:55:21 +08:00
未来可欺
a6fd9ebdbc 简单更改了抖音保存图片与视频的命名方式,一个视频 id 仅对应一个短视频,返回一个 video_download_url,因此不需要使用数字方式进行命名 2025-07-31 23:11:45 +08:00
未来可欺
0b81240aed 升级 httpx 版本至 0.28.1,并修改关键字参数 proxies 至 proxy 2025-07-31 22:48:02 +08:00
未来可欺
9d90e9fc6d fix issue #689,目前来看,应该是 httpx 库的问题,因为无论是使用同步还是异步版本,构不构造 httpx.***Client 对象来发起请求,返回的响应都是为空,response.content = b'',response.text = ’‘,但换成 requests 库就能正常获取数据了 2025-07-31 22:01:48 +08:00
未来可欺
93a1c27fff 通过测试search模式,修复部分运行时的bug,并对能够爬取媒体的平台设置了较长的超时时间 2025-07-30 21:19:56 +08:00
未来可欺
173bc08a9d 添加了抖音存储视频以及图片的逻辑,并将config.py中ENABLE_GET_IMAGES参数更名为ENABLE_GET_MEIDAS,在此基础上略微修改存储逻辑 2025-07-30 18:24:08 +08:00
korruz
07a6e387ea refactor: move format_proxy_info to utils and update crawler classes to use it 2025-07-29 14:16:24 +08:00
程序员阿江(Relakkes)
fc06c783f5 fix: fixed xhs req headers 2025-07-23 13:28:58 +08:00
程序员阿江(Relakkes)
a4d9aaa34a refactor: xhs update 2025-07-21 21:26:16 +08:00
程序员阿江(Relakkes)
13b00f7a36 refactor: config update 2025-07-18 23:26:52 +08:00
gaoxiaobei
8105b053ed Merge remote-tracking branch 'origin/dev' into devdev 2025-07-18 17:37:29 +08:00
gaoxiaobei
7176956e51 Merge branch 'NanmiCoder:main' into dev 2025-07-18 17:32:04 +08:00
gaoxiaobei
b913db64bb refactor(config): move platform-specific configs to separate files
- Remove platform-specific configurations from base_config.py
- Create separate config files for each platform in their respective directories
- Update import statements in core files to use new platform-specific config modules
- Clean up unused and deprecated configuration options
2025-07-18 17:27:37 +08:00
chenfangliang
aa54dad9a5 feat: 修复抖音二级评论地理位置缺失问题 2025-07-18 10:48:43 +08:00
gaoxiaobei
1dc8c1789f docs(config): update Bilibili search mode options
- Clarify the three search mode options for Bilibili
- Add note about setting MAX_NOTES_PER_DAY in bilibili config
2025-07-17 07:51:27 +08:00
gaoxiaobei
6ced357096 Merge branch 'main' into dev 2025-07-17 06:45:30 +08:00
gaoxiaobei
9fb396c7d1 fix(media_platform): handle edge cases and improve error handling for Bilibili client and crawler
- BilibiliClient:
  - Improve wbi_img_urls handling for better compatibility
  - Add error handling for missing or invalid 'is_end' and 'next' in comment cursor

- BilibiliCrawler:
  - Fix daily limit logic for keyword-based searches
  - Improve logging and break conditions for max notes count limits
  - Ensure proper tracking of total notes crawled for each keyword
2025-07-17 06:40:56 +08:00
gaoxiaobei
fb846e9060 Merge branch 'NanmiCoder:main' into main 2025-07-17 06:39:04 +08:00
程序员阿江(Relakkes)
c795b1316a fix: import error for #663 2025-07-16 10:58:11 +08:00
gaoxiaobei
4d743f6c17 debug & resume default configuration 2025-07-14 08:00:48 +08:00
gaoxiaobei
e91ec750bb feat: Enhance Bilibili crawler with retry logic and robustness
This commit introduces several improvements to enhance the stability and functionality of the Bilibili crawler.

- **Add Retry Logic:** Implement a retry mechanism with exponential backoff when fetching video comments. This makes the crawler more resilient to transient network issues or API errors.
- **Improve Error Handling:** Add a `try...except` block to handle potential `JSONDecodeError` in the Bilibili client, preventing crashes when the API returns an invalid response.
- **Ensure Clean Shutdown:** Refactor `main.py` to use a `try...finally` block, guaranteeing that the crawler and database connections are properly closed on exit, error, or `KeyboardInterrupt`.
- **Update Default Config:** Adjust default configuration values to increase concurrency, enable word cloud generation by default, and refine the Bilibili search mode for more practical usage.
2025-07-13 10:42:15 +08:00
gaoxiaobei
d0d7293926 feat(bilibili): Add flexible search modes and fix limit logic
Refactors the Bilibili keyword search functionality to provide more flexible crawling strategies and corrects a flaw in how crawl limits were applied.

Previously, the `ALL_DAY` boolean flag offered a rigid choice for time-based searching and contained a logical issue where `CRAWLER_MAX_NOTES_COUNT` was incorrectly applied on a per-day basis instead of as an overall total.

This commit introduces the `BILI_SEARCH_MODE` configuration option with three distinct modes:
- `normal`: The default search behavior without time constraints.
- `all_in_time_range`: Maximizes data collection within a specified date range, replicating the original intent of `ALL_DAY=True`.
- `daily_limit_in_time_range`: A new mode that strictly enforces both the daily `MAX_NOTES_PER_DAY` and the total `CRAWLER_MAX_NOTES_COUNT` limits across the entire date range.

This change resolves the limit logic bug and gives users more precise control over the crawling process.

Changes include:
- Modified `config/base_config.py` to replace `ALL_DAY` with `BILI_SEARCH_MODE`.
- Refactored `media_platform/bilibili/core.py` to implement the new search mode logic.
2025-07-13 06:07:13 +08:00
gaoxiaobei
cad9fc7af8 feat: Add daily limit for video/post crawling in Bilibili and base config 2025-07-12 14:50:59 +08:00
Lei Cao
355ed183dd 增加选择微博搜索类型的配置 2025-07-05 22:14:31 +00:00
mirza-samad-ahmed-baig
7edf3bcc15 refactor(bilibili): process creator videos in batches 2025-07-04 21:04:10 +05:00
程序员阿江(Relakkes)
848df2b491 feat: other platfrom support the cdp mode 2025-07-03 17:13:32 +08:00
程序员阿江(Relakkes)
e83b2422d9 feat: 支持playwright通过cdp协议连接本地chrome浏览器
docs: 增加uv来管理python依赖的文档
2025-06-25 23:22:39 +08:00
chimeElm
26a845581e Update client.py
修复CRAWLER_MAX_NOTES_COUNT在爬取小红书作者帖子时失效的问题
2025-06-07 02:41:09 +08:00
Bowenwin
66843f216a finish_all_for_expand_bili 2025-05-22 22:26:30 +08:00
Bowenwin
59619fff0a finish_all 2025-05-22 22:06:06 +08:00
Bowenwin
44e3d370ff fix_words 2025-05-22 20:31:48 +08:00
Bowenwin
a356358c21 get_fans_and_get_followings 2025-05-19 19:57:36 +08:00
Relakkes
67d31bf42a fix: dy update fp params 2025-04-30 13:26:22 +08:00
翟持江
af5a393a7a Update core.py,删除了其它代码贡献者所添加的try-catch语句,该段try-catch语句将会影响其代码的最终逻辑并令其失效,使其仅能爬取当天一天数据而无法跳转到下一天(原先的逻辑就是try-catch捕获异常从而进入下一天,不要再向该语句中添加捕获异常操作或者finally语句!) 2025-04-19 04:34:24 +08:00
Relakkes
0d715a9f32 fix: bili qrcode login fix 2025-04-08 21:11:40 +08:00
Relakkes
660fd18a95 fix: dy login fix 2025-04-08 20:58:04 +08:00
crpa33
274d64aefc 处理xhs意外的评论信息为空的情况
报错就会打断我,我没辙
2025-04-02 11:59:27 +08:00
crpa33
a39b571d27 输出到日志-处理视频搜索页任务列表构造的错误 2025-04-02 11:57:28 +08:00
crpa33
413d91a520 输出到日志-author被封禁或存在错误 2025-04-02 11:52:36 +08:00
crpa33
eaf14721f8 输出到日志-NoneType导致的推导式错误 2025-04-02 11:48:36 +08:00
crpa33
2c4af2337e douyin搜索页为空跳下一关键词
预计页数没到,空了也跳
2025-03-27 23:32:21 +08:00
crpa33
3c72fc48b0 保护author为None但未被识别的情况 2025-03-27 23:22:47 +08:00
crpa33
6b6e2b8ba0 修复NoneType导致的推导式错误 2025-03-27 23:18:01 +08:00
Relakkes
061d1c15e2 feat: kuaishou search params update 2025-03-11 23:42:34 +08:00