Commit Graph

217 Commits

Author SHA1 Message Date
未来可欺
6a10d0d11c 原始的HTTPStatusError不能捕获像ConnectError、ReadError这些异常类型,本次提交修改了捕获异常的类型为httpx模块请求异常的基类:HTTPError,以便捕获在httpx.request方法中引发的任何异常(例如ip被封,服务器拒接连接),正确处理爬取媒体被中断时并不会导致爬取文本的中断逻辑 2025-08-06 11:24:51 +08:00
未来可欺
81f2dbe4ab 添加了对媒体资源服务器的异常处理,参见 issue #691 2025-08-05 13:11:00 +08:00
程序员阿江(Relakkes)
b9d30bbabb fix: #693 2025-08-01 15:55:21 +08:00
未来可欺
a6fd9ebdbc 简单更改了抖音保存图片与视频的命名方式,一个视频 id 仅对应一个短视频,返回一个 video_download_url,因此不需要使用数字方式进行命名 2025-07-31 23:11:45 +08:00
未来可欺
0b81240aed 升级 httpx 版本至 0.28.1,并修改关键字参数 proxies 至 proxy 2025-07-31 22:48:02 +08:00
未来可欺
9d90e9fc6d fix issue #689,目前来看,应该是 httpx 库的问题,因为无论是使用同步还是异步版本,构不构造 httpx.***Client 对象来发起请求,返回的响应都是为空,response.content = b'',response.text = ’‘,但换成 requests 库就能正常获取数据了 2025-07-31 22:01:48 +08:00
未来可欺
93a1c27fff 通过测试search模式,修复部分运行时的bug,并对能够爬取媒体的平台设置了较长的超时时间 2025-07-30 21:19:56 +08:00
未来可欺
173bc08a9d 添加了抖音存储视频以及图片的逻辑,并将config.py中ENABLE_GET_IMAGES参数更名为ENABLE_GET_MEIDAS,在此基础上略微修改存储逻辑 2025-07-30 18:24:08 +08:00
korruz
07a6e387ea refactor: move format_proxy_info to utils and update crawler classes to use it 2025-07-29 14:16:24 +08:00
程序员阿江(Relakkes)
fc06c783f5 fix: fixed xhs req headers 2025-07-23 13:28:58 +08:00
程序员阿江(Relakkes)
a4d9aaa34a refactor: xhs update 2025-07-21 21:26:16 +08:00
程序员阿江(Relakkes)
13b00f7a36 refactor: config update 2025-07-18 23:26:52 +08:00
gaoxiaobei
8105b053ed Merge remote-tracking branch 'origin/dev' into devdev 2025-07-18 17:37:29 +08:00
gaoxiaobei
7176956e51 Merge branch 'NanmiCoder:main' into dev 2025-07-18 17:32:04 +08:00
gaoxiaobei
b913db64bb refactor(config): move platform-specific configs to separate files
- Remove platform-specific configurations from base_config.py
- Create separate config files for each platform in their respective directories
- Update import statements in core files to use new platform-specific config modules
- Clean up unused and deprecated configuration options
2025-07-18 17:27:37 +08:00
chenfangliang
aa54dad9a5 feat: 修复抖音二级评论地理位置缺失问题 2025-07-18 10:48:43 +08:00
gaoxiaobei
1dc8c1789f docs(config): update Bilibili search mode options
- Clarify the three search mode options for Bilibili
- Add note about setting MAX_NOTES_PER_DAY in bilibili config
2025-07-17 07:51:27 +08:00
gaoxiaobei
6ced357096 Merge branch 'main' into dev 2025-07-17 06:45:30 +08:00
gaoxiaobei
9fb396c7d1 fix(media_platform): handle edge cases and improve error handling for Bilibili client and crawler
- BilibiliClient:
  - Improve wbi_img_urls handling for better compatibility
  - Add error handling for missing or invalid 'is_end' and 'next' in comment cursor

- BilibiliCrawler:
  - Fix daily limit logic for keyword-based searches
  - Improve logging and break conditions for max notes count limits
  - Ensure proper tracking of total notes crawled for each keyword
2025-07-17 06:40:56 +08:00
gaoxiaobei
fb846e9060 Merge branch 'NanmiCoder:main' into main 2025-07-17 06:39:04 +08:00
程序员阿江(Relakkes)
c795b1316a fix: import error for #663 2025-07-16 10:58:11 +08:00
gaoxiaobei
4d743f6c17 debug & resume default configuration 2025-07-14 08:00:48 +08:00
gaoxiaobei
e91ec750bb feat: Enhance Bilibili crawler with retry logic and robustness
This commit introduces several improvements to enhance the stability and functionality of the Bilibili crawler.

- **Add Retry Logic:** Implement a retry mechanism with exponential backoff when fetching video comments. This makes the crawler more resilient to transient network issues or API errors.
- **Improve Error Handling:** Add a `try...except` block to handle potential `JSONDecodeError` in the Bilibili client, preventing crashes when the API returns an invalid response.
- **Ensure Clean Shutdown:** Refactor `main.py` to use a `try...finally` block, guaranteeing that the crawler and database connections are properly closed on exit, error, or `KeyboardInterrupt`.
- **Update Default Config:** Adjust default configuration values to increase concurrency, enable word cloud generation by default, and refine the Bilibili search mode for more practical usage.
2025-07-13 10:42:15 +08:00
gaoxiaobei
d0d7293926 feat(bilibili): Add flexible search modes and fix limit logic
Refactors the Bilibili keyword search functionality to provide more flexible crawling strategies and corrects a flaw in how crawl limits were applied.

Previously, the `ALL_DAY` boolean flag offered a rigid choice for time-based searching and contained a logical issue where `CRAWLER_MAX_NOTES_COUNT` was incorrectly applied on a per-day basis instead of as an overall total.

This commit introduces the `BILI_SEARCH_MODE` configuration option with three distinct modes:
- `normal`: The default search behavior without time constraints.
- `all_in_time_range`: Maximizes data collection within a specified date range, replicating the original intent of `ALL_DAY=True`.
- `daily_limit_in_time_range`: A new mode that strictly enforces both the daily `MAX_NOTES_PER_DAY` and the total `CRAWLER_MAX_NOTES_COUNT` limits across the entire date range.

This change resolves the limit logic bug and gives users more precise control over the crawling process.

Changes include:
- Modified `config/base_config.py` to replace `ALL_DAY` with `BILI_SEARCH_MODE`.
- Refactored `media_platform/bilibili/core.py` to implement the new search mode logic.
2025-07-13 06:07:13 +08:00
gaoxiaobei
cad9fc7af8 feat: Add daily limit for video/post crawling in Bilibili and base config 2025-07-12 14:50:59 +08:00
Lei Cao
355ed183dd 增加选择微博搜索类型的配置 2025-07-05 22:14:31 +00:00
mirza-samad-ahmed-baig
7edf3bcc15 refactor(bilibili): process creator videos in batches 2025-07-04 21:04:10 +05:00
程序员阿江(Relakkes)
848df2b491 feat: other platfrom support the cdp mode 2025-07-03 17:13:32 +08:00
程序员阿江(Relakkes)
e83b2422d9 feat: 支持playwright通过cdp协议连接本地chrome浏览器
docs: 增加uv来管理python依赖的文档
2025-06-25 23:22:39 +08:00
chimeElm
26a845581e Update client.py
修复CRAWLER_MAX_NOTES_COUNT在爬取小红书作者帖子时失效的问题
2025-06-07 02:41:09 +08:00
Bowenwin
66843f216a finish_all_for_expand_bili 2025-05-22 22:26:30 +08:00
Bowenwin
59619fff0a finish_all 2025-05-22 22:06:06 +08:00
Bowenwin
44e3d370ff fix_words 2025-05-22 20:31:48 +08:00
Bowenwin
a356358c21 get_fans_and_get_followings 2025-05-19 19:57:36 +08:00
Relakkes
67d31bf42a fix: dy update fp params 2025-04-30 13:26:22 +08:00
翟持江
af5a393a7a Update core.py,删除了其它代码贡献者所添加的try-catch语句,该段try-catch语句将会影响其代码的最终逻辑并令其失效,使其仅能爬取当天一天数据而无法跳转到下一天(原先的逻辑就是try-catch捕获异常从而进入下一天,不要再向该语句中添加捕获异常操作或者finally语句!) 2025-04-19 04:34:24 +08:00
Relakkes
0d715a9f32 fix: bili qrcode login fix 2025-04-08 21:11:40 +08:00
Relakkes
660fd18a95 fix: dy login fix 2025-04-08 20:58:04 +08:00
crpa33
274d64aefc 处理xhs意外的评论信息为空的情况
报错就会打断我,我没辙
2025-04-02 11:59:27 +08:00
crpa33
a39b571d27 输出到日志-处理视频搜索页任务列表构造的错误 2025-04-02 11:57:28 +08:00
crpa33
413d91a520 输出到日志-author被封禁或存在错误 2025-04-02 11:52:36 +08:00
crpa33
eaf14721f8 输出到日志-NoneType导致的推导式错误 2025-04-02 11:48:36 +08:00
crpa33
2c4af2337e douyin搜索页为空跳下一关键词
预计页数没到,空了也跳
2025-03-27 23:32:21 +08:00
crpa33
3c72fc48b0 保护author为None但未被识别的情况 2025-03-27 23:22:47 +08:00
crpa33
6b6e2b8ba0 修复NoneType导致的推导式错误 2025-03-27 23:18:01 +08:00
Relakkes
061d1c15e2 feat: kuaishou search params update 2025-03-11 23:42:34 +08:00
Relakkes
f2cf864c27 fix: zhihu article url error #564 2025-03-03 18:18:41 +08:00
Relakkes
678ce1bfac fix: bilibili bugfix 2025-02-10 17:13:37 +08:00
翟持江
0364b23b5b Update core.py,为爬取类型为detailcreator的任务,添加了和search任务一样的,用于转存up主信息的bilibili_store.update_up_info的函数调用
正如`search`函数中一样,在调用`get_video_info_task`后,`bilibili_video`和`bilibili_up_info`信息都将获得。
原先的`get_specified_videos`在`detail`任务中仅保存了指定`bilibili_video`的信息,而`bilibili_up_info`信息尚未保存,`creator`任务的`get_creator_videos`中也调用了`get_specified_videos`获取指定创作者下所有的视频信息,同理也未保存`bilibili_up_info`信息。
所以只需为`get_specified_videos`添加一句`await bilibili_store.update_up_info(video_detail)`即可和`search`任务下获得的数据文件个数保持一致,不会缺少对应up主的个人信息。
已测试:
- 原先仅`search`任务下产生`*_creator.csv`、`*_contents.csv`、`*_comments.csv`,而`detail`和`creator`任务下缺少`*_creator.csv`文件。
- 此次提交后将使三种模式下的数据文件个数一致。
2025-01-19 19:55:18 +08:00
翟持江
2d93ec5a82 Update core.py,更改了错误的缩进 2025-01-15 18:33:12 +08:00