MediaCrawler

mirror of https://github.com/NanmiCoder/MediaCrawler.git synced 2026-05-08 11:37:36 +08:00

Author	SHA1	Message	Date
程序员阿江(Relakkes)	fc06c783f5	fix: fixed xhs req headers	2025-07-23 13:28:58 +08:00
程序员阿江(Relakkes)	a4d9aaa34a	refactor: xhs update	2025-07-21 21:26:16 +08:00
程序员阿江(Relakkes)	13b00f7a36	refactor: config update	2025-07-18 23:26:52 +08:00
gaoxiaobei	8105b053ed	Merge remote-tracking branch 'origin/dev' into devdev	2025-07-18 17:37:29 +08:00
gaoxiaobei	7176956e51	Merge branch 'NanmiCoder:main' into dev	2025-07-18 17:32:04 +08:00
gaoxiaobei	b913db64bb	refactor(config): move platform-specific configs to separate files - Remove platform-specific configurations from base_config.py - Create separate config files for each platform in their respective directories - Update import statements in core files to use new platform-specific config modules - Clean up unused and deprecated configuration options	2025-07-18 17:27:37 +08:00
chenfangliang	aa54dad9a5	feat: 修复抖音二级评论地理位置缺失问题	2025-07-18 10:48:43 +08:00
gaoxiaobei	1dc8c1789f	docs(config): update Bilibili search mode options - Clarify the three search mode options for Bilibili - Add note about setting MAX_NOTES_PER_DAY in bilibili config	2025-07-17 07:51:27 +08:00
gaoxiaobei	6ced357096	Merge branch 'main' into dev	2025-07-17 06:45:30 +08:00
gaoxiaobei	9fb396c7d1	fix(media_platform): handle edge cases and improve error handling for Bilibili client and crawler - BilibiliClient: - Improve wbi_img_urls handling for better compatibility - Add error handling for missing or invalid 'is_end' and 'next' in comment cursor - BilibiliCrawler: - Fix daily limit logic for keyword-based searches - Improve logging and break conditions for max notes count limits - Ensure proper tracking of total notes crawled for each keyword	2025-07-17 06:40:56 +08:00
gaoxiaobei	fb846e9060	Merge branch 'NanmiCoder:main' into main	2025-07-17 06:39:04 +08:00
程序员阿江(Relakkes)	c795b1316a	fix: import error for #663	2025-07-16 10:58:11 +08:00
gaoxiaobei	4d743f6c17	debug & resume default configuration	2025-07-14 08:00:48 +08:00
gaoxiaobei	e91ec750bb	feat: Enhance Bilibili crawler with retry logic and robustness This commit introduces several improvements to enhance the stability and functionality of the Bilibili crawler. - Add Retry Logic: Implement a retry mechanism with exponential backoff when fetching video comments. This makes the crawler more resilient to transient network issues or API errors. - Improve Error Handling: Add a `try...except` block to handle potential `JSONDecodeError` in the Bilibili client, preventing crashes when the API returns an invalid response. - Ensure Clean Shutdown: Refactor `main.py` to use a `try...finally` block, guaranteeing that the crawler and database connections are properly closed on exit, error, or `KeyboardInterrupt`. - Update Default Config: Adjust default configuration values to increase concurrency, enable word cloud generation by default, and refine the Bilibili search mode for more practical usage.	2025-07-13 10:42:15 +08:00
gaoxiaobei	d0d7293926	feat(bilibili): Add flexible search modes and fix limit logic Refactors the Bilibili keyword search functionality to provide more flexible crawling strategies and corrects a flaw in how crawl limits were applied. Previously, the `ALL_DAY` boolean flag offered a rigid choice for time-based searching and contained a logical issue where `CRAWLER_MAX_NOTES_COUNT` was incorrectly applied on a per-day basis instead of as an overall total. This commit introduces the `BILI_SEARCH_MODE` configuration option with three distinct modes: - `normal`: The default search behavior without time constraints. - `all_in_time_range`: Maximizes data collection within a specified date range, replicating the original intent of `ALL_DAY=True`. - `daily_limit_in_time_range`: A new mode that strictly enforces both the daily `MAX_NOTES_PER_DAY` and the total `CRAWLER_MAX_NOTES_COUNT` limits across the entire date range. This change resolves the limit logic bug and gives users more precise control over the crawling process. Changes include: - Modified `config/base_config.py` to replace `ALL_DAY` with `BILI_SEARCH_MODE`. - Refactored `media_platform/bilibili/core.py` to implement the new search mode logic.	2025-07-13 06:07:13 +08:00
gaoxiaobei	cad9fc7af8	feat: Add daily limit for video/post crawling in Bilibili and base config	2025-07-12 14:50:59 +08:00
Lei Cao	355ed183dd	增加选择微博搜索类型的配置	2025-07-05 22:14:31 +00:00
mirza-samad-ahmed-baig	7edf3bcc15	refactor(bilibili): process creator videos in batches	2025-07-04 21:04:10 +05:00
程序员阿江(Relakkes)	848df2b491	feat: other platfrom support the cdp mode	2025-07-03 17:13:32 +08:00
程序员阿江(Relakkes)	e83b2422d9	feat: 支持playwright通过cdp协议连接本地chrome浏览器 docs: 增加uv来管理python依赖的文档	2025-06-25 23:22:39 +08:00
chimeElm	26a845581e	Update client.py 修复CRAWLER_MAX_NOTES_COUNT在爬取小红书作者帖子时失效的问题	2025-06-07 02:41:09 +08:00
Bowenwin	66843f216a	finish_all_for_expand_bili	2025-05-22 22:26:30 +08:00
Bowenwin	59619fff0a	finish_all	2025-05-22 22:06:06 +08:00
Bowenwin	44e3d370ff	fix_words	2025-05-22 20:31:48 +08:00
Bowenwin	a356358c21	get_fans_and_get_followings	2025-05-19 19:57:36 +08:00
Relakkes	67d31bf42a	fix: dy update fp params	2025-04-30 13:26:22 +08:00
翟持江	af5a393a7a	Update core.py，删除了其它代码贡献者所添加的try-catch语句，该段try-catch语句将会影响其代码的最终逻辑并令其失效，使其仅能爬取当天一天数据而无法跳转到下一天（原先的逻辑就是try-catch捕获异常从而进入下一天，不要再向该语句中添加捕获异常操作或者finally语句！）	2025-04-19 04:34:24 +08:00
Relakkes	0d715a9f32	fix: bili qrcode login fix	2025-04-08 21:11:40 +08:00
Relakkes	660fd18a95	fix: dy login fix	2025-04-08 20:58:04 +08:00
crpa33	274d64aefc	处理xhs意外的评论信息为空的情况报错就会打断我，我没辙	2025-04-02 11:59:27 +08:00
crpa33	a39b571d27	输出到日志-处理视频搜索页任务列表构造的错误	2025-04-02 11:57:28 +08:00
crpa33	413d91a520	输出到日志-author被封禁或存在错误	2025-04-02 11:52:36 +08:00
crpa33	eaf14721f8	输出到日志-NoneType导致的推导式错误	2025-04-02 11:48:36 +08:00
crpa33	2c4af2337e	douyin搜索页为空跳下一关键词预计页数没到，空了也跳	2025-03-27 23:32:21 +08:00
crpa33	3c72fc48b0	保护author为None但未被识别的情况	2025-03-27 23:22:47 +08:00
crpa33	6b6e2b8ba0	修复NoneType导致的推导式错误	2025-03-27 23:18:01 +08:00
Relakkes	061d1c15e2	feat: kuaishou search params update	2025-03-11 23:42:34 +08:00
Relakkes	f2cf864c27	fix: zhihu article url error #564	2025-03-03 18:18:41 +08:00
Relakkes	678ce1bfac	fix: bilibili bugfix	2025-02-10 17:13:37 +08:00
翟持江	0364b23b5b	Update core.py，为爬取类型为`detail`和`creator`的任务，添加了和`search`任务一样的，用于转存up主信息的`bilibili_store.update_up_info`的函数调用正如`search`函数中一样，在调用`get_video_info_task`后，`bilibili_video`和`bilibili_up_info`信息都将获得。原先的`get_specified_videos`在`detail`任务中仅保存了指定`bilibili_video`的信息，而`bilibili_up_info`信息尚未保存，`creator`任务的`get_creator_videos`中也调用了`get_specified_videos`获取指定创作者下所有的视频信息，同理也未保存`bilibili_up_info`信息。所以只需为`get_specified_videos`添加一句`await bilibili_store.update_up_info(video_detail)`即可和`search`任务下获得的数据文件个数保持一致，不会缺少对应up主的个人信息。已测试： - 原先仅`search`任务下产生`_creator.csv`、`_contents.csv`、`_comments.csv`，而`detail`和`creator`任务下缺少`_creator.csv`文件。 - 此次提交后将使三种模式下的数据文件个数一致。	2025-01-19 19:55:18 +08:00
翟持江	2d93ec5a82	Update core.py，更改了错误的缩进	2025-01-15 18:33:12 +08:00
翟持江	d2ecd3b11d	Update client.py，将`search_video_by_keyword`中`post_data`错误的请求参数进行更新 `pubtime_begin`更改为`pubtime_begin_s`，`pubtime_end`更改为`pubtime_end_s`。已测试	2025-01-15 18:21:03 +08:00
翟持江	f2b41b573b	Update core.py，以实现按照 START_DAY 至 END_DAY ，每一天进行筛选，这样能够突破 1000 条视频的限制，最大程度爬取该关键词下的所有视频添加了`get_pubtime_datetime`函数用以获取`pubtime_begin_s`和`pubtime_end_s`参数，并为`search`函数添加了`ALL_DAY`选项，若`ALL_DAY`未开启，则保留原先的搜索策略，但每个关键词最多返回 1000 条数据，若`ALL_DAY`已开启，则使用新策略，按照 START_DAY 至 END_DAY 按照每一天进行筛选，这样能够突破 1000 条视频的限制，最大程度爬取该关键词下的所有视频，新添加的`get_pubtime_datetime`函数仅在`search`中使用，需要用户按安装`datetime`和`pandas`模块。已测试完毕	2025-01-15 18:18:36 +08:00
翟持江	0118621a79	将微博评论爬取函数get_note_all_comments的max_id_type便为可变请求参数除了原先的max_id参数外，max_id_type参数也附加在上一次api结果的解析中，初始为0，但随着获取的评论越来越多，会更改为1。此外，修改了WeiboClient类的request函数，将返回的ok_code异常处理进行了优化，细分为0，1，else...。这样即便获取到的max_id和max_id_type为None，也不会触发像'>' not supported between instances of 'NoneType' and 'int'这样模棱两可的异常提示，方便溯源问题所在，即api响应错误。对于评论的数据获取不全的情况，在浏览器中获取显示的评论数量为1000+，更改此次提交前获取的个数为308条，更改后为319条，使用网页端打开手动刷评论的最后一条和程序获取的最后一条内容一致，可能是微博默认开启的精选评论功能导致无法获取所有的微博...	2025-01-10 19:20:01 +08:00
Relakkes	fbbead814a	fix: 贴吧创作者bug修复	2025-01-02 20:29:05 +08:00
Relakkes	ea5223c708	feat: 知乎支持详情模式	2024-12-26 17:36:33 +08:00
liudongkai	33e7ef016d	feat: xhs 非代理模式下增加随机等待间隔, db存储模式下增加存储xsec_token字段	2024-12-05 21:10:31 +08:00
leantli	e830ada574	feat: xhs comments add xsec_token	2024-12-03 18:25:21 +08:00
Trojx	f9eedc59b1	fix：微博根据creator爬取note时，爬取评论失败。原因是解析的参数key有误	2024-11-29 10:47:40 +08:00
Relakkes	ca9b47ef63	fix: xhs 帖子详情优化	2024-11-27 09:41:24 +08:00

1 2 3 4 5

208 Commits