MediaCrawler

mirror of https://github.com/NanmiCoder/MediaCrawler.git synced 2026-02-20 05:51:00 +08:00

Author	SHA1	Message	Date
LePao1	3954c40e69	feat(bilibili)：增加视频清晰度参数，可以通过`BILI_QN`更改下载的视频清晰度；在 BilibiliClient 中添加视频质量配置并改进错误处理，修复下载请求被 302 重定向到 CDN，旧代码未跟随重定向且只接受 “OK” ，导致失败，现在即便是低清晰度/CDN 跳转的链接也能正常下载。	2025-09-24 12:27:16 +08:00
刘小龙	c87df59996	log client modify	2025-09-09 15:27:46 +08:00
程序员阿江(Relakkes)	2bce3593f7	feat: support time deplay for all platform	2025-09-02 16:43:09 +08:00
程序员阿江(Relakkes)	eb799e1fa7	refactor: xhs extractor	2025-09-02 14:50:32 +08:00
未来可欺	6a10d0d11c	原始的HTTPStatusError不能捕获像ConnectError、ReadError这些异常类型，本次提交修改了捕获异常的类型为httpx模块请求异常的基类：HTTPError，以便捕获在httpx.request方法中引发的任何异常（例如ip被封，服务器拒接连接），正确处理爬取媒体被中断时并不会导致爬取文本的中断逻辑	2025-08-06 11:24:51 +08:00
未来可欺	81f2dbe4ab	添加了对媒体资源服务器的异常处理，参见 issue #691	2025-08-05 13:11:00 +08:00
程序员阿江(Relakkes)	b9d30bbabb	fix: #693	2025-08-01 15:55:21 +08:00
未来可欺	a6fd9ebdbc	简单更改了抖音保存图片与视频的命名方式，一个视频 id 仅对应一个短视频，返回一个 video_download_url，因此不需要使用数字方式进行命名	2025-07-31 23:11:45 +08:00
未来可欺	0b81240aed	升级 httpx 版本至 0.28.1，并修改关键字参数 proxies 至 proxy	2025-07-31 22:48:02 +08:00
未来可欺	9d90e9fc6d	fix issue #689，目前来看，应该是 httpx 库的问题，因为无论是使用同步还是异步版本，构不构造 httpx.***Client 对象来发起请求，返回的响应都是为空，response.content = b''，response.text = ’‘，但换成 requests 库就能正常获取数据了	2025-07-31 22:01:48 +08:00
未来可欺	93a1c27fff	通过测试search模式，修复部分运行时的bug，并对能够爬取媒体的平台设置了较长的超时时间	2025-07-30 21:19:56 +08:00
未来可欺	173bc08a9d	添加了抖音存储视频以及图片的逻辑，并将config.py中ENABLE_GET_IMAGES参数更名为ENABLE_GET_MEIDAS，在此基础上略微修改存储逻辑	2025-07-30 18:24:08 +08:00
korruz	07a6e387ea	refactor: move format_proxy_info to utils and update crawler classes to use it	2025-07-29 14:16:24 +08:00
程序员阿江(Relakkes)	fc06c783f5	fix: fixed xhs req headers	2025-07-23 13:28:58 +08:00
程序员阿江(Relakkes)	a4d9aaa34a	refactor: xhs update	2025-07-21 21:26:16 +08:00
程序员阿江(Relakkes)	13b00f7a36	refactor: config update	2025-07-18 23:26:52 +08:00
gaoxiaobei	8105b053ed	Merge remote-tracking branch 'origin/dev' into devdev	2025-07-18 17:37:29 +08:00
gaoxiaobei	7176956e51	Merge branch 'NanmiCoder:main' into dev	2025-07-18 17:32:04 +08:00
gaoxiaobei	b913db64bb	refactor(config): move platform-specific configs to separate files - Remove platform-specific configurations from base_config.py - Create separate config files for each platform in their respective directories - Update import statements in core files to use new platform-specific config modules - Clean up unused and deprecated configuration options	2025-07-18 17:27:37 +08:00
chenfangliang	aa54dad9a5	feat: 修复抖音二级评论地理位置缺失问题	2025-07-18 10:48:43 +08:00
gaoxiaobei	1dc8c1789f	docs(config): update Bilibili search mode options - Clarify the three search mode options for Bilibili - Add note about setting MAX_NOTES_PER_DAY in bilibili config	2025-07-17 07:51:27 +08:00
gaoxiaobei	6ced357096	Merge branch 'main' into dev	2025-07-17 06:45:30 +08:00
gaoxiaobei	9fb396c7d1	fix(media_platform): handle edge cases and improve error handling for Bilibili client and crawler - BilibiliClient: - Improve wbi_img_urls handling for better compatibility - Add error handling for missing or invalid 'is_end' and 'next' in comment cursor - BilibiliCrawler: - Fix daily limit logic for keyword-based searches - Improve logging and break conditions for max notes count limits - Ensure proper tracking of total notes crawled for each keyword	2025-07-17 06:40:56 +08:00
gaoxiaobei	fb846e9060	Merge branch 'NanmiCoder:main' into main	2025-07-17 06:39:04 +08:00
程序员阿江(Relakkes)	c795b1316a	fix: import error for #663	2025-07-16 10:58:11 +08:00
gaoxiaobei	4d743f6c17	debug & resume default configuration	2025-07-14 08:00:48 +08:00
gaoxiaobei	e91ec750bb	feat: Enhance Bilibili crawler with retry logic and robustness This commit introduces several improvements to enhance the stability and functionality of the Bilibili crawler. - Add Retry Logic: Implement a retry mechanism with exponential backoff when fetching video comments. This makes the crawler more resilient to transient network issues or API errors. - Improve Error Handling: Add a `try...except` block to handle potential `JSONDecodeError` in the Bilibili client, preventing crashes when the API returns an invalid response. - Ensure Clean Shutdown: Refactor `main.py` to use a `try...finally` block, guaranteeing that the crawler and database connections are properly closed on exit, error, or `KeyboardInterrupt`. - Update Default Config: Adjust default configuration values to increase concurrency, enable word cloud generation by default, and refine the Bilibili search mode for more practical usage.	2025-07-13 10:42:15 +08:00
gaoxiaobei	d0d7293926	feat(bilibili): Add flexible search modes and fix limit logic Refactors the Bilibili keyword search functionality to provide more flexible crawling strategies and corrects a flaw in how crawl limits were applied. Previously, the `ALL_DAY` boolean flag offered a rigid choice for time-based searching and contained a logical issue where `CRAWLER_MAX_NOTES_COUNT` was incorrectly applied on a per-day basis instead of as an overall total. This commit introduces the `BILI_SEARCH_MODE` configuration option with three distinct modes: - `normal`: The default search behavior without time constraints. - `all_in_time_range`: Maximizes data collection within a specified date range, replicating the original intent of `ALL_DAY=True`. - `daily_limit_in_time_range`: A new mode that strictly enforces both the daily `MAX_NOTES_PER_DAY` and the total `CRAWLER_MAX_NOTES_COUNT` limits across the entire date range. This change resolves the limit logic bug and gives users more precise control over the crawling process. Changes include: - Modified `config/base_config.py` to replace `ALL_DAY` with `BILI_SEARCH_MODE`. - Refactored `media_platform/bilibili/core.py` to implement the new search mode logic.	2025-07-13 06:07:13 +08:00
gaoxiaobei	cad9fc7af8	feat: Add daily limit for video/post crawling in Bilibili and base config	2025-07-12 14:50:59 +08:00
Lei Cao	355ed183dd	增加选择微博搜索类型的配置	2025-07-05 22:14:31 +00:00
mirza-samad-ahmed-baig	7edf3bcc15	refactor(bilibili): process creator videos in batches	2025-07-04 21:04:10 +05:00
程序员阿江(Relakkes)	848df2b491	feat: other platfrom support the cdp mode	2025-07-03 17:13:32 +08:00
程序员阿江(Relakkes)	e83b2422d9	feat: 支持playwright通过cdp协议连接本地chrome浏览器 docs: 增加uv来管理python依赖的文档	2025-06-25 23:22:39 +08:00
chimeElm	26a845581e	Update client.py 修复CRAWLER_MAX_NOTES_COUNT在爬取小红书作者帖子时失效的问题	2025-06-07 02:41:09 +08:00
Bowenwin	66843f216a	finish_all_for_expand_bili	2025-05-22 22:26:30 +08:00
Bowenwin	59619fff0a	finish_all	2025-05-22 22:06:06 +08:00
Bowenwin	44e3d370ff	fix_words	2025-05-22 20:31:48 +08:00
Bowenwin	a356358c21	get_fans_and_get_followings	2025-05-19 19:57:36 +08:00
Relakkes	67d31bf42a	fix: dy update fp params	2025-04-30 13:26:22 +08:00
翟持江	af5a393a7a	Update core.py，删除了其它代码贡献者所添加的try-catch语句，该段try-catch语句将会影响其代码的最终逻辑并令其失效，使其仅能爬取当天一天数据而无法跳转到下一天（原先的逻辑就是try-catch捕获异常从而进入下一天，不要再向该语句中添加捕获异常操作或者finally语句！）	2025-04-19 04:34:24 +08:00
Relakkes	0d715a9f32	fix: bili qrcode login fix	2025-04-08 21:11:40 +08:00
Relakkes	660fd18a95	fix: dy login fix	2025-04-08 20:58:04 +08:00
crpa33	274d64aefc	处理xhs意外的评论信息为空的情况报错就会打断我，我没辙	2025-04-02 11:59:27 +08:00
crpa33	a39b571d27	输出到日志-处理视频搜索页任务列表构造的错误	2025-04-02 11:57:28 +08:00
crpa33	413d91a520	输出到日志-author被封禁或存在错误	2025-04-02 11:52:36 +08:00
crpa33	eaf14721f8	输出到日志-NoneType导致的推导式错误	2025-04-02 11:48:36 +08:00
crpa33	2c4af2337e	douyin搜索页为空跳下一关键词预计页数没到，空了也跳	2025-03-27 23:32:21 +08:00
crpa33	3c72fc48b0	保护author为None但未被识别的情况	2025-03-27 23:22:47 +08:00
crpa33	6b6e2b8ba0	修复NoneType导致的推导式错误	2025-03-27 23:18:01 +08:00
Relakkes	061d1c15e2	feat: kuaishou search params update	2025-03-11 23:42:34 +08:00

1 2 3 4 5

221 Commits