MediaCrawler

mirror of https://github.com/NanmiCoder/MediaCrawler.git synced 2026-05-30 22:47:28 +08:00

Author	SHA1	Message	Date
程序员阿江(Relakkes)	8e93438fe5	Keep PR 900 overrides bounded and opt-in The PR adds API limit overrides and static proxy support, but the review found that the default proxy provider changed to an invalid static placeholder and the new API fields accepted unbounded values. This keeps the existing proxy default intact, makes static proxy explicit via config or CLI, validates API limit ranges, and adds focused regression coverage for both paths. Constraint: PR branch must remain contributor-branch compatible and avoid adding dependencies Rejected: Keep static as the default provider \| breaks existing --enable_ip_proxy defaults with an invalid placeholder URL Rejected: Accept arbitrary integer limits \| lets API callers request negative or excessive crawl sizes Confidence: high Scope-risk: narrow Directive: Do not change proxy provider defaults when adding new providers; new providers should be opt-in and covered by provider-specific tests Tested: uv run pytest tests/test_api_limits.py tests/test_static_proxy_provider.py Tested: uv run pytest tests Tested: uv run pytest test/test_utils.py Tested: uv run python -m compileall api cmd_arg config proxy tests Tested: git diff --cached --check Not-tested: Live crawler run against external platforms or real proxy vendor endpoints	2026-05-29 21:27:52 +08:00
钟保罗	5a362aebeb	feat: 添加静态代理方式	2026-05-20 12:50:24 +08:00
程序员阿江(Relakkes)	5294b6d9b7	feat: 支持连接用户已有的 Chrome 浏览器进行爬取新增 CDP_CONNECT_EXISTING 配置项，默认开启，通过 Chrome 远程调试功能 (chrome://inspect/#remote-debugging) 直接连接用户正在使用的浏览器，复用真实的 Cookie、扩展和浏览历史，大幅降低平台风控检测风险。主要变更: - 新增 _connect_existing_browser 方法，通过 ws:// 直接连接已有浏览器 - 支持等待用户在浏览器端确认连接对话框（60秒超时） - cleanup 时不关闭用户的浏览器进程 - 修复小红书在真实浏览器下 cookie 过多导致签名失败的问题 - 更新 README、CDP使用指南和常见问题文档	2026-04-15 10:54:29 +08:00
Junwen	2a52c15fb3	feat: 添加海外版小红书(rednote.com)支持	2026-04-08 23:12:53 +08:00
程序员阿江(Relakkes)	699a90f830	fix: xhs creator error	2026-04-07 12:54:39 +08:00
Wei Liu	2970488f40	docs: 将新增注释和文档改为中文 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-18 12:44:37 +13:00
Wei Liu	125e02a4b9	fix: make SSL verification opt-in via config, extend fix to all platforms - Add DISABLE_SSL_VERIFY = False to base_config.py (default: verification on) - Add tools/httpx_util.py with make_async_client() factory that reads the config - Replace all httpx.AsyncClient() call sites across all platforms (bilibili, weibo, zhihu, xhs, douyin, kuaishou) and crawler_util with make_async_client() - Extends SSL fix to previously missed platforms: xhs, douyin, kuaishou Users running behind an intercepting proxy can set DISABLE_SSL_VERIFY = True in config/base_config.py. All other users retain certificate verification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-18 12:31:49 +13:00
程序员阿江(Relakkes)	0282e626c9	feat: 新增 JSONL 存储格式支持，默认存储格式改为 jsonl JSONL（JSON Lines）每行一个 JSON 对象，采用 append 模式写入，无需读取已有数据，大数据量下性能远优于 JSON 格式。 - 新增 AsyncFileWriter.write_to_jsonl() 核心方法 - 7 个平台新增 JsonlStoreImplement 类并注册到工厂 - 配置默认值从 json 改为 jsonl，CLI/API 枚举同步更新 - db_session.py 守卫条件加入 jsonl，避免误触 ValueError - 词云生成支持读取 JSONL 文件，优先 jsonl 回退 json - 原有 json 选项完全保留，向后兼容 - 更新相关文档和测试	2026-03-03 23:31:07 +08:00
程序员阿江(Relakkes)	d614ccf247	docs: translate comments and metadata to English Update Chinese comments, variable descriptions, and metadata across multiple configuration and core files to English. This improves codebase accessibility for international developers. Additionally, removed the sponsorship section from README files.	2026-02-12 05:30:11 +08:00
ouzhuowei	7484156f02	新增数据保存路径,默认不指定则保存到data文件夹下 Co-Authored-By: ouzhuowei <190020754@qq.com>	2026-02-03 11:24:22 +08:00
Doiiars	70a6ca55bb	feat(database): add PostgreSQL support and fix Windows subprocess encoding	2026-01-09 00:41:59 +08:00
程序员阿江(Relakkes)	55d8c7783f	feat: webo full context support	2025-12-26 19:22:24 +08:00
hsparks.codes	46ef86ddef	feat: Add Excel export functionality and unit tests Features: - Excel export with formatted multi-sheet workbooks (Contents, Comments, Creators) - Professional styling: blue headers, auto-width columns, borders, text wrapping - Smart export: empty sheets automatically removed - Support for all platforms (xhs, dy, ks, bili, wb, tieba, zhihu) Testing: - Added pytest framework with asyncio support - Unit tests for Excel store functionality - Unit tests for store factory pattern - Shared fixtures for test data - Test coverage for edge cases Documentation: - Comprehensive Excel export guide (docs/excel_export_guide.md) - Updated README.md and README_en.md with Excel examples - Updated config comments to include excel option Dependencies: - Added openpyxl>=3.1.2 for Excel support - Added pytest>=7.4.0 and pytest-asyncio>=0.21.0 for testing This contribution adds immediate value for users who need data analysis capabilities and establishes a testing foundation for future development.	2025-11-28 04:44:12 +01:00
程序员阿江(Relakkes)	ff8c92daad	chore: add copyright to every file	2025-11-18 12:24:02 +08:00
程序员阿江(Relakkes)	6dcfd7e0a5	refactor: weibo login	2025-11-17 17:11:35 +08:00
程序员阿江(Relakkes)	b6caa7a85e	refactor: add xhs creator params	2025-11-10 21:10:03 +08:00
程序员阿江(Relakkes)	1e3637f238	refactor: update xhs note detail	2025-11-10 18:13:51 +08:00
程序员阿江-Relakkes	05a1782746	Merge pull request #764 from yangtao210/main 新增存储到mongoDB	2025-11-06 06:10:49 -05:00
yt210	ef6948b305	新增存储到mongoDB	2025-11-06 10:40:30 +08:00
程序员阿江(Relakkes)	3f5925e326	feat: update xhs sign	2025-10-27 19:06:07 +08:00
程序员阿江(Relakkes)	ae7955787c	feat: kuaishou support url link	2025-10-18 07:40:10 +08:00
程序员阿江(Relakkes)	a9dd08680f	feat: xhs support creator url link	2025-10-18 07:20:09 +08:00
程序员阿江(Relakkes)	cae707cb2a	feat: douyin support url link	2025-10-18 07:00:21 +08:00
程序员阿江(Relakkes)	906c259cc7	feat: bilibili support url link	2025-10-18 06:30:20 +08:00
LePao1	3954c40e69	feat(bilibili)：增加视频清晰度参数，可以通过`BILI_QN`更改下载的视频清晰度；在 BilibiliClient 中添加视频质量配置并改进错误处理，修复下载请求被 302 重定向到 CDN，旧代码未跟随重定向且只接受 “OK” ，导致失败，现在即便是低清晰度/CDN 跳转的链接也能正常下载。	2025-09-24 12:27:16 +08:00
persist-1	be306c6f54	refactor(database): 重构数据库存储实现，使用SQLAlchemy ORM替代原始SQL操作 - 删除旧的async_db.py和async_sqlite_db.py实现 - 新增SQLAlchemy ORM模型和数据库会话管理 - 统一各平台存储实现到_store_impl.py文件 - 添加数据库初始化功能支持 - 更新.gitignore和pyproject.toml依赖配置 - 优化文件存储路径和命名规范	2025-09-06 04:10:20 +08:00
程序员阿江(Relakkes)	12450759d8	fix: httpx proxy format error feat: add a ip proxy provider	2025-08-01 01:05:11 +08:00
未来可欺	e9f976117a	将配置文件恢复原状	2025-07-30 21:32:00 +08:00
未来可欺	173bc08a9d	添加了抖音存储视频以及图片的逻辑，并将config.py中ENABLE_GET_IMAGES参数更名为ENABLE_GET_MEIDAS，在此基础上略微修改存储逻辑	2025-07-30 18:24:08 +08:00
persist-1	19df1734f1	chore: 增加--help参数中文显示支持及douyin_aweme表music_download_url字段\n\n- 为命令行参数增加中文显示支持，提升用户体验\n- 在douyin_aweme表中新增music_download_url字段用于存储视频音乐下载链接\n- 更新相关数据库表结构文件(tables.sql, sqlite_tables.sql)\n- 实现音乐下载URL提取逻辑并集成到数据存储流程	2025-07-24 22:39:53 +08:00
程序员阿江(Relakkes)	a4d9aaa34a	refactor: xhs update	2025-07-21 21:26:16 +08:00
程序员阿江(Relakkes)	26a43358cb	chore: update config	2025-07-20 14:34:56 +08:00
程序员阿江(Relakkes)	13b00f7a36	refactor: config update	2025-07-18 23:26:52 +08:00
gaoxiaobei	b913db64bb	refactor(config): move platform-specific configs to separate files - Remove platform-specific configurations from base_config.py - Create separate config files for each platform in their respective directories - Update import statements in core files to use new platform-specific config modules - Clean up unused and deprecated configuration options	2025-07-18 17:27:37 +08:00
gaoxiaobei	1dc8c1789f	docs(config): update Bilibili search mode options - Clarify the three search mode options for Bilibili - Add note about setting MAX_NOTES_PER_DAY in bilibili config	2025-07-17 07:51:27 +08:00
gaoxiaobei	6ced357096	Merge branch 'main' into dev	2025-07-17 06:45:30 +08:00
gaoxiaobei	fb846e9060	Merge branch 'NanmiCoder:main' into main	2025-07-17 06:39:04 +08:00
gaoxiaobei	4d743f6c17	debug & resume default configuration	2025-07-14 08:00:48 +08:00
买定不离手	1673bd5c0c	feat: 增强SQLite数据库配置和命令行参数支持 - 更新 cmd_arg/arg.py 文件，添加SQLite数据库选项的命令行参数解析支持 - 更新 config/base_config.py 文件，集成SQLite数据库的基础配置项和默认设置 - 更新 config/db_config.py 文件，扩展数据库配置以支持SQLite连接和参数管理 - 更新 pyproject.toml 文件，添加SQLite相关依赖包的版本管理和项目配置	2025-07-14 03:50:54 +08:00
gaoxiaobei	e91ec750bb	feat: Enhance Bilibili crawler with retry logic and robustness This commit introduces several improvements to enhance the stability and functionality of the Bilibili crawler. - Add Retry Logic: Implement a retry mechanism with exponential backoff when fetching video comments. This makes the crawler more resilient to transient network issues or API errors. - Improve Error Handling: Add a `try...except` block to handle potential `JSONDecodeError` in the Bilibili client, preventing crashes when the API returns an invalid response. - Ensure Clean Shutdown: Refactor `main.py` to use a `try...finally` block, guaranteeing that the crawler and database connections are properly closed on exit, error, or `KeyboardInterrupt`. - Update Default Config: Adjust default configuration values to increase concurrency, enable word cloud generation by default, and refine the Bilibili search mode for more practical usage.	2025-07-13 10:42:15 +08:00
gaoxiaobei	d0d7293926	feat(bilibili): Add flexible search modes and fix limit logic Refactors the Bilibili keyword search functionality to provide more flexible crawling strategies and corrects a flaw in how crawl limits were applied. Previously, the `ALL_DAY` boolean flag offered a rigid choice for time-based searching and contained a logical issue where `CRAWLER_MAX_NOTES_COUNT` was incorrectly applied on a per-day basis instead of as an overall total. This commit introduces the `BILI_SEARCH_MODE` configuration option with three distinct modes: - `normal`: The default search behavior without time constraints. - `all_in_time_range`: Maximizes data collection within a specified date range, replicating the original intent of `ALL_DAY=True`. - `daily_limit_in_time_range`: A new mode that strictly enforces both the daily `MAX_NOTES_PER_DAY` and the total `CRAWLER_MAX_NOTES_COUNT` limits across the entire date range. This change resolves the limit logic bug and gives users more precise control over the crawling process. Changes include: - Modified `config/base_config.py` to replace `ALL_DAY` with `BILI_SEARCH_MODE`. - Refactored `media_platform/bilibili/core.py` to implement the new search mode logic.	2025-07-13 06:07:13 +08:00
gaoxiaobei	cad9fc7af8	feat: Add daily limit for video/post crawling in Bilibili and base config	2025-07-12 14:50:59 +08:00
Lei Cao	355ed183dd	增加选择微博搜索类型的配置	2025-07-05 22:14:31 +00:00
程序员阿江(Relakkes)	848df2b491	feat: other platfrom support the cdp mode	2025-07-03 17:13:32 +08:00
程序员阿江(Relakkes)	e83b2422d9	feat: 支持playwright通过cdp协议连接本地chrome浏览器 docs: 增加uv来管理python依赖的文档	2025-06-25 23:22:39 +08:00
Bowenwin	66843f216a	finish_all_for_expand_bili	2025-05-22 22:26:30 +08:00
Bowenwin	59619fff0a	finish_all	2025-05-22 22:06:06 +08:00
Bowenwin	44e3d370ff	fix_words	2025-05-22 20:31:48 +08:00
Bowenwin	a356358c21	get_fans_and_get_followings	2025-05-19 19:57:36 +08:00
Relakkes	b43d6b7b91	chore: update config	2025-02-12 10:58:48 +08:00

1 2 3

133 Commits