feat: other platfrom support the cdp mode

This commit is contained in:
程序员阿江(Relakkes)
2025-07-03 17:13:32 +08:00
parent c892c3324c
commit 848df2b491
9 changed files with 565 additions and 102 deletions

166
CLAUDE.local.md Normal file
View File

@@ -0,0 +1,166 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
MediaCrawler is a multi-platform social media data collection tool supporting platforms like Xiaohongshu (Little Red Book), Douyin (TikTok), Kuaishou, Bilibili, Weibo, Tieba, and Zhihu. The project uses Playwright for browser automation and maintains login states to crawl public information without needing JS reverse engineering.
## Development Environment Setup
### Prerequisites
- **Python**: >= 3.9 (verified with 3.9.6)
- **Node.js**: >= 16.0.0 (required for Douyin and Zhihu crawlers)
- **uv**: Modern Python package manager (recommended)
### Installation Commands
```bash
# Using uv (recommended)
uv sync
uv run playwright install
# Using traditional pip (fallback)
pip install -r requirements.txt
playwright install
```
### Running the Application
```bash
# Basic crawling command
uv run main.py --platform xhs --lt qrcode --type search
# View all available options
uv run main.py --help
# Using traditional Python
python main.py --platform xhs --lt qrcode --type search
```
## Architecture Overview
### Core Components
1. **Platform Crawlers** (`media_platform/`):
- Each platform has its own crawler implementation
- Follows abstract base class pattern (`base/base_crawler.py`)
- Platforms: `xhs`, `dy`, `ks`, `bili`, `wb`, `tieba`, `zhihu`
2. **Configuration System** (`config/`):
- `base_config.py`: Main configuration file with extensive options
- `db_config.py`: Database configuration
- Key settings: login types, proxy settings, CDP mode, data storage options
3. **Data Storage** (`store/`):
- Multiple storage backends: CSV, JSON, MySQL
- Platform-specific storage implementations
- Image download capabilities
4. **Caching System** (`cache/`):
- Local cache and Redis cache implementations
- Factory pattern for cache selection
5. **Proxy Support** (`proxy/`):
- IP proxy pool management
- Multiple proxy provider support (Kuaidaili, Jishu)
6. **Browser Automation** (`tools/`):
- Playwright browser launcher
- CDP (Chrome DevTools Protocol) support
- Slider validation utilities
### Key Configuration Options
- `PLATFORM`: Target platform (xhs, dy, ks, bili, wb, tieba, zhihu)
- `KEYWORDS`: Search keywords (comma-separated)
- `CRAWLER_TYPE`: Type of crawling (search, detail, creator)
- `ENABLE_CDP_MODE`: Use Chrome DevTools Protocol for better anti-detection
- `SAVE_DATA_OPTION`: Data storage format (csv, db, json)
- `ENABLE_GET_COMMENTS`: Enable comment crawling
- `ENABLE_IP_PROXY`: Enable proxy IP rotation
## Testing
### Available Test Commands
```bash
# Run all tests
python -m unittest discover test
# Run specific test files
python -m unittest test.test_expiring_local_cache
python -m unittest test.test_proxy_ip_pool
python -m unittest test.test_redis_cache
python -m unittest test.test_utils
# Install and use pytest (enhanced testing)
uv add pytest
uv run pytest test/
```
### Test Coverage
- Cache functionality tests
- Proxy IP pool tests
- Utility function tests
- Redis cache tests (requires Redis server)
## Database Setup
### MySQL Database Initialization
```bash
# Initialize database tables (first time only)
python db.py
# Or with uv
uv run db.py
```
### Supported Storage Options
- **MySQL**: Full relational database with deduplication
- **CSV**: Simple file-based storage in `data/` directory
- **JSON**: Structured file-based storage in `data/` directory
## Common Development Tasks
### Adding New Platform Support
1. Create new directory in `media_platform/`
2. Implement crawler class inheriting from `AbstractCrawler`
3. Add platform-specific client, core, field, and login modules
4. Update `CrawlerFactory` in `main.py`
5. Add storage implementation in `store/`
### Debugging CDP Mode
- Set `ENABLE_CDP_MODE = True` in config
- Use `CDP_HEADLESS = False` for visual debugging
- Check browser console for CDP connection issues
### Managing Login States
- Login states are cached in `browser_data/` directory
- Platform-specific user data directories maintain session cookies
- Set `SAVE_LOGIN_STATE = True` to preserve login across runs
## Platform-Specific Notes
### Xiaohongshu (XHS)
- Supports search, detail, and creator crawling
- Requires `xsec_token` and `xsec_source` parameters for specific note URLs
- Custom User-Agent configuration available
### Douyin (DY)
- Requires Node.js environment
- Supports publish time filtering
- Has specific creator ID format (sec_id)
### Bilibili (BILI)
- Supports date range filtering with `START_DAY` and `END_DAY`
- Can crawl creator fans/following lists
- Uses BV video ID format
## Legal and Usage Notes
This project is for educational and research purposes only. Users must:
- Comply with platform terms of service
- Follow robots.txt rules
- Control request frequency appropriately
- Not use for commercial purposes
- Respect platform rate limits
The project includes comprehensive legal disclaimers and usage guidelines in the README.md file.

View File

@@ -22,7 +22,7 @@ CRAWLER_TYPE = (
"search" # 爬取类型search(关键词搜索) | detail(帖子详情)| creator(创作者主页数据)
)
# 自定义User Agent暂时仅对XHS有效
UA = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0'
UA = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0"
# 是否开启 IP 代理
ENABLE_IP_PROXY = False
@@ -190,9 +190,9 @@ ZHIHU_CREATOR_URL_LIST = [
# 指定知乎需要爬取的帖子ID列表
ZHIHU_SPECIFIED_ID_LIST = [
"https://www.zhihu.com/question/826896610/answer/4885821440", # 回答
"https://zhuanlan.zhihu.com/p/673461588", # 文章
"https://www.zhihu.com/zvideo/1539542068422144000" # 视频
"https://www.zhihu.com/question/826896610/answer/4885821440", # 回答
"https://zhuanlan.zhihu.com/p/673461588", # 文章
"https://www.zhihu.com/zvideo/1539542068422144000", # 视频
]
# 词云相关
@@ -212,10 +212,10 @@ STOP_WORDS_FILE = "./docs/hit_stopwords.txt"
FONT_PATH = "./docs/STZHONGS.TTF"
# 爬取开始的天数,仅支持 bilibili 关键字搜索YYYY-MM-DD 格式,若为 None 则表示不设置时间范围,按照默认关键字最多返回 1000 条视频的结果处理
START_DAY = '2024-01-01'
START_DAY = "2024-01-01"
# 爬取结束的天数,仅支持 bilibili 关键字搜索YYYY-MM-DD 格式,若为 None 则表示不设置时间范围,按照默认关键字最多返回 1000 条视频的结果处理
END_DAY = '2024-01-01'
END_DAY = "2024-01-01"
# 是否开启按每一天进行爬取的选项,仅支持 bilibili 关键字搜索
# 若为 False则忽略 START_DAY 与 END_DAY 设置的值
@@ -233,4 +233,4 @@ START_CONTACTS_PAGE = 1
CRAWLER_MAX_CONTACTS_COUNT_SINGLENOTES = 100
# 爬取作者动态数量控制(单作者)
CRAWLER_MAX_DYNAMICS_COUNT_SINGLENOTES = 50
CRAWLER_MAX_DYNAMICS_COUNT_SINGLENOTES = 50

View File

@@ -22,13 +22,14 @@ from typing import Dict, List, Optional, Tuple, Union
from datetime import datetime, timedelta
import pandas as pd
from playwright.async_api import (BrowserContext, BrowserType, Page, async_playwright)
from playwright.async_api import (BrowserContext, BrowserType, Page, Playwright, async_playwright)
import config
from base.base_crawler import AbstractCrawler
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
from store import bilibili as bilibili_store
from tools import utils
from tools.cdp_browser import CDPBrowserManager
from var import crawler_type_var, source_keyword_var
from .client import BilibiliClient
@@ -41,10 +42,12 @@ class BilibiliCrawler(AbstractCrawler):
context_page: Page
bili_client: BilibiliClient
browser_context: BrowserContext
cdp_manager: Optional[CDPBrowserManager]
def __init__(self):
self.index_url = "https://www.bilibili.com"
self.user_agent = utils.get_user_agent()
self.cdp_manager = None
async def start(self):
playwright_proxy_format, httpx_proxy_format = None, None
@@ -55,14 +58,23 @@ class BilibiliCrawler(AbstractCrawler):
ip_proxy_info)
async with async_playwright() as playwright:
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
chromium,
None,
self.user_agent,
headless=config.HEADLESS
)
# 根据配置选择启动模式
if config.ENABLE_CDP_MODE:
utils.logger.info("[BilibiliCrawler] 使用CDP模式启动浏览器")
self.browser_context = await self.launch_browser_with_cdp(
playwright, playwright_proxy_format, self.user_agent,
headless=config.CDP_HEADLESS
)
else:
utils.logger.info("[BilibiliCrawler] 使用标准模式启动浏览器")
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
chromium,
None,
self.user_agent,
headless=config.HEADLESS
)
# stealth.min.js is a js script to prevent the website from detecting the crawler.
await self.browser_context.add_init_script(path="libs/stealth.min.js")
self.context_page = await self.browser_context.new_page()
@@ -434,6 +446,42 @@ class BilibiliCrawler(AbstractCrawler):
)
return browser_context
async def launch_browser_with_cdp(self, playwright: Playwright, playwright_proxy: Optional[Dict],
user_agent: Optional[str], headless: bool = True) -> BrowserContext:
"""
使用CDP模式启动浏览器
"""
try:
self.cdp_manager = CDPBrowserManager()
browser_context = await self.cdp_manager.launch_and_connect(
playwright=playwright,
playwright_proxy=playwright_proxy,
user_agent=user_agent,
headless=headless
)
# 显示浏览器信息
browser_info = await self.cdp_manager.get_browser_info()
utils.logger.info(f"[BilibiliCrawler] CDP浏览器信息: {browser_info}")
return browser_context
except Exception as e:
utils.logger.error(f"[BilibiliCrawler] CDP模式启动失败回退到标准模式: {e}")
# 回退到标准模式
chromium = playwright.chromium
return await self.launch_browser(chromium, playwright_proxy, user_agent, headless)
async def close(self):
"""Close browser context"""
# 如果使用CDP模式需要特殊处理
if self.cdp_manager:
await self.cdp_manager.cleanup()
self.cdp_manager = None
else:
await self.browser_context.close()
utils.logger.info("[BilibiliCrawler.close] Browser context closed ...")
async def get_bilibili_video(self, video_item: Dict, semaphore: asyncio.Semaphore):
"""
download bilibili video

View File

@@ -16,13 +16,14 @@ import time
from asyncio import Task
from typing import Dict, List, Optional, Tuple
from playwright.async_api import BrowserContext, BrowserType, Page, async_playwright
from playwright.async_api import BrowserContext, BrowserType, Page, Playwright, async_playwright
import config
from base.base_crawler import AbstractCrawler
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
from store import kuaishou as kuaishou_store
from tools import utils
from tools.cdp_browser import CDPBrowserManager
from var import comment_tasks_var, crawler_type_var, source_keyword_var
from .client import KuaiShouClient
@@ -34,10 +35,12 @@ class KuaishouCrawler(AbstractCrawler):
context_page: Page
ks_client: KuaiShouClient
browser_context: BrowserContext
cdp_manager: Optional[CDPBrowserManager]
def __init__(self):
self.index_url = "https://www.kuaishou.com"
self.user_agent = utils.get_user_agent()
self.cdp_manager = None
async def start(self):
playwright_proxy_format, httpx_proxy_format = None, None
@@ -51,11 +54,20 @@ class KuaishouCrawler(AbstractCrawler):
)
async with async_playwright() as playwright:
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
chromium, None, self.user_agent, headless=config.HEADLESS
)
# 根据配置选择启动模式
if config.ENABLE_CDP_MODE:
utils.logger.info("[KuaishouCrawler] 使用CDP模式启动浏览器")
self.browser_context = await self.launch_browser_with_cdp(
playwright, playwright_proxy_format, self.user_agent,
headless=config.CDP_HEADLESS
)
else:
utils.logger.info("[KuaishouCrawler] 使用标准模式启动浏览器")
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
chromium, None, self.user_agent, headless=config.HEADLESS
)
# stealth.min.js is a js script to prevent the website from detecting the crawler.
await self.browser_context.add_init_script(path="libs/stealth.min.js")
self.context_page = await self.browser_context.new_page()
@@ -307,6 +319,32 @@ class KuaishouCrawler(AbstractCrawler):
)
return browser_context
async def launch_browser_with_cdp(self, playwright: Playwright, playwright_proxy: Optional[Dict],
user_agent: Optional[str], headless: bool = True) -> BrowserContext:
"""
使用CDP模式启动浏览器
"""
try:
self.cdp_manager = CDPBrowserManager()
browser_context = await self.cdp_manager.launch_and_connect(
playwright=playwright,
playwright_proxy=playwright_proxy,
user_agent=user_agent,
headless=headless
)
# 显示浏览器信息
browser_info = await self.cdp_manager.get_browser_info()
utils.logger.info(f"[KuaishouCrawler] CDP浏览器信息: {browser_info}")
return browser_context
except Exception as e:
utils.logger.error(f"[KuaishouCrawler] CDP模式启动失败回退到标准模式: {e}")
# 回退到标准模式
chromium = playwright.chromium
return await self.launch_browser(chromium, playwright_proxy, user_agent, headless)
async def get_creators_and_videos(self) -> None:
"""Get creator's videos and retrieve their comment information."""
utils.logger.info(
@@ -347,5 +385,10 @@ class KuaishouCrawler(AbstractCrawler):
async def close(self):
"""Close browser context"""
await self.browser_context.close()
# 如果使用CDP模式需要特殊处理
if self.cdp_manager:
await self.cdp_manager.cleanup()
self.cdp_manager = None
else:
await self.browser_context.close()
utils.logger.info("[KuaishouCrawler.close] Browser context closed ...")

View File

@@ -15,7 +15,7 @@ import random
from asyncio import Task
from typing import Dict, List, Optional, Tuple
from playwright.async_api import (BrowserContext, BrowserType, Page,
from playwright.async_api import (BrowserContext, BrowserType, Page, Playwright,
async_playwright)
import config
@@ -24,6 +24,7 @@ from model.m_baidu_tieba import TiebaCreator, TiebaNote
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
from store import tieba as tieba_store
from tools import utils
from tools.cdp_browser import CDPBrowserManager
from tools.crawler_util import format_proxy_info
from var import crawler_type_var, source_keyword_var
@@ -37,11 +38,13 @@ class TieBaCrawler(AbstractCrawler):
context_page: Page
tieba_client: BaiduTieBaClient
browser_context: BrowserContext
cdp_manager: Optional[CDPBrowserManager]
def __init__(self) -> None:
self.index_url = "https://tieba.baidu.com"
self.user_agent = utils.get_user_agent()
self._page_extractor = TieBaExtractor()
self.cdp_manager = None
async def start(self) -> None:
"""
@@ -305,11 +308,42 @@ class TieBaCrawler(AbstractCrawler):
)
return browser_context
async def launch_browser_with_cdp(self, playwright: Playwright, playwright_proxy: Optional[Dict],
user_agent: Optional[str], headless: bool = True) -> BrowserContext:
"""
使用CDP模式启动浏览器
"""
try:
self.cdp_manager = CDPBrowserManager()
browser_context = await self.cdp_manager.launch_and_connect(
playwright=playwright,
playwright_proxy=playwright_proxy,
user_agent=user_agent,
headless=headless
)
# 显示浏览器信息
browser_info = await self.cdp_manager.get_browser_info()
utils.logger.info(f"[TieBaCrawler] CDP浏览器信息: {browser_info}")
return browser_context
except Exception as e:
utils.logger.error(f"[TieBaCrawler] CDP模式启动失败回退到标准模式: {e}")
# 回退到标准模式
chromium = playwright.chromium
return await self.launch_browser(chromium, playwright_proxy, user_agent, headless)
async def close(self):
"""
Close browser context
Returns:
"""
await self.browser_context.close()
# 如果使用CDP模式需要特殊处理
if self.cdp_manager:
await self.cdp_manager.cleanup()
self.cdp_manager = None
else:
await self.browser_context.close()
utils.logger.info("[BaiduTieBaCrawler.close] Browser context closed ...")

View File

@@ -21,7 +21,7 @@ import random
from asyncio import Task
from typing import Dict, List, Optional, Tuple
from playwright.async_api import (BrowserContext, BrowserType, Page,
from playwright.async_api import (BrowserContext, BrowserType, Page, Playwright,
async_playwright)
import config
@@ -29,6 +29,7 @@ from base.base_crawler import AbstractCrawler
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
from store import weibo as weibo_store
from tools import utils
from tools.cdp_browser import CDPBrowserManager
from var import crawler_type_var, source_keyword_var
from .client import WeiboClient
@@ -42,12 +43,14 @@ class WeiboCrawler(AbstractCrawler):
context_page: Page
wb_client: WeiboClient
browser_context: BrowserContext
cdp_manager: Optional[CDPBrowserManager]
def __init__(self):
self.index_url = "https://www.weibo.com"
self.mobile_index_url = "https://m.weibo.cn"
self.user_agent = utils.get_user_agent()
self.mobile_user_agent = utils.get_mobile_user_agent()
self.cdp_manager = None
async def start(self):
playwright_proxy_format, httpx_proxy_format = None, None
@@ -57,14 +60,23 @@ class WeiboCrawler(AbstractCrawler):
playwright_proxy_format, httpx_proxy_format = self.format_proxy_info(ip_proxy_info)
async with async_playwright() as playwright:
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
chromium,
None,
self.mobile_user_agent,
headless=config.HEADLESS
)
# 根据配置选择启动模式
if config.ENABLE_CDP_MODE:
utils.logger.info("[WeiboCrawler] 使用CDP模式启动浏览器")
self.browser_context = await self.launch_browser_with_cdp(
playwright, playwright_proxy_format, self.mobile_user_agent,
headless=config.CDP_HEADLESS
)
else:
utils.logger.info("[WeiboCrawler] 使用标准模式启动浏览器")
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
chromium,
None,
self.mobile_user_agent,
headless=config.HEADLESS
)
# stealth.min.js is a js script to prevent the website from detecting the crawler.
await self.browser_context.add_init_script(path="libs/stealth.min.js")
self.context_page = await self.browser_context.new_page()
@@ -330,3 +342,39 @@ class WeiboCrawler(AbstractCrawler):
user_agent=user_agent
)
return browser_context
async def launch_browser_with_cdp(self, playwright: Playwright, playwright_proxy: Optional[Dict],
user_agent: Optional[str], headless: bool = True) -> BrowserContext:
"""
使用CDP模式启动浏览器
"""
try:
self.cdp_manager = CDPBrowserManager()
browser_context = await self.cdp_manager.launch_and_connect(
playwright=playwright,
playwright_proxy=playwright_proxy,
user_agent=user_agent,
headless=headless
)
# 显示浏览器信息
browser_info = await self.cdp_manager.get_browser_info()
utils.logger.info(f"[WeiboCrawler] CDP浏览器信息: {browser_info}")
return browser_context
except Exception as e:
utils.logger.error(f"[WeiboCrawler] CDP模式启动失败回退到标准模式: {e}")
# 回退到标准模式
chromium = playwright.chromium
return await self.launch_browser(chromium, playwright_proxy, user_agent, headless)
async def close(self):
"""Close browser context"""
# 如果使用CDP模式需要特殊处理
if self.cdp_manager:
await self.cdp_manager.cleanup()
self.cdp_manager = None
else:
await self.browser_context.close()
utils.logger.info("[WeiboCrawler.close] Browser context closed ...")

View File

@@ -16,7 +16,7 @@ import random
from asyncio import Task
from typing import Dict, List, Optional, Tuple, cast
from playwright.async_api import (BrowserContext, BrowserType, Page,
from playwright.async_api import (BrowserContext, BrowserType, Page, Playwright,
async_playwright)
import config
@@ -26,6 +26,7 @@ from model.m_zhihu import ZhihuContent, ZhihuCreator
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
from store import zhihu as zhihu_store
from tools import utils
from tools.cdp_browser import CDPBrowserManager
from var import crawler_type_var, source_keyword_var
from .client import ZhiHuClient
@@ -38,12 +39,14 @@ class ZhihuCrawler(AbstractCrawler):
context_page: Page
zhihu_client: ZhiHuClient
browser_context: BrowserContext
cdp_manager: Optional[CDPBrowserManager]
def __init__(self) -> None:
self.index_url = "https://www.zhihu.com"
# self.user_agent = utils.get_user_agent()
self.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
self._extractor = ZhihuExtractor()
self.cdp_manager = None
async def start(self) -> None:
"""
@@ -58,14 +61,23 @@ class ZhihuCrawler(AbstractCrawler):
playwright_proxy_format, httpx_proxy_format = self.format_proxy_info(ip_proxy_info)
async with async_playwright() as playwright:
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
chromium,
None,
self.user_agent,
headless=config.HEADLESS
)
# 根据配置选择启动模式
if config.ENABLE_CDP_MODE:
utils.logger.info("[ZhihuCrawler] 使用CDP模式启动浏览器")
self.browser_context = await self.launch_browser_with_cdp(
playwright, playwright_proxy_format, self.user_agent,
headless=config.CDP_HEADLESS
)
else:
utils.logger.info("[ZhihuCrawler] 使用标准模式启动浏览器")
# Launch a browser context.
chromium = playwright.chromium
self.browser_context = await self.launch_browser(
chromium,
None,
self.user_agent,
headless=config.HEADLESS
)
# stealth.min.js is a js script to prevent the website from detecting the crawler.
await self.browser_context.add_init_script(path="libs/stealth.min.js")
@@ -365,7 +377,38 @@ class ZhihuCrawler(AbstractCrawler):
)
return browser_context
async def launch_browser_with_cdp(self, playwright: Playwright, playwright_proxy: Optional[Dict],
user_agent: Optional[str], headless: bool = True) -> BrowserContext:
"""
使用CDP模式启动浏览器
"""
try:
self.cdp_manager = CDPBrowserManager()
browser_context = await self.cdp_manager.launch_and_connect(
playwright=playwright,
playwright_proxy=playwright_proxy,
user_agent=user_agent,
headless=headless
)
# 显示浏览器信息
browser_info = await self.cdp_manager.get_browser_info()
utils.logger.info(f"[ZhihuCrawler] CDP浏览器信息: {browser_info}")
return browser_context
except Exception as e:
utils.logger.error(f"[ZhihuCrawler] CDP模式启动失败回退到标准模式: {e}")
# 回退到标准模式
chromium = playwright.chromium
return await self.launch_browser(chromium, playwright_proxy, user_agent, headless)
async def close(self):
"""Close browser context"""
await self.browser_context.close()
# 如果使用CDP模式需要特殊处理
if self.cdp_manager:
await self.cdp_manager.cleanup()
self.cdp_manager = None
else:
await self.browser_context.close()
utils.logger.info("[ZhihuCrawler.close] Browser context closed ...")

View File

@@ -115,6 +115,7 @@ class BrowserLauncher:
args = [
browser_path,
f"--remote-debugging-port={debug_port}",
"--remote-debugging-address=0.0.0.0", # 允许远程访问
"--no-first-run",
"--no-default-browser-check",
"--disable-background-timer-throttling",
@@ -127,8 +128,8 @@ class BrowserLauncher:
"--disable-sync",
"--disable-web-security", # 可能有助于某些网站的访问
"--disable-features=VizDisplayCompositor",
"--disable-extensions-except", # 保留用户扩展
"--load-extension", # 允许加载扩展
"--disable-dev-shm-usage", # 避免共享内存问题
"--no-sandbox", # 在CDP模式下关闭沙箱
]
# 无头模式
@@ -136,7 +137,12 @@ class BrowserLauncher:
args.extend([
"--headless",
"--disable-gpu",
"--no-sandbox",
])
else:
# 非无头模式下也保持一些稳定性参数
args.extend([
"--disable-blink-features=AutomationControlled",
"--disable-infobars",
])
# 用户数据目录

View File

@@ -11,6 +11,8 @@
import os
import asyncio
import socket
import httpx
from typing import Optional, Dict, Any
from playwright.async_api import Browser, BrowserContext, Playwright
@@ -23,72 +25,102 @@ class CDPBrowserManager:
"""
CDP浏览器管理器负责启动和管理通过CDP连接的浏览器
"""
def __init__(self):
self.launcher = BrowserLauncher()
self.browser: Optional[Browser] = None
self.browser_context: Optional[BrowserContext] = None
self.debug_port: Optional[int] = None
async def launch_and_connect(self, playwright: Playwright,
playwright_proxy: Optional[Dict] = None,
user_agent: Optional[str] = None,
headless: bool = False) -> BrowserContext:
async def launch_and_connect(
self,
playwright: Playwright,
playwright_proxy: Optional[Dict] = None,
user_agent: Optional[str] = None,
headless: bool = False,
) -> BrowserContext:
"""
启动浏览器并通过CDP连接
"""
try:
# 1. 检测浏览器路径
browser_path = await self._get_browser_path()
# 2. 获取可用端口
self.debug_port = self.launcher.find_available_port(config.CDP_DEBUG_PORT)
# 3. 启动浏览器
await self._launch_browser(browser_path, headless)
# 4. 通过CDP连接
await self._connect_via_cdp(playwright)
# 5. 创建浏览器上下文
browser_context = await self._create_browser_context(
playwright_proxy, user_agent
)
self.browser_context = browser_context
return browser_context
except Exception as e:
utils.logger.error(f"[CDPBrowserManager] CDP浏览器启动失败: {e}")
await self.cleanup()
raise
async def _get_browser_path(self) -> str:
"""
获取浏览器路径
"""
# 优先使用用户自定义路径
if config.CUSTOM_BROWSER_PATH and os.path.isfile(config.CUSTOM_BROWSER_PATH):
utils.logger.info(f"[CDPBrowserManager] 使用自定义浏览器路径: {config.CUSTOM_BROWSER_PATH}")
utils.logger.info(
f"[CDPBrowserManager] 使用自定义浏览器路径: {config.CUSTOM_BROWSER_PATH}"
)
return config.CUSTOM_BROWSER_PATH
# 自动检测浏览器路径
browser_paths = self.launcher.detect_browser_paths()
if not browser_paths:
raise RuntimeError(
"未找到可用的浏览器。请确保已安装Chrome或Edge浏览器"
"或在配置文件中设置CUSTOM_BROWSER_PATH指定浏览器路径。"
)
browser_path = browser_paths[0] # 使用第一个找到的浏览器
browser_name, browser_version = self.launcher.get_browser_info(browser_path)
utils.logger.info(f"[CDPBrowserManager] 检测到浏览器: {browser_name} ({browser_version})")
utils.logger.info(
f"[CDPBrowserManager] 检测到浏览器: {browser_name} ({browser_version})"
)
utils.logger.info(f"[CDPBrowserManager] 浏览器路径: {browser_path}")
return browser_path
async def _test_cdp_connection(self, debug_port: int) -> bool:
"""
测试CDP连接是否可用
"""
try:
# 简单的socket连接测试
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.settimeout(5)
result = s.connect_ex(("localhost", debug_port))
if result == 0:
utils.logger.info(
f"[CDPBrowserManager] CDP端口 {debug_port} 可访问"
)
return True
else:
utils.logger.warning(
f"[CDPBrowserManager] CDP端口 {debug_port} 不可访问"
)
return False
except Exception as e:
utils.logger.warning(f"[CDPBrowserManager] CDP连接测试失败: {e}")
return False
async def _launch_browser(self, browser_path: str, headless: bool):
"""
启动浏览器进程
@@ -97,55 +129,94 @@ class CDPBrowserManager:
user_data_dir = None
if config.SAVE_LOGIN_STATE:
user_data_dir = os.path.join(
os.getcwd(), "browser_data",
f"cdp_{config.USER_DATA_DIR % config.PLATFORM}"
os.getcwd(),
"browser_data",
f"cdp_{config.USER_DATA_DIR % config.PLATFORM}",
)
os.makedirs(user_data_dir, exist_ok=True)
utils.logger.info(f"[CDPBrowserManager] 用户数据目录: {user_data_dir}")
# 启动浏览器
self.launcher.browser_process = self.launcher.launch_browser(
browser_path=browser_path,
debug_port=self.debug_port,
headless=headless,
user_data_dir=user_data_dir
user_data_dir=user_data_dir,
)
# 等待浏览器准备就绪
if not self.launcher.wait_for_browser_ready(
self.debug_port, config.BROWSER_LAUNCH_TIMEOUT
):
raise RuntimeError(f"浏览器在 {config.BROWSER_LAUNCH_TIMEOUT} 秒内未能启动")
# 额外等待一秒让CDP服务完全启动
await asyncio.sleep(1)
# 测试CDP连接
if not await self._test_cdp_connection(self.debug_port):
utils.logger.warning(
"[CDPBrowserManager] CDP连接测试失败但将继续尝试连接"
)
async def _get_browser_websocket_url(self, debug_port: int) -> str:
"""
获取浏览器的WebSocket连接URL
"""
try:
async with httpx.AsyncClient() as client:
response = await client.get(
f"http://localhost:{debug_port}/json/version", timeout=10
)
if response.status_code == 200:
data = response.json()
ws_url = data.get("webSocketDebuggerUrl")
if ws_url:
utils.logger.info(
f"[CDPBrowserManager] 获取到浏览器WebSocket URL: {ws_url}"
)
return ws_url
else:
raise RuntimeError("未找到webSocketDebuggerUrl")
else:
raise RuntimeError(f"HTTP {response.status_code}: {response.text}")
except Exception as e:
utils.logger.error(f"[CDPBrowserManager] 获取WebSocket URL失败: {e}")
raise
async def _connect_via_cdp(self, playwright: Playwright):
"""
通过CDP连接到浏览器
"""
cdp_url = f"http://localhost:{self.debug_port}"
utils.logger.info(f"[CDPBrowserManager] 正在通过CDP连接到浏览器: {cdp_url}")
try:
# 获取正确的WebSocket URL
ws_url = await self._get_browser_websocket_url(self.debug_port)
utils.logger.info(f"[CDPBrowserManager] 正在通过CDP连接到浏览器: {ws_url}")
# 使用Playwright的connectOverCDP方法连接
self.browser = await playwright.chromium.connect_over_cdp(cdp_url)
self.browser = await playwright.chromium.connect_over_cdp(ws_url)
if self.browser.is_connected():
utils.logger.info("[CDPBrowserManager] 成功连接到浏览器")
utils.logger.info(f"[CDPBrowserManager] 浏览器上下文数量: {len(self.browser.contexts)}")
utils.logger.info(
f"[CDPBrowserManager] 浏览器上下文数量: {len(self.browser.contexts)}"
)
else:
raise RuntimeError("CDP连接失败")
except Exception as e:
utils.logger.error(f"[CDPBrowserManager] CDP连接失败: {e}")
raise
async def _create_browser_context(self, playwright_proxy: Optional[Dict] = None,
user_agent: Optional[str] = None) -> BrowserContext:
async def _create_browser_context(
self, playwright_proxy: Optional[Dict] = None, user_agent: Optional[str] = None
) -> BrowserContext:
"""
创建或获取浏览器上下文
"""
if not self.browser:
raise RuntimeError("浏览器未连接")
# 获取现有上下文或创建新的上下文
contexts = self.browser.contexts
@@ -159,24 +230,24 @@ class CDPBrowserManager:
"viewport": {"width": 1920, "height": 1080},
"accept_downloads": True,
}
# 设置用户代理
if user_agent:
context_options["user_agent"] = user_agent
utils.logger.info(f"[CDPBrowserManager] 设置用户代理: {user_agent}")
# 注意CDP模式下代理设置可能不生效因为浏览器已经启动
if playwright_proxy:
utils.logger.warning(
"[CDPBrowserManager] 警告: CDP模式下代理设置可能不生效"
"建议在浏览器启动前配置系统代理或浏览器代理扩展"
)
browser_context = await self.browser.new_context(**context_options)
utils.logger.info("[CDPBrowserManager] 创建新的浏览器上下文")
return browser_context
async def add_stealth_script(self, script_path: str = "libs/stealth.min.js"):
"""
添加反检测脚本
@@ -184,10 +255,12 @@ class CDPBrowserManager:
if self.browser_context and os.path.exists(script_path):
try:
await self.browser_context.add_init_script(path=script_path)
utils.logger.info(f"[CDPBrowserManager] 已添加反检测脚本: {script_path}")
utils.logger.info(
f"[CDPBrowserManager] 已添加反检测脚本: {script_path}"
)
except Exception as e:
utils.logger.warning(f"[CDPBrowserManager] 添加反检测脚本失败: {e}")
async def add_cookies(self, cookies: list):
"""
添加Cookie
@@ -198,7 +271,7 @@ class CDPBrowserManager:
utils.logger.info(f"[CDPBrowserManager] 已添加 {len(cookies)} 个Cookie")
except Exception as e:
utils.logger.warning(f"[CDPBrowserManager] 添加Cookie失败: {e}")
async def get_cookies(self) -> list:
"""
获取当前Cookie
@@ -211,7 +284,7 @@ class CDPBrowserManager:
utils.logger.warning(f"[CDPBrowserManager] 获取Cookie失败: {e}")
return []
return []
async def cleanup(self):
"""
清理资源
@@ -222,35 +295,37 @@ class CDPBrowserManager:
await self.browser_context.close()
self.browser_context = None
utils.logger.info("[CDPBrowserManager] 浏览器上下文已关闭")
# 断开浏览器连接
if self.browser:
await self.browser.close()
self.browser = None
utils.logger.info("[CDPBrowserManager] 浏览器连接已断开")
# 关闭浏览器进程(如果配置为自动关闭)
if config.AUTO_CLOSE_BROWSER:
self.launcher.cleanup()
else:
utils.logger.info("[CDPBrowserManager] 浏览器进程保持运行AUTO_CLOSE_BROWSER=False")
utils.logger.info(
"[CDPBrowserManager] 浏览器进程保持运行AUTO_CLOSE_BROWSER=False"
)
except Exception as e:
utils.logger.error(f"[CDPBrowserManager] 清理资源时出错: {e}")
def is_connected(self) -> bool:
"""
检查是否已连接到浏览器
"""
return self.browser is not None and self.browser.is_connected()
async def get_browser_info(self) -> Dict[str, Any]:
"""
获取浏览器信息
"""
if not self.browser:
return {}
try:
version = self.browser.version
contexts_count = len(self.browser.contexts)
@@ -259,7 +334,7 @@ class CDPBrowserManager:
"version": version,
"contexts_count": contexts_count,
"debug_port": self.debug_port,
"is_connected": self.is_connected()
"is_connected": self.is_connected(),
}
except Exception as e:
utils.logger.warning(f"[CDPBrowserManager] 获取浏览器信息失败: {e}")