fix: 避免复用浏览器时跨域 Cookie 过长导致请求失败

连接已有 Chrome 会把整个浏览器上下文的 cookie 带入平台 client。
除 xhs 外,多数平台仍直接读取全量 cookies,导致请求头过长并放大跨域污染。
本次将各平台的 cookie 读取统一收口到平台域名,并补上基础回归测试。

Constraint: 必须继续复用用户真实浏览器里的平台登录态
Rejected: 仅修复 xhs | 其他平台在连接已有浏览器时仍会携带超长 Cookie
Confidence: high
Scope-risk: moderate
Reversibility: clean
Directive: 后续新增平台或调整 update_cookies 和 create client 流程时,只按平台域名读取 cookies
Tested: uv run pytest test/test_utils.py; python3 -m compileall tools/crawler_util.py media_platform/douyin/core.py media_platform/douyin/client.py media_platform/kuaishou/core.py media_platform/kuaishou/client.py media_platform/bilibili/core.py media_platform/bilibili/client.py media_platform/zhihu/core.py media_platform/zhihu/client.py media_platform/tieba/core.py media_platform/tieba/client.py media_platform/xhs/core.py media_platform/xhs/client.py media_platform/weibo/core.py media_platform/weibo/client.py test/test_utils.py
Not-tested: 各平台在真实 CDP 浏览器连接下的端到端抓取流程
This commit is contained in:
程序员阿江(Relakkes)
2026-04-21 13:49:37 +08:00
parent 15a20a7983
commit 0c5f281212
16 changed files with 155 additions and 43 deletions

View File

@@ -57,6 +57,7 @@ class ZhihuCrawler(AbstractCrawler):
def __init__(self) -> None:
self.index_url = "https://www.zhihu.com"
self.cookie_urls = [self.index_url]
# self.user_agent = utils.get_user_agent()
self.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
self._extractor = ZhihuExtractor()
@@ -114,7 +115,8 @@ class ZhihuCrawler(AbstractCrawler):
)
await login_obj.begin()
await self.zhihu_client.update_cookies(
browser_context=self.browser_context
browser_context=self.browser_context,
urls=self.cookie_urls,
)
# Zhihu's search API requires opening the search page first to access cookies, homepage alone won't work
@@ -125,7 +127,10 @@ class ZhihuCrawler(AbstractCrawler):
f"{self.index_url}/search?q=python&search_source=Guess&utm_content=search_hot&type=content"
)
await asyncio.sleep(5)
await self.zhihu_client.update_cookies(browser_context=self.browser_context)
await self.zhihu_client.update_cookies(
browser_context=self.browser_context,
urls=self.cookie_urls,
)
crawler_type_var.set(config.CRAWLER_TYPE)
if config.CRAWLER_TYPE == "search":
@@ -393,8 +398,9 @@ class ZhihuCrawler(AbstractCrawler):
utils.logger.info(
"[ZhihuCrawler.create_zhihu_client] Begin create zhihu API client ..."
)
cookie_str, cookie_dict = utils.convert_cookies(
await self.browser_context.cookies()
cookie_str, cookie_dict = await utils.convert_browser_context_cookies(
self.browser_context,
urls=self.cookie_urls,
)
zhihu_client_obj = ZhiHuClient(
proxy=httpx_proxy,