mirror of
https://github.com/NanmiCoder/MediaCrawler.git
synced 2026-05-10 20:47:39 +08:00
连接已有 Chrome 会把整个浏览器上下文的 cookie 带入平台 client。 除 xhs 外,多数平台仍直接读取全量 cookies,导致请求头过长并放大跨域污染。 本次将各平台的 cookie 读取统一收口到平台域名,并补上基础回归测试。 Constraint: 必须继续复用用户真实浏览器里的平台登录态 Rejected: 仅修复 xhs | 其他平台在连接已有浏览器时仍会携带超长 Cookie Confidence: high Scope-risk: moderate Reversibility: clean Directive: 后续新增平台或调整 update_cookies 和 create client 流程时,只按平台域名读取 cookies Tested: uv run pytest test/test_utils.py; python3 -m compileall tools/crawler_util.py media_platform/douyin/core.py media_platform/douyin/client.py media_platform/kuaishou/core.py media_platform/kuaishou/client.py media_platform/bilibili/core.py media_platform/bilibili/client.py media_platform/zhihu/core.py media_platform/zhihu/client.py media_platform/tieba/core.py media_platform/tieba/client.py media_platform/xhs/core.py media_platform/xhs/client.py media_platform/weibo/core.py media_platform/weibo/client.py test/test_utils.py Not-tested: 各平台在真实 CDP 浏览器连接下的端到端抓取流程
50 lines
1.7 KiB
Python
50 lines
1.7 KiB
Python
# -*- coding: utf-8 -*-
|
|
# Copyright (c) 2025 relakkes@gmail.com
|
|
#
|
|
# This file is part of MediaCrawler project.
|
|
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/test/test_utils.py
|
|
# GitHub: https://github.com/NanmiCoder
|
|
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
|
#
|
|
|
|
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
|
# 1. 不得用于任何商业用途。
|
|
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
|
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
|
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
|
# 5. 不得用于任何非法或不当的用途。
|
|
#
|
|
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
|
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
|
|
|
|
|
# -*- coding: utf-8 -*-
|
|
|
|
from unittest.mock import AsyncMock
|
|
|
|
import pytest
|
|
|
|
from tools import utils
|
|
|
|
|
|
def test_convert_cookies():
|
|
xhs_cookies = "a1=x000101360; webId=1190c4d3cxxxx125xxx; "
|
|
cookie_dict = utils.convert_str_cookie_to_dict(xhs_cookies)
|
|
assert cookie_dict.get("webId") == "1190c4d3cxxxx125xxx"
|
|
assert cookie_dict.get("a1") == "x000101360"
|
|
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_convert_browser_context_cookies_uses_url_filter():
|
|
browser_context = AsyncMock()
|
|
browser_context.cookies.return_value = [{"name": "sessionid", "value": "abc"}]
|
|
|
|
cookie_str, cookie_dict = await utils.convert_browser_context_cookies(
|
|
browser_context,
|
|
urls=["https://www.douyin.com"],
|
|
)
|
|
|
|
browser_context.cookies.assert_awaited_once_with(urls=["https://www.douyin.com"])
|
|
assert cookie_str == "sessionid=abc"
|
|
assert cookie_dict == {"sessionid": "abc"}
|