Files
MediaCrawler/api/schemas/crawler.py
程序员阿江(Relakkes) 8e93438fe5 Keep PR 900 overrides bounded and opt-in
The PR adds API limit overrides and static proxy support, but the review found that the default proxy provider changed to an invalid static placeholder and the new API fields accepted unbounded values. This keeps the existing proxy default intact, makes static proxy explicit via config or CLI, validates API limit ranges, and adds focused regression coverage for both paths.

Constraint: PR branch must remain contributor-branch compatible and avoid adding dependencies

Rejected: Keep static as the default provider | breaks existing --enable_ip_proxy defaults with an invalid placeholder URL

Rejected: Accept arbitrary integer limits | lets API callers request negative or excessive crawl sizes

Confidence: high

Scope-risk: narrow

Directive: Do not change proxy provider defaults when adding new providers; new providers should be opt-in and covered by provider-specific tests

Tested: uv run pytest tests/test_api_limits.py tests/test_static_proxy_provider.py

Tested: uv run pytest tests

Tested: uv run pytest test/test_utils.py

Tested: uv run python -m compileall api cmd_arg config proxy tests

Tested: git diff --cached --check

Not-tested: Live crawler run against external platforms or real proxy vendor endpoints
2026-05-29 21:27:52 +08:00

105 lines
3.0 KiB
Python

# -*- coding: utf-8 -*-
# Copyright (c) 2025 relakkes@gmail.com
#
# This file is part of MediaCrawler project.
# Repository: https://github.com/NanmiCoder/MediaCrawler/blob/main/api/schemas/crawler.py
# GitHub: https://github.com/NanmiCoder
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
#
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
# 1. 不得用于任何商业用途。
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
# 3. 不得进行大规模爬取或对平台造成运营干扰。
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
# 5. 不得用于任何非法或不当的用途。
#
# 详细许可条款请参阅项目根目录下的LICENSE文件。
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
from enum import Enum
from typing import Optional, Literal
from pydantic import BaseModel, Field
MAX_API_LIMIT_COUNT = 10000
class PlatformEnum(str, Enum):
"""Supported media platforms"""
XHS = "xhs"
DOUYIN = "dy"
KUAISHOU = "ks"
BILIBILI = "bili"
WEIBO = "wb"
TIEBA = "tieba"
ZHIHU = "zhihu"
class LoginTypeEnum(str, Enum):
"""Login method"""
QRCODE = "qrcode"
PHONE = "phone"
COOKIE = "cookie"
class CrawlerTypeEnum(str, Enum):
"""Crawler type"""
SEARCH = "search"
DETAIL = "detail"
CREATOR = "creator"
class SaveDataOptionEnum(str, Enum):
"""Data save option"""
CSV = "csv"
DB = "db"
JSON = "json"
JSONL = "jsonl"
SQLITE = "sqlite"
MONGODB = "mongodb"
EXCEL = "excel"
class CrawlerStartRequest(BaseModel):
"""Crawler start request"""
platform: PlatformEnum
login_type: LoginTypeEnum = LoginTypeEnum.QRCODE
crawler_type: CrawlerTypeEnum = CrawlerTypeEnum.SEARCH
keywords: str = "" # Keywords for search mode
specified_ids: str = "" # Post/video ID list for detail mode, comma-separated
creator_ids: str = "" # Creator ID list for creator mode, comma-separated
start_page: int = 1
enable_comments: bool = True
enable_sub_comments: bool = False
save_option: SaveDataOptionEnum = SaveDataOptionEnum.JSONL
cookies: str = ""
headless: bool = False
max_notes_count: Optional[int] = Field(default=None, ge=1, le=MAX_API_LIMIT_COUNT)
max_comments_count: Optional[int] = Field(default=None, ge=1, le=MAX_API_LIMIT_COUNT)
class CrawlerStatusResponse(BaseModel):
"""Crawler status response"""
status: Literal["idle", "running", "stopping", "error"]
platform: Optional[str] = None
crawler_type: Optional[str] = None
started_at: Optional[str] = None
error_message: Optional[str] = None
class LogEntry(BaseModel):
"""Log entry"""
id: int
timestamp: str
level: Literal["info", "warning", "error", "success", "debug"]
message: str
class DataFileInfo(BaseModel):
"""Data file information"""
name: str
path: str
size: int
modified_at: str
record_count: Optional[int] = None