jiazhizhong/MediaCrawler

Fork 0

mirror of https://github.com/NanmiCoder/MediaCrawler.git synced 2026-03-07 14:40:48 +08:00

Files

程序员阿江(Relakkes) 848df2b491 feat: other platfrom support the cdp mode

2025-07-03 17:13:32 +08:00

4.9 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

MediaCrawler is a multi-platform social media data collection tool supporting platforms like Xiaohongshu (Little Red Book), Douyin (TikTok), Kuaishou, Bilibili, Weibo, Tieba, and Zhihu. The project uses Playwright for browser automation and maintains login states to crawl public information without needing JS reverse engineering.

Development Environment Setup

Prerequisites

Python: >= 3.9 (verified with 3.9.6)
Node.js: >= 16.0.0 (required for Douyin and Zhihu crawlers)
uv: Modern Python package manager (recommended)

Installation Commands

# Using uv (recommended)
uv sync
uv run playwright install

# Using traditional pip (fallback)
pip install -r requirements.txt
playwright install

Running the Application

# Basic crawling command
uv run main.py --platform xhs --lt qrcode --type search

# View all available options
uv run main.py --help

# Using traditional Python
python main.py --platform xhs --lt qrcode --type search

Architecture Overview

Core Components

Platform Crawlers (media_platform/):
- Each platform has its own crawler implementation
- Follows abstract base class pattern (base/base_crawler.py)
- Platforms: xhs, dy, ks, bili, wb, tieba, zhihu
Configuration System (config/):
- base_config.py: Main configuration file with extensive options
- db_config.py: Database configuration
- Key settings: login types, proxy settings, CDP mode, data storage options
Data Storage (store/):
- Multiple storage backends: CSV, JSON, MySQL
- Platform-specific storage implementations
- Image download capabilities
Caching System (cache/):
- Local cache and Redis cache implementations
- Factory pattern for cache selection
Proxy Support (proxy/):
- IP proxy pool management
- Multiple proxy provider support (Kuaidaili, Jishu)
Browser Automation (tools/):
- Playwright browser launcher
- CDP (Chrome DevTools Protocol) support
- Slider validation utilities

Key Configuration Options

PLATFORM: Target platform (xhs, dy, ks, bili, wb, tieba, zhihu)
KEYWORDS: Search keywords (comma-separated)
CRAWLER_TYPE: Type of crawling (search, detail, creator)
ENABLE_CDP_MODE: Use Chrome DevTools Protocol for better anti-detection
SAVE_DATA_OPTION: Data storage format (csv, db, json)
ENABLE_GET_COMMENTS: Enable comment crawling
ENABLE_IP_PROXY: Enable proxy IP rotation

Testing

Available Test Commands

# Run all tests
python -m unittest discover test

# Run specific test files
python -m unittest test.test_expiring_local_cache
python -m unittest test.test_proxy_ip_pool
python -m unittest test.test_redis_cache
python -m unittest test.test_utils

# Install and use pytest (enhanced testing)
uv add pytest
uv run pytest test/

Test Coverage

Cache functionality tests
Proxy IP pool tests
Utility function tests
Redis cache tests (requires Redis server)

Database Setup

MySQL Database Initialization

# Initialize database tables (first time only)
python db.py

# Or with uv
uv run db.py

Supported Storage Options

MySQL: Full relational database with deduplication
CSV: Simple file-based storage in data/ directory
JSON: Structured file-based storage in data/ directory

Common Development Tasks

Adding New Platform Support

Create new directory in media_platform/
Implement crawler class inheriting from AbstractCrawler
Add platform-specific client, core, field, and login modules
Update CrawlerFactory in main.py
Add storage implementation in store/

Debugging CDP Mode

Set ENABLE_CDP_MODE = True in config
Use CDP_HEADLESS = False for visual debugging
Check browser console for CDP connection issues

Login states are cached in browser_data/ directory
Platform-specific user data directories maintain session cookies
Set SAVE_LOGIN_STATE = True to preserve login across runs

Platform-Specific Notes

Xiaohongshu (XHS)

Supports search, detail, and creator crawling
Requires xsec_token and xsec_source parameters for specific note URLs
Custom User-Agent configuration available

Douyin (DY)

Requires Node.js environment
Supports publish time filtering
Has specific creator ID format (sec_id)

Bilibili (BILI)

Supports date range filtering with START_DAY and END_DAY
Can crawl creator fans/following lists
Uses BV video ID format

Legal and Usage Notes

This project is for educational and research purposes only. Users must:

Comply with platform terms of service
Follow robots.txt rules
Control request frequency appropriately
Not use for commercial purposes
Respect platform rate limits

The project includes comprehensive legal disclaimers and usage guidelines in the README.md file.

4.9 KiB Raw Blame History