Files
MediaCrawler/docs/excel_export_guide.md
程序员阿江(Relakkes) 0282e626c9 feat: 新增 JSONL 存储格式支持,默认存储格式改为 jsonl
JSONL(JSON Lines)每行一个 JSON 对象,采用 append 模式写入,
无需读取已有数据,大数据量下性能远优于 JSON 格式。

- 新增 AsyncFileWriter.write_to_jsonl() 核心方法
- 7 个平台新增 JsonlStoreImplement 类并注册到工厂
- 配置默认值从 json 改为 jsonl,CLI/API 枚举同步更新
- db_session.py 守卫条件加入 jsonl,避免误触 ValueError
- 词云生成支持读取 JSONL 文件,优先 jsonl 回退 json
- 原有 json 选项完全保留,向后兼容
- 更新相关文档和测试
2026-03-03 23:31:07 +08:00

6.0 KiB

Excel Export Guide

Overview

MediaCrawler now supports exporting crawled data to formatted Excel files (.xlsx) with professional styling and multiple sheets for contents, comments, and creators.

Features

  • Multi-sheet workbooks: Separate sheets for Contents, Comments, and Creators
  • Professional formatting:
    • Styled headers with blue background and white text
    • Auto-adjusted column widths
    • Cell borders and text wrapping
    • Clean, readable layout
  • Smart export: Empty sheets are automatically removed
  • Organized storage: Files saved to data/{platform}/ directory with timestamps

Installation

Excel export requires the openpyxl library:

# Using uv (recommended)
uv sync

# Or using pip
pip install openpyxl

Usage

Basic Usage

  1. Configure Excel export in config/base_config.py:
SAVE_DATA_OPTION = "excel"  # Change from jsonl/json/csv/db to excel
  1. Run the crawler:
# Xiaohongshu example
uv run main.py --platform xhs --lt qrcode --type search

# Douyin example
uv run main.py --platform dy --lt qrcode --type search

# Bilibili example
uv run main.py --platform bili --lt qrcode --type search
  1. Find your Excel file in data/{platform}/ directory:
    • Filename format: {platform}_{crawler_type}_{timestamp}.xlsx
    • Example: xhs_search_20250128_143025.xlsx

Command Line Examples

# Search by keywords and export to Excel
uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel

# Crawl specific posts and export to Excel
uv run main.py --platform xhs --lt qrcode --type detail --save_data_option excel

# Crawl creator profile and export to Excel
uv run main.py --platform xhs --lt qrcode --type creator --save_data_option excel

Excel File Structure

Contents Sheet

Contains post/video information:

  • note_id: Unique post identifier
  • title: Post title
  • desc: Post description
  • user_id: Author user ID
  • nickname: Author nickname
  • liked_count: Number of likes
  • comment_count: Number of comments
  • share_count: Number of shares
  • ip_location: IP location
  • image_list: Comma-separated image URLs
  • tag_list: Comma-separated tags
  • note_url: Direct link to post
  • And more platform-specific fields...

Comments Sheet

Contains comment information:

  • comment_id: Unique comment identifier
  • note_id: Associated post ID
  • content: Comment text
  • user_id: Commenter user ID
  • nickname: Commenter nickname
  • like_count: Comment likes
  • create_time: Comment timestamp
  • ip_location: Commenter location
  • sub_comment_count: Number of replies
  • And more...

Creators Sheet

Contains creator/author information:

  • user_id: Unique user identifier
  • nickname: Display name
  • gender: Gender
  • avatar: Profile picture URL
  • desc: Bio/description
  • fans: Follower count
  • follows: Following count
  • interaction: Total interactions
  • And more...

Advantages Over Other Formats

vs CSV

  • Multiple sheets in one file
  • Professional formatting
  • Better handling of special characters
  • Auto-adjusted column widths
  • No encoding issues

vs JSON

  • Human-readable tabular format
  • Easy to open in Excel/Google Sheets
  • Better for data analysis
  • Easier to share with non-technical users

vs Database

  • No database setup required
  • Portable single-file format
  • Easy to share and archive
  • Works offline

Tips & Best Practices

  1. Large datasets: For very large crawls (>10,000 rows), consider using database storage instead for better performance

  2. Data analysis: Excel files work great with:

    • Microsoft Excel
    • Google Sheets
    • LibreOffice Calc
    • Python pandas: pd.read_excel('file.xlsx')
  3. Combining data: You can merge multiple Excel files using:

    import pandas as pd
    df1 = pd.read_excel('file1.xlsx', sheet_name='Contents')
    df2 = pd.read_excel('file2.xlsx', sheet_name='Contents')
    combined = pd.concat([df1, df2])
    combined.to_excel('combined.xlsx', index=False)
    
  4. File size: Excel files are typically 2-3x larger than CSV but smaller than JSON

Troubleshooting

"openpyxl not installed" error

# Install openpyxl
uv add openpyxl
# or
pip install openpyxl

Excel file not created

Check that:

  1. SAVE_DATA_OPTION = "excel" in config
  2. Crawler successfully collected data
  3. No errors in console output
  4. data/{platform}/ directory exists

Empty Excel file

This happens when:

  • No data was crawled (check keywords/IDs)
  • Login failed (check login status)
  • Platform blocked requests (check IP/rate limits)

Example Output

After running a successful crawl, you'll see:

[ExcelStoreBase] Initialized Excel export to: data/xhs/xhs_search_20250128_143025.xlsx
[ExcelStoreBase] Stored content to Excel: 7123456789
[ExcelStoreBase] Stored comment to Excel: comment_123
...
[Main] Excel file saved successfully

Your Excel file will have:

  • Professional blue headers
  • Clean borders
  • Wrapped text for long content
  • Auto-sized columns
  • Separate organized sheets

Advanced Usage

Programmatic Access

from store.excel_store_base import ExcelStoreBase

# Create store
store = ExcelStoreBase(platform="xhs", crawler_type="search")

# Store data
await store.store_content({
    "note_id": "123",
    "title": "Test Post",
    "liked_count": 100
})

# Save to file
store.flush()

Custom Formatting

You can extend ExcelStoreBase to customize formatting:

from store.excel_store_base import ExcelStoreBase

class CustomExcelStore(ExcelStoreBase):
    def _apply_header_style(self, sheet, row_num=1):
        # Custom header styling
        super()._apply_header_style(sheet, row_num)
        # Add your customizations here

Support

For issues or questions:

  • Check 常见问题
  • Open an issue on GitHub
  • Join the WeChat discussion group

Note: Excel export is designed for learning and research purposes. Please respect platform terms of service and rate limits.