Files
MediaCrawler/docs/excel_export_guide.md
hsparks.codes 46ef86ddef feat: Add Excel export functionality and unit tests
Features:
- Excel export with formatted multi-sheet workbooks (Contents, Comments, Creators)
- Professional styling: blue headers, auto-width columns, borders, text wrapping
- Smart export: empty sheets automatically removed
- Support for all platforms (xhs, dy, ks, bili, wb, tieba, zhihu)

Testing:
- Added pytest framework with asyncio support
- Unit tests for Excel store functionality
- Unit tests for store factory pattern
- Shared fixtures for test data
- Test coverage for edge cases

Documentation:
- Comprehensive Excel export guide (docs/excel_export_guide.md)
- Updated README.md and README_en.md with Excel examples
- Updated config comments to include excel option

Dependencies:
- Added openpyxl>=3.1.2 for Excel support
- Added pytest>=7.4.0 and pytest-asyncio>=0.21.0 for testing

This contribution adds immediate value for users who need data analysis
capabilities and establishes a testing foundation for future development.
2025-11-28 04:44:12 +01:00

6.0 KiB

Excel Export Guide

Overview

MediaCrawler now supports exporting crawled data to formatted Excel files (.xlsx) with professional styling and multiple sheets for contents, comments, and creators.

Features

  • Multi-sheet workbooks: Separate sheets for Contents, Comments, and Creators
  • Professional formatting:
    • Styled headers with blue background and white text
    • Auto-adjusted column widths
    • Cell borders and text wrapping
    • Clean, readable layout
  • Smart export: Empty sheets are automatically removed
  • Organized storage: Files saved to data/{platform}/ directory with timestamps

Installation

Excel export requires the openpyxl library:

# Using uv (recommended)
uv sync

# Or using pip
pip install openpyxl

Usage

Basic Usage

  1. Configure Excel export in config/base_config.py:
SAVE_DATA_OPTION = "excel"  # Change from json/csv/db to excel
  1. Run the crawler:
# Xiaohongshu example
uv run main.py --platform xhs --lt qrcode --type search

# Douyin example
uv run main.py --platform dy --lt qrcode --type search

# Bilibili example
uv run main.py --platform bili --lt qrcode --type search
  1. Find your Excel file in data/{platform}/ directory:
    • Filename format: {platform}_{crawler_type}_{timestamp}.xlsx
    • Example: xhs_search_20250128_143025.xlsx

Command Line Examples

# Search by keywords and export to Excel
uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel

# Crawl specific posts and export to Excel
uv run main.py --platform xhs --lt qrcode --type detail --save_data_option excel

# Crawl creator profile and export to Excel
uv run main.py --platform xhs --lt qrcode --type creator --save_data_option excel

Excel File Structure

Contents Sheet

Contains post/video information:

  • note_id: Unique post identifier
  • title: Post title
  • desc: Post description
  • user_id: Author user ID
  • nickname: Author nickname
  • liked_count: Number of likes
  • comment_count: Number of comments
  • share_count: Number of shares
  • ip_location: IP location
  • image_list: Comma-separated image URLs
  • tag_list: Comma-separated tags
  • note_url: Direct link to post
  • And more platform-specific fields...

Comments Sheet

Contains comment information:

  • comment_id: Unique comment identifier
  • note_id: Associated post ID
  • content: Comment text
  • user_id: Commenter user ID
  • nickname: Commenter nickname
  • like_count: Comment likes
  • create_time: Comment timestamp
  • ip_location: Commenter location
  • sub_comment_count: Number of replies
  • And more...

Creators Sheet

Contains creator/author information:

  • user_id: Unique user identifier
  • nickname: Display name
  • gender: Gender
  • avatar: Profile picture URL
  • desc: Bio/description
  • fans: Follower count
  • follows: Following count
  • interaction: Total interactions
  • And more...

Advantages Over Other Formats

vs CSV

  • Multiple sheets in one file
  • Professional formatting
  • Better handling of special characters
  • Auto-adjusted column widths
  • No encoding issues

vs JSON

  • Human-readable tabular format
  • Easy to open in Excel/Google Sheets
  • Better for data analysis
  • Easier to share with non-technical users

vs Database

  • No database setup required
  • Portable single-file format
  • Easy to share and archive
  • Works offline

Tips & Best Practices

  1. Large datasets: For very large crawls (>10,000 rows), consider using database storage instead for better performance

  2. Data analysis: Excel files work great with:

    • Microsoft Excel
    • Google Sheets
    • LibreOffice Calc
    • Python pandas: pd.read_excel('file.xlsx')
  3. Combining data: You can merge multiple Excel files using:

    import pandas as pd
    df1 = pd.read_excel('file1.xlsx', sheet_name='Contents')
    df2 = pd.read_excel('file2.xlsx', sheet_name='Contents')
    combined = pd.concat([df1, df2])
    combined.to_excel('combined.xlsx', index=False)
    
  4. File size: Excel files are typically 2-3x larger than CSV but smaller than JSON

Troubleshooting

"openpyxl not installed" error

# Install openpyxl
uv add openpyxl
# or
pip install openpyxl

Excel file not created

Check that:

  1. SAVE_DATA_OPTION = "excel" in config
  2. Crawler successfully collected data
  3. No errors in console output
  4. data/{platform}/ directory exists

Empty Excel file

This happens when:

  • No data was crawled (check keywords/IDs)
  • Login failed (check login status)
  • Platform blocked requests (check IP/rate limits)

Example Output

After running a successful crawl, you'll see:

[ExcelStoreBase] Initialized Excel export to: data/xhs/xhs_search_20250128_143025.xlsx
[ExcelStoreBase] Stored content to Excel: 7123456789
[ExcelStoreBase] Stored comment to Excel: comment_123
...
[Main] Excel file saved successfully

Your Excel file will have:

  • Professional blue headers
  • Clean borders
  • Wrapped text for long content
  • Auto-sized columns
  • Separate organized sheets

Advanced Usage

Programmatic Access

from store.excel_store_base import ExcelStoreBase

# Create store
store = ExcelStoreBase(platform="xhs", crawler_type="search")

# Store data
await store.store_content({
    "note_id": "123",
    "title": "Test Post",
    "liked_count": 100
})

# Save to file
store.flush()

Custom Formatting

You can extend ExcelStoreBase to customize formatting:

from store.excel_store_base import ExcelStoreBase

class CustomExcelStore(ExcelStoreBase):
    def _apply_header_style(self, sheet, row_num=1):
        # Custom header styling
        super()._apply_header_style(sheet, row_num)
        # Add your customizations here

Support

For issues or questions:

  • Check 常见问题
  • Open an issue on GitHub
  • Join the WeChat discussion group

Note: Excel export is designed for learning and research purposes. Please respect platform terms of service and rate limits.