mirror of https://github.com/NanmiCoder/MediaCrawler.git synced 2026-02-09 08:31:03 +08:00

Files

hsparks.codes 46ef86ddef feat: Add Excel export functionality and unit tests

Features:
- Excel export with formatted multi-sheet workbooks (Contents, Comments, Creators)
- Professional styling: blue headers, auto-width columns, borders, text wrapping
- Smart export: empty sheets automatically removed
- Support for all platforms (xhs, dy, ks, bili, wb, tieba, zhihu)

Testing:
- Added pytest framework with asyncio support
- Unit tests for Excel store functionality
- Unit tests for store factory pattern
- Shared fixtures for test data
- Test coverage for edge cases

Documentation:
- Comprehensive Excel export guide (docs/excel_export_guide.md)
- Updated README.md and README_en.md with Excel examples
- Updated config comments to include excel option

Dependencies:
- Added openpyxl>=3.1.2 for Excel support
- Added pytest>=7.4.0 and pytest-asyncio>=0.21.0 for testing

This contribution adds immediate value for users who need data analysis
capabilities and establishes a testing foundation for future development.

2025-11-28 04:44:12 +01:00

6.0 KiB

Raw Blame History

Excel Export Guide

Overview

MediaCrawler now supports exporting crawled data to formatted Excel files (.xlsx) with professional styling and multiple sheets for contents, comments, and creators.

Features

Multi-sheet workbooks: Separate sheets for Contents, Comments, and Creators
Professional formatting:
- Styled headers with blue background and white text
- Auto-adjusted column widths
- Cell borders and text wrapping
- Clean, readable layout
Smart export: Empty sheets are automatically removed
Organized storage: Files saved to data/{platform}/ directory with timestamps

Installation

Excel export requires the openpyxl library:

# Using uv (recommended)
uv sync

# Or using pip
pip install openpyxl

Usage

Basic Usage

Configure Excel export in config/base_config.py:

SAVE_DATA_OPTION = "excel"  # Change from json/csv/db to excel

Run the crawler:

# Xiaohongshu example
uv run main.py --platform xhs --lt qrcode --type search

# Douyin example
uv run main.py --platform dy --lt qrcode --type search

# Bilibili example
uv run main.py --platform bili --lt qrcode --type search

Find your Excel file in data/{platform}/ directory:
- Filename format: {platform}_{crawler_type}_{timestamp}.xlsx
- Example: xhs_search_20250128_143025.xlsx

Command Line Examples

# Search by keywords and export to Excel
uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel

# Crawl specific posts and export to Excel
uv run main.py --platform xhs --lt qrcode --type detail --save_data_option excel

# Crawl creator profile and export to Excel
uv run main.py --platform xhs --lt qrcode --type creator --save_data_option excel

Excel File Structure

Contents Sheet

Contains post/video information:

note_id: Unique post identifier
title: Post title
desc: Post description
user_id: Author user ID
nickname: Author nickname
liked_count: Number of likes
comment_count: Number of comments
share_count: Number of shares
ip_location: IP location
image_list: Comma-separated image URLs
tag_list: Comma-separated tags
note_url: Direct link to post
And more platform-specific fields...

Comments Sheet

Contains comment information:

comment_id: Unique comment identifier
note_id: Associated post ID
content: Comment text
user_id: Commenter user ID
nickname: Commenter nickname
like_count: Comment likes
create_time: Comment timestamp
ip_location: Commenter location
sub_comment_count: Number of replies
And more...

Creators Sheet

Contains creator/author information:

user_id: Unique user identifier
nickname: Display name
gender: Gender
avatar: Profile picture URL
desc: Bio/description
fans: Follower count
follows: Following count
interaction: Total interactions
And more...

Advantages Over Other Formats

vs CSV

✅ Multiple sheets in one file
✅ Professional formatting
✅ Better handling of special characters
✅ Auto-adjusted column widths
✅ No encoding issues

vs JSON

✅ Human-readable tabular format
✅ Easy to open in Excel/Google Sheets
✅ Better for data analysis
✅ Easier to share with non-technical users

vs Database

✅ No database setup required
✅ Portable single-file format
✅ Easy to share and archive
✅ Works offline

Tips & Best Practices

Large datasets: For very large crawls (>10,000 rows), consider using database storage instead for better performance
Data analysis: Excel files work great with:
- Microsoft Excel
- Google Sheets
- LibreOffice Calc
- Python pandas: pd.read_excel('file.xlsx')

Combining data: You can merge multiple Excel files using:

import pandas as pd
df1 = pd.read_excel('file1.xlsx', sheet_name='Contents')
df2 = pd.read_excel('file2.xlsx', sheet_name='Contents')
combined = pd.concat([df1, df2])
combined.to_excel('combined.xlsx', index=False)

File size: Excel files are typically 2-3x larger than CSV but smaller than JSON

Troubleshooting

"openpyxl not installed" error

# Install openpyxl
uv add openpyxl
# or
pip install openpyxl

Excel file not created

Check that:

SAVE_DATA_OPTION = "excel" in config
Crawler successfully collected data
No errors in console output
data/{platform}/ directory exists

Empty Excel file

This happens when:

No data was crawled (check keywords/IDs)
Login failed (check login status)
Platform blocked requests (check IP/rate limits)

Example Output

After running a successful crawl, you'll see:

[ExcelStoreBase] Initialized Excel export to: data/xhs/xhs_search_20250128_143025.xlsx
[ExcelStoreBase] Stored content to Excel: 7123456789
[ExcelStoreBase] Stored comment to Excel: comment_123
...
[Main] Excel file saved successfully

Your Excel file will have:

Professional blue headers
Clean borders
Wrapped text for long content
Auto-sized columns
Separate organized sheets

Advanced Usage

Programmatic Access

from store.excel_store_base import ExcelStoreBase

# Create store
store = ExcelStoreBase(platform="xhs", crawler_type="search")

# Store data
await store.store_content({
    "note_id": "123",
    "title": "Test Post",
    "liked_count": 100
})

# Save to file
store.flush()

Custom Formatting

You can extend ExcelStoreBase to customize formatting:

from store.excel_store_base import ExcelStoreBase

class CustomExcelStore(ExcelStoreBase):
    def _apply_header_style(self, sheet, row_num=1):
        # Custom header styling
        super()._apply_header_style(sheet, row_num)
        # Add your customizations here

Support

For issues or questions:

Check 常见问题
Open an issue on GitHub
Join the WeChat discussion group

Note: Excel export is designed for learning and research purposes. Please respect platform terms of service and rate limits.

6.0 KiB Raw Blame History