mirror of
https://github.com/NanmiCoder/MediaCrawler.git
synced 2026-04-21 03:07:37 +08:00
feat: Add Excel export functionality and unit tests
Features: - Excel export with formatted multi-sheet workbooks (Contents, Comments, Creators) - Professional styling: blue headers, auto-width columns, borders, text wrapping - Smart export: empty sheets automatically removed - Support for all platforms (xhs, dy, ks, bili, wb, tieba, zhihu) Testing: - Added pytest framework with asyncio support - Unit tests for Excel store functionality - Unit tests for store factory pattern - Shared fixtures for test data - Test coverage for edge cases Documentation: - Comprehensive Excel export guide (docs/excel_export_guide.md) - Updated README.md and README_en.md with Excel examples - Updated config comments to include excel option Dependencies: - Added openpyxl>=3.1.2 for Excel support - Added pytest>=7.4.0 and pytest-asyncio>=0.21.0 for testing This contribution adds immediate value for users who need data analysis capabilities and establishes a testing foundation for future development.
This commit is contained in:
244
docs/excel_export_guide.md
Normal file
244
docs/excel_export_guide.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# Excel Export Guide
|
||||
|
||||
## Overview
|
||||
|
||||
MediaCrawler now supports exporting crawled data to formatted Excel files (.xlsx) with professional styling and multiple sheets for contents, comments, and creators.
|
||||
|
||||
## Features
|
||||
|
||||
- **Multi-sheet workbooks**: Separate sheets for Contents, Comments, and Creators
|
||||
- **Professional formatting**:
|
||||
- Styled headers with blue background and white text
|
||||
- Auto-adjusted column widths
|
||||
- Cell borders and text wrapping
|
||||
- Clean, readable layout
|
||||
- **Smart export**: Empty sheets are automatically removed
|
||||
- **Organized storage**: Files saved to `data/{platform}/` directory with timestamps
|
||||
|
||||
## Installation
|
||||
|
||||
Excel export requires the `openpyxl` library:
|
||||
|
||||
```bash
|
||||
# Using uv (recommended)
|
||||
uv sync
|
||||
|
||||
# Or using pip
|
||||
pip install openpyxl
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
|
||||
1. **Configure Excel export** in `config/base_config.py`:
|
||||
|
||||
```python
|
||||
SAVE_DATA_OPTION = "excel" # Change from json/csv/db to excel
|
||||
```
|
||||
|
||||
2. **Run the crawler**:
|
||||
|
||||
```bash
|
||||
# Xiaohongshu example
|
||||
uv run main.py --platform xhs --lt qrcode --type search
|
||||
|
||||
# Douyin example
|
||||
uv run main.py --platform dy --lt qrcode --type search
|
||||
|
||||
# Bilibili example
|
||||
uv run main.py --platform bili --lt qrcode --type search
|
||||
```
|
||||
|
||||
3. **Find your Excel file** in `data/{platform}/` directory:
|
||||
- Filename format: `{platform}_{crawler_type}_{timestamp}.xlsx`
|
||||
- Example: `xhs_search_20250128_143025.xlsx`
|
||||
|
||||
### Command Line Examples
|
||||
|
||||
```bash
|
||||
# Search by keywords and export to Excel
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel
|
||||
|
||||
# Crawl specific posts and export to Excel
|
||||
uv run main.py --platform xhs --lt qrcode --type detail --save_data_option excel
|
||||
|
||||
# Crawl creator profile and export to Excel
|
||||
uv run main.py --platform xhs --lt qrcode --type creator --save_data_option excel
|
||||
```
|
||||
|
||||
## Excel File Structure
|
||||
|
||||
### Contents Sheet
|
||||
Contains post/video information:
|
||||
- `note_id`: Unique post identifier
|
||||
- `title`: Post title
|
||||
- `desc`: Post description
|
||||
- `user_id`: Author user ID
|
||||
- `nickname`: Author nickname
|
||||
- `liked_count`: Number of likes
|
||||
- `comment_count`: Number of comments
|
||||
- `share_count`: Number of shares
|
||||
- `ip_location`: IP location
|
||||
- `image_list`: Comma-separated image URLs
|
||||
- `tag_list`: Comma-separated tags
|
||||
- `note_url`: Direct link to post
|
||||
- And more platform-specific fields...
|
||||
|
||||
### Comments Sheet
|
||||
Contains comment information:
|
||||
- `comment_id`: Unique comment identifier
|
||||
- `note_id`: Associated post ID
|
||||
- `content`: Comment text
|
||||
- `user_id`: Commenter user ID
|
||||
- `nickname`: Commenter nickname
|
||||
- `like_count`: Comment likes
|
||||
- `create_time`: Comment timestamp
|
||||
- `ip_location`: Commenter location
|
||||
- `sub_comment_count`: Number of replies
|
||||
- And more...
|
||||
|
||||
### Creators Sheet
|
||||
Contains creator/author information:
|
||||
- `user_id`: Unique user identifier
|
||||
- `nickname`: Display name
|
||||
- `gender`: Gender
|
||||
- `avatar`: Profile picture URL
|
||||
- `desc`: Bio/description
|
||||
- `fans`: Follower count
|
||||
- `follows`: Following count
|
||||
- `interaction`: Total interactions
|
||||
- And more...
|
||||
|
||||
## Advantages Over Other Formats
|
||||
|
||||
### vs CSV
|
||||
- ✅ Multiple sheets in one file
|
||||
- ✅ Professional formatting
|
||||
- ✅ Better handling of special characters
|
||||
- ✅ Auto-adjusted column widths
|
||||
- ✅ No encoding issues
|
||||
|
||||
### vs JSON
|
||||
- ✅ Human-readable tabular format
|
||||
- ✅ Easy to open in Excel/Google Sheets
|
||||
- ✅ Better for data analysis
|
||||
- ✅ Easier to share with non-technical users
|
||||
|
||||
### vs Database
|
||||
- ✅ No database setup required
|
||||
- ✅ Portable single-file format
|
||||
- ✅ Easy to share and archive
|
||||
- ✅ Works offline
|
||||
|
||||
## Tips & Best Practices
|
||||
|
||||
1. **Large datasets**: For very large crawls (>10,000 rows), consider using database storage instead for better performance
|
||||
|
||||
2. **Data analysis**: Excel files work great with:
|
||||
- Microsoft Excel
|
||||
- Google Sheets
|
||||
- LibreOffice Calc
|
||||
- Python pandas: `pd.read_excel('file.xlsx')`
|
||||
|
||||
3. **Combining data**: You can merge multiple Excel files using:
|
||||
```python
|
||||
import pandas as pd
|
||||
df1 = pd.read_excel('file1.xlsx', sheet_name='Contents')
|
||||
df2 = pd.read_excel('file2.xlsx', sheet_name='Contents')
|
||||
combined = pd.concat([df1, df2])
|
||||
combined.to_excel('combined.xlsx', index=False)
|
||||
```
|
||||
|
||||
4. **File size**: Excel files are typically 2-3x larger than CSV but smaller than JSON
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "openpyxl not installed" error
|
||||
|
||||
```bash
|
||||
# Install openpyxl
|
||||
uv add openpyxl
|
||||
# or
|
||||
pip install openpyxl
|
||||
```
|
||||
|
||||
### Excel file not created
|
||||
|
||||
Check that:
|
||||
1. `SAVE_DATA_OPTION = "excel"` in config
|
||||
2. Crawler successfully collected data
|
||||
3. No errors in console output
|
||||
4. `data/{platform}/` directory exists
|
||||
|
||||
### Empty Excel file
|
||||
|
||||
This happens when:
|
||||
- No data was crawled (check keywords/IDs)
|
||||
- Login failed (check login status)
|
||||
- Platform blocked requests (check IP/rate limits)
|
||||
|
||||
## Example Output
|
||||
|
||||
After running a successful crawl, you'll see:
|
||||
|
||||
```
|
||||
[ExcelStoreBase] Initialized Excel export to: data/xhs/xhs_search_20250128_143025.xlsx
|
||||
[ExcelStoreBase] Stored content to Excel: 7123456789
|
||||
[ExcelStoreBase] Stored comment to Excel: comment_123
|
||||
...
|
||||
[Main] Excel file saved successfully
|
||||
```
|
||||
|
||||
Your Excel file will have:
|
||||
- Professional blue headers
|
||||
- Clean borders
|
||||
- Wrapped text for long content
|
||||
- Auto-sized columns
|
||||
- Separate organized sheets
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Programmatic Access
|
||||
|
||||
```python
|
||||
from store.excel_store_base import ExcelStoreBase
|
||||
|
||||
# Create store
|
||||
store = ExcelStoreBase(platform="xhs", crawler_type="search")
|
||||
|
||||
# Store data
|
||||
await store.store_content({
|
||||
"note_id": "123",
|
||||
"title": "Test Post",
|
||||
"liked_count": 100
|
||||
})
|
||||
|
||||
# Save to file
|
||||
store.flush()
|
||||
```
|
||||
|
||||
### Custom Formatting
|
||||
|
||||
You can extend `ExcelStoreBase` to customize formatting:
|
||||
|
||||
```python
|
||||
from store.excel_store_base import ExcelStoreBase
|
||||
|
||||
class CustomExcelStore(ExcelStoreBase):
|
||||
def _apply_header_style(self, sheet, row_num=1):
|
||||
# Custom header styling
|
||||
super()._apply_header_style(sheet, row_num)
|
||||
# Add your customizations here
|
||||
```
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions:
|
||||
- Check [常见问题](常见问题.md)
|
||||
- Open an issue on GitHub
|
||||
- Join the WeChat discussion group
|
||||
|
||||
---
|
||||
|
||||
**Note**: Excel export is designed for learning and research purposes. Please respect platform terms of service and rate limits.
|
||||
Reference in New Issue
Block a user