JSONL(JSON Lines)每行一个 JSON 对象,采用 append 模式写入, 无需读取已有数据,大数据量下性能远优于 JSON 格式。 - 新增 AsyncFileWriter.write_to_jsonl() 核心方法 - 7 个平台新增 JsonlStoreImplement 类并注册到工厂 - 配置默认值从 json 改为 jsonl,CLI/API 枚举同步更新 - db_session.py 守卫条件加入 jsonl,避免误触 ValueError - 词云生成支持读取 JSONL 文件,优先 jsonl 回退 json - 原有 json 选项完全保留,向后兼容 - 更新相关文档和测试
6.0 KiB
Excel Export Guide
Overview
MediaCrawler now supports exporting crawled data to formatted Excel files (.xlsx) with professional styling and multiple sheets for contents, comments, and creators.
Features
- Multi-sheet workbooks: Separate sheets for Contents, Comments, and Creators
- Professional formatting:
- Styled headers with blue background and white text
- Auto-adjusted column widths
- Cell borders and text wrapping
- Clean, readable layout
- Smart export: Empty sheets are automatically removed
- Organized storage: Files saved to
data/{platform}/directory with timestamps
Installation
Excel export requires the openpyxl library:
# Using uv (recommended)
uv sync
# Or using pip
pip install openpyxl
Usage
Basic Usage
- Configure Excel export in
config/base_config.py:
SAVE_DATA_OPTION = "excel" # Change from jsonl/json/csv/db to excel
- Run the crawler:
# Xiaohongshu example
uv run main.py --platform xhs --lt qrcode --type search
# Douyin example
uv run main.py --platform dy --lt qrcode --type search
# Bilibili example
uv run main.py --platform bili --lt qrcode --type search
- Find your Excel file in
data/{platform}/directory:- Filename format:
{platform}_{crawler_type}_{timestamp}.xlsx - Example:
xhs_search_20250128_143025.xlsx
- Filename format:
Command Line Examples
# Search by keywords and export to Excel
uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel
# Crawl specific posts and export to Excel
uv run main.py --platform xhs --lt qrcode --type detail --save_data_option excel
# Crawl creator profile and export to Excel
uv run main.py --platform xhs --lt qrcode --type creator --save_data_option excel
Excel File Structure
Contents Sheet
Contains post/video information:
note_id: Unique post identifiertitle: Post titledesc: Post descriptionuser_id: Author user IDnickname: Author nicknameliked_count: Number of likescomment_count: Number of commentsshare_count: Number of sharesip_location: IP locationimage_list: Comma-separated image URLstag_list: Comma-separated tagsnote_url: Direct link to post- And more platform-specific fields...
Comments Sheet
Contains comment information:
comment_id: Unique comment identifiernote_id: Associated post IDcontent: Comment textuser_id: Commenter user IDnickname: Commenter nicknamelike_count: Comment likescreate_time: Comment timestampip_location: Commenter locationsub_comment_count: Number of replies- And more...
Creators Sheet
Contains creator/author information:
user_id: Unique user identifiernickname: Display namegender: Genderavatar: Profile picture URLdesc: Bio/descriptionfans: Follower countfollows: Following countinteraction: Total interactions- And more...
Advantages Over Other Formats
vs CSV
- ✅ Multiple sheets in one file
- ✅ Professional formatting
- ✅ Better handling of special characters
- ✅ Auto-adjusted column widths
- ✅ No encoding issues
vs JSON
- ✅ Human-readable tabular format
- ✅ Easy to open in Excel/Google Sheets
- ✅ Better for data analysis
- ✅ Easier to share with non-technical users
vs Database
- ✅ No database setup required
- ✅ Portable single-file format
- ✅ Easy to share and archive
- ✅ Works offline
Tips & Best Practices
-
Large datasets: For very large crawls (>10,000 rows), consider using database storage instead for better performance
-
Data analysis: Excel files work great with:
- Microsoft Excel
- Google Sheets
- LibreOffice Calc
- Python pandas:
pd.read_excel('file.xlsx')
-
Combining data: You can merge multiple Excel files using:
import pandas as pd df1 = pd.read_excel('file1.xlsx', sheet_name='Contents') df2 = pd.read_excel('file2.xlsx', sheet_name='Contents') combined = pd.concat([df1, df2]) combined.to_excel('combined.xlsx', index=False) -
File size: Excel files are typically 2-3x larger than CSV but smaller than JSON
Troubleshooting
"openpyxl not installed" error
# Install openpyxl
uv add openpyxl
# or
pip install openpyxl
Excel file not created
Check that:
SAVE_DATA_OPTION = "excel"in config- Crawler successfully collected data
- No errors in console output
data/{platform}/directory exists
Empty Excel file
This happens when:
- No data was crawled (check keywords/IDs)
- Login failed (check login status)
- Platform blocked requests (check IP/rate limits)
Example Output
After running a successful crawl, you'll see:
[ExcelStoreBase] Initialized Excel export to: data/xhs/xhs_search_20250128_143025.xlsx
[ExcelStoreBase] Stored content to Excel: 7123456789
[ExcelStoreBase] Stored comment to Excel: comment_123
...
[Main] Excel file saved successfully
Your Excel file will have:
- Professional blue headers
- Clean borders
- Wrapped text for long content
- Auto-sized columns
- Separate organized sheets
Advanced Usage
Programmatic Access
from store.excel_store_base import ExcelStoreBase
# Create store
store = ExcelStoreBase(platform="xhs", crawler_type="search")
# Store data
await store.store_content({
"note_id": "123",
"title": "Test Post",
"liked_count": 100
})
# Save to file
store.flush()
Custom Formatting
You can extend ExcelStoreBase to customize formatting:
from store.excel_store_base import ExcelStoreBase
class CustomExcelStore(ExcelStoreBase):
def _apply_header_style(self, sheet, row_num=1):
# Custom header styling
super()._apply_header_style(sheet, row_num)
# Add your customizations here
Support
For issues or questions:
- Check 常见问题
- Open an issue on GitHub
- Join the WeChat discussion group
Note: Excel export is designed for learning and research purposes. Please respect platform terms of service and rate limits.