mirror of https://github.com/NanmiCoder/MediaCrawler.git synced 2026-03-04 13:10:48 +08:00

Files

程序员阿江(Relakkes) 0282e626c9 feat: 新增 JSONL 存储格式支持，默认存储格式改为 jsonl

JSONL（JSON Lines）每行一个 JSON 对象，采用 append 模式写入，
无需读取已有数据，大数据量下性能远优于 JSON 格式。

- 新增 AsyncFileWriter.write_to_jsonl() 核心方法
- 7 个平台新增 JsonlStoreImplement 类并注册到工厂
- 配置默认值从 json 改为 jsonl，CLI/API 枚举同步更新
- db_session.py 守卫条件加入 jsonl，避免误触 ValueError
- 词云生成支持读取 JSONL 文件，优先 jsonl 回退 json
- 原有 json 选项完全保留，向后兼容
- 更新相关文档和测试

2026-03-03 23:31:07 +08:00

6.0 KiB

Raw Blame History

Excel Export Guide

Overview

MediaCrawler now supports exporting crawled data to formatted Excel files (.xlsx) with professional styling and multiple sheets for contents, comments, and creators.

Features

Multi-sheet workbooks: Separate sheets for Contents, Comments, and Creators
Professional formatting:
- Styled headers with blue background and white text
- Auto-adjusted column widths
- Cell borders and text wrapping
- Clean, readable layout
Smart export: Empty sheets are automatically removed
Organized storage: Files saved to data/{platform}/ directory with timestamps

Installation

Excel export requires the openpyxl library:

# Using uv (recommended)
uv sync

# Or using pip
pip install openpyxl

Usage

Basic Usage

Configure Excel export in config/base_config.py:

SAVE_DATA_OPTION = "excel"  # Change from jsonl/json/csv/db to excel

Run the crawler:

# Xiaohongshu example
uv run main.py --platform xhs --lt qrcode --type search

# Douyin example
uv run main.py --platform dy --lt qrcode --type search

# Bilibili example
uv run main.py --platform bili --lt qrcode --type search

Find your Excel file in data/{platform}/ directory:
- Filename format: {platform}_{crawler_type}_{timestamp}.xlsx
- Example: xhs_search_20250128_143025.xlsx

Command Line Examples

# Search by keywords and export to Excel
uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel

# Crawl specific posts and export to Excel
uv run main.py --platform xhs --lt qrcode --type detail --save_data_option excel

# Crawl creator profile and export to Excel
uv run main.py --platform xhs --lt qrcode --type creator --save_data_option excel

Excel File Structure

Contents Sheet

Contains post/video information:

note_id: Unique post identifier
title: Post title
desc: Post description
user_id: Author user ID
nickname: Author nickname
liked_count: Number of likes
comment_count: Number of comments
share_count: Number of shares
ip_location: IP location
image_list: Comma-separated image URLs
tag_list: Comma-separated tags
note_url: Direct link to post
And more platform-specific fields...

Comments Sheet

Contains comment information:

comment_id: Unique comment identifier
note_id: Associated post ID
content: Comment text
user_id: Commenter user ID
nickname: Commenter nickname
like_count: Comment likes
create_time: Comment timestamp
ip_location: Commenter location
sub_comment_count: Number of replies
And more...

Creators Sheet

Contains creator/author information:

user_id: Unique user identifier
nickname: Display name
gender: Gender
avatar: Profile picture URL
desc: Bio/description
fans: Follower count
follows: Following count
interaction: Total interactions
And more...

Advantages Over Other Formats

vs CSV

✅ Multiple sheets in one file
✅ Professional formatting
✅ Better handling of special characters
✅ Auto-adjusted column widths
✅ No encoding issues

vs JSON

✅ Human-readable tabular format
✅ Easy to open in Excel/Google Sheets
✅ Better for data analysis
✅ Easier to share with non-technical users

vs Database

✅ No database setup required
✅ Portable single-file format
✅ Easy to share and archive
✅ Works offline

Tips & Best Practices

Large datasets: For very large crawls (>10,000 rows), consider using database storage instead for better performance
Data analysis: Excel files work great with:
- Microsoft Excel
- Google Sheets
- LibreOffice Calc
- Python pandas: pd.read_excel('file.xlsx')

Combining data: You can merge multiple Excel files using:

import pandas as pd
df1 = pd.read_excel('file1.xlsx', sheet_name='Contents')
df2 = pd.read_excel('file2.xlsx', sheet_name='Contents')
combined = pd.concat([df1, df2])
combined.to_excel('combined.xlsx', index=False)

File size: Excel files are typically 2-3x larger than CSV but smaller than JSON

Troubleshooting

"openpyxl not installed" error

# Install openpyxl
uv add openpyxl
# or
pip install openpyxl

Excel file not created

Check that:

SAVE_DATA_OPTION = "excel" in config
Crawler successfully collected data
No errors in console output
data/{platform}/ directory exists

Empty Excel file

This happens when:

No data was crawled (check keywords/IDs)
Login failed (check login status)
Platform blocked requests (check IP/rate limits)

Example Output

After running a successful crawl, you'll see:

[ExcelStoreBase] Initialized Excel export to: data/xhs/xhs_search_20250128_143025.xlsx
[ExcelStoreBase] Stored content to Excel: 7123456789
[ExcelStoreBase] Stored comment to Excel: comment_123
...
[Main] Excel file saved successfully

Your Excel file will have:

Professional blue headers
Clean borders
Wrapped text for long content
Auto-sized columns
Separate organized sheets

Advanced Usage

Programmatic Access

from store.excel_store_base import ExcelStoreBase

# Create store
store = ExcelStoreBase(platform="xhs", crawler_type="search")

# Store data
await store.store_content({
    "note_id": "123",
    "title": "Test Post",
    "liked_count": 100
})

# Save to file
store.flush()

Custom Formatting

You can extend ExcelStoreBase to customize formatting:

from store.excel_store_base import ExcelStoreBase

class CustomExcelStore(ExcelStoreBase):
    def _apply_header_style(self, sheet, row_num=1):
        # Custom header styling
        super()._apply_header_style(sheet, row_num)
        # Add your customizations here

Support

For issues or questions:

Check 常见问题
Open an issue on GitHub
Join the WeChat discussion group

Note: Excel export is designed for learning and research purposes. Please respect platform terms of service and rate limits.

6.0 KiB Raw Blame History