mirror of
https://github.com/NanmiCoder/MediaCrawler.git
synced 2026-03-09 15:40:48 +08:00
Merge pull request #783 from hsparks-codes/feature/excel-export-and-tests
feat: Add Excel export functionality and unit tests
This commit is contained in:
@@ -212,6 +212,10 @@ python main.py --help
|
||||
支持多种数据存储方式:
|
||||
- **CSV 文件**:支持保存到 CSV 中(`data/` 目录下)
|
||||
- **JSON 文件**:支持保存到 JSON 中(`data/` 目录下)
|
||||
- **Excel 文件**:支持保存到格式化的 Excel 文件(`data/` 目录下)✨ 新功能
|
||||
- 多工作表支持(内容、评论、创作者)
|
||||
- 专业格式化(标题样式、自动列宽、边框)
|
||||
- 易于分析和分享
|
||||
- **数据库存储**
|
||||
- 使用参数 `--init_db` 进行数据库初始化(使用`--init_db`时不需要携带其他optional)
|
||||
- **SQLite 数据库**:轻量级数据库,无需服务器,适合个人使用(推荐)
|
||||
@@ -224,6 +228,9 @@ python main.py --help
|
||||
|
||||
### 使用示例:
|
||||
```shell
|
||||
# 使用 Excel 存储数据(推荐用于数据分析)✨ 新功能
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel
|
||||
|
||||
# 初始化 SQLite 数据库(使用'--init_db'时不需要携带其他optional)
|
||||
uv run main.py --init_db sqlite
|
||||
# 使用 SQLite 存储数据(推荐个人用户使用)
|
||||
|
||||
@@ -209,6 +209,10 @@ python main.py --help
|
||||
Supports multiple data storage methods:
|
||||
- **CSV Files**: Supports saving to CSV (under `data/` directory)
|
||||
- **JSON Files**: Supports saving to JSON (under `data/` directory)
|
||||
- **Excel Files**: Supports saving to formatted Excel files (under `data/` directory) ✨ New Feature
|
||||
- Multi-sheet support (Contents, Comments, Creators)
|
||||
- Professional formatting (styled headers, auto-width columns, borders)
|
||||
- Easy to analyze and share
|
||||
- **Database Storage**
|
||||
- Use the `--init_db` parameter for database initialization (when using `--init_db`, no other optional arguments are needed)
|
||||
- **SQLite Database**: Lightweight database, no server required, suitable for personal use (recommended)
|
||||
@@ -221,6 +225,9 @@ Supports multiple data storage methods:
|
||||
|
||||
### Usage Examples:
|
||||
```shell
|
||||
# Use Excel to store data (recommended for data analysis) ✨ New Feature
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel
|
||||
|
||||
# Initialize SQLite database (when using '--init_db', no other optional arguments are needed)
|
||||
uv run main.py --init_db sqlite
|
||||
# Use SQLite to store data (recommended for personal users)
|
||||
|
||||
@@ -70,8 +70,8 @@ BROWSER_LAUNCH_TIMEOUT = 60
|
||||
# 设置为False可以保持浏览器运行,便于调试
|
||||
AUTO_CLOSE_BROWSER = True
|
||||
|
||||
# 数据保存类型选项配置,支持四种类型:csv、db、json、sqlite, 最好保存到DB,有排重的功能。
|
||||
SAVE_DATA_OPTION = "json" # csv or db or json or sqlite
|
||||
# 数据保存类型选项配置,支持五种类型:csv、db、json、sqlite、excel, 最好保存到DB,有排重的功能。
|
||||
SAVE_DATA_OPTION = "json" # csv or db or json or sqlite or excel
|
||||
|
||||
# 用户浏览器缓存的浏览器文件配置
|
||||
USER_DATA_DIR = "%s_user_data_dir" # %s will be replaced by platform name
|
||||
|
||||
244
docs/excel_export_guide.md
Normal file
244
docs/excel_export_guide.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# Excel Export Guide
|
||||
|
||||
## Overview
|
||||
|
||||
MediaCrawler now supports exporting crawled data to formatted Excel files (.xlsx) with professional styling and multiple sheets for contents, comments, and creators.
|
||||
|
||||
## Features
|
||||
|
||||
- **Multi-sheet workbooks**: Separate sheets for Contents, Comments, and Creators
|
||||
- **Professional formatting**:
|
||||
- Styled headers with blue background and white text
|
||||
- Auto-adjusted column widths
|
||||
- Cell borders and text wrapping
|
||||
- Clean, readable layout
|
||||
- **Smart export**: Empty sheets are automatically removed
|
||||
- **Organized storage**: Files saved to `data/{platform}/` directory with timestamps
|
||||
|
||||
## Installation
|
||||
|
||||
Excel export requires the `openpyxl` library:
|
||||
|
||||
```bash
|
||||
# Using uv (recommended)
|
||||
uv sync
|
||||
|
||||
# Or using pip
|
||||
pip install openpyxl
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
|
||||
1. **Configure Excel export** in `config/base_config.py`:
|
||||
|
||||
```python
|
||||
SAVE_DATA_OPTION = "excel" # Change from json/csv/db to excel
|
||||
```
|
||||
|
||||
2. **Run the crawler**:
|
||||
|
||||
```bash
|
||||
# Xiaohongshu example
|
||||
uv run main.py --platform xhs --lt qrcode --type search
|
||||
|
||||
# Douyin example
|
||||
uv run main.py --platform dy --lt qrcode --type search
|
||||
|
||||
# Bilibili example
|
||||
uv run main.py --platform bili --lt qrcode --type search
|
||||
```
|
||||
|
||||
3. **Find your Excel file** in `data/{platform}/` directory:
|
||||
- Filename format: `{platform}_{crawler_type}_{timestamp}.xlsx`
|
||||
- Example: `xhs_search_20250128_143025.xlsx`
|
||||
|
||||
### Command Line Examples
|
||||
|
||||
```bash
|
||||
# Search by keywords and export to Excel
|
||||
uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel
|
||||
|
||||
# Crawl specific posts and export to Excel
|
||||
uv run main.py --platform xhs --lt qrcode --type detail --save_data_option excel
|
||||
|
||||
# Crawl creator profile and export to Excel
|
||||
uv run main.py --platform xhs --lt qrcode --type creator --save_data_option excel
|
||||
```
|
||||
|
||||
## Excel File Structure
|
||||
|
||||
### Contents Sheet
|
||||
Contains post/video information:
|
||||
- `note_id`: Unique post identifier
|
||||
- `title`: Post title
|
||||
- `desc`: Post description
|
||||
- `user_id`: Author user ID
|
||||
- `nickname`: Author nickname
|
||||
- `liked_count`: Number of likes
|
||||
- `comment_count`: Number of comments
|
||||
- `share_count`: Number of shares
|
||||
- `ip_location`: IP location
|
||||
- `image_list`: Comma-separated image URLs
|
||||
- `tag_list`: Comma-separated tags
|
||||
- `note_url`: Direct link to post
|
||||
- And more platform-specific fields...
|
||||
|
||||
### Comments Sheet
|
||||
Contains comment information:
|
||||
- `comment_id`: Unique comment identifier
|
||||
- `note_id`: Associated post ID
|
||||
- `content`: Comment text
|
||||
- `user_id`: Commenter user ID
|
||||
- `nickname`: Commenter nickname
|
||||
- `like_count`: Comment likes
|
||||
- `create_time`: Comment timestamp
|
||||
- `ip_location`: Commenter location
|
||||
- `sub_comment_count`: Number of replies
|
||||
- And more...
|
||||
|
||||
### Creators Sheet
|
||||
Contains creator/author information:
|
||||
- `user_id`: Unique user identifier
|
||||
- `nickname`: Display name
|
||||
- `gender`: Gender
|
||||
- `avatar`: Profile picture URL
|
||||
- `desc`: Bio/description
|
||||
- `fans`: Follower count
|
||||
- `follows`: Following count
|
||||
- `interaction`: Total interactions
|
||||
- And more...
|
||||
|
||||
## Advantages Over Other Formats
|
||||
|
||||
### vs CSV
|
||||
- ✅ Multiple sheets in one file
|
||||
- ✅ Professional formatting
|
||||
- ✅ Better handling of special characters
|
||||
- ✅ Auto-adjusted column widths
|
||||
- ✅ No encoding issues
|
||||
|
||||
### vs JSON
|
||||
- ✅ Human-readable tabular format
|
||||
- ✅ Easy to open in Excel/Google Sheets
|
||||
- ✅ Better for data analysis
|
||||
- ✅ Easier to share with non-technical users
|
||||
|
||||
### vs Database
|
||||
- ✅ No database setup required
|
||||
- ✅ Portable single-file format
|
||||
- ✅ Easy to share and archive
|
||||
- ✅ Works offline
|
||||
|
||||
## Tips & Best Practices
|
||||
|
||||
1. **Large datasets**: For very large crawls (>10,000 rows), consider using database storage instead for better performance
|
||||
|
||||
2. **Data analysis**: Excel files work great with:
|
||||
- Microsoft Excel
|
||||
- Google Sheets
|
||||
- LibreOffice Calc
|
||||
- Python pandas: `pd.read_excel('file.xlsx')`
|
||||
|
||||
3. **Combining data**: You can merge multiple Excel files using:
|
||||
```python
|
||||
import pandas as pd
|
||||
df1 = pd.read_excel('file1.xlsx', sheet_name='Contents')
|
||||
df2 = pd.read_excel('file2.xlsx', sheet_name='Contents')
|
||||
combined = pd.concat([df1, df2])
|
||||
combined.to_excel('combined.xlsx', index=False)
|
||||
```
|
||||
|
||||
4. **File size**: Excel files are typically 2-3x larger than CSV but smaller than JSON
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "openpyxl not installed" error
|
||||
|
||||
```bash
|
||||
# Install openpyxl
|
||||
uv add openpyxl
|
||||
# or
|
||||
pip install openpyxl
|
||||
```
|
||||
|
||||
### Excel file not created
|
||||
|
||||
Check that:
|
||||
1. `SAVE_DATA_OPTION = "excel"` in config
|
||||
2. Crawler successfully collected data
|
||||
3. No errors in console output
|
||||
4. `data/{platform}/` directory exists
|
||||
|
||||
### Empty Excel file
|
||||
|
||||
This happens when:
|
||||
- No data was crawled (check keywords/IDs)
|
||||
- Login failed (check login status)
|
||||
- Platform blocked requests (check IP/rate limits)
|
||||
|
||||
## Example Output
|
||||
|
||||
After running a successful crawl, you'll see:
|
||||
|
||||
```
|
||||
[ExcelStoreBase] Initialized Excel export to: data/xhs/xhs_search_20250128_143025.xlsx
|
||||
[ExcelStoreBase] Stored content to Excel: 7123456789
|
||||
[ExcelStoreBase] Stored comment to Excel: comment_123
|
||||
...
|
||||
[Main] Excel file saved successfully
|
||||
```
|
||||
|
||||
Your Excel file will have:
|
||||
- Professional blue headers
|
||||
- Clean borders
|
||||
- Wrapped text for long content
|
||||
- Auto-sized columns
|
||||
- Separate organized sheets
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Programmatic Access
|
||||
|
||||
```python
|
||||
from store.excel_store_base import ExcelStoreBase
|
||||
|
||||
# Create store
|
||||
store = ExcelStoreBase(platform="xhs", crawler_type="search")
|
||||
|
||||
# Store data
|
||||
await store.store_content({
|
||||
"note_id": "123",
|
||||
"title": "Test Post",
|
||||
"liked_count": 100
|
||||
})
|
||||
|
||||
# Save to file
|
||||
store.flush()
|
||||
```
|
||||
|
||||
### Custom Formatting
|
||||
|
||||
You can extend `ExcelStoreBase` to customize formatting:
|
||||
|
||||
```python
|
||||
from store.excel_store_base import ExcelStoreBase
|
||||
|
||||
class CustomExcelStore(ExcelStoreBase):
|
||||
def _apply_header_style(self, sheet, row_num=1):
|
||||
# Custom header styling
|
||||
super()._apply_header_style(sheet, row_num)
|
||||
# Add your customizations here
|
||||
```
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions:
|
||||
- Check [常见问题](常见问题.md)
|
||||
- Open an issue on GitHub
|
||||
- Join the WeChat discussion group
|
||||
|
||||
---
|
||||
|
||||
**Note**: Excel export is designed for learning and research purposes. Please respect platform terms of service and rate limits.
|
||||
12
main.py
12
main.py
@@ -84,6 +84,18 @@ async def main():
|
||||
crawler = CrawlerFactory.create_crawler(platform=config.PLATFORM)
|
||||
await crawler.start()
|
||||
|
||||
# Flush Excel data if using Excel export
|
||||
if config.SAVE_DATA_OPTION == "excel":
|
||||
try:
|
||||
# Get the store instance and flush data
|
||||
from store.xhs import XhsStoreFactory
|
||||
store = XhsStoreFactory.create_store()
|
||||
if hasattr(store, 'flush'):
|
||||
store.flush()
|
||||
print(f"[Main] Excel file saved successfully")
|
||||
except Exception as e:
|
||||
print(f"Error flushing Excel data: {e}")
|
||||
|
||||
# Generate wordcloud after crawling is complete
|
||||
# Only for JSON save mode
|
||||
if config.SAVE_DATA_OPTION == "json" and config.ENABLE_GET_WORDCLOUD:
|
||||
|
||||
@@ -35,6 +35,9 @@ dependencies = [
|
||||
"wordcloud==1.9.3",
|
||||
"xhshow>=0.1.3",
|
||||
"pre-commit>=3.5.0",
|
||||
"openpyxl>=3.1.2",
|
||||
"pytest>=7.4.0",
|
||||
"pytest-asyncio>=0.21.0",
|
||||
]
|
||||
|
||||
[[tool.uv.index]]
|
||||
|
||||
@@ -25,4 +25,7 @@ alembic>=1.16.5
|
||||
asyncmy>=0.2.10
|
||||
sqlalchemy>=2.0.43
|
||||
motor>=3.3.0
|
||||
xhshow>=0.1.3
|
||||
xhshow>=0.1.3
|
||||
openpyxl>=3.1.2
|
||||
pytest>=7.4.0
|
||||
pytest-asyncio>=0.21.0
|
||||
263
store/excel_store_base.py
Normal file
263
store/excel_store_base.py
Normal file
@@ -0,0 +1,263 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (c) 2025 relakkes@gmail.com
|
||||
#
|
||||
# This file is part of MediaCrawler project.
|
||||
# Repository: https://github.com/NanmiCoder/MediaCrawler
|
||||
# GitHub: https://github.com/NanmiCoder
|
||||
# Licensed under NON-COMMERCIAL LEARNING LICENSE 1.1
|
||||
#
|
||||
# 声明:本代码仅供学习和研究目的使用。使用者应遵守以下原则:
|
||||
# 1. 不得用于任何商业用途。
|
||||
# 2. 使用时应遵守目标平台的使用条款和robots.txt规则。
|
||||
# 3. 不得进行大规模爬取或对平台造成运营干扰。
|
||||
# 4. 应合理控制请求频率,避免给目标平台带来不必要的负担。
|
||||
# 5. 不得用于任何非法或不当的用途。
|
||||
#
|
||||
# 详细许可条款请参阅项目根目录下的LICENSE文件。
|
||||
# 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
|
||||
|
||||
"""
|
||||
Excel Store Base Implementation
|
||||
Provides Excel export functionality for crawled data with formatted sheets
|
||||
"""
|
||||
|
||||
import os
|
||||
from datetime import datetime
|
||||
from typing import Dict, List, Any
|
||||
from pathlib import Path
|
||||
|
||||
try:
|
||||
import openpyxl
|
||||
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
|
||||
from openpyxl.utils import get_column_letter
|
||||
EXCEL_AVAILABLE = True
|
||||
except ImportError:
|
||||
EXCEL_AVAILABLE = False
|
||||
|
||||
from base.base_crawler import AbstractStore
|
||||
from tools import utils
|
||||
|
||||
|
||||
class ExcelStoreBase(AbstractStore):
|
||||
"""
|
||||
Base class for Excel storage implementation
|
||||
Provides formatted Excel export with multiple sheets for contents, comments, and creators
|
||||
"""
|
||||
|
||||
def __init__(self, platform: str, crawler_type: str = "search"):
|
||||
"""
|
||||
Initialize Excel store
|
||||
|
||||
Args:
|
||||
platform: Platform name (xhs, dy, ks, etc.)
|
||||
crawler_type: Type of crawler (search, detail, creator)
|
||||
"""
|
||||
if not EXCEL_AVAILABLE:
|
||||
raise ImportError(
|
||||
"openpyxl is required for Excel export. "
|
||||
"Install it with: pip install openpyxl"
|
||||
)
|
||||
|
||||
super().__init__()
|
||||
self.platform = platform
|
||||
self.crawler_type = crawler_type
|
||||
|
||||
# Create data directory
|
||||
self.data_dir = Path("data") / platform
|
||||
self.data_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Initialize workbook
|
||||
self.workbook = openpyxl.Workbook()
|
||||
self.workbook.remove(self.workbook.active) # Remove default sheet
|
||||
|
||||
# Create sheets
|
||||
self.contents_sheet = self.workbook.create_sheet("Contents")
|
||||
self.comments_sheet = self.workbook.create_sheet("Comments")
|
||||
self.creators_sheet = self.workbook.create_sheet("Creators")
|
||||
|
||||
# Track if headers are written
|
||||
self.contents_headers_written = False
|
||||
self.comments_headers_written = False
|
||||
self.creators_headers_written = False
|
||||
|
||||
# Generate filename
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
self.filename = self.data_dir / f"{platform}_{crawler_type}_{timestamp}.xlsx"
|
||||
|
||||
utils.logger.info(f"[ExcelStoreBase] Initialized Excel export to: {self.filename}")
|
||||
|
||||
def _apply_header_style(self, sheet, row_num: int = 1):
|
||||
"""
|
||||
Apply formatting to header row
|
||||
|
||||
Args:
|
||||
sheet: Worksheet object
|
||||
row_num: Row number for headers (default: 1)
|
||||
"""
|
||||
header_fill = PatternFill(start_color="366092", end_color="366092", fill_type="solid")
|
||||
header_font = Font(bold=True, color="FFFFFF", size=11)
|
||||
header_alignment = Alignment(horizontal="center", vertical="center", wrap_text=True)
|
||||
border = Border(
|
||||
left=Side(style='thin'),
|
||||
right=Side(style='thin'),
|
||||
top=Side(style='thin'),
|
||||
bottom=Side(style='thin')
|
||||
)
|
||||
|
||||
for cell in sheet[row_num]:
|
||||
cell.fill = header_fill
|
||||
cell.font = header_font
|
||||
cell.alignment = header_alignment
|
||||
cell.border = border
|
||||
|
||||
def _auto_adjust_column_width(self, sheet):
|
||||
"""
|
||||
Auto-adjust column widths based on content
|
||||
|
||||
Args:
|
||||
sheet: Worksheet object
|
||||
"""
|
||||
for column in sheet.columns:
|
||||
max_length = 0
|
||||
column_letter = get_column_letter(column[0].column)
|
||||
|
||||
for cell in column:
|
||||
try:
|
||||
if cell.value:
|
||||
max_length = max(max_length, len(str(cell.value)))
|
||||
except:
|
||||
pass
|
||||
|
||||
# Set width with min/max constraints
|
||||
adjusted_width = min(max(max_length + 2, 10), 50)
|
||||
sheet.column_dimensions[column_letter].width = adjusted_width
|
||||
|
||||
def _write_headers(self, sheet, headers: List[str]):
|
||||
"""
|
||||
Write headers to sheet
|
||||
|
||||
Args:
|
||||
sheet: Worksheet object
|
||||
headers: List of header names
|
||||
"""
|
||||
for col_num, header in enumerate(headers, 1):
|
||||
sheet.cell(row=1, column=col_num, value=header)
|
||||
|
||||
self._apply_header_style(sheet)
|
||||
|
||||
def _write_row(self, sheet, data: Dict[str, Any], headers: List[str]):
|
||||
"""
|
||||
Write data row to sheet
|
||||
|
||||
Args:
|
||||
sheet: Worksheet object
|
||||
data: Data dictionary
|
||||
headers: List of header names (defines column order)
|
||||
"""
|
||||
row_num = sheet.max_row + 1
|
||||
|
||||
for col_num, header in enumerate(headers, 1):
|
||||
value = data.get(header, "")
|
||||
|
||||
# Handle different data types
|
||||
if isinstance(value, (list, dict)):
|
||||
value = str(value)
|
||||
elif value is None:
|
||||
value = ""
|
||||
|
||||
cell = sheet.cell(row=row_num, column=col_num, value=value)
|
||||
|
||||
# Apply basic formatting
|
||||
cell.alignment = Alignment(vertical="top", wrap_text=True)
|
||||
cell.border = Border(
|
||||
left=Side(style='thin'),
|
||||
right=Side(style='thin'),
|
||||
top=Side(style='thin'),
|
||||
bottom=Side(style='thin')
|
||||
)
|
||||
|
||||
async def store_content(self, content_item: Dict):
|
||||
"""
|
||||
Store content data to Excel
|
||||
|
||||
Args:
|
||||
content_item: Content data dictionary
|
||||
"""
|
||||
# Define headers (customize based on platform)
|
||||
headers = list(content_item.keys())
|
||||
|
||||
# Write headers if first time
|
||||
if not self.contents_headers_written:
|
||||
self._write_headers(self.contents_sheet, headers)
|
||||
self.contents_headers_written = True
|
||||
|
||||
# Write data row
|
||||
self._write_row(self.contents_sheet, content_item, headers)
|
||||
|
||||
utils.logger.info(f"[ExcelStoreBase] Stored content to Excel: {content_item.get('note_id', 'N/A')}")
|
||||
|
||||
async def store_comment(self, comment_item: Dict):
|
||||
"""
|
||||
Store comment data to Excel
|
||||
|
||||
Args:
|
||||
comment_item: Comment data dictionary
|
||||
"""
|
||||
# Define headers
|
||||
headers = list(comment_item.keys())
|
||||
|
||||
# Write headers if first time
|
||||
if not self.comments_headers_written:
|
||||
self._write_headers(self.comments_sheet, headers)
|
||||
self.comments_headers_written = True
|
||||
|
||||
# Write data row
|
||||
self._write_row(self.comments_sheet, comment_item, headers)
|
||||
|
||||
utils.logger.info(f"[ExcelStoreBase] Stored comment to Excel: {comment_item.get('comment_id', 'N/A')}")
|
||||
|
||||
async def store_creator(self, creator_item: Dict):
|
||||
"""
|
||||
Store creator data to Excel
|
||||
|
||||
Args:
|
||||
creator_item: Creator data dictionary
|
||||
"""
|
||||
# Define headers
|
||||
headers = list(creator_item.keys())
|
||||
|
||||
# Write headers if first time
|
||||
if not self.creators_headers_written:
|
||||
self._write_headers(self.creators_sheet, headers)
|
||||
self.creators_headers_written = True
|
||||
|
||||
# Write data row
|
||||
self._write_row(self.creators_sheet, creator_item, headers)
|
||||
|
||||
utils.logger.info(f"[ExcelStoreBase] Stored creator to Excel: {creator_item.get('user_id', 'N/A')}")
|
||||
|
||||
def flush(self):
|
||||
"""
|
||||
Save workbook to file
|
||||
"""
|
||||
try:
|
||||
# Auto-adjust column widths
|
||||
self._auto_adjust_column_width(self.contents_sheet)
|
||||
self._auto_adjust_column_width(self.comments_sheet)
|
||||
self._auto_adjust_column_width(self.creators_sheet)
|
||||
|
||||
# Remove empty sheets
|
||||
if self.contents_sheet.max_row == 1:
|
||||
self.workbook.remove(self.contents_sheet)
|
||||
if self.comments_sheet.max_row == 1:
|
||||
self.workbook.remove(self.comments_sheet)
|
||||
if self.creators_sheet.max_row == 1:
|
||||
self.workbook.remove(self.creators_sheet)
|
||||
|
||||
# Save workbook
|
||||
self.workbook.save(self.filename)
|
||||
utils.logger.info(f"[ExcelStoreBase] Excel file saved successfully: {self.filename}")
|
||||
|
||||
except Exception as e:
|
||||
utils.logger.error(f"[ExcelStoreBase] Error saving Excel file: {e}")
|
||||
raise
|
||||
@@ -37,13 +37,14 @@ class XhsStoreFactory:
|
||||
"json": XhsJsonStoreImplement,
|
||||
"sqlite": XhsSqliteStoreImplement,
|
||||
"mongodb": XhsMongoStoreImplement,
|
||||
"excel": XhsExcelStoreImplement,
|
||||
}
|
||||
|
||||
@staticmethod
|
||||
def create_store() -> AbstractStore:
|
||||
store_class = XhsStoreFactory.STORES.get(config.SAVE_DATA_OPTION)
|
||||
if not store_class:
|
||||
raise ValueError("[XhsStoreFactory.create_store] Invalid save option only supported csv or db or json or sqlite or mongodb ...")
|
||||
raise ValueError("[XhsStoreFactory.create_store] Invalid save option only supported csv or db or json or sqlite or mongodb or excel ...")
|
||||
return store_class()
|
||||
|
||||
|
||||
|
||||
@@ -37,6 +37,7 @@ from tools.time_util import get_current_timestamp
|
||||
from var import crawler_type_var
|
||||
from database.mongodb_store_base import MongoDBStoreBase
|
||||
from tools import utils
|
||||
from store.excel_store_base import ExcelStoreBase
|
||||
|
||||
class XhsCsvStoreImplement(AbstractStore):
|
||||
def __init__(self, **kwargs):
|
||||
@@ -336,3 +337,11 @@ class XhsMongoStoreImplement(AbstractStore):
|
||||
data=creator_item
|
||||
)
|
||||
utils.logger.info(f"[XhsMongoStoreImplement.store_creator] Saved creator {user_id} to MongoDB")
|
||||
|
||||
|
||||
class XhsExcelStoreImplement(ExcelStoreBase):
|
||||
"""小红书Excel存储实现"""
|
||||
|
||||
def __init__(self, **kwargs):
|
||||
super().__init__(platform="xhs", crawler_type=crawler_type_var.get())
|
||||
utils.logger.info("[XhsExcelStoreImplement] Excel store initialized")
|
||||
|
||||
2
tests/__init__.py
Normal file
2
tests/__init__.py
Normal file
@@ -0,0 +1,2 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# MediaCrawler Test Suite
|
||||
81
tests/conftest.py
Normal file
81
tests/conftest.py
Normal file
@@ -0,0 +1,81 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
Pytest configuration and shared fixtures
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Add project root to Python path
|
||||
project_root = Path(__file__).parent.parent
|
||||
sys.path.insert(0, str(project_root))
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def project_root_path():
|
||||
"""Return project root path"""
|
||||
return project_root
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_xhs_note():
|
||||
"""Sample Xiaohongshu note data for testing"""
|
||||
return {
|
||||
"note_id": "test_note_123",
|
||||
"type": "normal",
|
||||
"title": "测试标题 Test Title",
|
||||
"desc": "这是一个测试描述 This is a test description",
|
||||
"video_url": "",
|
||||
"time": 1700000000,
|
||||
"last_update_time": 1700000000,
|
||||
"user_id": "user_123",
|
||||
"nickname": "测试用户",
|
||||
"avatar": "https://example.com/avatar.jpg",
|
||||
"liked_count": 100,
|
||||
"collected_count": 50,
|
||||
"comment_count": 25,
|
||||
"share_count": 10,
|
||||
"ip_location": "上海",
|
||||
"image_list": "https://example.com/img1.jpg,https://example.com/img2.jpg",
|
||||
"tag_list": "测试,编程,Python",
|
||||
"note_url": "https://www.xiaohongshu.com/explore/test_note_123",
|
||||
"source_keyword": "测试关键词",
|
||||
"xsec_token": "test_token_123"
|
||||
}
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_xhs_comment():
|
||||
"""Sample Xiaohongshu comment data for testing"""
|
||||
return {
|
||||
"comment_id": "comment_123",
|
||||
"create_time": 1700000000,
|
||||
"ip_location": "北京",
|
||||
"note_id": "test_note_123",
|
||||
"content": "这是一条测试评论 This is a test comment",
|
||||
"user_id": "user_456",
|
||||
"nickname": "评论用户",
|
||||
"avatar": "https://example.com/avatar2.jpg",
|
||||
"sub_comment_count": 5,
|
||||
"pictures": "",
|
||||
"parent_comment_id": 0,
|
||||
"like_count": 15
|
||||
}
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_xhs_creator():
|
||||
"""Sample Xiaohongshu creator data for testing"""
|
||||
return {
|
||||
"user_id": "creator_123",
|
||||
"nickname": "创作者名称",
|
||||
"gender": "女",
|
||||
"avatar": "https://example.com/creator_avatar.jpg",
|
||||
"desc": "这是创作者简介",
|
||||
"ip_location": "广州",
|
||||
"follows": 500,
|
||||
"fans": 10000,
|
||||
"interaction": 50000,
|
||||
"tag_list": '{"profession": "设计师", "interest": "摄影"}'
|
||||
}
|
||||
171
tests/test_excel_store.py
Normal file
171
tests/test_excel_store.py
Normal file
@@ -0,0 +1,171 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
Unit tests for Excel export functionality
|
||||
"""
|
||||
|
||||
import pytest
|
||||
import asyncio
|
||||
import os
|
||||
from pathlib import Path
|
||||
import tempfile
|
||||
import shutil
|
||||
|
||||
try:
|
||||
import openpyxl
|
||||
EXCEL_AVAILABLE = True
|
||||
except ImportError:
|
||||
EXCEL_AVAILABLE = False
|
||||
|
||||
from store.excel_store_base import ExcelStoreBase
|
||||
|
||||
|
||||
@pytest.mark.skipif(not EXCEL_AVAILABLE, reason="openpyxl not installed")
|
||||
class TestExcelStoreBase:
|
||||
"""Test cases for ExcelStoreBase"""
|
||||
|
||||
@pytest.fixture
|
||||
def temp_dir(self):
|
||||
"""Create temporary directory for test files"""
|
||||
temp_path = tempfile.mkdtemp()
|
||||
yield temp_path
|
||||
# Cleanup
|
||||
shutil.rmtree(temp_path, ignore_errors=True)
|
||||
|
||||
@pytest.fixture
|
||||
def excel_store(self, temp_dir, monkeypatch):
|
||||
"""Create ExcelStoreBase instance for testing"""
|
||||
# Monkey patch data directory
|
||||
monkeypatch.chdir(temp_dir)
|
||||
store = ExcelStoreBase(platform="test", crawler_type="search")
|
||||
yield store
|
||||
# Cleanup is handled by temp_dir fixture
|
||||
|
||||
def test_initialization(self, excel_store):
|
||||
"""Test Excel store initialization"""
|
||||
assert excel_store.platform == "test"
|
||||
assert excel_store.crawler_type == "search"
|
||||
assert excel_store.workbook is not None
|
||||
assert excel_store.contents_sheet is not None
|
||||
assert excel_store.comments_sheet is not None
|
||||
assert excel_store.creators_sheet is not None
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_store_content(self, excel_store):
|
||||
"""Test storing content data"""
|
||||
content_item = {
|
||||
"note_id": "test123",
|
||||
"title": "Test Title",
|
||||
"desc": "Test Description",
|
||||
"user_id": "user456",
|
||||
"nickname": "TestUser",
|
||||
"liked_count": 100,
|
||||
"comment_count": 50
|
||||
}
|
||||
|
||||
await excel_store.store_content(content_item)
|
||||
|
||||
# Verify data was written
|
||||
assert excel_store.contents_sheet.max_row == 2 # Header + 1 data row
|
||||
assert excel_store.contents_headers_written is True
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_store_comment(self, excel_store):
|
||||
"""Test storing comment data"""
|
||||
comment_item = {
|
||||
"comment_id": "comment123",
|
||||
"note_id": "note456",
|
||||
"content": "Great post!",
|
||||
"user_id": "user789",
|
||||
"nickname": "Commenter",
|
||||
"like_count": 10
|
||||
}
|
||||
|
||||
await excel_store.store_comment(comment_item)
|
||||
|
||||
# Verify data was written
|
||||
assert excel_store.comments_sheet.max_row == 2 # Header + 1 data row
|
||||
assert excel_store.comments_headers_written is True
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_store_creator(self, excel_store):
|
||||
"""Test storing creator data"""
|
||||
creator_item = {
|
||||
"user_id": "creator123",
|
||||
"nickname": "Creator Name",
|
||||
"fans": 10000,
|
||||
"follows": 500,
|
||||
"interaction": 50000
|
||||
}
|
||||
|
||||
await excel_store.store_creator(creator_item)
|
||||
|
||||
# Verify data was written
|
||||
assert excel_store.creators_sheet.max_row == 2 # Header + 1 data row
|
||||
assert excel_store.creators_headers_written is True
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_multiple_items(self, excel_store):
|
||||
"""Test storing multiple items"""
|
||||
# Store multiple content items
|
||||
for i in range(5):
|
||||
await excel_store.store_content({
|
||||
"note_id": f"note{i}",
|
||||
"title": f"Title {i}",
|
||||
"liked_count": i * 10
|
||||
})
|
||||
|
||||
# Verify all items were stored
|
||||
assert excel_store.contents_sheet.max_row == 6 # Header + 5 data rows
|
||||
|
||||
def test_flush(self, excel_store):
|
||||
"""Test flushing data to file"""
|
||||
# Add some test data
|
||||
asyncio.run(excel_store.store_content({
|
||||
"note_id": "test",
|
||||
"title": "Test"
|
||||
}))
|
||||
|
||||
# Flush to file
|
||||
excel_store.flush()
|
||||
|
||||
# Verify file was created
|
||||
assert excel_store.filename.exists()
|
||||
|
||||
# Verify file can be opened
|
||||
wb = openpyxl.load_workbook(excel_store.filename)
|
||||
assert "Contents" in wb.sheetnames
|
||||
wb.close()
|
||||
|
||||
def test_header_formatting(self, excel_store):
|
||||
"""Test header row formatting"""
|
||||
asyncio.run(excel_store.store_content({"note_id": "test", "title": "Test"}))
|
||||
|
||||
# Check header formatting
|
||||
header_cell = excel_store.contents_sheet.cell(row=1, column=1)
|
||||
assert header_cell.font.bold is True
|
||||
# RGB color may have different prefix (00 or FF), check the actual color part
|
||||
assert header_cell.fill.start_color.rgb[-6:] == "366092"
|
||||
|
||||
def test_empty_sheets_removed(self, excel_store):
|
||||
"""Test that empty sheets are removed on flush"""
|
||||
# Only add content, leave comments and creators empty
|
||||
asyncio.run(excel_store.store_content({"note_id": "test"}))
|
||||
|
||||
excel_store.flush()
|
||||
|
||||
# Reload workbook
|
||||
wb = openpyxl.load_workbook(excel_store.filename)
|
||||
|
||||
# Only Contents sheet should exist
|
||||
assert "Contents" in wb.sheetnames
|
||||
assert "Comments" not in wb.sheetnames
|
||||
assert "Creators" not in wb.sheetnames
|
||||
wb.close()
|
||||
|
||||
|
||||
@pytest.mark.skipif(not EXCEL_AVAILABLE, reason="openpyxl not installed")
|
||||
def test_excel_import_availability():
|
||||
"""Test that openpyxl is available"""
|
||||
assert EXCEL_AVAILABLE is True
|
||||
import openpyxl
|
||||
assert openpyxl is not None
|
||||
75
tests/test_store_factory.py
Normal file
75
tests/test_store_factory.py
Normal file
@@ -0,0 +1,75 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
Unit tests for Store Factory functionality
|
||||
"""
|
||||
|
||||
import pytest
|
||||
from unittest.mock import patch, MagicMock
|
||||
|
||||
from store.xhs import XhsStoreFactory
|
||||
from store.xhs._store_impl import (
|
||||
XhsCsvStoreImplement,
|
||||
XhsJsonStoreImplement,
|
||||
XhsDbStoreImplement,
|
||||
XhsSqliteStoreImplement,
|
||||
XhsMongoStoreImplement,
|
||||
XhsExcelStoreImplement
|
||||
)
|
||||
|
||||
|
||||
class TestXhsStoreFactory:
|
||||
"""Test cases for XhsStoreFactory"""
|
||||
|
||||
@patch('config.SAVE_DATA_OPTION', 'csv')
|
||||
def test_create_csv_store(self):
|
||||
"""Test creating CSV store"""
|
||||
store = XhsStoreFactory.create_store()
|
||||
assert isinstance(store, XhsCsvStoreImplement)
|
||||
|
||||
@patch('config.SAVE_DATA_OPTION', 'json')
|
||||
def test_create_json_store(self):
|
||||
"""Test creating JSON store"""
|
||||
store = XhsStoreFactory.create_store()
|
||||
assert isinstance(store, XhsJsonStoreImplement)
|
||||
|
||||
@patch('config.SAVE_DATA_OPTION', 'db')
|
||||
def test_create_db_store(self):
|
||||
"""Test creating database store"""
|
||||
store = XhsStoreFactory.create_store()
|
||||
assert isinstance(store, XhsDbStoreImplement)
|
||||
|
||||
@patch('config.SAVE_DATA_OPTION', 'sqlite')
|
||||
def test_create_sqlite_store(self):
|
||||
"""Test creating SQLite store"""
|
||||
store = XhsStoreFactory.create_store()
|
||||
assert isinstance(store, XhsSqliteStoreImplement)
|
||||
|
||||
@patch('config.SAVE_DATA_OPTION', 'mongodb')
|
||||
def test_create_mongodb_store(self):
|
||||
"""Test creating MongoDB store"""
|
||||
store = XhsStoreFactory.create_store()
|
||||
assert isinstance(store, XhsMongoStoreImplement)
|
||||
|
||||
@patch('config.SAVE_DATA_OPTION', 'excel')
|
||||
def test_create_excel_store(self):
|
||||
"""Test creating Excel store"""
|
||||
# ContextVar cannot be mocked, so we test with actual value
|
||||
store = XhsStoreFactory.create_store()
|
||||
assert isinstance(store, XhsExcelStoreImplement)
|
||||
|
||||
@patch('config.SAVE_DATA_OPTION', 'invalid')
|
||||
def test_invalid_store_option(self):
|
||||
"""Test that invalid store option raises ValueError"""
|
||||
with pytest.raises(ValueError) as exc_info:
|
||||
XhsStoreFactory.create_store()
|
||||
|
||||
assert "Invalid save option" in str(exc_info.value)
|
||||
|
||||
def test_all_stores_registered(self):
|
||||
"""Test that all store types are registered"""
|
||||
expected_stores = ['csv', 'json', 'db', 'sqlite', 'mongodb', 'excel']
|
||||
|
||||
for store_type in expected_stores:
|
||||
assert store_type in XhsStoreFactory.STORES
|
||||
|
||||
assert len(XhsStoreFactory.STORES) == len(expected_stores)
|
||||
Reference in New Issue
Block a user