mirror of
https://github.com/NanmiCoder/MediaCrawler.git
synced 2026-05-28 13:37:25 +08:00
feat: 新增 JSONL 存储格式支持,默认存储格式改为 jsonl
JSONL(JSON Lines)每行一个 JSON 对象,采用 append 模式写入, 无需读取已有数据,大数据量下性能远优于 JSON 格式。 - 新增 AsyncFileWriter.write_to_jsonl() 核心方法 - 7 个平台新增 JsonlStoreImplement 类并注册到工厂 - 配置默认值从 json 改为 jsonl,CLI/API 枚举同步更新 - db_session.py 守卫条件加入 jsonl,避免误触 ValueError - 词云生成支持读取 JSONL 文件,优先 jsonl 回退 json - 原有 json 选项完全保留,向后兼容 - 更新相关文档和测试
This commit is contained in:
@@ -36,6 +36,7 @@ class KuaishouStoreFactory:
|
||||
"db": KuaishouDbStoreImplement,
|
||||
"postgres": KuaishouDbStoreImplement,
|
||||
"json": KuaishouJsonStoreImplement,
|
||||
"jsonl": KuaishouJsonlStoreImplement,
|
||||
"sqlite": KuaishouSqliteStoreImplement,
|
||||
"mongodb": KuaishouMongoStoreImplement,
|
||||
"excel": KuaishouExcelStoreImplement,
|
||||
|
||||
@@ -167,6 +167,21 @@ class KuaishouJsonStoreImplement(AbstractStore):
|
||||
pass
|
||||
|
||||
|
||||
class KuaishouJsonlStoreImplement(AbstractStore):
|
||||
def __init__(self, **kwargs):
|
||||
super().__init__(**kwargs)
|
||||
self.writer = AsyncFileWriter(platform="kuaishou", crawler_type=crawler_type_var.get())
|
||||
|
||||
async def store_content(self, content_item: Dict):
|
||||
await self.writer.write_to_jsonl(item_type="contents", item=content_item)
|
||||
|
||||
async def store_comment(self, comment_item: Dict):
|
||||
await self.writer.write_to_jsonl(item_type="comments", item=comment_item)
|
||||
|
||||
async def store_creator(self, creator: Dict):
|
||||
pass
|
||||
|
||||
|
||||
class KuaishouSqliteStoreImplement(KuaishouDbStoreImplement):
|
||||
async def store_creator(self, creator: Dict):
|
||||
pass
|
||||
|
||||
Reference in New Issue
Block a user