领域本体学习语料的自动获取与预处理方法研究 | |
Alternative Title | Research on Automatic Acquisition and Preprocessing Methods of Domain Ontology Learning Corpus |
王思丽1,2,3![]() ![]() ![]() ![]() | |
2019-10-25 | |
Source Publication | 图书馆学研究
![]() |
ISSN | 1001-0424 |
Issue | 20Pages:54-64 |
Contribution Rank | 1 |
Abstract | [目的/意义]实现领域语料的自动获取与预处理,为机器/深度学习驱动的领域本体自动构建提供数据及数据处理技术基础。[方法/过程]首先,对所涉及语料的类型、获取方法及应用研究现状进行分析,提出多源异构领域语料的自动获取方法,包括基于Web Spider的网络开放领域语料和基于Web API的科学文献领域语料的自动获取等。其次,分析提出领域基础知识词典的自动构建方法,为语料预处理奠定基础。最后,通过对主流分词方法及开源分词工具进行测试与评估,提出基于增量训练HanLP-SP领域分词模型的多策略混合的自动分词与新词发现方法,并进行实验研究。[结果/结论]方法能够有效获取到领域语料,并实现分词等预处理任务。 |
Other Abstract | [Purpose/Significance]Realize the automatic acquisition and preprocessing of domain corpus, and provide data and data processing technology basis for machine learning or depth learning driven domain ontology automatic construction. [Method/Process]Firstly, the types of corpora, acquisition methods and application research status are analyzed. The automatic acquisition methods of multi-source heterogeneous domain corpus are proposed, including Web Spider-based network open domain corpus automatic acquisition and Web API-based scientific literature domain corpus automatic acquisition, etc. Secondly, an automatic construction method of domain basic knowledge dictionary is proposed, which lays a foundation for preprocessing corpus. Finally, through the test and evaluation of the mainstream word segmentation method and the open source word segmentation tool, a multi-strategy hybrid automatic word segmentation and new word discovery method based on the incremental training HanLP-SP domain segmentation model is proposed and experimental research is carried out. [Result/Conclusion]The method can effectively acquire the domain corpus and realize the preprocessing tasks such as word segmentation. |
Keyword | 领域语料 本体学习 自动获取 预处理 分词 |
MOST Discipline Catalogue | 管理学 ; 管理学::图书情报与档案管理 |
Indexed By | CSSCI ; 中文核心期刊要目总览 |
Language | 中文 |
Funding Project | 基于深度学习的领域本体自动构建方法研究 |
Document Type | 期刊论文 |
Identifier | http://ir.las.ac.cn/handle/12502/10533 |
Collection | 中国科学院兰州文献情报中心_资源系统建设部 |
Affiliation | 1.中国科学院西北生态环境资源研究院 文献情报中心 2.中国科学院兰州文献情报中心 3.中国科学院大学 |
First Author Affilication | 中国科学院文献情报中心 |
Recommended Citation GB/T 7714 | 王思丽,祝忠明,刘巍,等. 领域本体学习语料的自动获取与预处理方法研究[J]. 图书馆学研究,2019(20):54-64. |
APA | 王思丽,祝忠明,刘巍,&杨恒.(2019).领域本体学习语料的自动获取与预处理方法研究.图书馆学研究(20),54-64. |
MLA | 王思丽,et al."领域本体学习语料的自动获取与预处理方法研究".图书馆学研究 .20(2019):54-64. |
Files in This Item: | Download All | |||||
File Name/Size | DocType | Version | Access | License | ||
领域本体学习语料的自动获取与预处理方法研(909KB) | 期刊论文 | 作者接受稿 | 开放获取 | CC BY-NC-SA | View Download |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment