NSL OpenIR  > 中国科学院兰州文献情报中心  > 资源系统建设部
领域本体学习语料的自动获取与预处理方法研究
Alternative TitleResearch on Automatic Acquisition and Preprocessing Methods of Domain Ontology Learning Corpus
王思丽1,2,3; 祝忠明1,2; 刘巍1,2; 杨恒1,2
2019-10-25
Source Publication图书馆学研究
ISSN1001-0424
Issue20Pages:54-64
Contribution Rank1
Abstract

[目的/意义]实现领域语料的自动获取与预处理,为机器/深度学习驱动的领域本体自动构建提供数据及数据处理技术基础。[方法/过程]首先,对所涉及语料的类型、获取方法及应用研究现状进行分析,提出多源异构领域语料的自动获取方法,包括基于Web Spider的网络开放领域语料和基于Web API的科学文献领域语料的自动获取等。其次,分析提出领域基础知识词典的自动构建方法,为语料预处理奠定基础。最后,通过对主流分词方法及开源分词工具进行测试与评估,提出基于增量训练HanLP-SP领域分词模型的多策略混合的自动分词与新词发现方法,并进行实验研究。[结果/结论]方法能够有效获取到领域语料,并实现分词等预处理任务

Other Abstract

[Purpose/Significance]Realize the automatic acquisition and preprocessing of domain corpus, and provide data and data processing technology basis for machine learning or depth learning driven domain ontology automatic construction. [Method/Process]Firstly, the types of corpora, acquisition methods and application research status are analyzed. The automatic acquisition methods of multi-source heterogeneous domain corpus are proposed, including Web Spider-based network open domain corpus automatic acquisition and Web API-based scientific literature domain corpus automatic acquisition, etc. Secondly, an automatic construction method of domain basic knowledge dictionary is proposed, which lays a foundation for preprocessing corpus. Finally, through the test and evaluation of the mainstream word segmentation method and the open source word segmentation tool, a multi-strategy hybrid automatic word segmentation and new word discovery method based on the incremental training HanLP-SP domain segmentation model is proposed and experimental research is carried out. [Result/Conclusion]The method can effectively acquire the domain corpus and realize the preprocessing tasks such as word segmentation.

Keyword领域语料 本体学习 自动获取 预处理 分词
MOST Discipline Catalogue管理学 ; 管理学::图书情报与档案管理
Indexed ByCSSCI ; 中文核心期刊要目总览
Language中文
Funding Project基于深度学习的领域本体自动构建方法研究
Document Type期刊论文
Identifierhttp://ir.las.ac.cn/handle/12502/10533
Collection中国科学院兰州文献情报中心_资源系统建设部
Affiliation1.中国科学院西北生态环境资源研究院 文献情报中心
2.中国科学院兰州文献情报中心
3.中国科学院大学
First Author Affilication中国科学院文献情报中心
Recommended Citation
GB/T 7714
王思丽,祝忠明,刘巍,等. 领域本体学习语料的自动获取与预处理方法研究[J]. 图书馆学研究,2019(20):54-64.
APA 王思丽,祝忠明,刘巍,&杨恒.(2019).领域本体学习语料的自动获取与预处理方法研究.图书馆学研究(20),54-64.
MLA 王思丽,et al."领域本体学习语料的自动获取与预处理方法研究".图书馆学研究 .20(2019):54-64.
Files in This Item: Download All
File Name/Size DocType Version Access License
领域本体学习语料的自动获取与预处理方法研(909KB)期刊论文作者接受稿开放获取CC BY-NC-SAView Download
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[王思丽]'s Articles
[祝忠明]'s Articles
[刘巍]'s Articles
Baidu academic
Similar articles in Baidu academic
[王思丽]'s Articles
[祝忠明]'s Articles
[刘巍]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[王思丽]'s Articles
[祝忠明]'s Articles
[刘巍]'s Articles
Terms of Use
No data!
Social Bookmark/Share
File name: 领域本体学习语料的自动获取与预处理方法研究-王思丽-修改稿20190803.pdf
Format: Adobe PDF
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.