中国科学院文献情报中心机构知识库
Advanced  
NSL OpenIR  > Journal of Data and Information Science  > Journal of Data and Information Science-2016  > 期刊论文
Title: Can Automatic Classification Help to Increase Accuracy in Data Collection?
Author: Frederique Lang1; Diego Chavarro1; Yuxian Liu2
Source: Journal of Data and Information Science
Issued Date: 2016-09-18
Volume: 1, Issue:3, Pages:42-58
Keyword: Disambiguation ; Machine learning ; Data cleaning ; Classification ; Accuracy ; Recall ; Coverage
Subject: 新闻学与传播学 ; 图书馆、情报与文献学
Indexed Type: 其他
DOI: 10.20309/jdis.201619
Corresponding Author: Yuxian Liu (E-mail: yxliu@tongji.edu.cn).
DOC Type: Research Papers
Abstract:
Purpose: The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets.
Design/methodology/approach: The paper is centered on cleaning datasets gathered from publishers and online resources by the use of specific keywords. In this case, we analyzed data from the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of seven supervised classification algorithms (Support Vector Machine (SVM), Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest) and analyzed two properties: accuracy and recall. We assessed not only each algorithm individually, but also their combinations through a voting scheme. We also tested the performance of these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms.
Findings: We found that the performance of the algorithms used vary with the size of the sample for training. However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This combination performs well with a small training dataset (10%), which may reduce the manual work needed for classification tasks.
Research limitations: The dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification of unrelated papers.
Practical implications: Although the classification achieved by this means is not completely accurate, the amount of manual coding needed can be greatly reduced by using classification algorithms. This can be of great help when the dataset is big. With the help of accuracy, recall, and coverage measures, it is possible to have an estimation of the error involved in this classification, which could open the possibility of incorporating the use of these algorithms in software specifically designed for data cleaning and classification.
Originality/value: We analyzed the performance of seven algorithms and whether combinations of these algorithms improve accuracy in data collection. Use of these algorithms could reduce time needed for manual data cleaning.
English Abstract:
Purpose: The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets.
Design/methodology/approach: The paper is centered on cleaning datasets gathered from publishers and online resources by the use of specific keywords. In this case, we analyzed data from the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of seven supervised classification algorithms (Support Vector Machine (SVM), Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest) and analyzed two properties: accuracy and recall. We assessed not only each algorithm individually, but also their combinations through a voting scheme. We also tested the performance of these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms.
Findings: We found that the performance of the algorithms used vary with the size of the sample for training. However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This combination performs well with a small training dataset (10%), which may reduce the manual work needed for classification tasks.
Research limitations: The dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification of unrelated papers.
Practical implications: Although the classification achieved by this means is not completely accurate, the amount of manual coding needed can be greatly reduced by using classification algorithms. This can be of great help when the dataset is big. With the help of accuracy, recall, and coverage measures, it is possible to have an estimation of the error involved in this classification, which could open the possibility of incorporating the use of these algorithms in software specifically designed for data cleaning and classification.
Originality/value: We analyzed the performance of seven algorithms and whether combinations of these algorithms improve accuracy in data collection. Use of these algorithms could reduce time needed for manual data cleaning.
Project Number: No.: 71173154 ; No.: 08BZX076
Project: The authors are grateful to Peter Bone for his help with the proofreading of this paper. We wish also to thank Jose Christian for helpful discussion and support. Thanks go also to the colleagues who gave us comments during the 4<sup>th</sup> Global TechMining Conference and in the Science Policy Research Unit (SPRU) Wednesday Seminar. Yuxian Liu's work was supported by National Natural Science Foundation of China (NSFC) (Grant No.: 71173154), The National Social Science Fund of China (NSSFC) (Grant No.: 08BZX076) and the Fundamental Research Funds for the Central Universities.
Related URLs: 查看原文
Language: 英语
Citation statistics:
Content Type: 期刊论文
URI: http://ir.las.ac.cn/handle/12502/8731
Appears in Collections:Journal of Data and Information Science_Journal of Data and Information Science-2016 _期刊论文

Files in This Item:
File Name/ File Size Content Type Version Access License
20160304.pdf(945KB)期刊论文作者接受稿开放获取View Download

description.institution: 1.Science Policy Research Unit (SPRU), School of Business, Management and Economics, University of Sussex, Falmer, Brighton, BN1 9SL, United Kingdom
2.Tongji University Library, Tongji University, Shanghai 200092, China

Recommended Citation:
Frederique Lang,Diego Chavarro,Yuxian Liu. Can Automatic Classification Help to Increase Accuracy in Data Collection?[J]. Journal of Data and Information Science,2016,1(3):42-58.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[Frederique Lang]'s Articles
[Diego Chavarro]'s Articles
[Yuxian Liu]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[Frederique Lang]‘s Articles
[Diego Chavarro]‘s Articles
[Yuxian Liu]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
文件名: 20160304.pdf
格式: Adobe PDF
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2017  中国科学院文献情报中心 - Feedback
Powered by CSpace