Identifying Scientific Project-generated Data Citation from Full-text Articles: An Investigation of TCGA Data Citation
Jiao Li; Si Zheng; Hongyu Kang; Zhen Hou; Qing Qian; Qing Qian (E-mail: qian.qing@imicams.ac.cn).
2016-06-17
发表期刊Journal of Data and Information Science
卷号1期号:2页码:32-44
摘要

Purpose: In the open science era, it is typical to share project-generated scientific data by depositing it in an open and accessible database. Moreover, scientific publications are preserved in a digital library archive. It is challenging to identify the data usage that is mentioned in literature and associate it with its source. Here, we investigated the data usage of a government-funded cancer genomics project, The Cancer Genome Atlas (TCGA), via a full-text literature analysis.
Design/methodology/approach: We focused on identifying articles using the TCGA dataset and constructing linkages between the articles and the specific TCGA dataset. First, we collected 5,372 TCGA-related articles from PubMed Central (PMC). Second, we constructed a benchmark set with 25 full-text articles that truly used the TCGA data in their studies, and we summarized the key features of the benchmark set. Third, the key features were applied to the remaining PMC full-text articles that were collected from PMC.
Findings: The amount of publications that use TCGA data has increased significantly since 2011, although the TCGA project was launched in 2005. Additionally, we found that the critical areas of focus in the studies that use the TCGA data were glioblastoma multiforme, lung cancer, and breast cancer; meanwhile, data from the RNA-sequencing (RNA-seq) platform is the most preferable for use.
Research limitations: The current workflow to identify articles that truly used TCGA data is labor-intensive. An automatic method is expected to improve the performance.
Practical implications: This study will help cancer genomics researchers determine the latest advancements in cancer molecular therapy, and it will promote data sharing and data-intensive scientific discovery.
Originality/value: Few studies have been conducted to investigate data usage by governmentfunded projects/programs since their launch. In this preliminary study, we extracted articles that use TCGA data from PMC, and we created a link between the full-text articles and the source data.

;

Purpose: In the open science era, it is typical to share project-generated scientific data by depositing it in an open and accessible database. Moreover, scientific publications are preserved in a digital library archive. It is challenging to identify the data usage that is mentioned in literature and associate it with its source. Here, we investigated the data usage of a government-funded cancer genomics project, The Cancer Genome Atlas (TCGA), via a full-text literature analysis.
Design/methodology/approach: We focused on identifying articles using the TCGA dataset and constructing linkages between the articles and the specific TCGA dataset. First, we collected 5,372 TCGA-related articles from PubMed Central (PMC). Second, we constructed a benchmark set with 25 full-text articles that truly used the TCGA data in their studies, and we summarized the key features of the benchmark set. Third, the key features were applied to the remaining PMC full-text articles that were collected from PMC.
Findings: The amount of publications that use TCGA data has increased significantly since 2011, although the TCGA project was launched in 2005. Additionally, we found that the critical areas of focus in the studies that use the TCGA data were glioblastoma multiforme, lung cancer, and breast cancer; meanwhile, data from the RNA-sequencing (RNA-seq) platform is the most preferable for use.
Research limitations: The current workflow to identify articles that truly used TCGA data is labor-intensive. An automatic method is expected to improve the performance.
Practical implications: This study will help cancer genomics researchers determine the latest advancements in cancer molecular therapy, and it will promote data sharing and data-intensive scientific discovery.
Originality/value: Few studies have been conducted to investigate data usage by governmentfunded projects/programs since their launch. In this preliminary study, we extracted articles that use TCGA data from PMC, and we created a link between the full-text articles and the source data.

文章类型Research Paper
关键词Scientific Data Full-text Literature Open Access Pubmed Central Data Citation
学科领域新闻学与传播学 ; 图书馆、情报与文献学
DOI10.20309/jdis.201600
URL查看原文
收录类别其他
所属项目编号Grant No.: 13R0101
语种英语
资助项目the Fundamental Research Funds for the Central Universities ; the National Population and Health Scientific Data Sharing Program of China, the Knowledge Centre for Engineering Sciences and Technology (Medical Centre)
引用统计
文献类型期刊论文
条目标识符http://ir.las.ac.cn/handle/12502/8596
专题Journal of Data and Information Science_Journal of Data and Information Science-2016
通讯作者Qing Qian (E-mail: qian.qing@imicams.ac.cn).
作者单位Institute of Medical Information and Library, Chinese Academy of Medical Sciences, Beijing 100020, China
推荐引用方式
GB/T 7714
Jiao Li,Si Zheng,Hongyu Kang,et al. Identifying Scientific Project-generated Data Citation from Full-text Articles: An Investigation of TCGA Data Citation[J]. Journal of Data and Information Science,2016,1(2):32-44.
APA Jiao Li,Si Zheng,Hongyu Kang,Zhen Hou,Qing Qian,&Qing Qian .(2016).Identifying Scientific Project-generated Data Citation from Full-text Articles: An Investigation of TCGA Data Citation.Journal of Data and Information Science,1(2),32-44.
MLA Jiao Li,et al."Identifying Scientific Project-generated Data Citation from Full-text Articles: An Investigation of TCGA Data Citation".Journal of Data and Information Science 1.2(2016):32-44.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
20160204.pdf(1807KB)期刊论文出版稿开放获取CC BY-NC-ND请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Jiao Li]的文章
[Si Zheng]的文章
[Hongyu Kang]的文章
百度学术
百度学术中相似的文章
[Jiao Li]的文章
[Si Zheng]的文章
[Hongyu Kang]的文章
必应学术
必应学术中相似的文章
[Jiao Li]的文章
[Si Zheng]的文章
[Hongyu Kang]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。