An analysis of data paper templates and guidelines: types of contextual information described by data journals

Article information

Sci Ed. 2020;7(1):16-23

Publication date (electronic) : 2020 February 20

doi : https://doi.org/10.6087/kcse.185

Jihyun Kim

Department of Library and Information Science, Ewha Womans University, Seoul, Korea

Correspondence to Jihyun Kim kim.jh@ewha.ac.kr

Received 2020 January 20; Accepted 2020 January 27.

Abstract

Purpose

Data papers are a promising genre of scholarly communication, in which research data are described, shared, and published. Rich documentation of data, including adequate contextual information, enhances the potential of data reuse. This study investigated the extent to which the components of data papers specified by journals represented the types of contextual information necessary for data reuse.

Methods

A content analysis of 15 data paper templates/guidelines from 24 data journals indexed by the Web of Science was performed. A coding scheme was developed based on previous studies, consisting of four categories: general data set properties, data production information, repository information, and reuse information.

Results

Only a few types of contextual information were commonly requested by the journals. Except data format information and file names, general data set properties were specified less often than other categories of contextual information. Researchers were frequently asked to provide data production information, such as information on the data collection, data producer, and related project. Repository information focused on data identifiers, while information about repository reputation and curation practices was rarely requested. Reuse information mostly involved advice on the reuse of data and terms of use.

Conclusion

These findings imply that data journals should provide a more standardized set of data paper components to inform reusers of relevant contextual information in a consistent manner. Information about repository reputation and curation could also be provided by data journals to complement the repository information provided by the authors of data papers and to help researchers evaluate the reusability of data.

Keywords: Data papers; Data journals; Data description; Data documentation; Contextual information

Introduction

Data sharing is an emerging scholarly communication practice that facilitates the progress of science by making data accessible, verifiable, and reproducible [1]. There are several ways of sharing data, including personally exchanging data sets, posting data on researchers’ or laboratories’ websites, and depositing data sets in repositories. A relatively novel means of releasing data sets is the publication of data papers, which describe how data were collected, processed, and verified, thereby improving data provenance [2]. Data papers are published by data journals, and the publication process is similar to that of conventional journals, in that data papers and data are both peer-reviewed, amended, and publicly accessible under unique identifiers [3]. Since data papers take the form of academic papers and can be cited by primary research articles, credit can be given to data creators [4].

Data papers contain facts about data instead of hypotheses and arguments resulting from an analysis of the data, as commonly presented in traditional research articles [5]. Their primary purpose is thus to explain data sets by providing “information on the what, where, why, how, and who of the data” [6]. The primary advantage of data papers is their rich documentation of data, which is essential for data reuse. A data paper is usually short and consists of an abstract, collection methods, and a description of the relevant data set(s) [7].

However, previous studies have identified a lack of shared templates/guidelines for data papers across data journals. Candela et al. identified 10 classes of data paper components recommended by data journals: availability, competing interests, coverage, format, license, microattribution, project, provenance, quality, and reuse. The authors noted that a unique identifier indicating data availability, such as a DOI or URI, was the only information provided by all the data journals they examined. Less than half of the data journals asked for information on coverage, license, microattribution, project, and reuse [8]. In addition, Chen explained that data paper templates/guidelines mostly focus on single data sets—that is, on the item level—and only a few provide collection-level descriptions of data, such as multiple data sets or databases. The author suggested that the granularity of research data that a data paper describes should be specified by data journals [9].

The lack of standardization and the problem of granularity in describing data have been discussed in other studies regarding data documentation and metadata [10-12]. Those studies also indicated that documenting an adequate amount of proper contextual information about data would increase the potential for data reuse. The underlying goal of publishing data papers likewise is to enable data reuse, and such papers are expected to address the challenges that elicit “data reuse failure” [13,14]. In this context, the present study examined the types of contextual information that data journals request to be described and determined how common or variable these types of information are across such journals.

This study aimed to identify the components of a data paper as defined by the templates/guidelines of 24 data journals indexed by Web of Science (WoS), with the document type restricted to data papers. The data paper components were mapped onto the types of contextual information suggested by previous studies [15,16]. Therefore, it may be possible to determine the extent to which data papers published in various journals cover the contextual information that researchers need for data reuse and to identify common and unique components across the data journals. The results would help researchers better understand areas for improvement in the guidance provided by data journals for documenting data and in the roles of data journals.

Methods

This study initially identified a broad set of data journals on the basis of 1) two studies [8,9] that conducted a content analysis of data paper templates and/or guidelines and 2) a list of data journals reported by Akers [17]. Candela et al. [8] analyzed 116 data journals from 15 publishers via web-based searches. Chen [9] created an initial list of 93 data journals on the basis of Akers’ list and searches on UlrichsWeb and selected 26 data journals from 16 publishers with appropriate consideration of disciplinary domains. Excluding duplicate journals from the two studies resulted in 106 data journals. As previous studies [8,9] suggested, the vast majority of the data journals were mixed journals (i.e., journals publishing any type of paper, including data papers), and pure data journals (i.e., journals publishing only data papers) accounted for only a small proportion.

This study utilized WoS as a tool for selecting the data journals for the analysis. Despite debates over the trustworthiness of journal impact factors generated by WoS, journals indexed by WoS usually maintain a good status because they need to meet the quality criteria set by the database. Of the 106 data journals, 79 were indexed by WoS. The search was restricted to the “data papers” document type in the advanced search function of the database, and 24 data journals were ultimately selected (Fig. 1) [8,9]. Eighteen of those 24 journals overlapped with those examined by either or both of the aforementioned studies (10 by Candela et al. [8], 1 by Chen [9], and 7 by both). The six other journals, namely, BioInvasions Records, Data, Ecological Research, Journal of Hymenoptera Research, Frontiers in Marine Science, and Comparative Cytogenetics, were also analyzed in this study.

Fig. 1.

The process of finalizing the selection of the 24 data journals analyzed in this study [8,9].

Of the 24 data journals, seven were published by Springer Nature and six by Pensoft (Appendix 1). All the journals published by Pensoft used the same data paper guideline. Among the seven Springer Nature journals, five BioMed Central (BMC) journals shared a single data paper guideline. The remaining data journals provided their own data paper templates and/or guidelines. Therefore, this study collected 15 distinctive data paper templates and/or guidelines for the analysis.

To investigate the contextual information covered by the data papers, I used the types of contextual information suggested by Faniel et al. [15] and Chin and Lansing [16], who elaborated a range of data contexts reflecting the perspectives of data reusers. Chin and Lansing [16] originally proposed various attributes of scientific and social contexts that facilitated data sharing in biological science collaboratories. Four of these contextual attributes are particularly relevant to the scientific context, and therefore were employed for the analysis (Table 1) [8,9,15,16]. I then mapped the types of contextual information onto the data paper components identified by Candela et al. [8] and Chen [9]. This mapping enabled a preliminary assessment of the relationship between data paper components and contextual information and the development of a coding scheme for the content analysis.

Table 1.

Mapping between types of contextual information and data paper components

Table 1 showed that not all the types of contextual information suggested by Faniel at al. [15] matched with the data paper components. Specifically, no data paper component identified by the previous studies corresponded to “data analysis,” “missing data,” “research objectives,” “repository reputation and history,” and “curation and digitization”. There was also an inconsistent definition of the term “provenance” between the studies. Candela et al. [8] defined this notion the “methodology leading to the production of the data set”, which is more similar to “data collection” than the definition by Faniel et al. as “sources of the material or traceability”.

“General data set properties,” suggested by Chin and Lansing [16], corresponded to a data paper component relating to the description of data formats, versions, and creation dates. Moreover, “project,” which was mentioned by Candela et al. [8], referring to information about the initiatives within which data are generated, was the only component that did not match any of the types of contextual information. “Funding statement,” which was identified by Chen [9], was also related to the “project” element. Being aware of information about a project and the funding sources that led to data creation would be useful when considering the possibility of data reuse. Thus, this study considered project information to be additional contextual information.

The coding scheme for analyzing the data paper components was largely based on the types of contextual information suggested by Faniel et al. [15]. In addition, the “general data set properties” component, which was noted by Chin and Lansing [16], was added to the coding scheme. The “project” component was included under the “data production information” category, which was proposed by Faniel et al. [15], since it was a contextual factor relevant to data creation.

Results

The types of contextual information examined in this study were categorized into four groups: general data set properties, data production information, repository information, and data reuse information. Concerning general data set properties, the study explored whether the 15 data paper templates and/or guidelines required the authors to describe the attributes listed in Table 2. Data creation dates, formats, and versions were mentioned by Chen [9], and the remaining properties were identified during the coding process.

Table 2.

General data set properties identified in the data paper templates/guidelines

Data file name/title was the only property requested by eight of the templates/guidelines (53.3%), and data format description was requested by seven templates (46.7%). The rest of the data set properties were required infrequently by the data journals, and descriptions of data creation dates and languages were rare. Data type was defined differently by the journals; for example, Journal of Open Archaeology Data (JOAD) distinguished among primary data, secondary data, processed data, interpretation of data, and final reports, whereas Data in Brief (DiB) classified data by type into tables, images, charts, graphs, and figures.

Data production information tended to be requested more frequently by the data journals than general data set properties (Table 3). All the templates and guidelines required information relating to data collection, mainly regarding data collection steps, sampling strategies, and quality control mechanisms. Descriptions of data producers were required by nine templates/guidelines, five of which specifically asked for information about the data creators (author list of data sets [GigaScience] or creators [J_pensoft, Geoscience Data Journal, Data, JOAD]). The four remaining journals asked for a description of the authors’ contributions or information, which possibly corresponded to the role of data creator. The “project” component was mentioned by seven templates/guidelines, and only two of these required overall project descriptions (Ecology and J_pensoft). The remaining five journals required information about funding.

Table 3.

Data production information identified in the data paper templates/guidelines

Specimen and artifact information was requested by five templates/guidelines of biological science, geoscience, and archaeology journals, and researchers in these disciplines needed such information for data reuse [15]. Temporal, spatial, or taxonomic coverage (J_pensoft, JOAD), sample availability or location (Earth System Science Data, DiB), and descriptions of organisms or tissues (GigaScience) were identified. Information about data analysis, missing data, and research objectives was also requested by four or five templates/guidelines. Data analysis information involved how data were processed, and missing data information dealt with data anomalies or noise. The research objectives were often expressed as motivations or rationales for collecting the data sets.

In terms of repository information, all but one journal asked to describe the provenance of the data, indicating the identifier or location of data (Table 4). Data provenance also referred to the relationship of data with other materials, although only one journal (DiB) required a description of any research articles related to the data. None of the data journals asked for descriptions of the repository reputation and history. Regarding curation and digitization, one journal (Ecology) required information on archival procedures, including a description of how the data were archived for long-term storage and access.

Table 4.

Repository information and data reuse information identified in the data paper templates/guidelines

Data reuse information, mostly concerning advice on reuse and terms of use, was requested (Table 4). In regard to advice on reuse, the authors were required to describe the potential reuse and value of their data for reuse. Data journals required authors to describe the terms of use, primarily relating to competing interests, but several other aspects existed, including ethical approval and consent for participation, consent for publication, license, copyright, and accessibility requirements.

Discussion

The findings revealed common and unique types of contextual information that the data journals requested authors to describe. The most common form of contextual information documented by the journals was data collection methods, followed by data provenance (repository locations and/or data identifiers). More than or almost half of the templates/guidelines identified data file names/titles and data formats as general data set properties, data production information (including data producer and project), and reuse information (including advice on reuse and terms of use). The results are mostly consistent with those of previous studies [8,9]. Yet, Candela et al. [8] mentioned that descriptions of reuse information are often neglected by data journals, but most of the data journals examined in this study addressed the potential reuse of data. In terms of data provenance (indicating the relationship of data to other objects), only one journal (DiB) in this study required information on this relationship, although Chen [9] identified more instances of this information being required.

The types of contextual information that the data journals never or rarely requested included repository information (repository reputation and history, and curation and digitization) and data set properties (data creation dates and languages). In particular, Faniel et al. [15] stated that repository reputation and history are less easily documented since they are more social and relative than other types of context. Two of the data journals examined in this study (Scientific Data and Earth System Science Data) provided criteria for recommending data repositories, namely, data access conditions and long-term availability [18,19]. The provision of such repository information by data journals would help reusers understand the trustworthiness of the repositories where certain data are deposited. While data creation dates will help reusers develop sampling frames and identify changes in data creation contexts [20], only one journal asked for this information.

Data production information regarding data analysis, missing data, and research objectives was not specified by the studies of Candela et al. [8] and Chen [9] (Table 1). However, four or five of the templates/guidelines asked for a description of such information. Furthermore, data journals infrequently requested information on data version, file size, and prior reuse, which three of the templates/guidelines mentioned.

Overall, only a small amount of contextual information was commonly requested by the data journals. They tended to focus more on data production information (data collection, data producer, and project) and reuse information (potential reuse and terms of use) than general data set properties or repository information. With the exception of file names and data formats, descriptions of data set properties were generally lacking. Repository information mostly involved unique identifiers of data, but information about repositories’ reputation or their curation practices could be provided by data journals to help readers of data papers assess the reusability of data.

In conclusion, the present study examined types of contextual information that data journals asked authors to describe and determined the extent of variation in this information across certain data journals. The primary motivation of publishing data papers is to make data reusable and reproducible, and data papers should provide extensive data documentation that reflects sufficient contextual information. This study suggests that data journals should provide a more standardized set of data paper components to inform reusers of the various types of context in a consistent manner. Furthermore, data journals should not only require data availability information, but also provide details about the quality of data repositories that would complement the repository information described by data paper authors.

Notes

Conflict of Interest

No potential conflict of interest relevant to this article was reported.

References

1. Rowhani-Farid A, Allen M, Barnett AG. What incentives increase data sharing in health and medical research? A systematic review. Res Integr Peer Rev 2017;2:4. https://doi.org/10.1186/s41073-017-0028-9.

2. Pasquetto IV, Randles BM, Borgman CL. On the reuse of scientific data. Data Sci J 2017;16:8. http://doi.org/10.5334/dsj-2017-008.

3. Gray S. Case study: publishing a data paper [Internet]. [place unknow]: DOCZZ; 2015. [cited 2020 Jan 15]. Available from: http://doczz.net/doc/6729472/case-study%E2%80%94publishing-a-data-paper---research-data-service.

4. Jefferies N, Murphy F, Ranganathan A, Murray H. Data2paper: giving researchers credit for their data. Publications 2019;7:36. https://doi.org/10.3390/publications7020036.

5. Chavan V, Penev L. The data paper: a mechanism to incentivize data publishing in biodiversity science. BMC Bioinformatics 2011;12 Suppl 15:S2. https://doi.org/10.1186/1471-2105-12-S15-S2.

6. Callaghan S, Donegan S, Pepler S, et al. Making data a first class scientific output: data citation and publication by NERC’s environmental data centres. Int J Digit Curation 2012;7:107–13. http://www.ijdc.net/article/view/208.

7. Kratz J, Strasser C. Data publication consensus and controversies. F1000Research 2014;3:94. https://doi.org/10.12688/f1000research.3979.3.

8. Candela L, Castelli D, Manghi P, Tani A. Data journals: a survey. J Assoc Inf Sci Technol 2015;66:1747–62. https://doi.org/10.1002/asi.23358.

9. Chen YN. An analysis of characteristics and structures embedded in data papers: a preliminary study. Libellarium 2017;Mar. [Epub]. http://dx.doi.org/10.15291/libellarium.v9i2.266.

10. Atici L, Kansa SW, Lev-Tov J, Kansa EC. Other people’s data: a demonstration of the imperative of publishing primary data. J Archaeol Method Theory 2013;20:663–81. https://doi.org/1007/s10816-012-9132-9.

11. Friedhoff S, Meier zu Verl C, Pietsch C, et al. Social research data: documentation, management, and technical implementation within the SFB 882 [Internet]. Bielefeld: University of Bielefeld; 2013. [cited 2020 Jan 15]. Available from: https://pub.uni-bielefeld.de/download/2560035/2560036/SFB_882_WP_0016_Friedhoff_Meier-zu-Verl_Pietsch_Meyer_Vompras_Liebig.pdf.

12. Kim J, Yakel E, Faniel IM. Exposing standardization and consistency issues in repository metadata requirements for data deposition. Coll Res Libr 2019;80:6. https://doi.org/10.5860/crl.80.6.843.

13. Rees J. Recommendations for independent scholarly publication of data sets. San Francisco, CA: Creative Commons; 2010.

14. Costello MJ, Michener WK, Gahegan M, Zhang ZQ, Bourne PE. Biodiversity data should be published, cited, and peer reviewed. Trends Ecol Evol 2013;28:454–61. https://doi.org/10.1016/j.tree.2013.05.002.

15. Faniel IM, Frank RD, Yakel E. Context from the data reuser’s point of view. J Doc 2019;75(6):1274–97. https://doi.org/10.1108/JD-08-2018-0133.

16. Chin G, Lansing CS. Capturing and supporting contexts for scientific data sharing via the biological sciences collaboratory. Paper presented at: 2004 ACM Conference on Computer Supported Cooperative Work; 2004. Nov. https://doi.org/10.1145/1031607.1031677.

17. Akers K. A growing list of data journals [Internet]. [place unknown]: Data@MLibrary; 2014. [cited 2020 Jan 20]. Available from: https://mlibrarydata.wordpress.com/2014/05/09/data-journals/.

18. Earth System Science Data. Repository criteria [Internet]. Gottingen: Copernicus Publications; [cited 2020 Jan 20]. Available from: https://www.earth-system-science-data.net/for_authors/repository_criteria.html.

19. Scientific Data. Suggesting additional repositories [Internet]. [place unknown]: Nature Research; [cited 2020 Jan 20]. Available from: https://www.nature.com/sdata/policies/data-policies#repo-suggest.

20. Zimmerman A. Not by metadata alone: the use of diverse forms of knowledge to locate data for reuse. Int J Digit Lib 2007;7:5–16. https://doi.org/10.1007/s00799-007-0015-8.

Appendix

Appendix 1. Twenty-four data journals selected for the analysis

Article information Continued

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

	Types of contextual information	Data paper components
Data production information	Data collection	Experimental properties	Quality, provenance	Collection
Specimen and artifact	-	Coverage	Coverage
Data producer	-	Microattribution	Description (file creators), author's contribution
Data analysis	Analysis and interpretation	-	-
Missing data	-	-	-
Research objectives	-	-	-
Repository information	Provenance	Data provenance	Availability	Identifier, relationship
Repository reputation and history	-	-	-
Curation and digitization	-	-	-
Data reuse information	Prior reuse	-	Reuse
Advice on reuse	-	Reuse	-
Terms of use	-	License, competing interests	Copyright, ethical approval, consent to publication, competing interests
-	-	General data set properties	Format	Description (file format, version, creation date)
-	-	-	Project	Funding statement

Table 2.

General data set properties identified in the data paper templates/guidelines

General data set properties	SD	BMC genet	J_pensoft^b)	ER	Ecol	GDJ	ESSD	DiB	FiMS	Data	GigaSci	JOAD
Data creation date												√
Data format	√	√		√	√	√		√				√
Data type						√		√			√	√
Data version			√			√						√
File name/title		√	√		√	√			√	√	√	√
File size					√		√				√
Language			√									√

SD, Scientific Data; BMC genet, BMC Genetics; ER, Ecological Research; Ecol, Ecology; GDJ, Geoscience Data Journal; ESSD, Earth System Science Data; DiB, Data in Brief; FiMS, Frontiers in Marine Science; GigaSci, GigaScience; BIR, BioInvasion Records; IJRR, International Journal of Robotics Research; JOAD, Journal of Open Archaeology Data.

^a)

Five BMC data journals that used the same guideline were included: BMC Bioinformatics, BMC Genomics, BMC Medical Genomics, BMC Medical Informatics and Decision Making, and BMC Musculoskeletal Disorders;

^b)

Six data journals published by Pensoft that used the same guideline were included: Biodiversity Data Journal, Comparative Cytogenetics, Journal of Hymenoptera Research, Neobiota, Phytokeys, and Zookeys.

Table 3.

Data production information identified in the data paper templates/guidelines

Data production information	J_BMC^a)	SD	BMC genet	J_pensoft^b)	ER	Ecol	GDJ	ESSD	DiB	FiMS	Data	GigaSci	BIR	IJRR	JOAD
Data collection	√	√	√	√	√	√	√	√	√	√	√	√	√	√	√
Specimens and artifacts				√				√	√			√			√
Data producer	√	√	√	√			√	√			√	√			√
Data analysis		√	√			√				√
Missing data			√		√	√					√			√
Research objectives		√	√			√	√					√
Project	√		√	√		√		√				√			√

^a)

Five BMC data journals that used the same guideline were included: BMC Bioinformatics, BMC Genomics, BMC Medical Genomics, BMC Medical Informatics and Decision Making, and BMC Musculoskeletal Disorders;

^b)

Table 4.

Repository information and data reuse information identified in the data paper templates/guidelines

Types of contextual information	J_BMC^a)	SD	BMC genet	J_pensoft^b)	ER	Ecol	GDJ	ESSD	DiB	FiMS	Data	GigaSci	IJRR	JOAD
Repository information
Provenance	√	√	√	√	√	√	√	√	√	√	√	√	√	√
Repository reputation and history
Curation and digitization						√
Data reuse information
Prior reuse	√					√	√
Advice on reuse	√	√					√		√	√	√	√		√
Terms of use	√	√	√		√	√	√	√	√		√	√		√

^a)

^b)

Publisher	Data journal (full name)	Subject area	Publishing model	Pure vs. mixed
Springer Nature	BMC Bioinformatics^a)	Biochemical research methods	OA	Mixed
	BMC Genetics	Genetics, heredity	OA	Mixed
	BMC Genomics^a)	Biotechnology, applied microbiology	OA	Mixed
	BMC Medical Genomics^a)	Genetics, heredity	OA	Mixed
	BMC Medical Informatics and Decision Making^a)	Medical informatics	OA	Mixed
	BMC Musculoskeletal Disorders^a)	Orthopedics	OA	Mixed
	Scientific Data	Multidisciplinary sciences	OA	Pure
Pensoft^b)	Biodiversity Data Journal	Biodiversity conservation	OA	Mixed
	Comparative Cytogenetics	Genetics, heredity	OA	Mixed
	Journal of Hymenoptera Research	Entomology	OA	Mixed
	Neobiota	Biodiversity conservation	OA	Mixed
	Phytokeys	Plant sciences	OA	Mixed
	Zookeys	Zoology	OA	Mixed
Wiley	Ecological Research	Ecology	Hybrid	Mixed
	Ecology	Ecology	Hybrid	Mixed
	Geoscience Data Journal	Geosciences, multidisciplinary	OA	Pure
Copernicus Publications	Earth System Science Data	Geosciences, multidisciplinary	OA	Pure
Elsevier	Data in Brief	Multidisciplinary sciences	OA	Pure
Frontiers Media S.A.	Frontiers in Marine Science	Environmental sciences	OA	Mixed
MDPI	Data	Computer science information systems	OA	Mixed
Oxford University Press	GigaScience	Multidisciplinary sciences	OA	Pure
REABIC Journals	BioInvasions Records	Biodiversity conservation	OA	Mixed
SAGE Publications	International Journal of Robotics Research	Robotics	Subscribe	Mixed
Ubiquity Press	Journal of Open Archaeology Data	Archaeology	OA	Pure

OA, open access.

^a)

These journals share a single data paper guideline;

^b)

The journals published by Pensoft use the same data paper guideline.

	Types of contextual information		Data paper components
	Faniel et al. [15]	Chin and Lansing [16]	Candela et al. [8]	Chen [9]
Data production information	Data collection	Experimental properties	Quality, provenance	Collection
	Specimen and artifact	-	Coverage	Coverage
	Data producer	-	Microattribution	Description (file creators), author's contribution
	Data analysis	Analysis and interpretation	-	-
	Missing data	-	-	-
	Research objectives	-	-	-
Repository information	Provenance	Data provenance	Availability	Identifier, relationship
	Repository reputation and history	-	-	-
	Curation and digitization	-	-	-
Data reuse information	Prior reuse	-	Reuse
	Advice on reuse	-	Reuse	-
	Terms of use	-	License, competing interests	Copyright, ethical approval, consent to publication, competing interests
-	-	General data set properties	Format	Description (file format, version, creation date)
-	-	-	Project	Funding statement