An ontology-based biomedical research paper authoring support tool

Senator Jeong; Sejin Nam; Hyun-Young Park

doi:10.6087/kcse.2014.1.37

Articles

Page Path: HOME > Sci Ed > Volume 1(1); 2014 > Article

Original Article An ontology-based biomedical research paper authoring support tool: Senator Jeong¹, Sejin Nam², Hyun-Young Park¹; Science Editing 2014;1(1):37-42.
DOI: https://doi.org/10.6087/kcse.2014.1.37
Published online: February 13, 2014

¹National Center for Medical Information and Knowledge, Korea National Institute of Health, Cheongwon, Korea

²Biomedical Knowledge Engineering Laboratory, Seoul National University, Seoul, Korea

Correspondence to Senator Jeong E-mail: senatorjeong@gmail.com

This paper was posted on the Journal Article Tag Suite Conference proceedings website available from: http://www.ncbi.nlm.nih.gov/books/NBK159968/

• Received: October 4, 2013 • Accepted: December 1, 2013

This is an open access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

21,266 Views
127 Download
5 Web of Science
2 Crossref
5 Scopus

Full Article

Download PDF

Abstract
Introduction
Methods
Results
Use-case scenario
Discussion
Notes
References

Abstract

This work aims to develop a paper authoring support system that helps biomedical scientists to organize their ideas for a specific discourse purpose. As an initial step toward the goal, this study developed an abstract authoring support tool that provides candidate lexical bundles organized according to the introduction, methods, results, and discussion (IMRAD) structure. Lexical bundles were extracted from the sentences in 152,083 structured abstracts of the PubMed Central open access subset and their distribution was analyzed by IMRAD section. To organize lexical bundles according to IMRAD, the Lexical Bundle Ontology was built. A Journal Article Tag Suite compliant authoring support tool was implemented. This tool lists candidate lexical bundles corresponding to authors’ discourse purposes in a specific section and thereby helps to complete sentences. We expect this tool be useful, at least in biomedical abstract writing, to organize an author’s ideas to achieve a specific discourse purpose. This tool is targeted primarily at biomedical scientists whose mother tongue is not English; however, English native speakers may find it useful as well.
Keywords: Lexical bundle; Ontology; Paper authoring tool; Research paper abstract

Introduction

Biomedical research papers are typically formatted according to the introduction, methods, results, and discussion (IMRAD) structure. Even with the guidance of this format, writing a paper in English is challenging for authors because the most difficult thing is to organize and present their ideas with appropriate expressions. This is true to some extent for all authors, regardless of whether their mother tongue is English or not. Our goal is to develop a paper authoring support system that helps biomedical scientists to organize their ideas for a specific discourse purpose. As an initial step toward the goal, this study developed an abstract authoring support tool that helps to write sentences using lexical bundles organized according to the IMRAD structure. Lexical bundles function as the basic building blocks of this discourse structure. For example, the lexical bundle “the purpose of this study was” indicates the research purpose in the introduction section.

The motivation for this study arises from the fact that each IMRAD section has a highly codified discourse purpose and frequently occurring lexical bundles. Our approach also draws on the literature that has determined that (1) communicative meanings and functions are often realized by formulaic expression [¹]; and that (2) in complying with IMRAD format, academic authors adopt predetermined phraseological patterns in discourse steps [²].

One study showed that a more compact set of vocabulary is used in biomedical science texts than in general English [³]. Along the same lines, Wang et al. [⁴] compiled the medical academic word list, which is a list of the most frequently used medical academic words in medical research papers. However, individual word forms are not sufficient for building an author support tool because most rhetorical functions are not accomplished by choosing a word from a word list.

Lexical bundles are combinations of three or more words that frequently occur in a corpus [⁵]. They have many names: ‘n-grams’, ‘multi-word patterns’, ‘formulaic patterns’, ‘clusters’, or ‘collocations’. Though they are not complete structural units, lexical bundles work as basic building blocks of discourse. Several researchers focused on characterizing lexical bundles in research articles. Cortes [⁶] analyzed the relationship of the lexical bundles to the ‘moves’ in the Introduction section of research articles. He identified that some lexical bundles composed of more than five words trigger the communicative function of a move (e.g., ‘the purpose of this study was’ describes the objective of study). Saber analyzed the distribution of lexical bundles among the different IMRAD sections in biomedical research articles and found a limited repertoire of key standardized phraseological patterns specific to certain rhetorical steps [²]. Daniel [⁷] demonstrated that lexical bundles work as clear signals of discourse purpose in abstract texts. Lorenzo [⁸] investigated the frequency, structure, and functions of lexical bundles in English research papers.

Previous studies have demonstrated the utility of lexical bundles in achieving a specific discourse purpose. A formulaic expression compendium is another example of their utility [⁹]. Drawing upon the utility of lexical bundles, we developed an abstract writing support tool that can help an author to combine the appropriate lexical bundles with a given discourse step.

Data source: To collect lexical bundles for the authoring support tool, we identified the salient lexical bundles in a specific IMRAD section. The data source for the lexical bundles was the structured abstracts of research articles (n=152,083) in the PubMed Central open access subset [¹⁰]. We repurposed the data collected for our another project which classify sentences in biomedical paper abstracts. Therefore, the data processing steps in this study are partially the same as those in that project.
Extraction of lexical bundles: Our structured abstract corpus contains a variety of section headings (n=1,628). These variants include plurals (conclusion, conclusions), modifiers (conclusion, major conclusion), different word sequences (conclusions and significance, significance and conclusion), and joint section headings (method and result, result and discussion) [¹¹]. Different section headings representing the same concept were merged into 1,001 headings using OpenRefine (http://openrefine.org), then further normalized into IMRAD using the National Library of Medicine’s category mapping table [¹²]. The mapping table uses five categories: background, objective, methods, results, and conclusions. In our study, background and objective were merged into introduction and conclusions was renamed to discussion. To extract lexical bundles, LingPipe was used for sentence splitting [¹³]. Each sentence was appended with its corresponding IMRAD heading.; After grouping the sections into IMRAD, lexical bundles were extracted from the sentences in each section, and then their occurrence frequency was analyzed using our in-house program. The lexical bundles with an occurrence frequency of under 50 instances were ignored. The lexical bundles in this study are defined as: all n-grams (from 2- to 10-gram) without any further processing except numbers, which were marked with a special tag, NUM. Table 1 shows that each IMRAD section has commonly occurring lexical bundles.
Lexical bundle ontology: We designed a simple ontology, Lexical Bundle Ontology (LBO), that organizes lexical bundles according to IMRAD. Thus, LBO extends the generic IMRAD model to produce its own classes. LBO has four classes under the top class Lexical- Bundle: IntroductionBundle, MethodBundle, ResultBundle, and DiscussionBundle. Each class has three datatype properties: frequency, nGram, and rdfs:label. The prefix ‘lbo’ was declared for the dereferenceable uniform resource identifier (http://lbo.studiosusan.kr/), where an application can find and access bundle instances. The box below shows an LBO instance in the Turtle format. The 9-gram lexical bundle (‘The aim of the present study was to evaluate’) is identified by ‘IntroLBl1108’ and its frequency is 326.
Implementation of authoring support tool: A Journal Article Tag Suite (JATS)-compliant authoring support tool prototype was implemented. This proof-of-concept tool was designed to list candidate lexical bundles corresponding to a writer’s discourse purpose and thereby help to complete sentences in the abstract. The Apache Lucene library was used for indexing and searching the lexical bundles in the LBO. To obtain recommend lexical bundles for a specific discourse purpose, they can be sorted by occurrence frequency, length of n-gram, or alphabetical code order. In our tool, the bundles are sorted by their length or occurrence frequency.; A user completes a sentence by juxtaposing the last words of a given bundle with the first words of other bundles available to come. To make it happen, the left token-first-partial matching method was applied. When a user types a string, the tool matches bundles from the index file in real-time and returns the matched items to the user. For an interactive user interface, AJAX was used.

Results

When a user writes an abstract according to the IMRAD format, the tool lists up to 10 candidate lexical bundles (Fig. 1). The tool is case-sensitive. When a user enters an uppercase letter at the very beginning of a string, the tool returns all bundles beginning with the typed letter. The completed texts are transformed to JATS format by clicking the button labeled ‘Convert to JATS format.’ The Paper Authoring Support Tool is available at http://147.46.70.42:7000/jats/demo.html.

Use-case scenario

In this section, we present a use-case scenario to show the utility of the tool for authors. Let us demonstrate the writing of the abstract of a published paper [¹⁴] with this tool.

In the introduction section, the author begins by describing background knowledge and presenting the knowledge gap with the lexical bundle ‘However little is known about’ (Fig. 2). Then she states the research purpose with an expression such as ‘In this study we investigated the’ (Fig. 3). The methods section often involves describing a study type (Fig. 4), subject (Fig. 5), and experimental procedure (Fig. 6). The results section often has statistical test results (Fig. 7) with a confidence interval (Fig. 8). The discussion section often ends with an expression such as ‘Our results indicate that’ (Fig. 9). In each step of writing, the author selects a proper bundle serving her discourse purpose. Finally, by clicking the appropriate button, the author can transform the text to the JATS extensible mark-up language.

Discussion

This study presented an abstract authoring support tool that lists candidate lexical bundles responding to authors’ discourse purposes in a specific IMRAD section and can help to complete sentences. We hope the use case scenario has demonstrated the viability of this proof-of-concept tool.

The implications of the tool are obvious. It would be a useful tool, at least in abstract writing, to organize the author’s ideas to achieve a specific discourse purpose. It would thereby generate a well-organized and clearly written manuscript. This tool is targeted primarily at biomedical scientists whose mother tongue is not English. However, native speakers may still find it useful. This tool could also be useful for authors in any other scientific domain because many lexical bundles are commonly used in papers across the sciences.

Admittedly, this exploratory study extracted the lexical bundles from the abstracts in the PubMed Central open access subset; thus it does not show all of the thematic contents in the bodies of research papers. Further studies are required to cover full-text papers. Further work would include refining the authoring support tool. Grouping the lexical bundles with a similar role into a single group would be one approach (e.g., the two lexical bundles ‘the objective of this study’ and ‘the aim of this study’ have a similar role). This semantic structure will assist in highly selective searching for lexical bundles.

In conclusion, we hope that the present work will provoke the science community to start discussions on and endorsement of this authoring support tool.

Notes

No potential conflict of interest relevant to this article was reported.

Acknowledgements

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (no. 2011-0010515) and by the Korea National Institute of Health (no. 4800-4848-307 and no. 3000-3334-300).

Fig. 1.

Prototype interface for the biomedical research paper authoring support tool.

Fig. 2.

The author begins by describing background knowledge and presenting the knowledge gap with the lexical bundle, ‘However little is known about.’

Fig. 3.

The author states research purpose with a lexical bundle, ‘In this study we investigated the.’

Fig. 4.

The author describes a study type by selecting a lexical bundle, ‘prospective cohort study’ matching with her discourse purpose.

Fig. 5.

In the methods section, the author describes experimental subject with a lexical bundle, “total of NUM patients.”

Fig. 6.

In the methods section, the author describes experimental procedure with a lexical bundle, ‘classified according to the.’

Fig. 7.

The results section often has statistical test results.

Fig. 8.

Description of test results with confidence interval in the Results section.

Fig. 9.

The author describes the discussion section with an lexical bundle, ‘Our results indicate that.’

Table 1.

Distribution of the top 3 lexical bundles in each IMRAD (introduction, methods, results, and discussion) section of the PubMed Central open access subset (from 2- to 10-gram bundles)

n-Gram	Introduction	Frequency	Methods	Frequency	Results	Frequency	Discussion	Frequency
2	of the	130,996	of the	63,454	NUM %	480,724	of the	94,020
	in the	94,588	and NUM	61,634	NUM NUM	294,308	in the	79,248
	of this	45,916	NUM NUM	55,754	of the	191,242	to the	32,302
3	study was to	31,444	n=NUM	25,918	NUM NUM %	101,998	as well as	8,192
	of this study	29,622	NUM and NUM	21,810	NUM±NUM	58,342	results suggest that	7,828
	this study was	23,354	was used to	12,098	NUM % of	56,900	in patients with	7,182
4	of this study was	22,792	A total of NUM	6,004	NUM % CI NUM	47,016	These results suggest that	2,818
	this study was to	22,518	between NUM and NUM	4,764	NUM NUM % CI	39,362	Our results suggest that	2,670
	The aim of this	15,124	NUM mg / kg	4,460	% CI NUM NUM	32,358	can be used to	2,524
5	of this study was to	22,346	were divided into NUM groups	2,002	NUM NUM % CI NUM	32,792	The results of this study	1,382
	The aim of this study	12,788	A total of NUM patients	1,340	NUM % CI NUM NUM	32,302	play an important role in	1,048
	aim of this study was	11,530	were included in the study	1,302	NUM % and NUM %	17,240	for the first time that	830
6	aim of this study was to	11,286	mean age NUM ± NUM years	702	NUM NUM % CI NUM NUM	24,766	our knowledge this is the first	774
	The aim of this study was	10,796	A total of NUM patients with	506	=NUM NUM % CI NUM	8,602	To our knowledge this is the	494
	purpose of this study was to	5,870	were randomly divided into NUM groups	500	OR NUM NUM % CI NUM	8,126	To the best of our knowledge	442
7	The aim of this study was to	10,560	were divided into NUM groups according to	282	=NUM NUM % CI NUM NUM	6,924	To our knowledge this is the first	472
	The purpose of this study was to	5,558	NUM/NUM and NUM/NUM	200	OR NUM NUM % CI NUM NUM	6,220	To the best of our knowledge this	342
	The objective of this study was to	3,642	ratios ORs and NUM % confidence intervals	180	NUM NUM % CI NUM to NUM	5,190	the best of our knowledge this is	314
8	The aim of the present study was to	1,896	ratios ORs and NUM % confidence intervals CIs	142	OR =NUM NUM % CI NUM NUM	4,050	To the best of our knowledge this is	304
	The aim of this study was to investigate	1,758	odds ratios ORs and NUM % confidence intervals	126	= NUM NUM % CI = NUM NUM	3,478	the best of our knowledge this is the	300
	The aim of this study was to evaluate	1,660	were divided into NUM groups according to the	120	OR =NUM NUM % CI =NUM	2,916	best of our knowledge this is the first	286
9	The aim of this study was to evaluate the	1,200	odds ratios ORs and NUM % confidence intervals CIs	102	OR = NUM NUM % CI = NUM NUM	2,396	To the best of our knowledge this is the	292
	The aim of this study was to investigate the	1,126	with a mean age of NUM±NUM years	96	NUM NUM % CI NUM NUM p=NUM	1,922	the best of our knowledge this is the first	284
	The aim of this study was to determine the	820	odds ratios OR and NUM % confidence intervals CI	82	NUM % NUM % NUM % and NUM %	1,418	best of our knowledge this is the first report	88
10	The aim of the present study was to investigate the	246	/NUM/NUM and NUM/NUM/NUM	70	NUM NUM % confidence interval [CI] NUM NUM	890	To the best of our knowledge this is the first	276
	The aim of the present study was to evaluate the	224	NUM/NUM/NUM and NUM/NUM	70	NUM % NUM % NUM % and NUM % respectively	744	the best of our knowledge this is the first report	86
	aim of this study was to evaluate the effect of	142	between NUM/NUM/NUM and NUM/ NUM	62	NUM % NUM % CI NUM % to NUM %	626	the best of our knowledge this is the first case	60

References

1. Martinez R, Schmitt N. A phrasal expressions list. Appl Linguist 2012;33:299-320.http://dx.doi.org/10.1093/applin/ams010. Article PDF
2. Saber A. Phraseological patterns in a large corpus of biomedical articles. In : Boulton A, Carter-Thomas S, Rowley-Jolivet E, editors. Corpus-informed research and learning in ESP: issues and applications. Amsterdam: John Benjamins Publishing Company; 2012:45-82.Article
3. Waxmonsky S, Goldsmith J, Rzhetsky A. Discovering and counting biomedical verbs. Paper presented at: 2010 Ninth International Conference on Machine Learning and Applications (ICMLA). 2010 Dec 12-14; Washington, DC, USA. Article
4. Wang J, Liang SI, Ge GC. Establishment of a medical academic word list. Engl Specif Purp 2008;27:442-58.http://dx.doi.org/10.1016/j.esp.2008.05.003. Article
5. Biber D, Conrad S, Cortes V. If you look at …: lexical bundles in university teaching and extbooks. Appl Linguist 2004;25:371-405.http://dx.doi.org/10.1093/applin/23.371. Article PDF
6. Cortes V. The purpose of this study is to: connecting lexical bundles and moves in research article introductions. J Engl Acad Purp 2013;12:33-43.http://dx.doi.org/10.1016/j.jeap.2012.11.002. Article
7. Daniel R. Domain-independent mining of abstracts using indicator phrases. DLib Mag 2012;18:http://dx.doi.org/10.1045/july2012-daniel. Article
8. Lorenzo SD. Lexical bundles in scientific English: a corpus-based study of native and non-native writing [dissertation]. Barcelona: Universitat de Barcelona; 2011.
9. Morley J. Academic Phrasebank [Internet]. [place unknown]: University of Manchester; 2013 [cited 2013 Apr 6]. Available from: http://www.phrasebank.manchester.ac.uk.
10. US National Library of Medicine. PMC: FTP service [Internet]; Bethesda: US National Library of Medicine; [cited 2011 Apr 12]. Available from: http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/.
11. Ripple AM, Mork JG, Rozier JM, Knecht LS. Structured abstracts in MEDLINE: twenty-five years later [Internet]. Bethesda: US National Library of Medicine; 2012 [cited 2013 Dec 29]. Available from: http://structuredabstracts.nlm.nih.gov/Structured_Abstracts_in_MEDLINE_Twenty-Years_Later.pdf.
12. USA.gov. Structured abstracts in MEDLINE: implementation information [Internet]; Bethesda: US National Library of Medicine; 2012 [cited 2012 Sep 22]. Available from: http://structuredabstracts.nlm.nih.gov/Implementation.shtml.
13. Smith L, Rindflesch T, Wilbur WJ. MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics 2004;20:2320-1.http://dx.doi.org/10.1093/bioinformatics/bth227. Article PDF
14. Kim MJ, Lim NK, Park HY. Relationship between prehypertension and chronic kidney disease in middle-aged people in Korea: the Korean genome and epidemiology study. BMC Public Health 2012;12:960. http://dx.doi.org/10.1186/1471-2458-12-960. Article PubMed PMC PDF

Figure & Data

References

Citations

Citations to this article as recorded by

Structuralizing biomedical abstracts with discriminative linguistic features
Sejin Nam, Senator Jeong, Sang-Kyun Kim, Hong-Gee Kim, Victoria Ngo, Nansu Zong
Computers in Biology and Medicine.2016; 79: 276. CrossRef
Editing and publishing scholarly journals in the internet age
Kihong Kim
Science Editing.2014; 1(1): 2. CrossRef