Abstract
-
Purpose
- Reliable bibliometric analysis requires the accurate linkage of heterogeneous affiliation strings to persistent organizational identifiers. Generic natural language processing tools frequently fail at this task because they tend to prioritize coverage rather than precision. This study evaluated whether anchoring an entity-linking model to the Research Organization Registry improved precision relative to generic tools.
-
Methods
- We developed a conservative, two-stage model. First, using a normalized registry corpus, we applied rule-based exact matching with geographic validation. Second, selective fuzzy matching was applied only to the remaining nonmatched affiliations. We evaluated model performance against an off-the-shelf spaCy named entity recognition baseline using a manually adjudicated gold standard dataset derived from PubMed Digital Health records. Finally, we assessed the comparative advantage of our model using nonparametric paired comparison tests and bootstrap methods.
-
Results
- Our two-stage approach achieved substantially higher precision (0.97) and recall (0.93) than both the generic baseline (precision, 0.75; recall, 0.47) and unconstrained fuzzy matching models (precision, 0.77; recall, 0.83). This balanced improvement in precision and recall resulted in the highest F1 score (0.95). The ablation study further confirmed that the “exact matching first” strategy was structurally necessary to prevent the inflation of false positives observed when unconstrained fuzzy matching was applied.
-
Conclusion
- Anchoring entity resolution to a canonical registry using a tiered matching strategy substantially enhances the precision of institutional attribution. This approach provides a robust method for correcting metadata quality in editorial and repository workflows.
-
Keywords: Institutional affiliation; Entity-linking model; Research Organization Registry; PubMed; Tiered matching strategy
Introduction
- Background/rationale
- Reliable bibliometric analysis depends on precise institutional attribution. To map collaboration networks or track funding, researchers must link raw affiliation strings in publication metadata to persistent organizational identifiers. However, data heterogeneity continues to undermine this process [1–3]. Raw affiliation strings, particularly those in author-generated PubMed fields, often diverge substantially from canonical organizational names and frequently contain ambiguous abbreviations, omitted hierarchical relationships, and irregular orthography.
- This divergence leads to systematic misattribution. When analysts attempt to resolve these noisy strings using generic, off-the-shelf natural language processing (NLP) tools or unconstrained string similarity metrics, a critical failure mode emerges [4]. Lacking domain-specific constraints, these models prioritize coverage at the expense of accuracy, producing false positives that obscure the identities of true performing institutions [2,4]. As a result, the field lacks a method capable of parsing high-noise academic metadata with the level of precision required for authoritative bibliometric metrics.
- Objectives
- To address this trade-off, we developed a conservative, registry-anchored entity-linking model that explicitly prioritizes precision. Our model is anchored to the Research Organization Registry (ROR), which constrains the candidate space of possible institutional entities. Building on this anchor, our approach implements two stages: (1) rule-based exact matching with geographic validation as the primary filter; and (2) selective fuzzy matching applied only to the bounded residual set. This study evaluates whether this exact-matching-first strategy outperforms a generic baseline by improving precision without sacrificing recall. We further examine the structural necessity of the exact-matching-first constraint through an ablation experiment.
Methods
- Ethics statement
- As this study analyzed publicly available bibliographic metadata and involved no human subjects, neither institutional review board approval nor informed consent was required.
- Study design
- We conducted a descriptive, benchmark-based evaluation comparing an off-the-shelf generic named-entity recognition (NER) approach with the ROR-anchored entity-linking pipeline. Performance was assessed using standard classification metrics, including precision, recall, F1 score, and accuracy. In addition, we applied a paired statistical comparison framework, including McNemar test and bootstrap confidence intervals with Bonferroni correction, to validate the performance advantage of our proposed model.
- Data
ROR corpus
- Institutional reference data were sourced from the official ROR distribution [5], comprising 163,963 records. To construct a domain-specific corpus, we consolidated name variants, removed ambiguous tokens (e.g., short acronyms), and manually added high-frequency variants identified in prior work [1,4].
PubMed data
- We retrieved 482,086 PubMed records (PMIDs) indexed between 2014 and 2024 using the E-utilities (Entrez programming utilities), a public API (application programming interface) to the US National Center for Biotechnology Information (NCBI) Entrez system (most recent access on December 20, 2025) [6]. The query targeted digital health–related publications and excluded nonhuman research.
Gold standard benchmark
- To evaluate model performance, we constructed a gold standard benchmark [7] by randomly sampling 4,000 PMIDs from the full set of 482,086 records and extracting 46,378 author-affiliation strings. Four trained annotators manually mapped affiliation strings to ROR identifiers, labeling unmappable cases as “abstain” or “unlinkable” through a two-pass adjudication process. Detailed procedures for constructing the ROR corpus, PubMed dataset (Dataset 1), and gold standard benchmark are provided in Suppl. 1.
- Proposed method: ROR-anchored exact-matching-first model
- Detailed descriptions of preprocessing, stage 1, and stage 2 procedures are provided in Suppl. 2.
Preprocessing
- We parsed 4,000 PubMed XML records to generate 46,378 distinct author-affiliation rows. Preprocessing steps removed noninstitutional noise (e.g., email addresses and postal codes) and normalized text to reduce orthographic inconsistency.
Stage 1 (exact matching with geographic validation)
- Stage 1 functioned as a high-precision filtering step. Exact matching was performed using ROR-derived patterns, with longest-first variant prioritization to prevent partial overlaps. Administrative tokens (e.g., “Oxford”) were excluded to reduce spurious matches. In parallel, a spaCy EntityRuler validated candidate matches by intersecting extracted geographic entities with ROR metadata; affiliations lacking geographic alignment were deferred to stage 2.
Stage 2 (selective fuzzy matching)
- Selective fuzzy matching was applied only to affiliations unresolved by stage 1. We computed a weighted similarity score incorporating token overlap, edit distance, and rare-token matching. Final acceptance was determined using calibrated score thresholds.
- Evaluation strategy
Metrics
- Predicted identifiers from each model were compared against gold standard labels using standard classification metrics, including accuracy, precision, recall, and F1 score. Correctness was prioritized to minimize false-positive institutional assignments.
Baseline method
- For comparison, we implemented a baseline using a pretrained, general-domain spaCy NER model with strict surface-form linking. This baseline represents a generic NLP strategy without domain-specific anchoring. Additional details on evaluation metrics and the baseline implementation are provided in Suppl. 3.
- Statistical methods
- Predicted ROR identifiers were evaluated against the manually adjudicated gold standard benchmark comprising 46,378 affiliation strings. Performance of the spaCy baseline, stage 1 (exact matching with geographic validation), stage 2 (selective fuzzy matching following stage 1), and an unconstrained fuzzy matching ablation was summarized using accuracy, precision, recall, and F1 score. Correctness was prioritized to minimize false-positive institutional assignments, and calibrated thresholds governed acceptance of fuzzy matches in stage 2. The McNemar test was applied using paired correctness indicators derived from the gold standard benchmark to assess the direction of performance differences. Specifically, discordant pairs were compared for stage 2 relative to the baseline, stage 1, and the unconstrained fuzzy matching ablation. In addition, differences in F1 score were estimated using bootstrap confidence intervals. Bonferroni correction was applied to both statistical procedures. Python codes used in this study are available in Suppl. 4.
Results
- Performance metrics across the baseline, stage 1, stage 2, and ablation models are summarized in Table 1.
- Baseline performance: Generic NER
- The off-the-shelf spaCy NER model (baseline) failed to detect more correct institutions than it missed, identifying 17,276 correct matches while missing 19,508. Its relatively low precision (0.75) and the lowest recall (0.47) among all evaluated models resulted in the lowest F1 score (0.58) and accuracy (0.46).
- Stage 1 performance: Exact matching with geographic validation
- Stage 1, which applied registry-derived exact matching with geographic validation, increased precision to 0.98 and recall to 0.87. This configuration achieved the highest precision (0.98) among all tested models. Stage 1 yielded 9,183 negative cases, comprising 5,317 false negatives and 3,866 true negatives. These cases shared characteristics such as noncanonical naming, partial institutional mentions, typographical errors, and other lexical variations, which could reduce the correctness of our model. To preserve high correctness, all negative cases were subsequently forwarded to stage 2 for further resolution.
- Stage 2 performance: Selective fuzzy matching after stage 1
- This configuration was selected as the preferred model. It applied selective fuzzy matching to the 9,183 residual cases passed from stage 1 and achieved a precision of 0.97, a recall of 0.93, and the highest F1 score (0.95). Although precision decreased marginally by 1 percentage point compared with stage 1, recall and F1 score increased by 6 and 3 percentage points, respectively. Accuracy also increased by 4 percentage points, reaching the highest observed level (0.91).
- Ablation study: Unconstrained fuzzy matching
- Removing the stage 1 filter (exact matching with geographic validation) resulted in reduced precision (0.77) and recall (0.83). This unconstrained configuration generated 8,314 false positives, representing a nearly ninefold increase relative to stage 1. This result demonstrates that unconstrained fuzzy matching tends to over-match plausible but incorrect institutions when affiliation strings are noisy.
Discussion
- Key results
- Our proposed two-stage ROR-anchored model (stage 2) substantially outperformed both the generic spaCy baseline and the unconstrained fuzzy matching approach. Using the McNemar test applied to paired correctness indicators, we confirmed that stage 2 significantly outperformed all other models in achieving correctness at the 1% significance level with the Bonferroni correction (Table 1). In addition, the win-loss odds ratio for stage 2 ranged from 12.7 to 34.6, indicating that stage 2 produced between 12.7 and 34.6 correct predictions for every one correct prediction made by the comparator models (Table S3.1 in Suppl. 3). Finally, bootstrap confidence intervals for the F1 score difference confirmed that stage 2 increased the F1 score by 2.8% to 63.5% (Fig. 1, Table S3.2 in Suppl. 3).
- Interpretation
- Our analysis yields three primary insights regarding institutional entity-linking. First, generic NER models without domain-specific anchoring exhibit limited recall. The baseline model’s failure to detect the majority of valid institutions stems from the mismatch between its general-domain training data and the highly heterogeneous structure of scholarly affiliation strings. Orthographic inconsistencies, abbreviations, truncated institution names, and domain-specific naming conventions hindered precise NER.
- In bibliometric applications, such omissions systematically bias collaboration and productivity indicators by undercounting institutions represented through nonstandard, abbreviated, or multilingual forms.
- Second, anchoring entity-linking to a canonical registry such as ROR, combined with an exact-matching-first strategy, achieves a strong balance between precision and recall, as reflected by the highest F1 score of stage 2 and the second highest on of stage 1. The two-stage models validated the hypothesis that high-precision exact matching provides a stable foundation for institutional resolution. This stage operationalized a conservative standard of correctness by accepting institution mentions only when they aligned with registry-derived surface forms and geographic metadata. As a result, the approach mitigated homonym-based conflation and the overmatching behavior inherent in models that rely on unconstrained similarity metrics.
- Third, fuzzy matching is effective only when applied selectively. Stage 2 demonstrated that selective fuzzy matching improved recall by 6 percentage points by resolving residual nonmatches passed from stage 1, while preserving high precision (0.97). The ablation study showed that unconstrained fuzzy matching inflated false positives by 46.9% because the algorithm overmatched plausible but incorrect institutions in noisy affiliation strings. These findings indicate that exact matching in the first stage is a structural requirement for controlling false positives, rather than a minor optimization.
- Implications for bibliometrics and workflows
- Correctness represents the limiting factor in institutional affiliation linking. False positives contaminate institutional indicators, distort collaboration networks, and undermine trust in policy-relevant evaluations. The model’s high correctness therefore supports reliable institutional analytics derived from PubMed data, while its registry-anchored design enables principled abstention. In repository ingestion pipelines or editorial quality control workflows, abstaining on uncertain cases, such as department-only mentions, hospitals lacking disambiguating metadata, or organizations absent from the registry, is preferable to forcing low-confidence assignments.
- Limitations
- This study has several limitations. First, the gold standard bench mark was derived from a digital health–focused subset of PubMed, and performance may differ across disciplines with distinct affiliation conventions. Second, recall recovery in the model depends on calibrated thresholds and weighting schemes used in fuzzy matching, which may require adjustment for other domains. Third, some affiliation strings are intrinsically ambiguous, such as department-only mentions or records lacking geographic context, and cannot be resolved without external evidence.
- Conclusions
- The heterogeneity of affiliation strings in PubMed prevents generic NER models and unconstrained fuzzy matching approaches from delivering reliable institutional attribution at scale. The ROR-anchored, two-stage approach, which enforces an exact-matching-first strategy followed by selective, tiered fuzzy matching, provides a practical foundation for improving the integrity of institutional metadata used in bibliometrics and research evaluation.
Notes
-
Conflict of interest
No potential conflict of interest relevant to this article was reported.
-
Funding
This work was supported by the Support Program for Fostering Talent in Industrial Innovation (No. RS-2025-02217527), funded by the Korean Ministry of Industry, Trade and Resources.
-
Data availability
The dataset file is available from the Harvard Dataverse at https://doi.org/10.7910/DVN/M5PRZB.
Dataset 1. The dataset analyzed during the current study.
kcse-396-dataset-1.csv
Supplementary materials
Supplementary files are available from https://doi.org/10.7910/DVN/M5PRZB and GitHub (Suppl. 4; https://github.com/peterkang222/DEIR_ROR-anchored_Two-stage_NER).
Fig. 1.Model performance differences. Error bars indicate Bonferroni-corrected 98.3% confidence intervals. Baseline, generic named-entity recognition. Stage 1, exact matching with geographic validation. Stage 2, selective fuzzy matching after stage 1. Ablation study, unconstrained fuzzy matching.
Table 1.Comparative performance
|
Metric |
Baseline |
Stage 1 |
Stage 2 |
Ablation study |
|
True positive |
17,276 |
36,269 |
38,305 |
28,452 |
|
False positive |
5,660 |
926 |
1,388 |
8,314 |
|
True negative |
3,934 |
3,866 |
3,705 |
3,895 |
|
False negative |
19,508 |
5,317 |
2,980 |
5,717 |
|
Precision |
0.75 |
0.98 |
0.97 |
0.77 |
|
Recall |
0.47 |
0.87 |
0.93 |
0.83 |
|
F1 score |
0.58 |
0.92 |
0.95 |
0.80 |
|
Accuracy |
0.46 |
0.87 |
0.91 |
0.70 |
|
χ² (stage 2 vs. model) |
19,629.76 |
1,600.19 |
– |
8,931.00 |
|
P-value |
<0.001***
|
<0.001***
|
– |
<0.001***
|
References
- 1. Jonnalagadda SR, Topham P. NEMO: extraction and normalization of organization names from PubMed affiliations. J Biomed Discov Collab 2010;5:50-75.https://doi.org/10.5210/disco.v5i0.3047. ArticlePubMedPMC
- 2. Huang S, Yang B, Yan S, Rousseau R. Institution name disambiguation for research assessment. Scientometrics 2014;99:823-38.https://doi.org/10.1007/s11192-013-1214-2. Article
- 3. Torvik VI. MapAffil: a bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide. Dlib Mag 2015;21:10. 1045/november2015-torvik. https://doi.org/10.1045/november2015-torvik. ArticlePubMedPMC
- 4. Caron E, Daniels H. Identification of organization name variants in large databases using rule-based scoring and clustering: with a case study on the Web of Science database. In: Hammoudi S, Maciaszek L, Missikoff MM, Camp O, Cordeiro J, editors. Proceedings of the 18th International Conference on Enterprise Information Systems, Volume 1. 18th International Conference on Enterprise Information Systems (ICEIS 2016); 2016 Apr 25–28; Rome, Italy. SciTePress; 2016. p. 182–7. https://doi.org/10.5220/0005836701820187. Article
- 5. Research Organization Registry. ROR data [Internet]. Version 2. Zenodo; 2025 [cited 2025 Dec 20]. Available from: https://zenodo.org/records/17953395.
- 6. Sayers E. A general introduction to the E-utilities [updated 2022 Nov 17]. In: Entrez programming utilities help [Internet]. US National Center for Biotechnology Information; [cited 2025 Dec 20]. Available from: https://www.ncbi.nlm.nih.gov/books/NBK25497.
- 7. Wissler L, Almashraee M, Monett D, Paschke A. The gold standard in corpus annotation. Presented at: 5th IEEE Germany Student Conference; 2014 Jun 26–27; Passau, Germany. https://doi.org/10.13140/2.1.4316.3523. Article
Citations
Citations to this article as recorded by
