Skip Navigation
Skip to contents

Science Editing : Science Editing

OPEN ACCESS
SEARCH
Search

Articles

Page Path
HOME > Sci Ed > Volume 13(1); 2026 > Article
Original Article
A two-stage registry-anchored approach for precision improvement in organization name recognition from PubMed affiliation strings: a validation study
Inmo Kang1orcid, Joonmo Park1orcid, Heesoo Jeong2orcid, Seyoung Chung3orcid, Changmin Jeon4orcid, Seongwuk Moon1orcid
Science Editing 2026;13(1):46-50.
DOI: https://doi.org/10.6087/kcse.396
Published online: February 9, 2026

1Graduate School of Management of Technology, Sogang Univeristy, Seoul, Korea

2Department of Psychology, College of Social Science, Sogang University, Seoul, Korea

3Department of Economics, Graduate School, Sogang University, Seoul, Korea

4Department of Mathematics, College of Natural Science, Sogang University, Seoul, Korea

Correspondence to Seongwuk Moon seongwuk@sogang.ac.kr
• Received: January 5, 2026   • Accepted: January 28, 2026

Copyright © 2026 Korean Council of Science Editors

This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

prev next
  • 985 Views
  • 35 Download
  • Purpose
    Reliable bibliometric analysis requires the accurate linkage of heterogeneous affiliation strings to persistent organizational identifiers. Generic natural language processing tools frequently fail at this task because they tend to prioritize coverage rather than precision. This study evaluated whether anchoring an entity-linking model to the Research Organization Registry improved precision relative to generic tools.
  • Methods
    We developed a conservative, two-stage model. First, using a normalized registry corpus, we applied rule-based exact matching with geographic validation. Second, selective fuzzy matching was applied only to the remaining nonmatched affiliations. We evaluated model performance against an off-the-shelf spaCy named entity recognition baseline using a manually adjudicated gold standard dataset derived from PubMed Digital Health records. Finally, we assessed the comparative advantage of our model using nonparametric paired comparison tests and bootstrap methods.
  • Results
    Our two-stage approach achieved substantially higher precision (0.97) and recall (0.93) than both the generic baseline (precision, 0.75; recall, 0.47) and unconstrained fuzzy matching models (precision, 0.77; recall, 0.83). This balanced improvement in precision and recall resulted in the highest F1 score (0.95). The ablation study further confirmed that the “exact matching first” strategy was structurally necessary to prevent the inflation of false positives observed when unconstrained fuzzy matching was applied.
  • Conclusion
    Anchoring entity resolution to a canonical registry using a tiered matching strategy substantially enhances the precision of institutional attribution. This approach provides a robust method for correcting metadata quality in editorial and repository workflows.
Background/rationale
Reliable bibliometric analysis depends on precise institutional attribution. To map collaboration networks or track funding, researchers must link raw affiliation strings in publication metadata to persistent organizational identifiers. However, data heterogeneity continues to undermine this process [13]. Raw affiliation strings, particularly those in author-generated PubMed fields, often diverge substantially from canonical organizational names and frequently contain ambiguous abbreviations, omitted hierarchical relationships, and irregular orthography.
This divergence leads to systematic misattribution. When analysts attempt to resolve these noisy strings using generic, off-the-shelf natural language processing (NLP) tools or unconstrained string similarity metrics, a critical failure mode emerges [4]. Lacking domain-specific constraints, these models prioritize coverage at the expense of accuracy, producing false positives that obscure the identities of true performing institutions [2,4]. As a result, the field lacks a method capable of parsing high-noise academic metadata with the level of precision required for authoritative bibliometric metrics.
Objectives
To address this trade-off, we developed a conservative, registry-anchored entity-linking model that explicitly prioritizes precision. Our model is anchored to the Research Organization Registry (ROR), which constrains the candidate space of possible institutional entities. Building on this anchor, our approach implements two stages: (1) rule-based exact matching with geographic validation as the primary filter; and (2) selective fuzzy matching applied only to the bounded residual set. This study evaluates whether this exact-matching-first strategy outperforms a generic baseline by improving precision without sacrificing recall. We further examine the structural necessity of the exact-matching-first constraint through an ablation experiment.
Ethics statement
As this study analyzed publicly available bibliographic metadata and involved no human subjects, neither institutional review board approval nor informed consent was required.
Study design
We conducted a descriptive, benchmark-based evaluation comparing an off-the-shelf generic named-entity recognition (NER) approach with the ROR-anchored entity-linking pipeline. Performance was assessed using standard classification metrics, including precision, recall, F1 score, and accuracy. In addition, we applied a paired statistical comparison framework, including McNemar test and bootstrap confidence intervals with Bonferroni correction, to validate the performance advantage of our proposed model.
Data

ROR corpus

Institutional reference data were sourced from the official ROR distribution [5], comprising 163,963 records. To construct a domain-specific corpus, we consolidated name variants, removed ambiguous tokens (e.g., short acronyms), and manually added high-frequency variants identified in prior work [1,4].

PubMed data

We retrieved 482,086 PubMed records (PMIDs) indexed between 2014 and 2024 using the E-utilities (Entrez programming utilities), a public API (application programming interface) to the US National Center for Biotechnology Information (NCBI) Entrez system (most recent access on December 20, 2025) [6]. The query targeted digital health–related publications and excluded nonhuman research.

Gold standard benchmark

To evaluate model performance, we constructed a gold standard benchmark [7] by randomly sampling 4,000 PMIDs from the full set of 482,086 records and extracting 46,378 author-affiliation strings. Four trained annotators manually mapped affiliation strings to ROR identifiers, labeling unmappable cases as “abstain” or “unlinkable” through a two-pass adjudication process. Detailed procedures for constructing the ROR corpus, PubMed dataset (Dataset 1), and gold standard benchmark are provided in Suppl. 1.
Proposed method: ROR-anchored exact-matching-first model
Detailed descriptions of preprocessing, stage 1, and stage 2 procedures are provided in Suppl. 2.

Preprocessing

We parsed 4,000 PubMed XML records to generate 46,378 distinct author-affiliation rows. Preprocessing steps removed noninstitutional noise (e.g., email addresses and postal codes) and normalized text to reduce orthographic inconsistency.

Stage 1 (exact matching with geographic validation)

Stage 1 functioned as a high-precision filtering step. Exact matching was performed using ROR-derived patterns, with longest-first variant prioritization to prevent partial overlaps. Administrative tokens (e.g., “Oxford”) were excluded to reduce spurious matches. In parallel, a spaCy EntityRuler validated candidate matches by intersecting extracted geographic entities with ROR metadata; affiliations lacking geographic alignment were deferred to stage 2.

Stage 2 (selective fuzzy matching)

Selective fuzzy matching was applied only to affiliations unresolved by stage 1. We computed a weighted similarity score incorporating token overlap, edit distance, and rare-token matching. Final acceptance was determined using calibrated score thresholds.
Evaluation strategy

Metrics

Predicted identifiers from each model were compared against gold standard labels using standard classification metrics, including accuracy, precision, recall, and F1 score. Correctness was prioritized to minimize false-positive institutional assignments.

Baseline method

For comparison, we implemented a baseline using a pretrained, general-domain spaCy NER model with strict surface-form linking. This baseline represents a generic NLP strategy without domain-specific anchoring. Additional details on evaluation metrics and the baseline implementation are provided in Suppl. 3.
Statistical methods
Predicted ROR identifiers were evaluated against the manually adjudicated gold standard benchmark comprising 46,378 affiliation strings. Performance of the spaCy baseline, stage 1 (exact matching with geographic validation), stage 2 (selective fuzzy matching following stage 1), and an unconstrained fuzzy matching ablation was summarized using accuracy, precision, recall, and F1 score. Correctness was prioritized to minimize false-positive institutional assignments, and calibrated thresholds governed acceptance of fuzzy matches in stage 2. The McNemar test was applied using paired correctness indicators derived from the gold standard benchmark to assess the direction of performance differences. Specifically, discordant pairs were compared for stage 2 relative to the baseline, stage 1, and the unconstrained fuzzy matching ablation. In addition, differences in F1 score were estimated using bootstrap confidence intervals. Bonferroni correction was applied to both statistical procedures. Python codes used in this study are available in Suppl. 4.
Performance metrics across the baseline, stage 1, stage 2, and ablation models are summarized in Table 1.
Baseline performance: Generic NER
The off-the-shelf spaCy NER model (baseline) failed to detect more correct institutions than it missed, identifying 17,276 correct matches while missing 19,508. Its relatively low precision (0.75) and the lowest recall (0.47) among all evaluated models resulted in the lowest F1 score (0.58) and accuracy (0.46).
Stage 1 performance: Exact matching with geographic validation
Stage 1, which applied registry-derived exact matching with geographic validation, increased precision to 0.98 and recall to 0.87. This configuration achieved the highest precision (0.98) among all tested models. Stage 1 yielded 9,183 negative cases, comprising 5,317 false negatives and 3,866 true negatives. These cases shared characteristics such as noncanonical naming, partial institutional mentions, typographical errors, and other lexical variations, which could reduce the correctness of our model. To preserve high correctness, all negative cases were subsequently forwarded to stage 2 for further resolution.
Stage 2 performance: Selective fuzzy matching after stage 1
This configuration was selected as the preferred model. It applied selective fuzzy matching to the 9,183 residual cases passed from stage 1 and achieved a precision of 0.97, a recall of 0.93, and the highest F1 score (0.95). Although precision decreased marginally by 1 percentage point compared with stage 1, recall and F1 score increased by 6 and 3 percentage points, respectively. Accuracy also increased by 4 percentage points, reaching the highest observed level (0.91).
Ablation study: Unconstrained fuzzy matching
Removing the stage 1 filter (exact matching with geographic validation) resulted in reduced precision (0.77) and recall (0.83). This unconstrained configuration generated 8,314 false positives, representing a nearly ninefold increase relative to stage 1. This result demonstrates that unconstrained fuzzy matching tends to over-match plausible but incorrect institutions when affiliation strings are noisy.
Key results
Our proposed two-stage ROR-anchored model (stage 2) substantially outperformed both the generic spaCy baseline and the unconstrained fuzzy matching approach. Using the McNemar test applied to paired correctness indicators, we confirmed that stage 2 significantly outperformed all other models in achieving correctness at the 1% significance level with the Bonferroni correction (Table 1). In addition, the win-loss odds ratio for stage 2 ranged from 12.7 to 34.6, indicating that stage 2 produced between 12.7 and 34.6 correct predictions for every one correct prediction made by the comparator models (Table S3.1 in Suppl. 3). Finally, bootstrap confidence intervals for the F1 score difference confirmed that stage 2 increased the F1 score by 2.8% to 63.5% (Fig. 1, Table S3.2 in Suppl. 3).
Interpretation
Our analysis yields three primary insights regarding institutional entity-linking. First, generic NER models without domain-specific anchoring exhibit limited recall. The baseline model’s failure to detect the majority of valid institutions stems from the mismatch between its general-domain training data and the highly heterogeneous structure of scholarly affiliation strings. Orthographic inconsistencies, abbreviations, truncated institution names, and domain-specific naming conventions hindered precise NER.
In bibliometric applications, such omissions systematically bias collaboration and productivity indicators by undercounting institutions represented through nonstandard, abbreviated, or multilingual forms.
Second, anchoring entity-linking to a canonical registry such as ROR, combined with an exact-matching-first strategy, achieves a strong balance between precision and recall, as reflected by the highest F1 score of stage 2 and the second highest on of stage 1. The two-stage models validated the hypothesis that high-precision exact matching provides a stable foundation for institutional resolution. This stage operationalized a conservative standard of correctness by accepting institution mentions only when they aligned with registry-derived surface forms and geographic metadata. As a result, the approach mitigated homonym-based conflation and the overmatching behavior inherent in models that rely on unconstrained similarity metrics.
Third, fuzzy matching is effective only when applied selectively. Stage 2 demonstrated that selective fuzzy matching improved recall by 6 percentage points by resolving residual nonmatches passed from stage 1, while preserving high precision (0.97). The ablation study showed that unconstrained fuzzy matching inflated false positives by 46.9% because the algorithm overmatched plausible but incorrect institutions in noisy affiliation strings. These findings indicate that exact matching in the first stage is a structural requirement for controlling false positives, rather than a minor optimization.
Implications for bibliometrics and workflows
Correctness represents the limiting factor in institutional affiliation linking. False positives contaminate institutional indicators, distort collaboration networks, and undermine trust in policy-relevant evaluations. The model’s high correctness therefore supports reliable institutional analytics derived from PubMed data, while its registry-anchored design enables principled abstention. In repository ingestion pipelines or editorial quality control workflows, abstaining on uncertain cases, such as department-only mentions, hospitals lacking disambiguating metadata, or organizations absent from the registry, is preferable to forcing low-confidence assignments.
Limitations
This study has several limitations. First, the gold standard bench mark was derived from a digital health–focused subset of PubMed, and performance may differ across disciplines with distinct affiliation conventions. Second, recall recovery in the model depends on calibrated thresholds and weighting schemes used in fuzzy matching, which may require adjustment for other domains. Third, some affiliation strings are intrinsically ambiguous, such as department-only mentions or records lacking geographic context, and cannot be resolved without external evidence.
Conclusions
The heterogeneity of affiliation strings in PubMed prevents generic NER models and unconstrained fuzzy matching approaches from delivering reliable institutional attribution at scale. The ROR-anchored, two-stage approach, which enforces an exact-matching-first strategy followed by selective, tiered fuzzy matching, provides a practical foundation for improving the integrity of institutional metadata used in bibliometrics and research evaluation.

Conflict of interest

No potential conflict of interest relevant to this article was reported.

Funding

This work was supported by the Support Program for Fostering Talent in Industrial Innovation (No. RS-2025-02217527), funded by the Korean Ministry of Industry, Trade and Resources.

Data availability

The dataset file is available from the Harvard Dataverse at https://doi.org/10.7910/DVN/M5PRZB.

Dataset 1. The dataset analyzed during the current study.

kcse-396-dataset-1.csv

Supplementary files are available from https://doi.org/10.7910/DVN/M5PRZB and GitHub (Suppl. 4; https://github.com/peterkang222/DEIR_ROR-anchored_Two-stage_NER).
Suppl. 1. Methods for constructing the ROR corpus, PubMed sample, and gold standard.
kcse-396-Supplementary-1.pdf
Suppl. 2. Model implementation details.
kcse-396-Supplementary-2.pdf
Suppl. 3. Evaluation metrics and baseline implementation.
kcse-396-Supplementary-3.pdf
Suppl. 4. Python codes for data analysis.
kcse-396-Supplementary-4.zip
Fig. 1.
Model performance differences. Error bars indicate Bonferroni-corrected 98.3% confidence intervals. Baseline, generic named-entity recognition. Stage 1, exact matching with geographic validation. Stage 2, selective fuzzy matching after stage 1. Ablation study, unconstrained fuzzy matching.
kcse-396f1.jpg
Table 1.
Comparative performance
Metric Baseline Stage 1 Stage 2 Ablation study
True positive 17,276 36,269 38,305 28,452
False positive 5,660 926 1,388 8,314
True negative 3,934 3,866 3,705 3,895
False negative 19,508 5,317 2,980 5,717
Precision 0.75 0.98 0.97 0.77
Recall 0.47 0.87 0.93 0.83
F1 score 0.58 0.92 0.95 0.80
Accuracy 0.46 0.87 0.91 0.70
χ² (stage 2 vs. model) 19,629.76 1,600.19 8,931.00
P-value <0.001*** <0.001*** <0.001***

The McNemar test was implemented to test the superiority of the stage 2 model (three test cases: stage 2 vs. baseline; stage 2 vs. stage 1; stage 2 vs. ablation study). Baseline, generic named-entity recognition. Stage 1, exact matching with geographic validation. Stage 2, selective fuzzy matching after stage 1. Ablation study, unconstrained fuzzy matching.

*** P<0.001 (The significance level was 1% with the Bonferroni correction).

  • 1. Jonnalagadda SR, Topham P. NEMO: extraction and normalization of organization names from PubMed affiliations. J Biomed Discov Collab 2010;5:50-75.https://doi.org/10.5210/disco.v5i0.3047. ArticlePubMedPMC
  • 2. Huang S, Yang B, Yan S, Rousseau R. Institution name disambiguation for research assessment. Scientometrics 2014;99:823-38.https://doi.org/10.1007/s11192-013-1214-2. Article
  • 3. Torvik VI. MapAffil: a bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide. Dlib Mag 2015;21:10. 1045/november2015-torvik. https://doi.org/10.1045/november2015-torvik. ArticlePubMedPMC
  • 4. Caron E, Daniels H. Identification of organization name variants in large databases using rule-based scoring and clustering: with a case study on the Web of Science database. In: Hammoudi S, Maciaszek L, Missikoff MM, Camp O, Cordeiro J, editors. Proceedings of the 18th International Conference on Enterprise Information Systems, Volume 1. 18th International Conference on Enterprise Information Systems (ICEIS 2016); 2016 Apr 25–28; Rome, Italy. SciTePress; 2016. p. 182–7. https://doi.org/10.5220/0005836701820187. Article
  • 5. Research Organization Registry. ROR data [Internet]. Version 2. Zenodo; 2025 [cited 2025 Dec 20]. Available from: https://zenodo.org/records/17953395.
  • 6. Sayers E. A general introduction to the E-utilities [updated 2022 Nov 17]. In: Entrez programming utilities help [Internet]. US National Center for Biotechnology Information; [cited 2025 Dec 20]. Available from: https://www.ncbi.nlm.nih.gov/books/NBK25497.
  • 7. Wissler L, Almashraee M, Monett D, Paschke A. The gold standard in corpus annotation. Presented at: 5th IEEE Germany Student Conference; 2014 Jun 26–27; Passau, Germany. https://doi.org/10.13140/2.1.4316.3523. Article

Figure & Data

References

    Citations

    Citations to this article as recorded by  

      • PubReader PubReader
      • Cite
        Cite
        export Copy
        Close
      • XML DownloadXML Download
      Figure
      • 0
      A two-stage registry-anchored approach for precision improvement in organization name recognition from PubMed affiliation strings: a validation study
      Image
      Fig. 1. Model performance differences. Error bars indicate Bonferroni-corrected 98.3% confidence intervals. Baseline, generic named-entity recognition. Stage 1, exact matching with geographic validation. Stage 2, selective fuzzy matching after stage 1. Ablation study, unconstrained fuzzy matching.
      A two-stage registry-anchored approach for precision improvement in organization name recognition from PubMed affiliation strings: a validation study
      Metric Baseline Stage 1 Stage 2 Ablation study
      True positive 17,276 36,269 38,305 28,452
      False positive 5,660 926 1,388 8,314
      True negative 3,934 3,866 3,705 3,895
      False negative 19,508 5,317 2,980 5,717
      Precision 0.75 0.98 0.97 0.77
      Recall 0.47 0.87 0.93 0.83
      F1 score 0.58 0.92 0.95 0.80
      Accuracy 0.46 0.87 0.91 0.70
      χ² (stage 2 vs. model) 19,629.76 1,600.19 8,931.00
      P-value <0.001*** <0.001*** <0.001***
      Table 1. Comparative performance

      The McNemar test was implemented to test the superiority of the stage 2 model (three test cases: stage 2 vs. baseline; stage 2 vs. stage 1; stage 2 vs. ablation study). Baseline, generic named-entity recognition. Stage 1, exact matching with geographic validation. Stage 2, selective fuzzy matching after stage 1. Ablation study, unconstrained fuzzy matching.

      P<0.001 (The significance level was 1% with the Bonferroni correction).


      Science Editing : Science Editing
      TOP