Named Entity Recognition (NER) and Medical Ontology Mapping for Clinical Documentation

Research Proposal

Dinith Perera•Computer Science•FYP 2026/27

1. Introduction

1.1 Background

Electronic Health Records (EHRs) contain large volumes of patient data, but research shows that approximately 80% of this clinical documentation exists as unstructured free text (Dash et al., 2019). While doctors use this open format to write detailed notes, machines cannot easily process or interpret unstructured information. This prevents healthcare systems from effectively searching, analyzing, or utilizing the data at scale. To solve this issue, unstructured clinical notes must be converted into structured formats. This requires extracting key medical concepts from the text and mapping them to universally recognized medical ontologies, such as ICD-11 and SNOMED-CT (Wang et al., 2018). Standardizing the data makes it possible for algorithms to process the information, enabling large-scale analysis and automated medical research.

1.2 Problem Definition

Despite advances in Natural Language Processing (NLP), existing systems struggle with clinical documentation from low-resource settings. Most commercial clinical NLP models are exclusively trained on US-based datasets, such as MIMIC-III. Consequently, they perform poorly when applied to healthcare data from South Asian countries, where clinical notes exhibit distinct linguistic patterns, localized drug naming conventions, and non-standard abbreviation usage (Ramachandran et al., 2021).

To illustrate this disparity, the following is a fictional sample aligned to real formats written by a Sri Lankan doctor, collected during a problem identification survey:

No ischemic changes. Poor R wave progression. BP 216/115 mmHg. B/L B profile+ IV cannula inserted RBS 185mg/dl VBG done Inj GTN IV infusion started at 6p.m

This note demonstrates the heavy use of abbreviations (e.g., "B/L", "RBS", "VBG") and specific medication formatting (e.g., "Inj GTN") that are not commonly represented in Western training data. Because current NER models and LLMs fail to accurately interpret these localized variations, they are highly prone to hallucination or silent medical errors. Thus, there is a critical need to adapt and evaluate clinical NLP architectures to ensure accuracy and safety within South Asian healthcare contexts.

1.3 Research Objectives

To compile and annotate a specialized clinical corpus representative of non-western English (Sri Lankan) medical documentation, capturing local abbreviations, drug formats, and linguistic patterns.
To develop and fine-tune a Named Entity Recognition (NER) architecture optimized for extracting clinical entities from low-resource, highly abbreviated text.
To implement a mapping system that links extracted localized clinical entities to standard global medical ontologies (e.g., SNOMED CT, ICD-11).
To evaluate the performance of the proposed NLP pipeline against existing commercial or US-trained state-of-the-art models to demonstrate improvements in accuracy and reductions in hallucination within the target context.

2. Literature Review: Current Gaps

The literature review for this proposal was conducted using PubMed, Google Scholar, and IEEE Xplore. The search strategy targeted subjects including "Clinical NLP," "Clinical Documentation," and "ICD-11." The identified research highlights several limitations in existing systems, with the specific gaps synthesized below:

2.1.Multilingual & Non-English Clinical Notes

Description: Models fail on non-English clinical records.

“The framework lacks evaluation on unstructured or multilingual healthcare records. Future work could consider semantic adjustments for multilingual records to enhance global applicability.”

— JMIR Medical Informatics (2025)

“Most prior studies focus on a single type of data, a specific task, or a single language, which limits their generalisability.”

— Evaluating Open-Weight LLMs for Structured Extraction (2025)

2.2.Cross-Institution Generalization

Description: Models trained at one hospital fail at another due to differing documentation styles.

“While models performed well when trained and tested on individual datasets, cross-dataset generalization highlighted remaining obstacles.”

— npj Digital Medicine, Social Determinants of Health Extraction (2025)

“Future research should extend validation across multiple institutions and diversify annotated datasets.”

— JMIR Medical Informatics, NLP & ICD-10 Bleeding Events (2025)

2.3.Negation and Uncertainty Detection

Description: “No evidence of pneumonia” must NOT generate a pneumonia code. Still unsolved.

“Negation detection performance suffers when there is no in-domain development or training data. Practical negation detection is not reliable without in-domain training data — it can be optimized for a domain but not generalized.”

— PMC, Negation’s Not Solved: Generalizability Versus Optimizability in Clinical NLP

“Future work should further refine temporal reasoning and negation capabilities within NLP algorithms.”

— JMIR Medical Informatics, NLP & ICD-10 Bleeding Events (2025)

2.4.Rare Disease & Unseen Code Coverage

Description: Models have never seen ~90% of ICD-10 codes in training, and fail on rare diseases.

“LLMs severely underperform in rare disease differential diagnosis. ICD codes significantly undercount rare diseases since many lack direct mappings to comprehensive rare disease databases like Orphanet.”

— MIMIC-RD, arXiv (2026)

“Even MIMIC-IV includes only about 10% of the 70K ICD-10-CM codes, severely restricting generalization to unseen codes.”

— From Documents to Spans, arXiv (2026)

2.5.Hallucination in Clinical Extraction

Description: Models fabricate codes, citations, and clinical facts not present in the source text.

“In medical case summaries, hallucinations reached 64.1% without mitigation prompts.”

— LLM Hallucination Statistics (2026)

“Medical LLMs struggle to generalize beyond their training data, particularly when faced with rare diseases, novel treatments, or atypical clinical presentations — producing erroneous or irrelevant outputs.”

— Medical Hallucination in Foundation Models, medRxiv (2025)

2.6.Interdependent Variable Consistency

Description: Extracting one field (e.g., tumour stage) that logically constrains another (e.g., treatment type) — models produce internally inconsistent structured outputs.

“Existing LLM-based extraction pipelines often struggle to capture interdependencies among variables, leading to clinically inconsistent outputs.”

— Deep Reflective Reasoning, arXiv (2026)

2.7.Privacy-Preserving On-Premise Deployment

Description: Most high-performing models are cloud-based, making them legally unusable in many hospitals.

“Most published work has relied on commercial, cloud-based models such as ChatGPT. Keeping data under local control and ensuring patient privacy are critical when working with clinical records.”

— Evaluating Open-Weight LLMs for Structured Extraction (2025)

2.8.Structured Output Format Reliability (JSON/Schema Validity)

Description: Small local models frequently produce malformed structured outputs that downstream systems cannot parse.

“Syntactic validity — whether outputs conformed to a given format — was assumed rather than explicitly evaluated in most studies.”

— Evaluating Structured Output Robustness of Small Language Models, arXiv (2025)

2.9.Dual Ontology Consistency (ICD + SNOMED simultaneously)

Description: No method reliably maps a single mention to both ICD and SNOMED without conflicts.

“The parallel use of ICD-10-CM for discharge diagnoses and SNOMED CT for the health problem list might affect user consistency. Future developments aim to bridge this gap by mapping ICD-10-CM codes to SNOMED CT concepts to enrich or validate the problem list over time.”

— Journal of Medical Systems, SNOMED CT Upstream Coding (2025)

2.10.Temporal Reasoning Across Longitudinal Records

Description: Models cannot reliably track when a condition was active vs. resolved across multiple visits.

“LLMs are still struggling with temporal progression and rare disease prediction across long patient trajectories.”

— Zero-Shot LLMs for Long Clinical Text Summarization with Temporal Reasoning, medRxiv (2025)

3. Proposed Methodology

This section outlines the two-fold methodological framework: first, the design of a specialized technical architecture capable of structuring localized clinical data, and second, the systematic research protocol utilized to construct the dataset, adapt the models, and measure their comparative efficacy against established baselines.

3.1 Proposed System Architecture

The target system processes raw, localized clinical text through a sequence of extraction and mapping modules to produce structured, interoperable outputs:

Raw Clinical Notes

Unstructured local clinical text.

NER & Features

Identifying entities and relations.

Ontology Linking

Normalizing entities to ICD-11 & SNOMED CT.

Structured Database

Standardized JSON for clinical analytics.

Figure 1: Proposed System Architecture for Clinical Data Structuring Pipeline

The proposed system architecture follows a modular extraction-normalization-storage pipeline. It begins with raw, unstructured clinical notes often containing localized medical terminology. The NER (Named Entity Recognition) layer identifies crucial medical entities, which are then passed through an ontology linking module that maps these concepts to international standards like SNOMED CT and ICD-11, ensuring semantic interoperability across healthcare systems. Finally, the structured output is stored in a relational format optimized for downstream clinical analytics and research database integration.

3.2 Research Execution Phases

To successfully build and validate the architecture above, this research will proceed through three systematic execution phases:

Phase 1:
Dataset & Annotation

Acquiring notes and establishing a clinician-annotated gold standard.

Phase 2:
Model Adaptation

Adapting open-weight models to localized syntax.

Phase 3:
Evaluation & Benchmarking

Testing against baselines to validate accuracy.

Figure 2: Phased Execution Plan for Model Development and Evaluation

The research is divided into three distinct execution phases to ensure systematic development. Phase 1 focuses on data acquisition and the creation of a clinician-annotated "gold standard" corpus, which serves as the ground-truth benchmark for all subsequent experiments. Phase 2 involves the core technical work of adapting and fine-tuning open-weight NLP models to account for the unique medical syntax and localized vocabulary found in the target context. Phase 3 concludes with a rigorous evaluation against existing US-trained baseline models to measure improvements in precision, recall, and practical clinical utility.

Evaluation Framework & Metrics

The performance of the fine-tuned architecture will be evaluated using standard NLP metrics, including Precision, Recall, and the F1-score. For the annotation phase (Phase 1), the reliability of the "gold standard" will be validated using Cohen’s Kappa or Fleiss’ Kappa to measure inter-annotator agreement between participating clinicians.

Beyond standard extraction metrics, the framework will undergo a qualitative Error Analysis to specifically quantify the reduction in hallucinated entities (incorrectly generated medical facts) compared to baseline models like GPT-4 or standard ClinicalBERT. This ensures the system is not just accurate in extracting text, but safe for clinical decision support.

4. Expected Outcomes & Significance

4.1 Expected Outcomes

The primary output of this research will be a validated framework for processing clinical text in low-resource settings. Specifically, the following deliverables are expected:

Localized Clinical Corpus: A curated and de-identified dataset of clinical notes representing the linguistic and formatting nuances of non-Western (Sri Lankan) medical documentation.
clinician-Annotated Gold Standard: A high-fidelity reference dataset with multi-layered annotations, verified for inter-annotator agreement using specialized metrics (Fleiss' Kappa).
Domain-Adapted NER Architecture: A fine-tuned open-weight NLP model specifically optimized to handle localized medical shorthand and non-standard syntax with high precision.
Comparative Performance Benchmark: A comprehensive evaluation report demonstrating the model's superiority in reducing hallucination and improving entity extraction accuracy compared to general-purpose Western baselines.

4.2 Significance of the Research

This research holds significant value for both the technical NLP community and the broader healthcare system:

Addressing Global Data Gaps: By focusing on underrepresented regional data, this study contributes to the "de-Westernization" of clinical NLP, making AI tools more equitable and effective globally.
Enhancing Clinical Decision Support: Accurate extraction of clinical data is a prerequisite for reliable AI-driven healthcare. This work directly impacts patient safety by minimizing errors in automated medical records.
Enabling Digital Transformation in Resource-Limited Settings: The findings will provide a roadmap for developing countries to utilize their existing legacy documentation for modern health analytics and evidence-based policy making.

References

Dash, S., Shakyawar, S. K., Sharma, M. and Kaushik, S. (2019) 'Big data in healthcare: management, analysis and future prospects', Journal of Big Data, 6(1), p. 54. Available at: https://doi.org/10.1186/s40537-019-0217-0.
'Evaluating Open-Weight LLMs for Structured Extraction of Clinical Information from Unstructured Health Records' (2025) arXiv preprint arXiv:2511.10658. Available at: https://arxiv.org/abs/2511.10658.
'Evaluating Structured Output Robustness of Small Language Models' (2025) arXiv preprint arXiv:2507.01810. Available at: https://arxiv.org/abs/2507.01810.
'From Documents to Spans: Benchmarking Semantic Extraction' (2026) arXiv preprint arXiv:2603.15270. Available at: https://arxiv.org/abs/2603.15270.
Ismail, A. et al. (2025) 'Natural Language Processing and ICD-10 Coding for Identification of Bleeding Events', JMIR Medical Informatics, 13, p. e67837. Available at: https://doi.org/10.2196/67837.
'LLM Hallucination Statistics' (2026) Software Quality Magazine. Available at: https://sqmagazine.co.uk/llm-hallucination-statistics/.
'Medical Feature Extraction From Clinical Examination Notes: Development and Evaluation of a Two-Phase Large Language Model Framework' (2025) JMIR Medical Informatics, 13, p. e78432. Available at: https://doi.org/10.2196/78432.
'Medical Hallucination in Foundation Models' (2025) medRxiv preprint. Available at: https://doi.org/10.1101/2025.02.28.25323115.
'MIMIC-RD: A Benchmark for Rare Disease Differential Diagnosis' (2026) arXiv preprint arXiv:2601.11559. Available at: https://arxiv.org/abs/2601.11559.
Ramachandran, A. et al. (2021) 'Natural language processing for clinical text in low-resource settings: a review', Frontiers in Digital Health, 3, p. 734661. Available at: https://doi.org/10.3389/fdgth.2021.734661.
'Reflective Reasoning for Structured Clinical Extraction' (2026) arXiv preprint arXiv:2603.20435. Available at: https://arxiv.org/abs/2603.20435.
'SNOMED CT Upstream Coding: Linking ICD-10-CM to SNOMED CT' (2025) Journal of Medical Systems, 49(1), p. 25. Available at: https://doi.org/10.1007/s10916-025-02200-4.
'Social determinants of health extraction from clinical notes across institutions using large language models' (2025) npj Digital Medicine, 8, p. 45. Available at: https://doi.org/10.1038/s41746-025-01645-8.
Wang, Y. et al. (2018) 'Clinical information extraction applications: A literature review', Journal of Biomedical Informatics, 77, pp. 34-49. Available at: https://doi.org/10.1016/j.jbi.2017.11.011.
Wu, S. T. et al. (2014) 'Negation's not solved: generalizability versus optimizability in clinical natural language processing', PloS one, 9(11), p. e112774. Available at: https://doi.org/10.1371/journal.pone.0112774.
'Zero-Shot LLMs for Long Clinical Text Summarization with Temporal Reasoning' (2025) medRxiv preprint. Available at: https://doi.org/10.1101/2025.07.21.25331947.

Table of Contents