The foundational capabilities of large language models in predicting postoperative risks using clinical notes

Our dataset originated from electronic anesthesia records (Epic) for all adult patients undergoing surgery at Barnes-Jewish Hospital (BJH), the largest healthcare system in the greater St. Louis (MO) region, spanning four years from 2018 to 2021. The dataset included n = 84, 875 of preoperative notes and its surgical outcomes.

The text-based notes were single-sentence documents with a vocabulary size of ∣V∣ = 3203, and mean word and vocabulary lengths of \(\overlinel_w=8.9\) (SD: 6.9) and \(\overlinel_v=7.3\) (SD: 4.4), respectively. The textual data contained detailed information on planned surgical procedures, which were derived from smart text records during patient consultations. To preserve patient privacy, the data was provided in a de-identified and censored format, with patient identifiers removed and procedural information presented in an arbitrary order, ensuring that no information could be traced back to any uniquely identifiable patient.

Of the 84,875 patients, the distribution of patient types included 17% (14,412) in orthopedics, 8.8% (7442) in ophthalmology, and 7.4% (6236) in urology. The gender distribution was 50.3% male (42,722 patients). Ethnic representation included 74% White (62,563 patients), 22.6% African American (19,239 patients), 1.7% Hispanic (1488 patients), and 1.2% Asian (1015 patients). The mean weight was 86 kg (± 24.7 kg) and the mean height was 170 cm (± 11 cm). Other characteristics of the patients could be found in Table 6.

Table of Contents

Outcomes

Our six outcomes were: 30-day mortality, acute knee injury (AKI), pulmonary embolism (PE), pneumonia, deep vein thrombosis (DVT), and delirium. These six outcomes remain pertinent in perioperative care, particularly during OR-ICU handoff^1,56,57.

AKI was defined according to the Kidney Disease Improving Global Outcomes criteria and was determined using a combination of laboratory values (serum creatinine) and dialysis event records. Exclusion criteria for AKI included baseline end-stage renal disease, as indicated by structured anesthesia assessments, laboratory data, and billing data. Delirium was determined from nurse flow-sheets (positive Confusion Assessment Method for the Intensive Care unit test result); pneumonia, DVT, and PE were determined based on the International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10) diagnosis codes. Patients without delirium screenings were excluded from the analysis of that complication.

Data quality

The dataset originally included a total of 90,005 records with postoperative outcomes. However, 5130 instances did not have clinical notes associated with their EHR records, resulting in a final sample size of n = 84,875. A total of 72,697 patients without delirium screenings were excluded from the analysis pertaining to the delirium outcome, resulting in an 86% missing rate for that complication. The remaining five outcomes—30-day mortality, DVT, PE, pneumonia, and AKI—had complete data and no missing outcomes among all cases with associated clinical notes.

Traditional word embeddings and pretrained LLMs

We employed a combination of BERT³⁰ and GPT-based³¹ large language models (LLMs) that were trained on either biomedical or open-source clinical corpora—specifically BioGPT²¹, ClinicalBERT¹⁷, and BioClinicalBERT¹⁸—for predicting risk factors from notes taken during perioperative care. BioGPT is a 347-million-parameter model trained on 15 million PubMed abstracts²¹. It adopts the GPT-2 model architecture, making it an auto-regressive model trained on the language modeling task, which seeks to predict the next word given all preceding words. In contrast, ClinicalBERT was trained on the publicly available Medical Information Mart for Intensive Care III (MIMIC-III) dataset, which contains 2,083,180 de-identified clinical notes associated with critical care unit admissions¹⁷. It was initialized from the BERT_base architecture with the masked language modeling (MLM) and next sentence prediction (NSP) objectives³⁰. Similarly, BioClinicalBERT was pretrained on all the available clinical notes associated with the MIMIC-III dataset¹⁸. However, unlike ClinicalBERT, BioClinicalBERT was based on the BioBERT model⁵⁸, which itself was trained on 4.5 billion words of PubMed abstracts and 13.5 billion words of PubMed Central full-text articles. This allowed BioClinicalBERT to leverage texts from both the biomedical and clinical domains.

These models have been tested across representative NLP benchmarks in the medical domains, including Question-Answering tasks benchmarked by PubMedQA²¹, recognizing entities from texts¹⁸ and logical relationships in clinical text pairs⁵⁹. In addition to the clinically-oriented LLMs, we included traditional NLP word embeddings as baselines for comparison. These include word2vec’s continuous bag-of-words (CBOW)²², doc2vec²³, GloVe²⁴, and FastText²⁵.

By comparing transformer-based LLMs with traditional words embeddings, we aim to analyze the magnitude of improvements which clinically-oriented pretrained LLMs could potentially possess in grasping and contextualizing perioperative texts towards postoperative risk prediction, in comparison to traditional word embeddings that represent each token as an independent vector.

Transfer learning: self-supervised fine-tuning

A comparison of the distinct fine-tuning methods employed among our study could be best illustrated in Fig. 1.

To bridge the gap between the corpora of the pretrained models and that of perioperative notes, we first expose and adjust these models to our perioperative text through self-supervised fine-tuning. We adapt the pretrained LLMs with our training data in accordance with their existing training objectives. This process leverages the information contained within the source domain and exploits it to align the distributions of source and target data. For BioGPT, this entails the language modeling task^21,31. For ClinicalBERT and BioClinicalBERT, it involves the masked language modeling (MLM), as well as the Next Sentence Prediction (NSP) objectives if the document contains multiple sentences, as elaborated in the “Traditional word embeddings and pretrained LLMs” section above³⁰.

Transfer learning: incorporating labels into fine-tuning

In lieu of anticipated improvements from self-supervised fine-tuning, we took a step further by incorporating labels as part of the fine-tuning process, in the hopes of boosting predictive performances as demonstrated in past studies^17,45. We do this through a semi-supervised approach, as illustrated in Fig. 1. This means that in contrast to the self-supervised approach which adjusts weights based solely on training texts, the semi-supervised method infuses label information during the fine-tuning process. In doing so, the model leverages an auxiliary fully-connected feed-forward neural network atop of the contextual representations, found in the final layer of the hidden states, to predict the labels as part of its fine-tuning process. The auxiliary neural network uses the Binary-Cross-Entropy (BCE), Cross-Entropy (CE), and Mean-Square-Error (MSE) losses for binary classification, multi-label classification, and regression tasks, respectively. A λ parameter was introduced to balance the magnitude of losses between the supervised and self-supervised objectives. The λ parameter, as well as all other parameters used to fine-tune the models, are detailed in Table 7. The appropriate values for λ were selected to balance each model’s self-supervised objectives with the losses across distinct labels within the same training batch, ensuring both losses converged at relatively similar rates. Henceforth, in addition to the potential improvements brought by the self-supervised training objectives, we are now able to leverage the labels to supervise the textual embedding to better align with the training labels.

Table 7 Details of parameters selected when fine-tuning each large language model

Foundation fine-tuning strategy

To build a foundation model with knowledge across all tasks, we extended the above-mentioned strategies and exploited all possible labels available within the dataset, including but not limited to selected tasks, as inspired by Aghajanyan et al.⁴⁶. This involves employing a multi-task learning framework for knowledge sharing across all available labels from all six outcomes—death in 30 days, AKI, PE, pneumonia, DVT, and delirium—present in the dataset. This task-agnostic approach therefore enables knowledge sharing across all available labels. Therefore, the model becomes foundational in the sense that it solves various tasks simultaneously, meaning a single robust model can be deployed to a wide range of postoperative risks. This contrasts previous approaches that required separate models dedicated to each specific outcome. To achieve this, each label is assigned a task-specific auxiliary fully-connected feed-forward neural network, wherein the losses across all labels are pooled together. To control for the magnitude of losses between each task-specific auxiliary network, a λ parameter, where \(\lambda * \mathop\sum \nolimits_i^mloss_i\) given i outcome across m total outcomes, is introduced as weights that contribute to the overall loss calculation across all labels from each training batch. The λ parameter, as well as all other parameters used to fine-tune the models, are detailed in Table 7. The appropriate values for λ were selected to balance each model’s self-supervised objectives with the losses across distinct labels within the same training batch, ensuring both losses converged at relatively similar rates.

Predictors

We defaulted to XGBoost²⁷ for predicting outcomes from the generated word embeddings or text representations to facilitate consistent comparisons between traditional word embeddings, pretrained LLMs, and their fine-tuned variants. This selection allows to accommodate a diverse range of input types while leveraging XGBoost’s widespread use in healthcare due to its robust performance in various clinical prediction tasks. Nonetheless, the choice of predictors remain flexible. This includes utilizing the task-specific fully-connected feed-forward neural network found in models that were fine-tuned with respect to the labels.

To examine if the choice of predictors makes a noticeable difference in predictive performances, we experimented with various predictors among the models of the best-performing fine-tuning strategy. These include (1) the default XGBoost, (2) Random Forest, (3) Logistic Regression (4) and the feed-forward network found in models that incorporate labels into the fine-tuning process. The range of hyper-parameters used for each predictor could be found in the Table 8.

Table 8 Details of cross-validated hyperparameters that were experimented when using the XGBoost, Logistic Regression, and Random Forest model

Evaluation metrics and validation strategies

For a rigorous evaluation, experiments were stratified into 5-folds for cross-validation⁴⁸. A nested cross-validation approach was used to ensure a robust and fair comparison among the best combination of hyperparameters across all models and approaches spanning multiple folds.

The main evaluation metrics included area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC) to get a comprehensive evaluation of the models’ overall prediction performance in the face of class imbalance. To contextualize our results, we fixed the specificity at 95% and reported the improvements in sensitivity across all the experimented strategies for the postoperative complication demonstrating the greatest gains compared to baseline traditional word embeddings. In our case, pneumonia showed the largest gains, while 30-day mortality was excluded as it is not classified as a postoperative surgical complication. These gains can be interpreted as the number of correctly identified high-risk patients (per 100 cases) that would have otherwise been missed by baseline models. This underscores the model’s ability to detect previously overlooked cases across complications. In addition, for our best-performing foundation models, we reported accuracy, sensitivity, specificity, precision, and F-scores.

Qualitative evaluation of our foundation models

We qualitatively evaluate our models for safety while ensuring their adaptation to perioperative corpora beyond the quantitative results. As such, we ran some broad case-prompts on our best-performing foundation models through their objective loss functions. For the BERT-based models, this encompassed the masked language modeling (MLM) objective, where we get the model to fill a single masked token in a ‘fill-in-the-blank’ format³⁰. For the GPT models, this involved the language modeling objective, where we get the model to ‘complete the sentence’ based on the provided incomplete sentences³¹.

Incorporation of tabular-based features

To evaluate how predictive performance differs with the incorporation of tabular-based data and assess their contribution to predictive power, we utilized tabular features tied to the respective clinical notes of each patient. These tabular features includes demographics (e.g., age, race, and sex), medical history (e.g., smoking status and heart failure), physiological measurements (e.g., blood pressure and heart rate), laboratory results (e.g., white blood cell count), and medications. These features were concatenated with the textual representations of the clinical notes obtained from each fine-tuned language model.

Tabular features with a missing rate greater than 50% were excluded, while missing values were imputed using the most frequent class for categorical features and median values for continuous features. XGBoost was then re-utilized among the combined textual representations and encoded tabular features, using the same parameters across our experiments without the tabular data, as specified in Table 8.

Explainability of models

SHAP²⁹, a technique that uses Shapley values from game theory to explain how each feature affects a prediction, was employed to explain the influence of distinct tokens in the clinical notes on the prediction. In this context, each feature corresponds to a token—typically representing a sub-word or symbol—present among the clinical texts. Consequently, each token is assigned a feature value based on its influence on the model’s prediction. We utilize the best-performing model and the corresponding tokenizer of each outcome to generate our SHAP visualizations among all six outcomes.

Comparison with NSQIP Surgical Risk Calculator

The ACS NSQIP Risk Calculator³³, often regarded as the ‘gold standard’ for risk estimation, uses 20 distinct discrete variables to predict the risk of complications. These variables include procedural details, demographic information such as age and sex, chronic conditions or comorbidities like congestive heart failure and hypertension, and patient status indicators such as ASA class and functional status. The ACS NSQIP Risk Calculator is proprietary software developed and maintained by the American College of Surgeons.

To compare the performances of our models with the classifications obtained from the NSQIP Surgical Risk Calculator, we randomly selected 100 cases from the test data of the first fold, stratified by the occurrence of serious postoperative complications. Each patient’s discrete features were manually entered into the ACS NSQIP Risk Calculator ( and the predicted outcomes were compared with those generated by our best-performing foundation model, the BioGPT variant. For this comparison, a prediction of a “higher than average” risk of serious complications by the NSQIP calculator was considered to be a high-risk patient (i.e., positive case). Since the NSQIP Risk Calculator incorporates procedural information as discrete variables, we ensured a fair comparison by evaluating its performance both with clinical notes alone and with clinical notes combined with tabular features.

Replication of methods on MIMIC-III

To ensure that our methods are generalizable beyond perioperative care and the BJH dataset, we replicated our methods on the publicly available MIMIC-III dataset^35,36, which contains de-identified reports and clinical notes of critical care patients at the Beth Israel Deaconess Medical Center between 2001 and 2012. The notable differences arises from MIMIC-III’s primary focus critical care cases, including emergency operations across various specialties, whilst our dataset encompasses a broader range of both emergency and elective surgical cases, with Orthopedics, Ophthalmology, and Urology among the most represented specialties. To closely mimic the approach and settings employed in BJH’s clinical notes, we utilized the long-form descriptive texts of procedural codes in MIMIC-III. Specifically, ICD-10 codes containing procedural information from each patient were traced and formatted into their respective long-form titles. For each patient, these long-form titles were then combined to form a single-sentenced clinical note, thereby aligning with the textual characteristics of BJH’s clinical notes. The selected outcomes were length-of-stay (LOS), in-hospital mortality, 12 h discharge status, and death in 30 days, as adapted from previous studies^1,35,36,60.

Computing resources

A single RTX A6000 with 40GB of Random Access Memory (RAM) was used to fine-tune all the models. The range of RAM required ranged from 8GB in bioClinicalBERT and clinicalBERT with self-supervised fine-tuning to 37GB in bioGPT among foundation fine-tuning. As bioClinicalBERT and clinicalBERT models contained 110 million parameters and bioGPT contained 347 million parameters, the ranking of computing resources required to fine-tune these models are bioClinicalBERT ≈ clinicalBERT < bioGPT. Similarly, the self-supervised models were fine-tuned on a single objective and could accommodate to smaller batch sizes, therefore requiring lesser resources compared to the supervised counterparts. The foundation model naturally required more resources as it was fine-tuned on multiple outcomes, with each outcome represented as a separate feed-forward function, thereby requiring larger batch sizes to ensure each batch had a fair representation among the outcomes. Therefore, methods-wise, the rank of computing resources required to fine-tune these models were unsupervised < semi-supervised < foundation.

Ethics consideration

For the BJH dataset, the internal review board of Washington University School of Medicine in St. Louis (IRB #201903026) approved the study with a waiver of patient consent. The establishment of the MIMIC-III database was approved by the Massachusetts Institute of Technology and Beth Israel Deaconess Medical Center, and informed consent was obtained for the use of data. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

link

Remedy Alliance