Natural language processing of admission notes to predict severe maternal morbidity during the delivery encounter





Background


Severe maternal morbidity and mortality remain public health priorities in the United States, given their high rates relative to other high-income countries and the notable racial and ethnic disparities that exist. In general, accurate risk stratification methods are needed to help patients, providers, hospitals, and health systems plan for and potentially avert adverse outcomes.


Objective


Our objective was to understand if machine learning methods with natural language processing of history and physical notes could identify a group of patients at high risk of maternal morbidity on admission for delivery without relying on any additional patient information (eg, demographics and diagnosis codes).


Study Design


This was a retrospective study of people admitted for delivery at 2 hospitals (hospitals A and B) in a single healthcare system between July 1, 2016, and June 30, 2020. The primary outcome was severe maternal morbidity, as defined by the Centers for Disease Control and Prevention; furthermore, we examined nontransfusion severe maternal morbidity. Clinician documents designated as history and physical notes were extracted from the electronic health record for processing and analysis. A bag-of-words approach was used for this natural language processing analysis (ie, each history or physical note was converted into a matrix of counts of individual words (or phrases) that occurred within the document). The least absolute shrinkage and selection operator models were used to generate prediction probabilities for severe maternal morbidity and nontransfusion severe maternal morbidity for each note. Model discrimination was assessed via the area under the receiver operating curve. Discrimination was compared between models using the DeLong test. Calibration plots were generated to assess model calibration. Moreover, the natural language processing models with history and physical note texts were compared with validated obstetrical comorbidity risk scores based on diagnosis codes.


Results


There were 13,572 delivery encounters with history and physical notes from hospital A, split between training (A train , n=10,250) and testing (A test , n=3,322) datasets for model derivation and internal validation. There were 23,397 delivery encounters with history and physical notes from hospital B (B valid ) used for external validation. For the outcome of severe maternal morbidity, the natural language processing model had an area under the receiver operating curve of 0.67 (95% confidence interval, 0.63–0.72) and 0.72 (95% confidence interval, 0.70–0.74) in the A test and B valid datasets, respectively. For the outcome of nontransfusion severe maternal morbidity, the area under the receiver operating curve was 0.72 (95% confidence interval, 0.65–0.80) and 0.76 (95% confidence interval, 0.73–0.79) in the A test and B valid datasets, respectively. The calibration plots demonstrated the bag-of-words model’s ability to distinguish a group of individuals at a substantially higher risk of severe maternal morbidity and nontransfusion severe maternal morbidity, notably those in the top deciles of predicted risk. Areas under the receiver operating curve in the natural language processing–based models were similar to those generated using a validated, retrospectively derived, diagnosis code–based comorbidity score.


Conclusion


In this practical application of machine learning, we demonstrated the capabilities of natural language processing for the prediction of severe maternal morbidity based on provider documentation inherently generated at the time of admission. This work should serve as a catalyst for providers, hospitals, and electronic health record systems to explore ways that artificial intelligence can be incorporated into clinical practice and evaluated rigorously for their ability to improve health.


Introduction


Severe maternal morbidity (SMM) and mortality remain public health priorities in the United States, given their high rates relative to other developed countries and the notable racial and ethnic disparities that exist. The levels of maternal care designation system has been proposed as a method for improving maternal outcomes by directing the highest-risk patients to facilities with the appropriate resources. Other strategies, such as consultations or predelivery planning and communication, for those at high risk could be employed as strategies to combat SMM. In general, accurate risk stratification methods are needed to help patients, providers, hospitals, and health systems plan for and potentially avert adverse outcomes.



AJOG at a Glance


Why was this study conducted?


This study was conducted to understand if machine learning methods with natural language processing (NLP) of history and physical notes could identify a group of patients at high risk of severe maternal morbidity (SMM) on admission for delivery without relying on any additional patient information.


Key findings


In this analysis involving both internal and external validation datasets, we demonstrated the ability of NLP to assist in maternal morbidity risk stratification. A bag-of-words modeling approach using a vocabulary of single words with minimal preprocessing resulted in areas under the curve of >0.70 for SMM in the external validation dataset.


What does this add to what is known?


This work was early evidence of the potential for machine learning methods and NLP to advance our ability to identify patients at high risk of complications.



The availability of electronic health record (EHR) data and the application of machine learning present an opportunity to advance maternal risk stratification. Several groups have demonstrated the value of intrapartum EHR data in modifying or updating clinical risk. , Our objective was to understand if machine learning methods with natural language processing (NLP) could identify a group of patients at high risk on admission for delivery without relying on any additional patient information (eg, demographics and diagnosis codes). NLP, the work of extracting informative signals from human-authored free text, such as clinical documentation, using software, has improved a wide range of clinical prediction tasks. , Unlike many other machine learning methods and some risk stratification scores, NLP can be applied to inherently generated clinical documentation; it does not require the creation of any additional labels or structured data elements, which require additional steps and resources and can introduce errors. We hypothesized that text processing of history and physical (H&P) notes could be successfully used for maternal risk stratification.


Methods and Materials


This was a retrospective study of people admitted for delivery at 2 hospitals in a single healthcare system between July 1, 2016, and June 30, 2020. Both hospitals used the same EHR system (Epic EHR software; Epic Systems Corporation, Verona, WI) and have their EHR data uploaded to the system’s Research Patient Data Registry (RPDR), which is available for use for clinical research. This registry was used to obtain clinical notes, encounter diagnosis codes, and obtain demographic data. All encounters with a delivery were included in the analysis.


The primary outcome was SMM, as defined by the Centers for Disease Control and Prevention (CDC). This International Classification of Disease, Tenth Revision (ICD-10), diagnosis and procedure code definition of SMM includes such conditions as acute myocardial infarction, stroke, and peripartum hysterectomy. The full list of conditions and the corresponding ICD-10 code list is maintained and published on the CDC website. As a secondary outcome, we also examined SMM, excluding transfusion (nontransfusion SMM [nt-SMM]), as the SMM composite outcome is known to be heavily weighted by blood transfusion.


Clinician documents designated in the EHR as H&P notes were extracted from the RPDR for processing and analysis. Notes were verified to be associated with the delivery encounter by ensuring the note occurred within the dates of the delivery encounter. A bag-of-words (BOW) model was used for this NLP analysis. In a BOW model, the clinical document is converted into the counts of individual words (or phrases) that occur within the document. Ultimately, a matrix (called the vocabulary) is constructed in which the rows represent individual documents, each column represents a unique word that occurs in any of the notes within the document dataset, and the cells are counts of the unique word’s occurrences in a single document. The BOW model was developed after standard preprocessing of the raw clinical text, which included white space normalization, stop word removal using the ‘SMART’ dictionary list in the ‘stopword’ package in R, and elimination of machine-generated templated text (eg, document headers, such as “Physical exam”). The BOW model included monograms, or single words. To limit the sparseness of the BOW model and improve analytical tractability, only those terms that passed frequency thresholds were considered. Frequencies thresholds were set a priori at ≥5% and ≤80% based on experience, that is, only words that occurred in ≥5% but ≤80% of notes were included.


To facilitate model development and testing, we divided the notes from hospital A into a training dataset (75%, “A train ”) and a testing dataset (25%, “A test ”) ( Figure 1 ). The dictionary for the BOW model was generated from the A train dataset of H&P notes. Moreover, we used least absolute shrinkage and selection operator models to determine the relationship between the occurrence counts of individual words (using the dictionaries) and each outcome. Model discrimination was assessed via the area under the receiver operating curve (AUC). The AUC is used to interpret the ability of a model to predict an outcome and ranges from 0.5 (model no better than chance) to 1.0 (perfectly predictive). In general, the discrimination of a test can be classified as “excellent,” “good,” “fair,” or “poor” based on AUC ranges of 0.90 to 0.99, 0.80 to 0.89, 0.70 to 0.79, and <0.7, respectively. Calibration plots were generated to assess model calibration.




Figure 1


Overview of analysis

The figure shows the visual display of the analysis.

Clapp et al. Natural language processing for maternal risk stratification. Am J Obstet Gynecol 2022.


As sensitivity analysis for the primary outcome, we compared the AUCs from the primary model with a model that also used maternal demographic information, which included the following variables: maternal age, self-reported race ethnicity (White, Black, Hispanic, Asian, and Other), and self-reported primary language spoken (English or non-English). This demographic-enriched model was used to determine if performance varied after accounting for commonly known structured information at the time of admission. Furthermore, we compared model performance using dictionaries constructed of 2- and 3-word phrases with the primary monogram model to determine if using combinations of words significantly improved performance. Discrimination between models was compared using the DeLong test.


To understand how NLP models with the H&P note text compare with a validated obstetrical comorbidity score, we compared AUCs between the NLP models and the expanded obstetrical comorbidity score (EOCS), which was described by Leonard et al. The EOCS is calculated for each patient by summing the derived weights for approximately 30 specific ICD-10 diagnosis codes that may occur during a delivery encounter (ie, diagnoses with strong associations with morbidity have high “point” values, and higher comorbidity scores equate to a higher risk of morbidity). EOCSs were generated for SMM and nt-SMM per the corresponding diagnosis code weights in the original published article. AUCs between the original NLP model and the EOCS models were compared using the DeLong test.


For external validation, we applied the model and its vocabulary weights developed in the hospital A training cohort to the new dataset (hospital B or “B valid ”) for SMM and nt-SMM. Furthermore, we compared the discrimination between the NLP model and the EOCS model in the B valid dataset.


The following patient characteristics were compared between the A train and A test groups, to examine for a random allocation, and between the A test and B valid groups to examine for potential differences between hospital sites: age, self-reported race and ethnicity, primary language, comorbidity score for SMM and nt-SMM, and rates of SMM and nt-SMM. Chi-squared tests were used for categorical variable comparisons. A 2-sided t test and Wilcoxon rank-sum test were used for age and comorbidity score comparisons, respectively.


The analysis was conducted using R (Vienna, Austria) and Stata 16.1 MP (College Station, TX). , The Mass General Brigham Institutional Review Board reviewed and approved this study. P values of <.05 were considered statistically significant.


Results


There were 13,572 delivery encounters with H&P notes from hospital A, split between A train (n=10,250) and A test (n=3,322) datasets for model derivation and internal validation, respectively. There were 23,397 delivery encounters with H&P notes from hospital B (B valid ), which were used for external validation. Figure 1 shows a schematic for how the cohorts and analysis were constructed. The Table shows the characteristics of patients in each cohort: A train , A test , and B valid . The cohorts A train vs A test did not differ; however, the internal validation cohort from A train and the external validation cohort B valid differed significantly in the following characteristics: maternal age was slightly lower in the B valid dataset (mean, 32.1 vs 32.5 years; P <.001); racial and ethnic distribution, with more Black (13.8% vs 7.9%) and Hispanic (5.8% vs 4.6%) individuals in the B valid dataset; more primarily English-speaking patients (90.4% vs 86.0%; P <.001) in the B valid dataset; and lower comorbidity scores (median, 7 vs 11; P <.001) and rates of SMM (3.2% vs 4.2%) in the B valid dataset.



Table

Sample characteristics between the different datasets used in the model development and validation analyses






















































































































Sample characteristics Hospital A
Training data
A test (n=10,250)
Hospital A
Testing data
A train (n=3322)
Hospital B
Validation data
B valid (n=23,397)
P value a
Age (y) 32.5 (5.0) 32.5 (5.0) 32.1 (5.2) <.001
Demographics
Race and ethnicity <.001
White 6344 (61.9) 2026 (61.0) 13,434 (57.4)
Asian 1229 (12.0) 409 (12.3) 2541 (10.9)
Black 749 (7.3) 263 (7.9) 3226 (13.8)
Hispanic 480 (4.7) 154 (4.6) 1358 (5.8)
Other 1216 (11.9) 393 (11.8) 2315 (9.9)
Missing 232 (2.3) 77 (2.3) 523 (2.2)
Primary language
English 8803 (85.9) 2858 (86.0) 21,151 (90.4) <.001
Non-English 1447 (14.1) 464 (14.0) 2246 (9.6)
Comorbidity score b
For SMM 11 (0–23) 11 (0–26) 7 (0–21) <.001
For nt-SMM 7 (0–13) 7 (0–14) 4 (0–12) <.001
SMM
Including transfusion 453 (4.4) 141 (4.2) 751 (3.2) .002
Excluding transfusion 184 (1.8) 54 (1.6) 353 (1.5) .61

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Aug 28, 2022 | Posted by in GYNECOLOGY | Comments Off on Natural language processing of admission notes to predict severe maternal morbidity during the delivery encounter

Full access? Get Clinical Tree

Get Clinical Tree app for offline access