Background
Preeclampsia presents a highly prevalent burden on pregnant women with an estimated incidence of 2% to 5%. Preeclampsia increases the maternal risk of death 20-fold and is one of the main causes of perinatal morbidity and mortality. Novel biomarkers, such as soluble fms-like tyrosine kinase-1 and placental growth factor in addition to a wide span of conventional clinical data (medical history, physical symptoms, laboratory parameters, etc.), present an excellent basis for the application of early-detection machine-learning models.
Objective
This study aimed to develop, train, and test an automated machine-learning model for the prediction of adverse outcomes in patients with suspected preeclampsia.
Study Design
Our real-world dataset of 1647 (2472 samples) women was retrospectively recruited from women who presented to the Department of Obstetrics at the Charité – Universitätsmedizin Berlin, Berlin, Germany, between July 2010 and March 2019. After standardization and data cleaning, we calculated additional features regarding the biomarkers soluble fms-like tyrosine kinase-1 and placental growth factor and sonography data (umbilical artery pulsatility index, middle cerebral artery pulsatility index, mean uterine artery pulsatility index), resulting in a total of 114 features. The target metric was the occurrence of adverse outcomes throughout the remaining pregnancy and 2 weeks after delivery. We trained 2 different models, a gradient-boosted tree and a random forest classifier. Hyperparameter training was performed using a grid search approach. All results were evaluated via a 10 × 10-fold cross-validation regimen.
Results
We obtained metrics for the 2 naive machine-learning models. A gradient-boosted tree model was performed with a positive predictive value of 88%±6%, a negative predictive value of 89%±3%, a sensitivity of 66%±5%, a specificity of 97%±2%, an overall accuracy of 89%±3%, an area under the receiver operating characteristic curve of 0.82±0.03, an F1 score of 0.76±0.04, and a threat score of 0.61±0.05. The random forest classifier returned an equal positive predictive value (88%±6%) and specificity (97%±1%) while performing slightly inferior on the other available metrics. Applying differential cutoffs instead of a naive cutoff for positive prediction at ≥0.5 for the classifier’s results yielded additional increases in performance.
Conclusion
Machine-learning techniques were a valid approach to improve the prediction of adverse outcomes in pregnant women at high risk of preeclampsia vs current clinical standard techniques. Furthermore, we presented an automated system that did not rely on manual tuning or adjustments.
Introduction
With an incidence of 2% to 5% and severe associated maternal and fetal complications, preeclampsia poses a global burden. Women with preeclampsia have a 20 times higher maternal mortality rate, and hypertensive pregnancy disorders account for 16% of maternal deaths in industrial countries. Current diagnosis relies on a combination of clinical parameters and the treating physicians’ evaluation, achieving a positive predictive value (PPV) of approximately 20% for adverse outcomes (AOs).
Why was this study conducted?
There is an unmet medical need for a reliable way of predicting adverse outcomes (AOs) in preeclampsia. Machine-learning (ML) techniques can contribute to improving the prediction of AOs in patients with suspected preeclampsia.
Key findings
We showed that we can reliably predict AOs using ML classifiers (gradient-boosted tree and random forest) vs conventional methods with a higher positive predictive value, sensitivity, and area under the curve.
What does this add to what is known?
This work further substantiates the potential ML has as part of the clinical decision-making process
This is further complicated by difficulties in estimating the severity of the disease, which often leads to increased hospitalization and high costs for the healthcare system, which could be greatly reduced by improvements in the precise prediction of AOs.
Biomarkers, such as soluble fms-like tyrosine kinase-1 (sFlt-1) and placental growth factor (PlGF) , , for the prediction of preeclampsia in high-risk women or PlGF and pregnancy-associated plasma protein A , for first-trimester preeclampsia screening, have shown great promise for the development of diagnostic tools for women at risk of AOs, with additional biomarkers (eg, N-terminal prohormone of brain natriuretic peptide ) undergoing research. Recently, real-world data analyses have confirmed high accuracy in the prediction from prospective cohort studies and randomized trials.
Including the angiogenic and antiangiogenic biomarkers at fixed cutoffs into the diagnostic workup for preeclampsia leads to a negative predictive value (NPV) of 99.3%. , Thus, the current recommendation is to use biomarkers as a rule-out test in women presenting at high risk of preeclampsia. However, when used in detecting preeclampsia-related AOs, the NPV and PPV for these novel biomarkers as standalone measurements remain unsatisfyingly low.
We and others have shown that integrating several different parameters derived from a woman’s medical history (family predisposition, preeclampsia or related diseases in earlier pregnancies, etc.), current condition (body mass index, age, systolic and diastolic blood pressures, etc.), and laboratory variables (platelet count, alanine aminotransferase, aspartate aminotransferase, maternal sodium levels, sFlt-1, and PlGF, etc.) into a multifactorial analysis will improve diagnostics and enable doctors to more accurately assess a woman’s risk of preeclampsia-associated diseases. This setting suits deployment of a machine-learning (ML) solution perfectly, as these methods can handle high numbers of features and set them in context, which allows us to uncover traits and signals that are hidden to conventional analysis.
Here, we hypothesized that using an ML approach, applied on real-world patient data, proves superior vs current clinical predictive approaches while yielding a more accurate prediction of AOs, equivalent to “traditional” statistical models , , while requiring substantially less overhead in terms of manual tuning.
A glossary for machine-learning terms ( Table 10 ) and a table of abbreviations ( Table 11 ) can be found in the Supplementary Appendix .
Materials and Methods
Study population
The real-world study population was retrospectively recruited from pregnant women presenting with a clinical suspicion of preeclampsia to the obstetrics department, Charité – Universitätsmedizin Berlin, Berlin, Germany, between July 2010 and March 2019. Most of the dataset was previously used in Dröge et al and has been updated and expanded with additional patients (525 additional patients and 837 additional samples).
We included all women aged ≥18 years, with a singleton pregnancy, and a gestational age of ≥20 weeks with signs and symptoms of preeclampsia, available laboratory values for the sFlt-1–to–PlGF ratio, and complete outcome data. “Signs and symptoms” of preeclampsia were defined as elevated blood pressure (systolic blood pressure of ≥140 mm Hg and diastolic blood pressure of ≥90 mm Hg) or preexisting hypertension without concurring proteinuria; proteinuria determined by urine dipstick reading >2+ in 2 separate tests at least 6 hours apart or ≥300 mg/L in 24-hour urine sampling without concurring hypertension; at least 1 preeclampsia-related symptom (headache, visual disturbances, progressive edema and/or excessive weight gain, or upper epigastric pain); abnormalities in specific laboratory values (isolated low platelet count or elevated liver enzymes); fetal abnormalities, such as intrauterine growth restriction (IUGR; <10th percentile) or presenting as small for gestational age (SGA) in sonography; and abnormal values in Doppler sonography of the uterine artery, umbilical artery, or fetal medial cerebral artery.
Per patient data were organized into units of individual days. Of note, 2 measurements, returned from the laboratory on the same day, were consolidated into a single entry for that patient. Moreover, 2 measurements of an identical laboratory test reported on the same day resulted in us using only the most recently performed analysis. In case of multiple measurements at the same visit, we calculated a mean blood pressure.
On average, individual patients had 1.5±0.91 day entries per patient in our database with a median of 1.
Between per-day entries data values, which were at risk of varying, in a clinically relevant manner, from visit to visit (eg, blood pressure, liver enzymes, symptoms related to preeclampsia) were not carried forward and thus were marked as absent or missing.
Entries from the patient history, such as date of birth, height, and medical history, were assumed to be constant and were carried forward to the new per-patient day entries. Finally, in marking up the AOs, if a patient subsequently (in this given pregnancy) suffers an AO, then all visits (day entries) were considered to be potentially indicative of this subsequent AO and thus labeled as such.
In the case of a patient presenting with >1 pregnancy within the given time frame of our investigation, we treated them the same way as a patient who presented without previous knowledge, except for adapting their medical history and biographic information. In case of repeated admissions that did not lead to an AO, these were treated as visits. All women presented without the manifest clinical diagnosis of preeclampsia.
Study outcome
The aim of this study was the assessment of automated ML algorithms as predictive tools in the diagnosis of preeclampsia-associated AOs.
This study’s main target variable was the occurrence of a composite fetal or maternal AO at any time after a visit, up until 14 days after delivery of the neonate, if not specified otherwise. Maternal AOs were cerebral hemorrhage; disseminated intravascular coagulopathy (DIC); pulmonary edema; renal failure; hemolysis, elevated liver enzymes, and low platelet count (HELLP syndrome); eclampsia; and death, whereas AOs in the child were defined as IUGR, SGA, premature delivery because of preeclampsia at ≤34 weeks of gestation, respiratory distress syndrome (RDS), placental abruption, intraventricular hemorrhage (IVH), necrotizing enterocolitis (NEC), or fetal death.
Feature processing
We focused on analyzing routinely available clinical features derived from the current standard of care. These were divided into the broad categories of patient history, laboratory values, ultrasound findings, symptoms at presentation, and previous diagnoses, resulting in a total of 114 features ( Appendix ). Additional values considered were the assessments of sFlt-1 and PlGF and their ratio. sFlt-1 and PlGF were determined using the Elecsys immunoassays for sFlt-1 and PlGF (Roche Diagnostics GmbH, Mannheim, Germany).
To prepare the dataset for ML, we standardized different laboratory measuring systems for a single parameter to adhere to standard units for that parameter. Moreover, we removed any categorical variable via an indicator flag. We analyzed patients on a per-visit basis, thus maintaining biographic variables, such as medical history and diagnoses, occurring before current pregnancy for all subsequent visits. We did not perform any imputation of values and marked all missing data entries with a consistent variable that did not occur within the dataset and which could be detected by the algorithms and treated accordingly.
Furthermore, we classified the sFlt-1 and PlGF measurements and their ratio into reference ranges along known distributions and added these as separate features to the datasets. We did the same for both the umbilical and middle cerebral artery pulsatility indices.
Moreover, we transformed each feature into an additional metric expressing them as a multiple of median (MoM) based on the median obtained from our given dataset.
Dataset splitting
We split the dataset using a train-test approach by applying random sampling of patients moving their entire record into either one of the assessed splits.
Because of limitations regarding the size of our dataset, we chose a 90% to 10% train-test split.
The different classifiers’ performance was compared using a model for a “corrected repeated k-fold cv test” proposed by Bouckaert and Frank. ,
Algorithms
Thoroughly following best practice guidelines for the development of ML systems, we chose 2 different ML algorithms, random forest (RF) classifier and gradient-boosted tree (GBTree), , , for examination of our dataset and focused on methods based on decision tree learning using classification trees. The Appendix and this recent article on medical artificial intelligence (AI) best practices, provides a deeper explanation of the algorithms and the developmental practices followed throughout. These methods provide some limited interpretability for the decision-making of the algorithms and allow clinicians to intuitively understand key aspects of the decision process. We used Shapley values for both methods to provide interpretable metrics for the model’s decision-making.
Throughout hyperparameter tuning and training, we did not manually intervene or tune parameters in any way. Further Information regarding the model creation can be found in the appendix ( Supplemental Table 7 ).
A condensed statement regarding the algorithms setup and structure can be found in the supplementary appendix ( Table 12 ).
Study population characteristics
Continuous variables were tested for normal distribution using the Kolmogorov-Smirnov test.
If both instances of the compared variable were normally distributed, we used the Welch t test for the evaluation of the likelihood that both were drawn from the same distribution; if not, we used the Wilcoxon signed-rank test. For categorical variables, we used the Fisher exact test. Where multiple comparisons were performed, the Bonferroni correction was applied.
Direct biomarker models
To cover different approaches and to draw comparisons with clinical practice, we explored several approaches to derive alternative predictions for our datasets.
- 1.
We used blood pressure as a predictive factor, classifying any women with either a systolic blood pressure ≥140 mm Hg or a diastolic blood pressure ≥90 mm Hg as being predictive of an AO in the future.
- 2.
We used a cutoff of 38 for the sFlt-1–to–PlGF ratio, which has previously demonstrated a high predictive value for the absence of AO, and used it to classify women as “at risk” if the ratio was >38.
- 3.
We combined blood pressure threshold, proteinuria, indicated by a urine dipstick measurement of “++” and above or >300 mg/L in 24-hour urine analysis and the sFlt-1–to–PlGF ratio as binary flags combined via a logical odds ratio and analyzed the predictive value of that ensemble.
Cutoff analysis
The basic operation of our tree classifiers was set up to produce a probability for AOs. For standard analysis, we chose a naive cutoff of ≥0.5 to classify outputs as predictive for the probability of an AO.
In a second step, we iterated on the entire range of possible values to derive decision thresholds from the training set and maximize for accuracy, F1 score, and area under the receiver operating characteristics curve (ROCAUC), respectively. Moreover, we applied the individual decision thresholds to the training dataset and analyzed the resulting classification compared with the naive approach.
Model evaluation
We based the evaluation of our model on simple confusion matrix–derived metrics, such as sensitivity, specificity, PPV, and NPV and 4 composite measures for accuracy: the F1 score, threat score, accuracy, and ROCAUC.
Additional data concerning the calibration of the models are available in the Appendix ( Figure 2 ; Supplemental Table 5 ).
Data gathering
Data were gathered manually in an Excel spreadsheet from the hospital information system (SAP Hana, SAP, Walldorf, Germany), ultrasound records (Viewpoint; GE Healthcare, Solingen, Germany), or paper records if none of the others applied.
Ethics approval
All participants consented to the gathering of their data. The study was approved by the ethics committee of the Charité – Universitätsmedizin Berlin, Berlin, Germany.
Results
Study population characteristics
The study population was composed of 1647 patients with a mean maternal age of 33±6 years and a mean gestational age of 34 6/7 weeks (204 gestational days) with a standard deviation of 41 days.
Between patients with an AO and without an AO, there were highly significant ( P <.001) differences in all diastolic blood pressure– and systolic blood pressure–derived features, gestational age in days, height, measurements of blood potassium levels, 24-hour proteinuria, blood urea measurements, headache, or visual disturbance symptoms.
Further features with highly significant differences were no occurrence of symptoms or other pathologic features at the current visit, medical history of hypertension with all features derived from the mean pulsatility index of the uterine arteries, medical history of hypertension, and all sFlt-1–, PlGF-, and sFlt-1–to–PlGF ratio–derived features ( Table 1 ). Full Study population characteristics can be found in the Supplemental Appendix ( Supplemental Table 3 ).
Feature name | Absolute P value | Number of sample in patients without AO | Number of sample in patients with AOs | Mean value for patients without AO | Mean value for patients with AOs | SD for patients without AO | SD for patients with AOs | P value level |
---|---|---|---|---|---|---|---|---|
Age | .01 | 1800 | 672 | 32.33 | 32.77 | 6.07 | 5.83 | <.05 |
BMI (kg/m 2 ) | .08 | 1778 | 657 | 26.97 | 27.23 | 6.56 | 6.88 | >.05 |
Creatine level (mg/dL) | .00 | 599 | 230 | 0.62 | 0.72 | 0.14 | 0.33 | <.05 |
Pulsatility index for middle cerebral artery | .10 | 548 | 344 | 1.76 | 1.71 | 0.47 | 0.51 | >.05 |
ALT (IU/L) | .00 | 665 | 242 | 29.18 | 46.47 | 58.43 | 88.38 | <.05 |
AST (IU/L) | .00 | 661 | 242 | 35.08 | 55.88 | 65.13 | 101.78 | <.05 |
Diastolic blood pressure (mm Hg) | .00 | 1500 | 526 | 80.23 | 86.95 | 14.78 | 16.36 | <.001 |
Gestational age in days | .00 | 1779 | 654 | 234.27 | 204.33 | 42.73 | 40.74 | <.001 |
Height (cm) | .00 | 1791 | 662 | 165.81 | 164.43 | 6.73 | 6.44 | <.001 |
Hematocrit level (%) | .20 | 690 | 253 | 0.31 | 0.34 | 17.87 | 0.04 | >.05 |
Hemoglobin level (mg/dL) | .37 | 690 | 253 | 11.78 | 11.75 | 1.31 | 1.62 | >.05 |
Kalium level (mmol/L) | .00 | 526 | 207 | 4.13 | 4.18 | 1.30 | 0.48 | <.001 |
LDH (U/L) | .28 | 215 | 79 | 246.14 | 272.07 | 54.23 | 166.39 | >.05 |
Mean pulsatility index for uterine arteries | .00 | 1116 | 457 | 1.15 | 99.45 | 4.33 | 2085.90 | <.001 |
24-h proteinuria (mg/day) | .00 | 562 | 271 | 632.06 | 1665.65 | 1968.10 | 2731.16 | <.001 |
Prothrombin time (s) | .00 | 275 | 121 | 104.78 | 108.36 | 14.24 | 13.99 | <.05 |
aPTT (s) | .03 | 283 | 123 | 33.10 | 33.49 | 9.13 | 5.43 | <.05 |
sFlt-1 (ng/dL) | .00 | 1782 | 656 | 4650.25 | 8347.30 | 4444.52 | 7090.54 | <.001 |
sFlt-1 (ng/dL) percentiles | .00 | 1769 | 649 | 0.43 | 0.66 | 0.35 | 0.35 | <.001 |
sFlt-1–to–PlGF ratio | .00 | 1782 | 656 | 57.58 | 243.70 | 102.51 | 310.47 | <.001 |
Pulsatility index for umbilical arteries | .00 | 1636 | 569 | 0.98 | 1.36 | 0.27 | 0.58 | <.001 |
Weight (kg) | .14 | 1783 | 657 | 74.24 | 73.64 | 18.76 | 19.03 | >.05 |
Sodium level in serum (mmol/L) | .15 | 273 | 147 | 137.01 | 136.71 | 2.66 | 2.98 | >.05 |
Systolic blood pressure (mm Hg) | .00 | 1500 | 526 | 129.35 | 139.98 | 21.32 | 23.38 | <.001 |
Thrombocyte count (/nL) | .71 | 690 | 253 | 219.96 | 224.68 | 70.51 | 75.06 | >.05 |
Urea (mg/dL) | .00 | 254 | 138 | 16.04 | 21.94 | 9.17 | 12.34 | <.001 |
PlGF (ng/dL) | .00 | 1782 | 656 | 314.18 | 123.21 | 386.36 | 218.24 | <.001 |
Antiphospholipid syndrome present | .62 | 1800 | 672 | 36 | 11 | — | — | >.05 |
Any autoimmune disease present | .54 | 1800 | 672 | 169 | 69 | — | — | >.05 |
Diabetes mellitus present | 1.00 | 1800 | 672 | 99 | 37 | — | — | >.05 |
Elevated liver enzymes | .01 | 1782 | 656 | 99 | 57 | — | — | <.05 |
Epigastric pain | .01 | 1782 | 656 | 156 | 37 | — | — | <.05 |
First parity | .01 | 1798 | 672 | 640 | 277 | — | — | <.05 |
Gestational hypertension in current pregnancy | .03 | 1782 | 656 | 205 | 55 | — | — | <.05 |
Headaches | .00 | 1782 | 656 | 207 | 45 | — | — | <.001 |
No symptom or pathologic blood measurement | .00 | 1782 | 656 | 667 | 133 | — | — | <.001 |
Thrombocyte count (/nL) <150 | .92 | 1782 | 656 | 96 | 34 | — | — | >.05 |
Family history for preeclampsia-related diseases | .08 | 1800 | 672 | 177 | 83 | — | — | >.05 |
Medical history of gestational hypertension in previous pregnancies | .33 | 1782 | 656 | 162 | 51 | — | — | >.05 |
Medical history of hypertension | .00 | 1800 | 672 | 219 | 132 | — | — | <.001 |
Medical history of preeclampsia in previous pregnancies | .04 | 1800 | 672 | 349 | 156 | — | — | <.05 |
sFlt-1–to–PlGF ratio>38 | .00 | 1800 | 672 | 651 | 491 | — | — | <.001 |
sFlt-1–to–PlGF ratio>84 | .00 | 1800 | 672 | 347 | 412 | — | — | <.001 |
Visual disturbances present | .00 | 1782 | 656 | 57 | 6 | — | — | <.001 |
Fetal sex | .27 | 784 | 404 | 384 | 212 | — | — | >.05 |
Smoking (current or past) | .31 | 1781 | 666 | 147 | 46 | — | — | >.05 |
Asian ethnicity | .21 | 1800 | 672 | 44 | 23 | — | — | >.05 |
African ethnicity | 1.00 | 1800 | 672 | 53 | 20 | — | — | >.05 |
Latin ethnicity | .51 | 1800 | 672 | 18 | 9 | — | — | >.05 |
White ethnicity | .13 | 1800 | 672 | 1670 | 611 | — | — | >.05 |
Renal disease present | .18 | 1800 | 672 | 57 | 29 | — | — | >.05 |
New onset of hypertension at visit | .00 | 1782 | 656 | 451 | 227 | — | — | <.001 |
New onset of proteinuria at visit | .01 | 1782 | 656 | 284 | 137 | — | — | <.05 |
Obesity (BMI≥30) | .96 | 1778 | 657 | 504 | 187 | — | — | >.05 |
Age>40 y | .43 | 1800 | 672 | 127 | 54 | — | — | >.05 |
Adverse outcomes
We observed an overall number of 386 (23.4% of patients) AOs, with most being fetal AOs. Of note, 24 children died within 1 week after delivery, 5 children developed IVH, 3 children developed NEC, 12 sustained a placental abruption, 253 neonates were delivered prematurely (<34 0/7 weeks of gestation), and 190 children sustained any form of RDS. The maternal AOs included 1 death during or after delivery, 5 occurrences of lung edema, 8 cases of renal failure, 1 case of DIC, and 33 cases of HELLP syndrome.
The overall number of patients presenting at <34 weeks of gestation was 917 (55.5%) with the remaining 733 patients (44.4%) presenting at ≥34 weeks of gestation.
Of all AOs, 339 cases (87.8%) manifested at <34 weeks of gestation, whereas the remaining 47 cases (12.2%) manifested at ≥34 weeks of gestation.
The fraction of AOs manifesting within 2 weeks after any measurement was 80.8% (312/386) ( Supplemental Table 4 ).
Algorithm performance
We obtained metrics for the 2 naive decision threshold ML models. The GBTree model was performed with a PPV of 81.8%±10%, an NPV of 88.5%±3.5%, a sensitivity of 67.6%±4.3%, a specificity of 94.6%±3%, an overall accuracy of 87.1%±2.8%, an ROCAUC of 0.811±0.029, and an F1 score of 0.737±0.057.
The RF classifier returned an equal PPV (80.8%±9%) and specificity (94.9%±1.9%) while performing slightly inferior on the other available metrics.
Because of their relative similarity, we presented the GBTree figures in the following sections. The results for the RF classifier can be found in the Supplemental Appendix .
Direct biomarker models
We investigated several different approaches to better compare the ML models’ results with clinical approaches represented by standard of care metrics and the sFlt-1–to–PlGF ratio–derived metrics.
- 1.
To understand our dataset, we first examined the predictive value of treating patients with elevated blood pressure as at risk of AOs. This approach performed worse overall than both ML methods across all metrics with a PPV of 32.9%±8.7%, an NPV of 74.5%±6.2%, a sensitivity of 28.9%±7%, a specificity of 78.2%±3.4%, an accuracy of 64.7%±3.2, an F1 score of 0.181±0.046, and an ROCAUC of 0.535±0.037.
- 2.
Examining the sFlt-1–to–PlGF ratio on its own performed well in terms of NPV (85.5%±3.9%) and sensitivity (69.9%±3.8%), with noticeably lower performance in terms of specificity (67.1%±5.9%) and accuracy (67.8%±4.5%) than the classifiers, with a lower F1 score of 0.372±0.071 and ROCAUC of 0.685±0.036. PPV at 44.5%±9.8% was considerably lower than the ML classifiers. For further information on the patients missed by the biomarker ratio that were classified correctly by the algorithm please review Supplementary Appendix Table 8 . An exemplary subset of patients missed by both the biomarkers and the algorithm can be found in Supplementary Table 9 .
- 3.
We next combined the blood pressure decision thresholds with proteinuria indicators and the sFlt-1–to–PlGF ratio cutoff. This model performed similarly to the sFlt-1–to–PlGF predictor alone, with high NPV (85.5%±4.1%) and sensitivity (70.4%±4.2%) and comparatively high ROCAUC (0.683±0.034) ( Table 2 ).
Table 2
Model
Metric
PPV
NPV
Sensitivity
Specificity
Accuracy
ROCAUC
F1
GBTree
Average
0.82
0.88
0.68
0.95
0.87
0.81
0.74
SD
0.10
0.04
0.04
0.03
0.03
0.03
0.06
RF
Average
0.81
0.86
0.59
0.95
0.85
0.77
0.68
SD
0.09
0.05
0.05
0.02
0.03
0.03
0.05
Blood pressure cutoffs
Average
0.33
0.74
0.29
0.78
0.65
0.54
0.18
SD
0.09
0.06
0.07
0.03
0.04
0.04
0.05
sFlt-1–to–PlGF cutoff
Average
0.44
0.86
0.70
0.67
0.68
0.69
0.37
SD
0.10
0.04
0.04
0.06
0.05
0.04
0.07
Blood pressure and proteinuria and sFlt-1–to–PlGF cutoff
Average
0.44
0.86
0.70
0.66
0.67
0.68
0.37
SD
0.09
0.04
0.04
0.06
0.04
0.03
0.07
Maximizing one of the composite metrics (accuracy, F1 score, or ROCAUC) resulted in higher performance across all composite metrics and comparable performance in terms of direct metrics (PPV, NPV, sensitivity, and specificity). Specific values for the mean cutoff by maximized metric can be found in the Appendix ( Supplemental Table 1 ).
The accuracy-optimized GBTree model returned a PPV of 86.9%±7.3%, an NPV of 89.2%±3.6%, a sensitivity of 69.3%±7.1%, a specificity of 95.9%±2.3%, an overall accuracy of 88.7%±2.6%, an ROCAUC of 0.826±0.032, and an F1 score of 0.767±0.049.
Compared with the accuracy optimization, the F1-optimized GBTree model performed with a PPV of 82.8%±6.9%, an NPV of 90.3%±2.2 %, a sensitivity of 72.5%±6.8%, a specificity of 93.8%±4.1%, an accuracy of 88.4%±2.8%, an ROCAUC of 0.832±0.03, and an F1 score of 0.77±0.05.
The ROCAUC-optimized models for the GBTree model returned a PPV of 72.3%±12%, an NPV of 91.9%±2.5%, a sensitivity of 79.4%±4.6%, a specificity of 88.1%±5.9%, an accuracy of 85.7%±3.9%, an ROCAUC of 0.837±0.026, and an F1 score of 0.75±0.066 ( Table 3 ; Figure 1 ).
Optimized parameter | Model | Metric | PPV | NPV | Sensitivity | Specificity | Accuracy | ROCAUC | F1 |
---|---|---|---|---|---|---|---|---|---|
Accuracy | GBTree | Average | 0.87 | 0.89 | 0.69 | 0.96 | 0.89 | 0.83 | 0.77 |
SD | 0.07 | 0.04 | 0.07 | 0.02 | 0.03 | 0.03 | 0.05 | ||
RF | Average | 0.81 | 0.88 | 0.66 | 0.93 | 0.86 | 0.80 | 0.72 | |
SD | 0.05 | 0.03 | 0.10 | 0.05 | 0.03 | 0.04 | 0.05 | ||
F1 | GBTree | Average | 0.83 | 0.90 | 0.73 | 0.94 | 0.88 | 0.83 | 0.77 |
SD | 0.07 | 0.02 | 0.07 | 0.04 | 0.03 | 0.03 | 0.05 | ||
RF | Average | 0.72 | 0.91 | 0.75 | 0.89 | 0.85 | 0.82 | 0.73 | |
SD | 0.07 | 0.03 | 0.07 | 0.05 | 0.03 | 0.03 | 0.05 | ||
ROCAUC | GBTree | Average | 0.72 | 0.92 | 0.79 | 0.88 | 0.86 | 0.84 | 0.75 |
SD | 0.12 | 0.03 | 0.05 | 0.06 | 0.04 | 0.03 | 0.07 | ||
RF | Average | 0.67 | 0.92 | 0.80 | 0.84 | 0.83 | 0.82 | 0.72 | |
SD | 0.11 | 0.03 | 0.03 | 0.06 | 0.04 | 0.02 | 0.06 |
Data concerning the model’s calibration and statistical differences between the optimized threshold models compared with the naive model can be found in the Appendix ( Supplemental Tables 5-7 and Supplemental Figure 2 ).
Interpreting the model
Interpreting the ML models, particularly in the domain of medical AI, is an emerging topic. A larger absolute Shapley value means a parameter typically contributes more to decision-making.
The highest impact features derived by Shapley values were the gestational age (0.725±0.073), measurement of sFlt-1–to–PlGF ratio outside of the 95th percentile (0.237±0.021), sFlt-1–to–PlGF ratio MoM (0.144±0.025), sFlt-1 MoM (0.184±0.023), PlGF deviation from median (0.157±0.028), and height (0.14±0.032) ( Figure 2 ). Concrete values for each features’ mean absolute Shapley values can be found in the appendix ( Supplemental Table 2 ) along with an exemplary illustration of Shapley values’ interpretation for a specific example ( Supplemental Figure 1 ).
Discussion
Principal findings
We developed a fully automated, easily expandable approach for creating a predictive model out of real-world patient data from a high-risk pregnant population. The resulting companion diagnostic tool was capable of predicting the risk of impending AOs with high accuracy, sensitivity, and PPV. These findings were a preliminary step toward a more integrative diagnostic system for disease in pregnancy.
This was an example of how decision threshold tuning can be used to tune an ML model toward desirable performance characteristics. Moreover, tuning the decision threshold can lead to overfitting and lack of generalization. We addressed this issue by comparing the results across several repetitions through cross-validation and through following best practices for developing medical AI.
Results of the study in context with other observations
The current standard of care for preeclampsia has a low predictive power to detect preeclampsia-related AOs. The discrepancy between the heterogeneity of signs and symptoms for preeclampsia and the severity of the potential complications leads to overdiagnosis and potentially overtherapy. The addition of the sFlt-1–to–PlGF ratio threshold test, which has yet to see widespread adoption, represented a major step toward a more precise detection of the disease and its severe outcomes. , , Our ML model, building on top of this and other biomarkers, demonstrated an improvement and other direct biomarker-based clinical decision criteria.
The prediction of preeclampsia employing ML models was recently examined in several scientific papers.
Jhee et al used a multitude of statistical models and found the most effective to be a stochastic gradient boosting method with an ROCAUC of 0.924 while predicting late-onset preeclampsia in a regional cohort of 11,006 pregnant women with a preeclampsia development rate of 4.7%.
Sandström et al used a dataset of 62,562 nulliparous women to evaluate and compare the performance of prespecified variables, backward selection of variables, and a RF classifier. The best performance, with an ROCAUC of 0.68 for the prediction of preeclampsia at <37 weeks of gestation, was obtained by using the prespecified variables model. The RF predictor did not prove superior. In their dataset, 4.4% of women developed preeclampsia.
Marić et al investigated the usage of ML as a tool for early prediction of preeclampsia. They investigated 2 different methods, elastic net and GBTree as algorithms, with the GBTree outperforming the elastic net with an ROCAUC of 0.89. They established the measurements on a retrospective cohort of 16,370 births.
These examples aimed to predict preeclampsia according to the clinical definitions that were currently undergoing substantial change rather than direct AOs. This particular outcome metric has been examined by several different studies, although with very little attention by ML researchers.
As published in the PROGNOSIS study, the sFlt-1–to–PlGF ratio at the cutoff of 38 has a very high NPV to rule out the onset of preeclampsia within 1 week of 99.3% and rule out AOs of 99.5%. The PPV to rule in the disease within 4 weeks was 36.7%. The high NPV was a substantial improvement vs conventional predictive tools and led to a widespread recommendation to use the sFlt-1–to–PlGF ratio as a rule-out test. However, the low PPV has not substantially improved positive prediction and has a lower impact on management decisions in case of increased values. Others have reported a comparable NPV at 98.8% and PPV at 53.5% for the sFlt-1–to–PlGF ratio.
We trained and tested our ML model using a real-world database, which is an expanded and updated version of the database described in Dröge et al. In using an ML approach rather than conventional statistics, we could substantially improve the positive rule-in metrics, such as sensitivity and PPV, which were the main metrics still underperforming in the current standard of care and purely biomarker-based predictive models.
Dröge et al evaluated the application of “conventional” statistics on a previous version of our dataset. Our colleagues could show that multivariate regression analysis yields markedly better results than sFlt-1–to–PlGF ratio or blood pressure and proteinuria alone. The area under the curve for the full model proved even higher than the one presented in this paper with 0.887 while retaining high positive and negative predictive power (specificity, 87.3%; sensitivity, 80%; NPV, 90.2%; PPV, 75%). This signified the validity of more traditional statistical approaches. However, creating it required considerably more effort and hands-on corrections, which is a key advantage of our methodology. Our automated pipeline was capable of generating noninferior models with remarkably less work overhead.
These examples showed the potential of ML models to match conventional statistical apporaches and outperform the current standard of care. The application for the prediction of AOs, which our paper presented a first step toward, has yet to be thoroughly investigated.
Moreover, although we acknowledge that screening for preeclampsia is recommended in the first trimester in low-risk women, we focused on short-term prediction of women presenting with signs and symptoms in later stages of pregnancy because of the higher intrinsic risk of our study population.
Clinical implications
Integrating ML-based algorithms in decision support tools provides a veritable improvement in the clinician’s ability to make informed decisions about their patients’ health status and appropriate treatments.
Using a formulated and established approach for the development of ML in healthcare, our findings were a relevant step toward a structured pipeline that connects medicine to modern ML practices.
Furthermore, we showed how a step toward a more interpretable ML model can be made. By using Shapley values, we gained limited insight into the weights associated with certain features as assigned by the algorithms.
This can provide a step toward truly interpretable models, especially with the adoption of the General Data Protection Regulation in the European Union, requiring “meaningful information about the logic involved” in automated decision processes.
Research implications
The presented research provided several different avenues for further investigation in both the clinical and theoretical fields.
First, the set of features we used to derive the ML model should be further investigated for additions, such as other biomarkers or exclusions, based on the statistics we provided for measuring the features’ impact on the decision made by the ML models.
Second, the research should be verified in other populations than the one presented in this paper, especially regarding ethnic composition and preliminary risk of preeclampsia.
Another topic for follow-up could be the application in different contexts, especially concerning modern capabilities of data collection, such as home monitoring and/or high-resolution time series data obtained in a clinical setting.
In addition, the topic of postpartum preeclampsia, which was not subject in our study, could prove worthy of further research using the presented methods.
Strengths and limitations
Our approach presented a fully automated system that was capable of incorporating considerably more clinical parameters than traditional approaches. It required very little interaction, handled missing features without the need for data imputation, and was, to a limited extent, tunable for specific characteristics.
The most significant limitation of our study was, from an ML perspective, the small size of the dataset, which posed the risk of overfitting the obtained results.
Second, the underlying patient collective represents a high-risk population, from a single hospital group, which limited the generalization to low-risk pregnant women. Moreover, we admit the potential for intervention bias, which could contribute to the relatively low predictive ability of the serum markers, as elevated serum levels would lead to increased attention and possible intervention.
Third, we were bound to use manual data gathering. Direct access to the clinical information system and its output would most likely increase the results’ reliability.
Fourth, there was an inherent risk for ML approaches to obtain patterns specific to a certain dataset and generate an “overfitted” model. High dimensionality and limited understanding of the decision-making processes of a fitted model represented significant obstacles in detecting erroneous pattern matching.
The need for data availability and interoperability was a key concern throughout all fields of medical research and posed an especially important challenge for the development of medical AI.
These limitations were strongly mitigated by the depth of our methodologic approach and the scale and international nature of the population served by the treating clinic.
Conclusion
We presented a working, automated, and scalable system for the prediction of AOs in high-risk pregnant women. The presented ML models performed with high accuracy and overall reliability and could be easily deployed to a clinical environment as a companion diagnostic tool for pregnant women and provide valuable decision support for physicians in both clinical and private practices.
Using ML to combine patient history with basic laboratory results, in an sFlt-1–to–PlGF ratio decision paradigm, greatly improved the rule-in predictive metrics while maintaining excellent performance on rule-out criteria.
Although the application of these methods was still in the research phase, the results, concerning the current extent of research performed, showed definite promise for a clinical decision support tool.
Acknowledgments
We want to thank Andreas Busjahn for the preliminary work on the database. Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Charité BIH Innovation, BIH Digital Health Accelerator Program funded the work.
Supplementary Data
Appendix
Detailed description of machine-learning techniques
A decision tree is a machine-learning (ML) method that iteratively builds an algorithm from a given set of features a={a 1 ,…, a n } , with |a|=N and a set of examples S with |S|=M , which can take different values for each of the features and are labeled with y=0/1 as belonging to distinctive categories.
Initially, the algorithm examines several thresholds j for each feature a k and categorizes all samples along this threshold to find an optimal partition of the original data points. An optimal partition is defined as the maximization of information gain using an exact greedy algorithm. Once the optimal value pair ( a k and j ) is found, the dataset is split, and the algorithm is carried out for each of the created leaves of the tree.
A random forest algorithm selects random subsets a s ={a s1 ,…,a sn } of features with a si ={a h, …, a k ⊂ a | h,k ∈ N and |a si |=d} and builds a set of shallow decision trees T={T 1 ,…,T n } with depth d from them where T i uses the features in a si for classification. Given an input vector x , it then computes a prediction vector y =T(x)={T 1 (x),…, T n (n)} and uses the majority of prediction maj(y)=ŷ as the overall predicted class for x .
The researcher has to determine 2 variables for this approach, the number of trees ( n in the example above) and their depth d .
Gradient boosting performs several iterative steps to find an optimal approximation for a predictor. At each step m , it evaluates an imperfect predictive function f m (x) by first calculating its prediction y and then subtracting y from the correct labeling to obtain the residual ŷ, ŷ=y true −y. This residual, representing the errors made by the previous iteration of the algorithm, is used as the basis to train a new predictor (in our instance, a new decision tree), g m (x), which can be viewed as a correction of the errors made by f m (x) . These 2 are combined by the following formula: f m+1 (x) = f m (x) + α g m (x) , with α representing the learning rate, the proportion of adjustments each new predictor is allowed to contribute to the overall function. This function forms the basis for the next step m+1. This process is repeated until a cutoff for improvement compared with the last iteration is reached.
Cross-validation
The different classifiers’ performance was compared using a model for a “corrected repeated k-fold cv test” proposed by Bouckaert and Frank in 2004. ,
All models were evaluated via 10×10-fold cross-validation, meaning that, for 10 unique initializations, we split the dataset into 10 approximately equally sized subsets and trained 10 models on each unique combination of 9 out of 10 of the subsets, testing on the remaining subset. This results in 100 models trained on unique, but sometimes overlapping, subsets of data for statistical comparison.
The different classifiers’ performance was compared using a model for a “corrected repeated k-fold cv test” proposed by Bouckaert and Frank. ,
Software packages
The analysis was performed using custom scripts written in the Python programming language. For statistical analysis, we used the libraries scipy, numpy, , and pandas. For data processing, we used the pandas package, and for training and evaluating the ML models, we used xgboost, , scikit-learn, and the “shap” package for calculation of the Shapley values.
All customly set random variables (seeds) were generated using the Mersenne Twister implementation provided by the numpy library.