Abstract
Objective
This study aims to develop and validate a model based on the weighted random forest (WRF) algorithm to predict early-onset preeclampsia (PE) and to assess the importance of various clinical and biochemical markers in early risk identification.
Materials and methods
This study was conducted at the Jiangxi Maternal and Child Health Hospital and involved 12,699 pregnant women from January 2019 to June 2022. Extensive clinical and biochemical markers were collected through prenatal care data, which were used to construct a predictive model for early-onset PE. The model was developed using the WRF and Logistic regression methods, and multivariable analysis was employed to identify markers significantly associated with the risk of PE.
Results
The relative importance of various markers was evaluated using the random forest (RF) model in a sample of 1200 patients diagnosed with PE. Blood pressure and pre-pregnancy body mass index (BMI) were identified as the most critical variables affecting the accuracy of the PE prediction model. The WRF model demonstrated higher predictive accuracy (AUC = 0.9614) than the Logistic regression model (AUC = 0.9138), highlighting its superiority in early risk identification for PE.
Conclusion
The WRF-based predictive model developed in this study effectively predicts the risk of early-onset PE, with blood pressure and BMI as vital predictive factors. These findings underscore the importance of employing a comprehensive predictive model for risk assessment in early pregnancy, facilitating early intervention and improving health outcomes for pregnant women and their newborns.
Introduction
Hypertension during pregnancy is a specific multisystem disorder that arises post-20 weeks of gestation, characterized by the onset of new hypertension and significant proteinuria [ ]. This condition can lead to liver and renal dysfunction and coagulation impairment in pregnant women, with severe cases potentially causing maternal pulmonary edema, eclampsia, cerebral damage, or even death [ ]. Furthermore, fetal complications associated with preeclampsia (PE) include fetal growth restriction, neonatal respiratory distress syndrome, and stillbirth [ ]. According to the World Health Organization, PE affects up to 8 % of pregnancies globally and is a leading cause of morbidity and mortality among pregnant women and perinatal outcomes [ ]. Despite the severe clinical consequences of PE, there are currently no accurate and effective predictive measures. At present, the delivery of the placenta is considered the definitive method for alleviating maternal symptoms of PE, and timely identification and management can significantly improve maternal and neonatal outcomes [ , ].
Typically, clinical prediction of PE can be categorized into four types: clinical risk factors, uterine artery Doppler screening, serological markers, and comprehensive screening [ ]. Despite the numerous predictive models for PE, they have yet to gain universal acceptance [ , ]. Many studies have indicated that individual and combined serological markers lack the specificity and sensitivity required for clinical application [ ]. Furthermore, more external validation is needed for many predictive models to be implemented widely in clinical practice [ ].
With the advancement of machine learning (ML) and the establishment of large databases, we can now identify significant connections between data points collected from various datasets, which may lead to breakthroughs in predicting PE. ML has been widely applied in the medical field, including in robotics, medical diagnostics, and prognostics [ ]. Random forest (RF), a method of ML, is recognized for its excellent noise resistance and capability to handle missing data, making it a promising approach for diagnosing diseases or predicting clinical outcomes [ ]. To address the issue where equal weighting of decision trees in RF methods may reduce overall classification performance, we have developed the weighted random forest (WRF) algorithm. This algorithm enhances the accuracy of decision trees by introducing dual training processes, thereby reducing misclassification rates and improving the classifier’s overall performance [ ].
This study aims to analyze all available clinical and laboratory data obtained from routine prenatal visits in early pregnancy using the WRF learning method and to compare the performance of models developed with WRF to those using traditional statistical methods. Yang et al. provided a predictive model for assessing the 3-year risk of cardiovascular diseases in a large Eastern Chinese population using the WRF algorithm, demonstrating that the WRF model outperformed multivariable regression models with an area under the curve (AUC) of 0.787 [ ]. Additionally, Guo Zhifei et al. developed a predictive model for classifying triple-negative breast cancer, where the WRF model’s performance significantly surpassed five other methods, with sensitivities, specificities, accuracies, AUC, and G-means of 0.852, 0.873, 0.871, 0.862, and 0.861, respectively.
The ultimate goal of this study is to provide clinicians with a routinely useable tool for assessing the risk of PE, enabling the identification of high-risk patients and guiding new physicians to take timely individualized measures, thus significantly improving the quality of life for mothers and children. By integrating the precise WRF model with extensive clinical and laboratory data, we aim to develop an innovative predictive model that can accurately forecast the risk of PE in early pregnancy, which is crucial for enhancing health outcomes for pregnant women and newborns.
Materials and methods
Study design and participants
This study was conducted at the Jiangxi Maternal and Child Health Hospital, a leading institution specializing in maternal and child healthcare in Jiangxi Province, with an annual delivery volume of 22,000 and over 2000 PE patients treated each year. It encompassed data from 12,699 pregnant women who received prenatal care at the hospital between January 2019 and June 2022, including all singleton pregnancies screened with NT ultrasound or Down syndrome screening between 11 + 0 and 13 + 6 weeks of gestation. The consistency and accuracy of data collection were ensured through adherence to the hospital’s routine prenatal care and assessment protocols.
Inclusion criteria were primarily single pregnancies with a viable fetus between 11 + 0 and 13 + 6 weeks, ensuring consistency and validity in comparative analyses. All risk factors included in the analysis were obtained before the diagnosis of preeclampsia, ensuring that the predictive model is based solely on pre-diagnostic data. Exclusion criteria were designed to remove factors that might affect the accuracy of the results, including multiple gestations, embryonic developmental arrest or severe structural anomalies, incomplete medical records, transfers after delivery, and mental health issues in pregnant women. Additionally, cases terminated due to medical or social reasons were excluded to minimize potential biases and uncertainties ( Fig. 1 ).

The design of this study carefully considered aspects, including the representativeness of the study population, the accuracy of data collection, and the ethical nature of the research process, ensuring the validity and reliability of the results. Through this investigation of a specific population, we aim to provide new insights and methods for predicting early-onset PE, ultimately contributing to improved health outcomes for pregnant women and their newborns.
Research process and data collection
Data collection encompassed various clinical and biochemical markers randomly retrieved from the obstetric database. This process included discrete data entered directly into clinicians’ or nurses’ electronic medical record systems, structured data, and diagnostic codes inputted by professional coders. The clinical data covered age, blood pressure, pre-pregnancy body mass index (BMI), number of pregnancies and deliveries, conception methods, educational level, dietary and lifestyle habits, and current gestational age. These data provided a comprehensive view for assessing the health status of pregnant women and their risk of PE.
In addition to clinical data, we collected a series of critical biochemical markers crucial for assessing the overall health of pregnant women and their potential risk of PE. This included fundamental blood composition analysis such as white blood cell count, hemoglobin levels, and other significant biochemical markers like liver function tests (alanine transaminase (ALT) and aspartate transaminase (AST)), kidney function tests (blood urea nitrogen (BUN) and serum creatinine), and blood lipid levels (total cholesterol, triglycerides, and lipoprotein levels). The integrated analysis of these data helps us better understand the relationship between pregnant women’s health status and PE risk.
Development of the predictive model
To develop a model capable of accurately predicting early-onset PE, this study analyzed 12,699 valid samples after thorough data cleansing, including 1200 positive cases diagnosed with PE and 11,499 control samples. The samples were randomly assigned into two datasets: 70 % formed the training set for model fitting, while the remaining 30 % served as the test set to evaluate model performance. We maintained a consistent distribution of PE cases in both datasets, approximately 9.45 %.
Regarding model development, we employed both the WRF and Logistic regression models. The RF, an ensemble learning method, makes its final judgment by constructing multiple decision trees and aggregating their predictions. To address the data imbalance in this study, we assigned higher weights to the PE category—specifically, 9.5 times the weight of the non-PE cases. This weighting approach helps reduce the model’s bias toward the majority class (non-PE cases) and enhances the detection of the minority class (PE cases).
To optimize the construction of decision trees, we utilized both the Gini index and entropy as measures. These metrics help assess the purity of data splits, allowing for the most effective selection of features that reduce uncertainty. This method improves the accuracy and generalizability of the decision trees on unseen data.
Additionally, we employed grid search techniques to identify the optimal hyperparameter settings, systematically exploring various parameter combinations to find those that yield the best model performance. We also ranked the importance of various features within the predictive model, providing valuable insights for the clinical assessment of PE risk.
The entire model development and evaluation process was conducted in a Python 3.7 environment using the scikit-learn 1.0.1 library. The application of these models, combined with data preprocessing and optimization strategies, aims to provide medical professionals with a reliable tool for effectively identifying and managing high-risk pregnancies, thereby making strides in the early prevention of PE.
Model evaluation metrics
To comprehensively evaluate the performance of our model, we employed vital indicators, including accuracy, sensitivity, specificity, and AUC score in the receiver operating characteristic curve (ROC). Accuracy measures the model’s ability to make correct predictions across all cases. Sensitivity assesses the model’s capability to correctly identify positive cases of PE, while specificity indicates the efficiency with which the model identifies non-PE cases. The AUC score provides a holistic assessment of the model’s performance across all possible classification thresholds. A higher AUC value, approaching 1, signifies the model’s more robust predictive capability.
Statistical analysis
Initially, all collected data underwent preliminary processing, which included descriptive statistical analysis for quantitative data (expressed as mean ± standard deviation or median with interquartile range) and percentage representation for categorical data. Quantitative differences between groups were assessed using either the Student’s t-test or the Kruskal–Wallis test, while differences in proportions of categorical data were evaluated using the chi-square test. Additionally, univariate binary Logistic regression analysis was used to identify variables significantly associated with PE (P < 0.05), further validated for their independent predictive value through multivariable binary Logistic regression analysis.
All statistical analyses were conducted using SPSS software version 24.0. The development and evaluation of the traditional Logistic regression and the WRF models were performed in a Python 3.7 environment, utilizing the scikit-learn 1.0.1 library. The performance of these models was assessed based on metrics such as ROC-AUC, accuracy, sensitivity, specificity, precision, and concordance. Any model with an AUC value greater than 0.5 was considered predictive, and results with a P-value less than 0.05 were deemed statistically significant.
Research ethics and approval
Before commencing this study on the predictive model for early-onset PE, all procedures and methods adhered to strict ethical guidelines to protect participant rights and received corresponding ethical approval.
The research protocol was submitted for detailed evaluation to the Jiangxi Provincial Medical Ethics Committee and was approved on June 23, 2022 (Approval Number: C-KT-202221). This approval was a critical prerequisite for initiating the study, ensuring its legality and ethical compliance while safeguarding participants’ health and privacy.
Throughout the study, we strictly adhered to the standards and treatment policies defined in the “Hypertension in Pregnancy (2020) Clinical Guidelines,” ensuring that all medical practices conformed to current medical knowledge and guidelines. Decision-making for the reception of PE patients was conducted through a tiered obstetric emergency area, assisted by experienced clinical personnel, further ensuring the professionalism and correctness of practical operations.
To ensure the accuracy and reliability of data, the research facility was equipped with a comprehensive electronic medical record registration system. All personnel involved in the study underwent standardized training, ensuring high standards and consistency in data collection, processing, and analysis. These measures not only helped to enhance the quality of the research but also safeguarded participant information and the accuracy of the research outcomes.
Results
Analysis of clinical markers and their importance in early detection of PE risk
In this study, we tracked and analyzed 12,699 pregnant women to develop a pre-pregnancy predictive model for PE ( Fig. 2 ). Clinical markers for the subjects obtained during the first trimester are detailed in Table 1 . Of our study cohort, 1200 women (9.4 %) were diagnosed with PE during the follow-up period. This finding highlights the relative prevalence of PE among pregnant women and underscores the importance of early identification and prevention of this complication.

| Variable | preeclampsia group (n = 1200) | Non-preeclampsia group (n = 11,499) | P Value |
|---|---|---|---|
| Maternal age (years) | 30.09 ± 5.40 | 29.19 ± 4.64 | <0.001∗ |
| BMI before pregnancy (kg/m 2 ) | 26.43 ± 4.20 | 22.99 ± 3.17 | <0.001∗ |
| Parity | 1.49 ± 0.72 | 1.68 ± 0.92 | <0.001∗ |
| Maternal history, n (%) | |||
| Primiparity, n (%) | 752 (62.7) | 6216 (54.1) | <0.001 |
| Natural conception, n (%) | 1066 (88.8) | 10,936 (95.1) | <0.001 |
| Residence, n (%) | <0.001 | ||
| Rural area | 478 (39.8) | 1721 (15) | |
| City | 722 (60.2) | 9778 (85) | |
| Educational level, n (%) | <0.001 | ||
| Primary education or below | 39 (3.3) | 130 (1.1) | |
| Secondary education | 447 (37.3) | 1603 (13.9) | |
| University degree and above | 714 (59.4) | 9766 (85) | |
| Preconception health education, n (%) | 715 (59.6) | 9102 (79.2) | <0.001 |
| Hormone use, n (%) | 429 (35.8) | 1010 (8.8) | <0.001 |
| Hyperemesis gravidarum, n (%) | 40 (3.3) | 26 (0.2) | <0.001 |
| Smoking/Alcohol, n (%) | 21 (1.8) | 3 (0.0) | <0.001 |
| Folic acid supplementation in early pregnancy, n (%) | 1100 (91.7) | 11,497 (100) | <0.001 |
| Gestational weight gain, n (%) | |||
| ≤10 kg | 41 (3.4) | 509 (4.4) | |
| >10 kg | 1159 (96.6) | 10,990 (95.6) | 0.102 |
| Diabetes, n (%) | 332 (27.7) | 1798 (15.6) | <0.001 |
| Thyroid disorder, n (%) | 93 (7.8) | 447 (3.9) | <0.001 |
| Family history of preeclampsia, n (%) | 157 (13.1) | 0 (0.0) | <0.001 |
| History of preeclampsia, n (%) | 137 (11.4) | 1 (0.0) | <0.001 |
| Chronic disease: Cardiovascular disease/nephrosis, n (%) | 162 (13.5) | 44 (0.4) | <0.001 |
| Autoimmune diseases, n (%) | 79 (6.6) | 11 (0.1) | <0.001 |
| APS, n (%) | 31 (2.6) | 25 (0.2) | <0.001 |
| Obstetric abnormality, n (%) | 143 (11.9) | 637 (5.5) | <0.001 |
| Anemia, n (%) | 454 (37.8) | 2346 (20.4) | <0.001 |
| BP (mmHg) | |||
| SBP (m) | 134.76 ± 7.95 | 115.28 ± 12.95 | <0.001∗ |
| DBP (m) | 84.53 ± 7.36 | 71.56 ± 6.57 | <0.001∗ |
| SBP-max | 149.70 ± 12.94 | 124.23 ± 8.45 | <0.001∗ |
| DBP-max | 96.15 ± 8.97 | 77.87 ± 6.23 | <0.001∗ |
A comparison of clinical markers between the two groups of women (those with PE and those without) revealed that women who developed PE were typically older and more likely to be first-time mothers, suggesting that age and first pregnancy are potential risk factors for developing PE. Notably, although the gestational weight gain was similar between the two groups, women who developed PE had significantly higher BMI, mean systolic blood pressure (SBP(m)), mean diastolic blood pressure (DBP(m)), maximum systolic pressure (SBP-max), and maximum diastolic pressure (DBP-max) compared to those who did not develop PE. These findings highlight the potential role of blood pressure control in preventing the development of PE.
In addition to physiological parameters, our observations indicated that women who developed PE had a significantly higher prevalence of underlying conditions, such as cardiovascular diseases, renal disorders, diabetes, and autoimmune diseases, compared to those who did not develop PE. Furthermore, these women were more likely to have a family history of PE or a prior diagnosis of the condition in previous pregnancies. These findings underscore the critical importance of personal and family medical histories in assessing the risk of PE.
Regarding lifestyle habits and health behaviors, significant differences were noted between women with and without PE in aspects such as mode of conception, urban residency, educational level, hormone use, hyperemesis gravidarum, smoking/drinking habits, and early pregnancy folic acid supplementation. These results provide valuable insights for future preventive strategies and recommendations for prenatal care.
In summary, this study highlights the significance of various clinical markers in predicting the development of PE. By identifying these risk factors, we aim to furnish healthcare professionals with enhanced information for risk assessment and intervention in early pregnancy, ultimately aiming to reduce the incidence of PE and improve health outcomes for pregnant women.
Significant differences in biochemical markers for early prediction of PE
In this study, we compared a series of laboratory markers between pregnant women diagnosed with PE and those without to explore their potential roles in predicting the condition. The analysis revealed significant differences across multiple laboratory markers between the two groups, providing crucial insights into the risk factors for PE ( Table 2 ).
| Variable | preeclampsia group (n = 1200) | Non-preeclampsia group (n = 11,499) | P Value |
|---|---|---|---|
| Coagulation function indicators | |||
| Prothrombin time (s) | 10.03 ± 0.66 | 10.20 ± 0.57 | <0.001∗ |
| Partial prothrombin time (s) | 27.25 ± 2.46 | 26.84 ± 2.37 | <0.001∗ |
| Hepatobiliary function index | |||
| ALT (U/L) | 11 (8, 15) | 9 (7, 12) | <0.001 |
| AST (U/L) | 20 (16, 25) | 18 (15, 21) | <0.001 |
| Seroglobulin (g/L) | 27.65 ± 3.81 | 27.56 ± 3.73 | 0.395∗ |
| Total protein (g/L) | 60.15 ± 6.5 | 61.65 ± 5.82 | <0.001∗ |
| Total bilirubin (μmol/L) | 10.92 ± 3.74 | 11.63 ± 4.04 | <0.001∗ |
| Renal function index | |||
| Serum creatinine (μmol/L) | 53 (45.25, 63) | 46 (40, 54) | <0.001 |
| BUN (mmol/L) | 3.8 (3.0, 4.7) | 3.3 (2.7, 4.0) | <0.001 |
| Uric acid (μmol/L) | 390 (321, 462.75) | 328 (280, 384) | <0.001 |
| Serum lipid parameters | |||
| Total cholesterol (mmol/L) | 5.99 (5.2, 6.84) | 5.77 (5.03, 6.59) | <0.001 |
| Triglycerides (mmol/L) | 3.53 (2.69, 4.87) | 3.12 (2.42, 4.12) | <0.001 |
| LDL (mmol/L) | 3.25 (2.59, 3.98) | 3.26 (2.58, 3.99) | 0.774 |
| HDL (mmol/L) | 1.77 (1.45, 2.22) | 1.97 (1.63, 2.32) | <0.001 |
| Lipoprotein (a) (g/L) | 0.15 (0.07, 0.31) | 0.13 (0.06, 0.27) | 0.001 |
| ApoA1 (g/L) | 2.22 (1.88, 2.5) | 2.12 (1.76, 2.45) | <0.001 |
| Blood cell count | |||
| Leukocytes (10 9 /L) | 9.71 (8.24, 11.94) | 9.53 (7.89, 11.61) | 0.001 |
| Neutrophils (10 9 /L) | 7.31 (5.94, 9.40) | 7.16 (5.73, 9.05) | 0.001 |
| Lymphocytes (10 9 /L) | 1.55 (1.25, 1.96) | 1.51 (1.22, 1.83) | <0.001 |
| Platelets (10 9 /L) | 200 (165, 241) | 198 (165, 235) | 0.141 |
| Hemoglobin (g/L) | 114 (103, 124) | 111 (102, 120) | <0.001 |
| Urine routine index, n (%) | |||
| Urine protein, n (%) | 833 (69.4) | 5175 (45) | <0.001 |
Stay updated, free articles. Join our Telegram channel
Full access? Get Clinical Tree