Objective
The objective of the study was to measure the performance of a 5-tier, color-coded graded classification of electronic fetal monitoring (EFM).
Study Design
We used specialized software to analyze and categorize 7416 hours of EFM from term pregnancies. We measured how often and for how long each of the color-coded levels appeared in 3 groups of babies: (A) 60 babies with neonatal encephalopathy (NE) and umbilical artery base deficit (BD) levels were greater than 12 mmol/L; (I) 280 babies without NE but with BD greater than 12 mmol/L; and (N) 2132 babies with normal gases.
Results
The frequency and duration of EFM abnormalities considered more severe in the classification method were highest in group A and lowest in group N. Detecting an equivalent percentage of cases with adverse outcomes required only minutes spent with marked EFM abnormalities compared with much longer periods with lesser abnormalities.
Conclusion
Both degree and duration of tracing abnormality are related to outcome. We present empirical data quantifying that relationship in a systematic fashion.
The general goal of electronic fetal monitoring (EFM) is to identify fetuses with increased risk of hypoxic injury so that intervention can avoid adverse outcome without also causing excessive numbers of unnecessary interventions. Currently clinicians visually assess EFM tracings to infer what physiological conditions are present and how well the fetus is compensating. Inconsistency in tracing evaluation is well recognized, especially when assessments are carried out over long periods of time or in the presence of fatigue or distractions.
See Journal Club, page 317
Written definitions for graded classifications of tracings are designed to reduce assessment inconsistency and help clinicians communicate more precisely the level of abnormality and hence the urgency and nature of clinical interventions required. Several recent publications define graded classification schema, but none have reported how often the various levels occur in labors with normal or abnormal outcomes. The objective of this study was to measure the performance of a 5 level classification system of EFM in 3 groups of term babies, defined by functional and biochemical markers of perinatal abnormality.
Materials and Methods
We studied the last 3 hours of EFM traces from 2472 babies greater than 35 weeks’ gestation that were born without apparent congenital malformations or inborn errors of metabolism. All babies had umbilical artery blood gas measurements and tracings that met preestablished criteria for recording quality.
The abnormal (A) or index group included 60 babies with umbilical artery base deficit levels greater than 12 mmol/L who developed neurological signs of encephalopathy in the early neonatal period. Encephalopathy was defined clinically requiring the presence of at least 2 of the following criteria lasting longer than 24 hours: altered level of consciousness, hypotonia, hypotonia, feeding difficulty of central origin, or respiratory difficulty of central origin.
The very low natural incidence of this condition necessitated collecting cases from a number of hospital series (n = 40) and medicolegal files (n = 20).
The reference groups were convenience samples of consecutive vaginal births from an urban university teaching hospital, which used EFM extensively and routinely measured umbilical artery blood gases at birth. The normal (N) group included 2132 babies with normal umbilical artery gases and no neurological signs of encephalopathy. The intermediate (I) group comprised 280 babies with base deficit values greater than 12 mmol/L but without encephalopathy. This base deficit level corresponds to the second percentile and hence is a reasonable way to define a group with increased risk of evolution to a more adverse state. All data were provided in compliance with institutional regulations.
The digital version of the 7416 hours of tracings were analyzed using CALM Patterns (LMS Medical Systems, Montreal, Canada with PeriGen, Princeton, NJ), a software that identifies and measures fetal heart rate (FHR) baseline, baseline variability, accelerations, and decelerations according to the National Institute of Child Health and Human Development (NICHD) definitions. The software uses a variety of proprietary digital signal processing techniques and pattern recognition algorithms to identify and measure the FHR features.
In brief, baseline was determined by a polynomial best-fit line in flat portions of the FHR that excluded accelerations and decelerations. Variability was the range enclosed by 2 SD of FHR values around the baseline segments. Internal neural networks classified deceleration segments as either gradual or variable decelerations using a number of criteria including slope of the onset and end of the deceleration, deceleration depth, and duration as well as ambient variability and baseline levels. Gradual decelerations were further classified as late or early, depending on their timing in relation to associated contractions.
No national standardized set of marked EFM tracings existed to test the performance of such computer programs. Performance was measured previously by comparing the computer-generated marking on approximately 50 hours of tracings to the independent markings of 5 obstetricians instructed to use the NICHD definitions. The performance of the computerized pattern recognition software was comparable with the clinicians.
Comparing the computerized method and the majority opinion of the clinicians, sensitivity for decelerations was 92% and the proportion of agreement was 73%. The correlation coefficient for baseline was 0.96. In comparison, published reports measuring agreement between clinicians generally describe lower proportions of agreement that range between 27% and 60%.
The EFM classification system of Parer and Ikeda first graded decelerations according to their size and persistence and then defined 5 color-coded categories based on 134 different possible combinations of decelerations with the classical gradations of variability and baseline. A summary of the combinations of features defining the various colors is outlined in Table 1 . Segments that were within normal limits for all features were coded as “green.” Progressively abnormal combinations of features result in tracing classification of “blue,” “yellow,” “orange,” or “red.” We created software to categorize the tracings every 2 minutes according to these definitions.
Decelerations | Recurrent variable | Recurrent late | Prolonged | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
None | Early | Mild | Mod | Severe | Mild | Mod | Severe | Mild | Mod | Severe | |
MODERATE (NORMAL) VARIABILITY | |||||||||||
Baseline | |||||||||||
Tachycardia | B | B | B | Y | O | Y | Y | O | Y | Y | O |
Normal | G | G | G | B | Y | B | Y | Y | Y | Y | O |
Mild bradycardia | Y | Y | Y | Y | O | Y | Y | O | Y | Y | O |
Moderate bradycardia | Y | Y | O | O | O | O | |||||
Severe bradycardia | O | O | O | O | O | ||||||
MINIMAL VARIABILITY | |||||||||||
Baseline | |||||||||||
Tachycardia | B | Y | Y | O | O | O | O | R | O | O | O |
Normal | B | O | Y | O | O | O | O | R | O | O | R |
Mild bradycardia | O | O | R | R | R | R | R | R | R | R | R |
Moderate bradycardia | O | O | R | R | R | R | |||||
Severe bradycardia | R | R | R | R | R | ||||||
ABSENT VARIABILITY | |||||||||||
Baseline | |||||||||||
Tachycardia | R | R | R | R | R | R | R | R | R | R | R |
Normal | O | R | R | R | R | R | R | R | R | R | R |
Mild bradycardia | R | R | R | R | R | R | R | R | R | R | R |
Moderate bradycardia | R | R | R | R | R | R | |||||
Severe bradycardia | R | R | R | R | R | ||||||
Sinusoidal | R | ||||||||||
Marked variability | Y |
Some of the FHR conditions in the classification by Parer and Ikeda required interpretation. For example, in mathematical terms, absent baseline variability (0 beats/min) would be equivalent to a perfectly flat line, which does not exist in living biological entities. Thus, for this exercise, absent variability was defined as a measured variability of less than 2 beats/min and minimal variability was defined to be between 2 and 5 beats/min.
Late and variable decelerations were defined as recurrent if there were at least 2 decelerations in the proceeding 20 minutes and at least 50% of the contractions in that interval were associated with a deceleration.
If variable decelerations were recurrent, they were said to be severe if any 1 of the variable decelerations in the 20 minute period was longer than 1 minute and went down to 70 beats/min or if it was longer than 2 minutes and went down to 80 beats/min. Recurrent variable decelerations were said to be moderate if any 1 of the variable decelerations was longer than 30 seconds and went down to 70 beats/min or was longer than 1 minute and went down to 80 beats/min.
We defined mild baseline bradycardia to be between 100 and 110 beats/min, moderate bradycardia between 90 and 100 bpm, and severe bradycardia less than 90 beats/min. Since completing this work, we have learned that our interpretations of these levels were higher than those intended by the authors of the framework.
We measured how long each tracing spent in every color over the course of the 3 hours. Thus, it was possible to measure how many tracings in each group attained any of the 5 color-coded levels as well as how long they spent in or above each level.
Receiver-operating characteristic (ROC) curves are a standard way of examining how well a test can identify members from the group with the adverse outcome (sensitivity) and the associated rate of incorrectly detecting members in the normal or control group (false-positive rate) as one varies a component of the test. This technique was useful to examine the effect of increasing duration of tracing abnormality on the associated sensitivity and false positive rates.
We constructed ROC curves by varying the amount of time spent at or above each of the colors. For example, each point along the red curve in the collection of ROC curves in the Figure represents a certain duration of time spent in red. The corresponding location along the vertical axis shows how often these conditions were found among the group A cases (sensitivity or true-positive rate), and the location along the horizontal axis shows how often these same conditions were found in group N (false-positive rate).