Objective
Our purpose was to test the reliability of the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) 3-Tier Fetal Heart Rate (FHR) classification system.
Study Design
Individual 15- to 20-minute FHR segments (n = 154) were independently reviewed without clinical data by 3 maternal-fetal medicine examiners and classified by NICHD category (I, II, III).
Results
Interobserver reliability was moderate (kappa 0.45) and varied by NICHD category (category I moderate [kappa 0.48], category II moderate [kappa 0.44], and category III poor [kappa 0.0]). The intraobserver agreement ranged from substantial to perfect (kappa 0.74-1.0).
Conclusion
Interobserver agreement of 3-Tier FHR classification System was moderate for NICHD categories I and II. Agreement for category III tracings was poor mainly due to lack of agreement regarding absent vs minimal variability.
In April 2008, the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD), American College of Obstetricians and Gynecologists, and Society for Maternal-Fetal Medicine cosponsored a workshop on electronic fetal monitoring (EFM) that recommended a new 3-Tier Fetal Heart Rate (FHR) classification system for use in the United States. One purpose for creating a new classification system was the potential ability to develop “evidence-based clinical management strategies of intrapartum fetal compromise.”
Another potential benefit of a 3-Tier classification system may be improved agreement of FHR interpretation between different observers. Given that the new FHR classification system (eg, the number of categories and the FHR patterns to be included in each category) was largely based on consensus opinion of the members attending the workshop, this system is untested with respect to its reliability, validity, and effectiveness. The purpose of our study was to assess the interobserver and intraobserver reliability of the NICHD 3-Tier FHR classification system. We hypothesized that the reliability of the 3-Tier system would be greatest with more “normal” and “very abnormal” FHR patterns.
Materials and Methods
A computerized perinatal database was used to identify women who delivered ≥37 weeks 0 days’ gestation from Jan. 1, 2008 through Dec. 31, 2009 at an institution that performed universal umbilical cord blood gas analysis on all deliveries. Cases with planned cesarean delivery or absence of umbilical artery (UA) blood gas results were excluded. Chart review was performed to obtain relevant clinical and outcome data. FHR tracings were examined (by one of the investigators not involved in FHR interpretation) to ensure there was adequate tracing for review and that the time from last EFM period to delivery was <30 minutes. To evaluate a broad range of FHR patterns, we selected cases from 3 groups of subjects based on UA pH at delivery (UA pH >7.10, 7.00-7.10, and <7.00 with base excess <–12 mEq). Based on review of the sample sizes of prior studies in the literature, as well as logistical issues with the time and effort associated with selection and deidentification of FHR tracing segments, we chose to evaluate 120 FHR tracings segments from 40 women (n = 15 with UA pH >7.10, n = 15 with UA pH 7.00-7.10, and n = 10 with UA pH <7.00 with base excess <–12 mEq).
FHR tracings were printed from an archived EFM system onto 11×8–in paper. Three FHR segments were selected for each case; each segment represented a 15- to 20-minute epoch. One segment was chosen from the last 60 minutes prior to birth and the other 2 segments were randomly selected from the last 180 minutes prior to birth (no segment overlapped). Approximately 1 in 4 FHR segments were randomly chosen for duplication to assess intraobserver reliability. FHR segments (total = 154 FHR; n = 120 original, n = 34 duplicate) were deidentified and placed in random order. They were given in bulk at one time to each reviewer.
An interactive training session was performed prior to FHR tracing review. The EFM workshop summary was discussed and nonstudy “training” FHR tracings were collaboratively examined to achieve consensus on definitions and criteria for each category. Three maternal-fetal medicine (MFM) board-certified practitioners who participated in the NICHD EFM workshop (S.C.B., W.G., C.G.B.) independently reviewed the FHR tracings without clinical information. The 3 MFM reviewers had completed fellowship training within the last 5-10 years and as part of their routine clinical practice reviewed FHR tracings while caring for intrapartum patients. Laminated cards of the NICHD classification system were used during review. A structured data collection instrument was utilized that included assessment of NICHD category (I, I, III), FHR baseline, presence/absence of accelerations, decelerations (early, variable, late), and FHR variability (absent, minimal, moderate, or marked). FHR variability was assessed by visual interpretation. Examiners could also describe an FHR tracing as “uninterpretable.”
Statistical analysis was performed with SPSS version 19.0 (SPSS, Inc, Chicago, IL). Cohen kappa was used to assess interobserver and intraobserver reliability (ie, to assess level of agreement beyond chance). Predefined criteria for agreement were used: kappa 0.0-0.20 (poor), 0.21-0.40 (fair), 0.41-0.6 (moderate), 0.61-0.8 (substantial), and 0.81-1.0 (almost perfect). A P value < .05 was considered significant. This study was submitted to our local institutional review board for review and was determined to qualify for exempt status.
Results
A total of 154 FHR tracings from 40 subjects were independently examined by the 3 examiners. The Figure describes the classification of the 120 unique FHR tracings by NICHD category by the 3 examiners. Overall, 28.3% (n = 102) FHR tracings were classified as category I, while 62.2% (n = 224) were classified as category II and 1.9% (n = 7) were classified as category III. Twenty-seven (7.5%) FHR tracings were described as “uninterpretable.” There were no cases where one examiner classified an FHR tracing as category I and another examiner classified it as category III. Thus, when disagreement between examiners did occur, it was related to adjacent categories. There was perfect agreement of NICHD category by all 3 examiners in 57.7% of cases (n = 68). The distribution of perfect agreement mirrored the frequency of FHR tracings in each category (category I, n = 15; category II, n = 49; category III, n = 4).
Reviewer 1 had substantial agreement with reviewer 2 (kappa 0.59; 95% confidence interval [CI], 0.44–0.73; P < .001) and fair agreement with reviewer 3 (kappa 0.38; 95% CI, 0.23–0.53; P < .001) while reviewer 2 had only fair agreement with reviewer 3 (kappa 0.39; 95% CI, 0.22–0.56; P < .001).
The overall interobserver agreement was moderate (kappa 0.45). Table 1 describes the reliability for each individual NICHD category. Agreement was moderate for both categories I and II but poor for category III. The low kappa for category III was based on disagreement between examiners regarding absent vs minimal variability. We further analyzed overall interobserver agreement of the NICHD classification system by umbilical pH at delivery: umbilical pH <7.0 cases had kappa 0.13 (poor), umbilical pH 7.0-7.10 cases had kappa 0.39 (fair), and umbilical pH >7.10 cases had kappa 0.41 (moderate).
Variable | Kappa | Agreement |
---|---|---|
Category I | 0.48 | Moderate |
Category II | 0.45 | Moderate |
Category III | 0.00 | Poor |
Overall | 0.45 | Moderate |
There were 20 FHR tracings that were evaluated for intraobserver agreement. Reliability ranged from substantial to perfect (reviewer 1 kappa = 0.74, reviewer 2 kappa = 0.81, reviewer 3 kappa = 1.0; P < .001 for all reviewers). Interobserver kappa for the individual FHR parameters was as follows: accelerations (0.58), early decelerations (0.49), variable decelerations (0.48), late decelerations (0.54), and variability (absent [0.16], minimal [0.53], moderate [0.69]).
Comment
Our findings indicate that the NICHD 3-Tier FHR classification system has moderate agreement between “expert” reviewers under study conditions. Since the classification system was just recently published in 2008, we are not aware of any prior studies that have evaluated its reliability. We hypothesized that agreement would be better with very “normal” and “abnormal” tracings, which was partially true (better agreement for category I tracings). Despite the poor agreement regarding category III tracings, we speculate that in actual practice this may not have adverse impact on patient care. Evaluation of these cases indicates there is reasonable agreement on most parameters except FHR variability. Since the disagreement of classification was based on whether the variability was considered absent or minimal, these FHR tracings were still categorized as type II, and we would expect them to be interpreted as sufficiently concerning for representing a fetus with (or at risk for) metabolic acidemia such that similar clinical management should occur. However, this hypothesis remains untested.
Multiple prior studies evaluating the reliability of FHR interpretation have been conducted over the past 30 years ( Table 2 ). There is some difficulty in summarizing the literature due to the fact that some studies report the level of agreement, while others utilize the kappa statistic. However, most studies have shown suboptimal agreement regarding presence/absence of individual FHR parameters as well as FHR pattern recognition. These findings are consistent regardless of the country of origin of the study, are present in more recent studies (<10 years), and persist regardless of level of expertise/experience of examiners. Thus, reliability noted in this study regarding the new NICHD 3-Tier FHR classification appears at least as good, if not better, than prior studies and methods of FHR categorization.