Comparison of experts and computer analysis in fetal heart rate interpretation: we need to agree on what agreement is




We read with interest the paper by Parer and Hamilton, evaluating interobserver agreement between experts and a computer system on fetal heart rate (FHR) tracing interpretation, using a 5-level classification method, and would like to share with the authors some concerns regarding the statistical methods used to estimate agreement, having in mind the impact that disagreement may have in medical research.


The term “proportions of agreement” (PA) seems to have been used to describe simple percentages of agreement among observers, whereas the method described in most statistical textbooks and papers to calculate the PA is referred to as the “average percentage of exact agreement.” The generalization of the PA for more than 2 observers is the number of agreements obtained divided by the number of possible agreements. Thus, we believe that the obtained PA in this study should be between 0.403 and 0.486 (Table 3), and not between 0.73 and 0.93, as referred in Table 5. Agreement with the majority opinion may be an acceptable standard, but it is substantially different from total agreement and may conceal a substantial disagreement. Let us suppose, for example, that clinicians A, B, and C agree that a certain FHR segment should be classified as green and that clinicians D and E consider the same segment should be classified as yellow and orange, respectively. Clinicians A, B, and C, would agree with the majority, whereas clinician D and E would not. There would be 3 agreements with the majority, of 5 possible agreements (0.60). However, the PA would only be 0.30 (that is, 3 agreements, A with B, A with C, and B with C and 7 disagreements, A with D, A with E, B with D, B with E, C with D, C with E, and D with E).


The 95% confidence intervals (CIs) of the averages of agreement presented in Tables 3-6 appear to have been estimated as 95% CI of simple averages rather than as 95% CI of averages of proportions, which can also significantly alter the results.


Estimation of the weighted Kappa is important to give a picture of the grade of agreement beyond chance in adjacent categories, but it needs to be interpreted with caution, as it can be influenced by particular aspects of class attribution. For instance, a systematic tracing classification as yellow by 1 of the observers, together with a low prevalence of green and red evaluations will yield very high agreement results. This measure needs to be provided together with the simple kappa, for the reader to estimate the influence of adjacent classifications, an aspect that by itself could justify the higher levels of agreement obtained in this study, when compared with previous reports.


Finally, much of the clinical relevance of this study relates to the possible advantages of using a 5-level classification system for FHR interpretation, as compared with the more generalized 3-level alternatives. A larger number of categories usually leads to a less reproducible evaluation, and this also appears to be the case in this classification system. Assessment of agreement within each classification category, as shown in Table 6, reports a low PA for all classes, except for green. Accepting the interpretation of PA values proposed by its developers, if the lower 95% CI is less than 0.50 there is an insufficient level of agreement and the method needs to be improved or abandoned.

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Jun 14, 2017 | Posted by in GYNECOLOGY | Comments Off on Comparison of experts and computer analysis in fetal heart rate interpretation: we need to agree on what agreement is

Full access? Get Clinical Tree

Get Clinical Tree app for offline access