Back to Reference Library
veterinary
farriery
behaviour
2023
Cohort Study

Cytologic scoring of equine exercise-induced pulmonary hemorrhage: Performance of human experts and a deep learning-based algorithm.

Authors: Bertram Christof A, Marzahl Christian, Bartel Alexander, Stayt Jason, Bonsembiante Federico, Beeler-Marfisi Janet, Barton Ann K, Brocca Ginevra, Gelain Maria E, Gläsel Agnes, Preez Kelly du, Weiler Kristina, Weissenbacher-Lang Christiane, Breininger Katharina, Aubreville Marc, Maier Andreas, Klopfleisch Robert, Hill Jenny

Journal: Veterinary pathology

Summary

Exercise-induced pulmonary haemorrhage (EIPH) remains a significant concern in sport horses, typically diagnosed by scoring haemosiderin-laden macrophages in bronchoalveolar lavage fluid samples using the total haemosiderin score (THS), yet clinicians have long suspected that subjective scoring introduces substantial inconsistency into diagnosis. Researchers analysed cytological specimens from 52 equine cases, asking ten pathologists to independently assign THSs and comparing their performance against both a ground truth dataset (derived from standardised grading criteria) and a machine learning algorithm trained on the same data. Human observers demonstrated poor reproducibility, with significant interobserver variation primarily driven by inconsistent grading of individual macrophage haemosiderin content; however, 87.7% of this variance could be eliminated through standardised grading protocols. The deep learning algorithm substantially outperformed human experts—achieving 92.3% diagnostic accuracy for EIPH classification (THS ≥75 versus <75) compared to 75.7% for clinicians, whilst maintaining equivalent correlation with direct chemical iron measurements. These findings suggest that incorporating algorithmic scoring into routine EIPH diagnosis would improve reliability for clinical decision-making, and the authors recommend treating human-derived THSs between 40–110 as diagnostically uncertain pending algorithmic or repeat assessment.

Read the full abstract on PubMed

Practical Takeaways

  • BALF cytology for EIPH diagnosis using traditional human scoring is unreliable—scores between 40-110 should be interpreted with caution as they lack reproducibility
  • Deep learning-based scoring systems offer significantly better accuracy (92% vs 76%) and consistency for EIPH diagnosis and should be considered for routine clinical use
  • Implementing standardized grading criteria and algorithmic support can substantially improve the diagnostic reliability of hemosiderin scoring in respiratory cases

Key Findings

  • Human annotators showed significant interobserver variability in total hemosiderin score (THS) with only 75.7% diagnostic accuracy for EIPH detection, primarily due to systematic grading differences between observers
  • Deep learning algorithm achieved 92.3% diagnostic accuracy for EIPH diagnosis compared to ground truth, with high consistency in hemosiderin grade assignment
  • Standardized grading based on ground truth could reduce measurement variance by 87.7%, and a diagnostic uncertainty interval of 40-110 THS is proposed for human expert assessment

Conditions Studied

exercise-induced pulmonary hemorrhage (eiph)