Article Text

Download PDFPDF

Inter-rater and intra-rater reliability and agreement of echocardiographic diagnosis of rheumatic heart disease using the World Heart Federation evidence-based criteria
  1. Bo Remenyi1,2,
  2. Jonathan Carapetis3,
  3. John W Stirling4,
  4. Beatrice Ferreira5,
  5. Krishnan Kumar6,
  6. John Lawrenson7,8,
  7. Eloi Marijon9,
  8. Mariana Mirabel10,
  9. A O Mocumbi11,
  10. Cleonice Mota12,
  11. John Paar13,
  12. Anita Saxena14,
  13. Janet Scheel15,
  14. Satu Viali16,
  15. I B Vijayalakshmi17,
  16. Gavin R Wheaton18,
  17. Liesl Zuhlke19,
  18. Karishma Sidhu2,
  19. Eliazar Dimalapang2,
  20. Thomas L Gentles20,
  21. Nigel J Wilson2,21
  1. 1Menzies School of Health Research, Casuarina, Northern Territory, Australia
  2. 2Green Lane Cardiovascular Services, Auckland City Hospital, Auckland, New Zealand
  3. 3Telethon Kids Institute, University of Western Australia, Subiaco, Western Australia, Australia
  4. 4Paediatric and Congenital Cardiac Services, Starship Children’s Hospital, Auckland, New Zealand
  5. 5Maputo HeartInstitute, Maputo, Mozambique
  6. 6Amrita Institute of Medical Sciences and Research Centre, Kochi, India
  7. 7Paediatrics and Child Health, Stellenbosch University, Cape Town, South Africa
  8. 8Department of Paediatrics and Child Health, Cape Town, South Africa
  9. 9Hop Europeen Georges Pompidou, Paris, France
  10. 10INSERM U970, Paris Cardiovascular Research Center PARCC, Paris, France
  11. 11Inst Coracao, New York City, New York, USA
  12. 12Federal University of Minas Gerais, Belo Horizonte, Brazil
  13. 13Cardiology, Project Health for León, Raleigh, North Carolina, USA
  14. 14All India Institute of Medical Sciences, New Delhi, India
  15. 15Pediatric Cardiology, Children’s National Health System, Washington, District of Columbia, USA
  16. 16Cardiology, Samoa National Hospital, Apia, Samoa
  17. 17Pediatric Cardiology, Sri Jayadeva Institute of Cardiovascular Sciences and Research, Bangalore, Karnataka, India
  18. 18Cardiology, Women’s and Children’s Hospital, Adelaide, South Australia, Australia
  19. 19Groote Schuur Hospital and University of Cape Town, Cape Town, South Africa
  20. 20Paediatric and Congenital Cardiology, Starship Children’s Hospital, Auckland, New Zealand
  21. 21University of Auckland, Auckland, New Zealand
  1. Correspondence to Dr Bo Remenyi, Menzies School of Health Research, Casuarina, NT 0810, Australia; Bo.Remenyi{at}


Objective Different definitions have been used for screening for rheumatic heart disease (RHD). This led to the development of the 2012 evidence-based World Heart Federation (WHF) echocardiographic criteria. The objective of this study is to determine the intra-rater and inter-rater reliability and agreement in differentiating no RHD from mild RHD using the WHF echocardiographic criteria.

Methods A standard set of 200 echocardiograms was collated from prior population-based surveys and uploaded for blinded web-based reporting. Fifteen international cardiologists reported on and categorised each echocardiogram as no RHD, borderline or definite RHD. Intra-rater and inter-rater reliability was calculated using Cohen’s and Fleiss’ free-marginal multirater kappa (κ) statistics, respectively. Agreement assessment was expressed as percentages. Subanalyses assessed reproducibility and agreement parameters in detecting individual components of WHF criteria.

Results Sample size from a statistical standpoint was 3000, based on repeated reporting of the 200 studies. The inter-rater and intra-rater reliability of diagnosing definite RHD was substantial with a kappa of 0.65 and 0.69, respectively. The diagnosis of pathological mitral and aortic regurgitation was reliable and almost perfect, kappa of 0.79 and 0.86, respectively. Agreement for morphological changes of RHD was variable ranging from 0.54 to 0.93 κ.

Conclusions The WHF echocardiographic criteria enable reproducible categorisation of echocardiograms as definite RHD versus no or borderline RHD and hence it would be a suitable tool for screening and monitoring disease progression. The study highlights the strengths and limitations of the WHF echo criteria and provides a platform for future revisions.

  • mitral regurgitation
  • aortic valve disease
  • paediatric echocardiography
  • rheumatic fever

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Key messages

What is already known about this subject?

  • Different definitions have been used for screening for rheumatic heart disease (RHD). This led to the development of the 2012 evidence-based World Heart Federation (WHF) echocardiographic criteria.

What does this study add?

  • This study demonstrates that if the WHF echocardiographic criteria are strictly applied to screening echocardiograms, then no RHD can be reliably differentiated from mild RHD. Physiological regurgitation can usually be differentiated from mild pathological regurgitation; however, the agreement over the presence of morphological features is more variable.

How might this impact on clinical practice?

  • The WHF echocardiographic criteria enable reproducible categorisation of echocardiograms as no RHD, borderline and definite RHD. The criteria are a suitable tool for RHD screening programmes and can be used in the clinical setting for the undifferentiated valve disease and when a diagnosis of RHD is being considered.


Rheumatic heart disease (RHD), a sequel of acute rheumatic fever (ARF), remains a major global health problem affecting an estimated 33.4 million people worldwide and leads to substantial morbidity and 319 400 deaths per year.1 ARF may go undetected if symptoms are mild or atypical, patients may not seek medical care or medical staff may not be equipped to make diagnoses. On a global basis, most patients with RHD who seek medical attention do not have a history of ARF.2

Asymptomatic patients with mild to moderate RHD likely benefit the most from secondary prophylaxis.3 4 Auscultation does not have sufficient sensitivity (just 20%) and specificity to be useful in diagnostic testing for RHD and is no longer recommended as a screening tool.5 6 Echocardiography is the gold standard for the diagnosis of both acute and chronic RHD.7 8

To allow for rapid and consistent case identification of patients with mild RHD without a prior history of ARF, in 2012 the evidence-based World Heart Federation (WHF) echocardiographic criteria for RHD were developed (table 1).7 The criteria were developed to discriminate at the milder end of the spectrum of RHD. The echocardiography of severe RHD has been well characterised.9

Table 1

2012 WHF criteria for echocardiographic diagnosis of RHD for individuals aged ≤20 years7

Since its publication, the 2012 WHF echocardiographic criteria for RHD have proven to be highly sensitive compared with auscultation5 10 and highly specific in the school-aged population.11–13 Three large population-based surveys showed that no ‘low-risk’ children were labelled with ‘definite RHD’ using the WHF definitions.11–13 Importantly, the criteria have been widely adopted for use since 2012 and have in essence become the gold standard.8 10 14 15

Concerns have been raised that the use of WHF criteria may be too complex for population-based screening.14 16 The interpretation of echocardiograms and specifically grading of severity of valvular regurgitation is known to have variable reproducibility.17 18 If echocardiography is to be used for population-based screening of school-aged children or for monitoring of disease progression and regression, then it is essential to ensure that the diagnosis of mild RHD is reproducible. This has not been formally evaluated to date.

The primary objective of this study is to assess the intra-rater and inter-rater reliability and the agreement parameters associated with the 2012 WHF echocardiographic criteria in terms of differentiating no RHD from borderline and definite RHD.


This study is reported on in accordance with guidelines for reporting on reliability and agreement studies—GRRASS 2011.19

Sample size

Sample size of 200 was chosen based on consideration of prevalence of disease and precision to be expected in estimates in kappa index and agreement parameters. Sample size calculations were performed using nQuery software. Using nQuery, if kappa (κ)=0.8, precision of ±0.1 can be expected with n=200 if prevalence of RHD is 0.25.

Study participants

Members of the WHF Advisory Group on echocardiographic screening of RHD participated as raters or reporters in the study: 15 cardiologists from 9 countries (Australia, Brazil, France, India, Mozambique, New Zealand, Samoa, South Africa and USA).


Two hundred de-identified digital echocardiographic studies were uploaded onto a secure website for viewing and reporting. Images were obtained prospectively from two large echocardiographic epidemiologic RHD screening studies conducted between 2008 and 2010 in New Zealand20 and Australia.10 Echocardiography was performed by qualified echocardiographers on Vivid E and Vivid I machines. From each site, 100 studies were selected. Normal case distribution during echocardiographic screening is 97% no RHD, 1%–2% borderline RHD and 1% definite RHD.10 In order to attain case distribution ideal for the evaluation of the reliability of the WHF criteria with kappa statistics, a non-probabilistic sampling methodology was used. The target distribution was 1/3 no RHD, 1/3 borderline RHD and 1/3 definite RHD. To achieve this, from each site consecutive abnormal studies (borderline and definite RHD as judged by the original reporting team) were enrolled as well as consecutive subtly abnormal studies that did not meet WHF definitions for RHD. Completely normal echocardiograms were excluded. Subtly abnormal studies included those with physiological mitral or aortic regurgitation, isolated morphological feature of RHD such as valvular or chordal thickening, and minor congenital defects such as a bicuspid valve. Excluding completely normal studies decreased the sample size required for statistical validity and made the study feasible with a large number of reporters.

Echocardiographic studies included the following moving images: parasternal-long-axis, parasternal-short-axis, apical-four-chamber and apical-five-chamber views (2D and colour Doppler). Still-frame images included in studies were continuous wave (CW) Doppler, image of the anterior mitral valve leaflet (AMVL) in diastole with measurement, and images of aortic and mitral regurgitant jets with measurements. The study participants were directed to re-measure these parameters using strict protocols as per WHF guidelines.7


Reporting cardiologist independently reviewed all 200 echocardiographic studies and entered reports in a standardised secure website that was specifically designed to view echocardiograms, perform measurements and report on echocardiograms, based on the 2012 WHF criteria. Cardiologists were blinded to all clinical information and case distribution. The flow of echocardiogram reports are depicted in figure 1.

Figure 1

Flow of echocardiogram reports.

To measure intra-observer variability, 100 images were re-coded and randomly re-uploaded to the website for re-reporting. Cardiologists were blinded to their original reading. Thirteen out of the 15 cardiologists participated in the intra-observer component of the study. The interval between first and second reading was >6 months.


The primary outcomes were to assess intra-rater and inter-rater reliability and proportion of agreement in categorising echocardiograms as no RHD, borderline or definite RHD, as per 2012 WHF criteria.7 Secondary outcomes were to assess agreement in identifying individual components of the 2012 WHF criteria such as pathological regurgitation, valvular thickening and chordal thickening as detailed in table 1.

The interpretation of kappa values was based on the Landis and Koch guidelines21:


Ethics approvals were obtained from Australia and New Zealand and individual patient consent was waived. All patients had previously provided formal written consent for the echocardiographic screening programmes.10 20 This study used de-identified and non-re-identifiable images for secondary research use.

Statistical analysis

Data were exported in Excel format from the designated research website. Statistical calculations were performed with the Statistical Package SAS software V.9.4 (SAS Institute, Cary, North Carolina, USA).

Inter-rater reliability was calculated using Fleiss’ free-marginal multi-rater kappa, as this was deemed to be the most appropriate statistics when marginals are not fixed and hence raters are unaware of case distribution.22 Intra-rater reliability were measured using Cohen’s kappa coefficient for dichotomous variables and linearly weighted Cohen’s kappa for trichotomous variables (no RHD, borderline RHD and definite RHD). Inter-rater reliability was expressed as mean kappa values and reported with a 95% CI. Intra-rater measurements were expressed as median kappa values with an IQR. The proportion of agreements were reported as mean percentages with a 95% CI for inter-rater agreement and as median with IQR for intra-rater agreement. Individual intra-rater reliability and agreement parameters are depicted in figures.figures 2–6 In the absence of a gold standard, it was not statistically possible to provide individual inter-rater results.

Figure 2

Definite rheumatic heart disease: inter-rater and intra-rater reliability and agreement.

Figure 3

Any rheumatic heart disease (borderline and definite): inter-rater and intra-rater reliability and agreement.

Figure 4

Mitral regurgitation: inter-rater and intra-rater reliability and agreement.

Figure 5

Aortic regurgitation: inter-rater and intra-rater reliability and agreement.

Figure 6

Presence of two or more morphological features of rheumatic heart disease of the mitral valve: inter-rater and intra-rater reliability and agreement.

Figure 7

Categorising echocardiograms as ‘no RHD’, ‘borderline RHD’ and ‘definite RHD’: inter-rater and intra-rater reliability and agreement. RHD, rheumatic heart disease.

Prevalence of many of the secondary endpoints, morphological features of RHD, were low. Both kappa values and proportions of agreement were reported. Kappa values were not adjusted for disease prevalence as per standard reporting requirements. When disease prevalence is very high or very low (rather than intermediate), the κ values decrease relative to the percentage of agreement, as κ is a relative measure of reliability and is heavily influenced by disease prevalence.23


Echocardiograms were obtained from RHD screening studies conducted at schools in children aged 5–15 years in Australia10 and 11–13 years in New Zealand.20 In those studies, 79% individuals identified as indigenous Australian, Maori or Pacific Islander and 49% were female.

A total of 3000 reports by 15 cardiologists were analysed for the inter-observer assessment. One cardiologist only reported on final diagnosis and not on subcategories. Thirteen cardiologists participated in the intra-rater assessment. Each reported on 99 echocardiograms as one study was uploaded to the website erroneously and hence 1287 reports were analysed. The flow of echocardiogram reports is depicted in figure 1.

In those without the target conditions of RHD, 13 had congenital heart disease as per original reports for the original screening programme where images were obtained from: 7 had bicuspid aortic valve (AV), 4 MV prolapse disease, 1 ventricular septal defect and 1 had atrioventricular septal defect.

Primary endpoint: RHD

Overall, the inter-rater reproducibility in categorising echocardiograms as no RHD, borderline and definite RHD (primary endpoint) was moderate with mean Fleiss’ free-marginal multi-rater kappa of 0.49 (95% CI 0.45 to 0.54) figure 2. When inter-rater reproducibility readings were dichotomised, is there definite RHD or is there any RHD, the agreement was substantial with of κ 0.65 (95% CI 0.59 to 0.70) and κ 0.6 (95% CI 0.55 to 0.65), respectively figures 3 and 4. Total proportion of agreement was highest when results were dichotomised to answer the question “Is there definite RHD?” with a total agreement of 82.27% (95% CI 79.54% to 84.99%) figure 3. Table 2 details reliability and agreement parameters inter-rater and intra-rater reproducibility.

Table 2

Inter-rater and intra-rater reproducibility of the WHF criteria

The intra-rater reproducibility (reliability and agreement) parameters in categorising echocardiograms as no RHD, borderline and definite RHD were as follows: the median linearly weighted Cohen κ was 0.68 (IQR 0.60–0.72) and total proportion of agreement was 74.75% (IQR 68.69%–80.81%). Median results are detailed in table 2 and individual results of reporting cardiologist depicted infigures 2–4.

Secondary endpoints

The inter-rater reliability of identifying isolated pathological mitral and aortic regurgitation was ‘good’ and ‘almost perfect’, κ 0.79 (95% CI 0.75 to 0.84) and κ 0.86 (95% CI 0.83 to 0.90), repectively see table 2 and figures 5 and 6. The inter-rater reliability of detecting ≥2 morphological features of RHD of the MV was ‘substantial’ with a κ of 0.57 (95% CI 0.51 to 0.62), with a proportion of agreement of 78.3% (95% CI 75.49% to 81.11%), see table 2 and figure 7. The most reliably detected morphological feature of the MV was the objective measure of thickening of the AMVL with an inter-rater κ of 0.75 (95% CI 0.7 to 0.8). The least reliable morphological feature of the MV was chordal thickening with an inter-rater κ of 0.54 (95% CI 0.49 to 0.59). The most reliably detected morphological feature of the AV was restricted leaflet motion with an inter-rater κ of 0.97 (95% CI 0.96 to 0.98). The least reliably detected morphological feature of the AV was the subjective measure of thickening with an inter-rater κ of 0.67 (95% CI 0.62 to 0.72). Further details are provided in table 2 and individual results detailed in figures 2–7.


This study demonstrates that WHF echocardiographic criteria enable reliable categorisation of screening echocardiograms as no RHD, borderline and definite RHD. The inter-rater and intra-rater reliability were substantial with a κ of 0.49 and 0.68, respectively. This level of reliability is comparable with that of other screening tests such as mammography for breast cancer screening κ 0.53–0.7724 25 and surpasses the reliability associated with other tests like the cytological assessment of the screening Papanicolaou (Pap) smears testing for cervical cancer (κ 0.46).26 Reliability improved when catagorisation of echocardiograms was dichotomised—“is there any RHD (borderline or definite)?” or “is there definite RHD?” with respective inter-rater κ values of 0.6 and 0.65, respectively.

Similarly, there was a good level of absolute agreement in deciding if definite RHD was present, with a total proportion of agreement being 82.27%. A test that is associated with a high level of absolute proportion of agreement is deemed to be a suitable tool to detect change over time,23 indicating that the WHF criteria should be a suitable tool to monitor disease progression or resolution.

There was almost perfect inter-rater and intra-rater agreement detecting pathological mitral regurgitation with κ of 0.79 and 0.92, respectively. This is substantially superior to agreement over the presence of severe MR regardless of methodology used17 18 and is likely the result of having very strict definitions where all four criteria must be met for regurgitation to be considered pathological (table 1).7 Therefore, physiological mitral regurgitation, which occurs in up to 18% of healthy children, can be very reliably differentiated from pathological mitral regurgitation that occurs in less than 0.5% of low-risk and up to 3% of children in high-risk populations for RHD.10

The reproducibility of identifying two or more morphological features of RHD of the MV (borderline category A) in a given echocardiogram was substantial with a κ 0.57 and an absolute proportion of agreement of 78.3%. The most reliable detected morphological features of the MV were AMVL thickening and excessive leaflet motion, while for the AV, it was restricted leaflet motion and AV prolapse.

Cohen’s kappa that was used to analyse intra-rater agreement is a relative measure of agreement (actual agreement minus expected agreement by chance). When disease distribution is skewed and prevalence is either very high or very low, then the expected level of agreement by chance rises and the actual kappa value lowers. Hence, kappa value is a relative measure of agreement and is influenced by disease prevalence. For inter-rater agreement, Fleiss’ free-marginal multi-rater kappa was used which better compensates for skewed distribution. By necessity, the different kappa statistics were used for multi-rater inter-observer agreement and bi-rater intra-observer agreement, and the results varied for echocardiographic features that were rare and this highlights some of the limitations of kappa statistics.

The total proportion of agreement over the presence of thickening of the MV and AV were similar: 87.27% and 83.29%, respectively. Similarly, inter-rater kappa values were 0.75 and 0.67, respectively. This is despite the fact the AMVL thickening had an objective measure (of >3 mm) while AV thickening was a subjective observation. Webb and colleagues found similarly high inter-observer agreement in relation to MV thickness measurements with an inter-class correlation coefficient of 0.85.27 They applied the same strict methodology as described in the WHF diagnostic guidelines.7

The absolute proportion of agreement in identifying individual morphological features of RHD was high for all features and ranged from 76.78% for chordal thickening to 98.39% for restricted motion of the AV.

To implement active surveillance for RHD on a global scale, as recommended by WHO some decades ago,28 would require considerable increase in human resources. Task shifting, through echocardiography performed by health workers, could provide part of the solution to make active case finding a reality in resource-poor settings. Concerns have been raised that the use of WHF criteria may be too complex for population-based screening and simplified criteria might be more practical in the field.14 16 As a result, the WHF criteria have already been modified by some researchers to allow for the use of hand-held echocardiography machines without CW capabilities and for health worker–led echocardiographic screening.15 29 Those criteria have focused on detecting mitral and/or aortic regurgitation and have ignored morphological features of RHD.

Our study supports the use of simplified criteria in the field. The most reliable component of the WHF criteria was the diagnosis of pathological mitral and aortic regurgitation, and hence it is appropriate to focus on these features when large-scale screening is being considered. The current study supports the use of the WHF guidelines for the final diagnosis of RHD for those individuals detected as positive for RHD by simplified screening protocols.

Regardless of skill level and whether the full WHF criteria or modified criteria are used for simplicity, rigorous training protocols and evaluation of competency prior to engaging in performing or reporting on screening echocardiograms for RHD should be mandatory.30

There are many unknowns that remain about echocardiographic screening for RHD. Perhaps the most important of these is the natural history of echocardiography-detected RHD. This study demonstrated that the WHF criteria could be useful in detecting change over time and therefore it could be an appropriate tool to use to evaluate the impact of secondary prophylaxis on disease progression of borderline RHD. A randomised control trial is currently under way to determine the absolute benefit of secondary prophylaxis in the setting of subclinical mild, definite and borderline RHD (The GOAL trial, NCT03346525).

The WHF echocardiographic criteria have shown good discriminating capacity and hence would be a suitable tool for population-based screening, active case finding and for diagnosis of RHD in the clinical setting. Having a reliable diagnostic method also permits the monitoring of epidemiological patterns and could aid the evaluation of interventions that are designed to reduce RHD burden, for example, sore throat programmes, Group A streptococcal vaccine trials or echocardiographic screening programmes.


This study was limited to interpretation of echocardiograms by cardiologists experienced in RHD. It is recognised that acquisition of high-quality images is fundamental to accurate diagnoses. In this study, all images were obtained by highly qualified echocardiographers in Australia and New Zealand, which may not be the case in screening studies in many resource-limited settings. Echocardiograms were obtained from screening studies from Australia and New Zealand only and may not be representative of demographics or disease pattern elsewhere. The 2012 WHF echocardiographic definitions for RHD are considered to be the current gold standard and were based on the best available echocardiographic, pathological and postmortem evidence of RHD.7 The current study represents the definitive validation of their reliability and agreement. Randomised controlled trials or carefully designed longitudinal studies are needed to ascertain risk of disease progression and the benefit of secondary prophylaxis for borderline RHD. Finally, the provision of still-frame images in our study may have inadvertently increased agreement.


This study demonstrates that application of the WHF echocardiographic criteria by specialist cardiologists enables reliable categorisation of screening echocardiograms as no RHD, borderline RHD and definite RHD. Pathological regurgitation is reliably differentiated from physiological regurgitation by experienced cardiologists. Agreement over the presence of morphological features of RHD was substantial, but the reliability was lower due to low prevalence of individual features. This study has demonstrated that the WHF criteria are useful tools for screening for RHD and for monitoring disease progression and resolution. They can also be used for clinical evaluation of new cases of MV and AV disease. Longitudinal studies are needed to evaluate the clinical significance of echocardiography-detected mild borderline and definite RHD.


BR received a scholarship from Heart Foundation of New Zealand and from the Lowitja Institute of Australia.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.


  • Contributors JC, NJW, TLG. KS and BR made substantial contributions to the conception and design of the work. BR, JC, JWS, BF, KK, JL, EM, MM, AOM, CM, JP, AS, JS, SV, IBV, GRW, LZ, KS, TLG and NJW made substantial contributions to the acquisition, analysis or interpretation of data for the work. BR prepared draft of manuscript. All authors made substantial contribution to the work or revising it critically for important intellectual content and final approval of the version to be published.

  • Funding Funding was received from the Green Lane Research and Education Fund, Auckland, New Zealand for the development of the study website.

  • Competing interests None declared.

  • Patient consent for publication Not required.

  • Ethics approval Ethics approvals were obtained for the study from the Northern X Regional Ethics Committee of the Ministry of Health of New Zealand and from the Human Research Ethics Committee of the Northern Territory Department of Health and Community Services of Australia. Both Ethics Committees waived individual patient consent.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement All data relevant to the study are included in the article or uploaded as online supplementary information.