For the study described in our previous paper, we prospectively included 60 patients from August 2018 to May 2019 (Ismail et al. 2020). All patients were diagnosed with PHPT and referred to preoperative location imaging using Di-SPECT and 11C-Choline PET/CT, and later to PTx. Of these, the 40 consecutive patients to first complete both imaging modalities and PTx were included in the present study. PET/CT images were re-anonymised and reassessed in order to determine inter- and intra-observer variation. Only the 11C-Choline PET/CT images were re-evaluated in this paper.
All 40 PET/CT images were given three new anonymised IDs in three randomised orders and thus evaluated three times. Each PET/CT image was evaluated by three readers, all blinded to the established truth (as determined by surgery and postoperative pathology and biochemistry) as well as to each other’s and their own previous results.
The three readers were all nuclear medicine specialists:
Expert reader: Highly experienced in PET/CT imaging and as well as parathyroid imaging in general including experience in 11C-Choline PET/CT. At the time the study began, the expert reader had evaluated approx. 70 parathyroid 11C-Choline PET/CT studies and had more than 20 years of nuclear medicine experience.
Non-expert 1: Experienced in PET/CT imaging (primarily oncology) and some experience with parathyroid imaging. No experience with 11C-Choline PET/CT. Eight years of nuclear medicine experience.
Non-expert 2: Experienced in PET/CT imaging (primarily oncology) and sparse experience with parathyroid imaging. Some experience with 11C-Choline PET/CT for detection of prostate cancer metastases. Ten years of nuclear medicine experience.
Readers were given a coded sheet and asked to assess (a) number of HPGs, (b) which side (right, left or both), (c) location relative to the thyroid gland (upper, middle, lower third or ectopic) and (d) confidence of response (low, moderate or high). All 40 images were assessed three times by each reader. Analyses were performed using the MIM software (MIM Software Inc., USA).
Assessments (b) and (c) leave it to the observer to choose between seven possible locations. Differentiation between the upper and middle third of the thyroid gland is notoriously difficult and imprecise. Therefore, we combined the upper and middle third into one location, and each ectopic location with the appropriate standard gland location. The result was a total of four possible locations, concurrent with normal anatomy. In order to simplify statistics, we assumed a total of four parathyroid glands per patient.
We have not taken into consideration the ‘true’ gland pathology as determined by surgery because our focus was on the reproducibility of image evaluation. For results on the former, we refer to our publication on this subject (Ismail et al. 2020).
Statistical analyses were conducted using ‘R’ version 4.0.2 (R Core Team, 2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (URL https://www.R-project.org/) and the two additional packages irr: Various Coefficients of Interrater Reliability and Agreement. R package version 0.84.1 and vcd: Visualizing Categorical Data. R package version 1.4-8 (Meyer et al. 2020; Gamer et al. 2019).
Intra-observer agreement was assessed using Fleiss’ kappa method, comparing all three rounds for each reader using four possible parathyroid locations (i.e. upper right, lower right, upper left and lower left). Kappa values were interpreted as follows: < 0.00, poor; 0.00–0.20, slight; 0.21–0.40, fair; 0.41–0.60, moderate; 0.61–0.80, good and > 0.80, almost perfect agreement (Landis and Koch 1977).
In order to avoid possibly assessing on a learning curve, the third and final round of image analysis from each reader was used to assess inter-observer agreement. Fleiss’ kappa was used in the same manner as described for inter-observer agreement to compare each of the two non-experts to the expert reader. We then calculated sensitivities and specificities of each of the non-experts, using the expert reader as the gold standard. All patients were diagnosed with PHPT and, as such, analysis at the patient level (i.e. HPG/no HPG) is irrelevant. Rather, sensitivities and specificities were calculated with regard to the side of the thyroid gland (left/right, N = 80) and location in relation to the thyroid (upper left, upper right, lower left or lower right, N = 160), whilst adjusting for clustered observations (i.e. multiple observations per patient) using a so-called ‘sandwich estimator’ of variance with correlation adjusted confidence intervals (Pustejovsky 2020; Gopstein 2018; Genders et al. 2012; Obuchowski 1998). ‘True positives’ and ‘true negatives’ were defined as cases in which the non-expert made the same assessment as the expert.
To visualise the degree of both inter- and intra-observer agreement and disagreement, we created Bangdiwala’s Observer Agreement Charts (Bangdiwala 2017). Unlike a single estimate given by the kappa analysis, the agreement chart allows a visual estimate of possible bias amongst observers via comparison of row and column totals (marginal totals) from the contingency table and shown as grey-scaled/shaded rectangles. In cases of perfect agreement, the rectangles are depicted as black squares lying exactly on the diagonal line. Disagreements are shown as white rectangles. Bias is assessed by the deviation of the rectangles away from the red diagonal line. The further away the ‘path’ of the rectangles is from the diagonal line, the larger the possible bias. For all reported confidence intervals, a level of 95% was chosen, and all reported p values are exact. Average confidence scores were calculated as the average (where low confidence = 1, moderate confidence = 2 and high confidence = 3).