Inter-observer reliability in ultrasound measurement of benign thyroid nodules in the follow-up of radiofrequency ablation: a retrospective study

Abstract Purpose To investigate the inter-observer reliability of ultrasound (US) measurement in volume and volume reduction rate (VRR) of benign thyroid nodules during the follow-up of radiofrequency ablation (RFA). Materials This retrospective study evaluated 76 patients with 88 benign thyroid nodules who underwent RFA. Two independent observers performed the US measurements during the same follow-up visit. The Bland–Altman analysis was used to evaluate the inter-observer reliability of volume and VRR (logarithmic transformation). The mean difference and 95% limits-of-agreement (LOA) were obtained after antilogarithm. Results No significant differences were found in the volume and VRR measurements during a mean follow-up time of 35.52 ± 12.82 months. After antilogarithms, the mean difference of volume was 0.9961, 0.9987, 1.0016, 0.9972, 0.9977 and 0.9969 at 1, 3, 6, 12, 24 and 36 months, respectively. The 95% LOA of volume became wider over the follow-up period, and the largest one was between 0.8471 and 1.1733 at 36 months. The 95% LOA of VRR became narrower over the follow-up period, and the largest one was between 0.9541 and 1.0469 at 1 month. The incidence of regrowth was 20.45% and the largest 95% LOA of regrowth nodules was between 0.9028 and 1.284 at 12 months. At the same follow-up period, VRR had a narrower 95% LOA than volume. Compared with volume ≥10 ml, nodules <10 ml revealed a larger 95% LOA in both the volume and VRR. Conclusion The inter-observer reliability of the US measurements of benign thyroid nodules during the follow-up period of RFA was acceptable. The variation by different observers could not affect the evaluation of efficacy.


Introduction
Thyroid nodules are a common disease in the general population, affecting 20-70% of individuals [1]. Although most nodules are incidental findings of asymptomatic and benign nodules that require observation only, 20% of nodules require treatment because of progressive growth and cosmetic or symptomatic problems [2]. Surgery is the standard treatment for thyroid nodules. However, it is associated with general anesthesia, complications and scar formation. Levothyroxine therapy is another option. Unfortunately, its efficacy is still debatable [3]. Thus, minimally invasive thermal ablation techniques, including radiofrequency ablation (RFA), microwave ablation and laser ablation have been widely used with excellent results [4][5][6][7][8].
RFA has been recommended as a safe and effective alternative to surgery for benign thyroid nodules [9][10][11][12]. It can selectively destroy the targeted nodule and induce irreversible cell injury [13]. After ablation, large amounts of nodule debris are gradually reabsorbed, which usually takes months to years [14]. Therefore, changes in volume are important indices for evaluating the efficacy of ablation. Moreover, the parameters to evaluate the efficacy, such as volume reduction rate (VRR), technique success and nodule regrowth [15], are based on the volume measurement at each follow-up period. As changes in nodule volume directly reflect treatment efficacy and indicate the need for nodule management during the follow-up [16], accurate and reliable nodule volume measurement is essential. Although thyroid ultrasound (US) is easily accessible, noninvasive, cost-effective and the most common method for nodule measurement [17], its being observer-dependent is the mean disadvantage that may result in variability [17,18]. A few studies have demonstrated that the inter-observer reliability of nodule volume measurement by US was from ±13.1% to ±48.6% [16,18,19]. However, to the best of our knowledge, no studies reported the inter-observer reliability of nodule volume or VRR during the follow-up period of RFA.
Therefore, the purpose of this study was to investigate the inter-observer reliability of US measurement in the volume and VRR of thyroid nodules during the follow-up period of RFA.

Materials and methods
This retrospective study was approved by the Institutional Review Board of Chinese PLA General Hospital (approval number: S2019-211-01). Written informed consent was obtained from all the patients prior to RFA.

Patients
All the enrolled patients fulfilled these inclusion criteria: (1) confirmation of benign nodule status on two separate fineneedle aspiration (FNA) or core-needle biopsy (CNB); (2) no suspicious malignant features on US examination; (3) solid ( 10% of fluid component) or predominantly solid nodules (11-50% of fluid component) [15]; (4) report of cosmetic and/or symptomatic problems or concern of nodules growing rapidly or concern of malignant transformation; (5) serum thyroid hormone and thyrotropin levels within normal ranges; (6) refusal or ineligibility for surgery; (7) follow-up time !24 months; (8) accept two complete new sets of US scans by two observers at each follow-up period. Exclusion criteria were: (1) malignancy findings or follicular neoplasm on FNA or CNB; (2) nodules with benign result on FNA or CNB had suspicious of malignancy in US; (3) follow-up time <24 months; (4) refuse two complete new sets of US scans by two observers at each follow-up period.
From August 2014 to March 2018, a total of 137 patients with benign solid/predominantly solid thyroid nodules accepted two complete new sets of US scans by two observers at each follow-up period. Among them, patients with follow-up time less than 24 months (N ¼ 61) were excluded. At last, 76 patients with 88 benign thyroid nodules were evaluated in this study.
Pre-ablation assessment US before and after RFA, as well as during follow-up were performed using a Philips iU22 Ultrasound System (Philips Healthcare, Bothell, WA) with a L12-5 linear array transducer or a Mindray M9 Ultrasound System (Mindray, Shenzhen, China) with a L12-4 linear array transducer. RFA were all performed using a Siemens Acuson Sequoia 512 Ultrasound System (Siemens, Mountain View, CA) with a 6L3 linear array transducer. CEUS was used to evaluate the ablated zone of the nodule immediately after RFA. Sulfur hexafluoride (SonoVueR, Bracco International, Milan, Italy) was used as US contrast agent. CEUS was performed after bolus injection of SonoVue (2.4 ml), followed by a 5 ml of normal saline flush.
Before RFA procedure, thyroid nodule was measured by an experienced US physician with more than 20-year experience in thyroid US and interventional US (Y.K.L.) who performed RFA procedure. Then, the initial volume was calculated by ellipsoid formula: V ¼ pabc/6 (V is the volume, while a is the largest diameter, b and c are the other two perpendicular diameters).

Ablation procedure
A bipolar RFA generator (CelonLabPOWER, Olympus Surgical Technologies Europe, Hamburg, Germany) and an 18-gauge bipolar RF electrodes with 0.9 cm active tip were used (CelonProSurge micro 100-T09, Olympus Surgical Technologies Europe, Hamburg, Germany) in this study.
Patients lay on an operating table in the supine position with the neck extended. Local anesthesia with 1% lidocaine was administered. RFA was performed using the trans-isthmic approach, hydrodissection technique and moving-shot technique. CEUS was performed immediately after the RFA procedure to evaluate the ablation area. If any enhancement existed, a complementary ablation could be performed. Each patient was observed for 1-2 h in the hospital while any adverse event including complication and side effect occurring during and immediately after ablation were carefully evaluated according to the clinical signs and symptoms [15].

Post-ablation evaluation
Two physicians (Observer A, Y.L. with more than 10-year experience in thyroid US; Observer B, X.J. with 3-year experience in thyroid US) performed all the measurements of ablated nodules at each follow-up period. Before this study, to obtain objective measurement, the two observers standardized the measurements method. The anteroposterior and transverse diameters of nodule were measured on the transverse US image with the largest dimensions, and the longitudinal diameter was measured on the longitudinal US image with the largest dimensions. Nodules were measured with the calipers placed outside of any visible halo [20].
Patients were scanned consecutively by the observers during the same visit. Only one observer was present in the US room at any time. As the volume measurements were performed as a part of clinical follow-up evaluation of RFA, the previous US finding of ablated nodule was known to the observers. For each patient, each observer performed a complete new set of US scans, without knowledge of the other observer's results. When the margin of ablated nodule was detected, the measurements of three diameters were repeated twice to calculate a mean volume of each observer. Then, VRR was calculated based on the mean volume of each observer. Thus, a total of 6 volume measurements and one VRR were obtained for each nodule at each follow-up period.
After RFA, patients were followed up at 1,3,6,12 months and every 12 months thereafter by US and clinical evaluation. The volume reduction was calculated as follows: VRR ¼ ([initial volume-final volume] Â 100%)/initial volume. Technique success was defined as a > 50% volume reduction at last follow-up point [15]. Regrowth was defined as an increased in total volume 50% over the previously recorded smallest volume [15].

Statistical analysis
Statistical analysis was performed using the SPSS statistical software (Version 25.0, SPSS Inc., Chicago, IL) and GraphPad Prism (Version 8.0.0, La Jolla, CA) software. Continuous data were expressed as mean ± SD (range). Wilcoxon's signed rank tests were used for pairwise comparisons. A difference with p < .05 was considered as statistically significant.
The inter-observer reliability of nodule volume and VRR was assessed using the Bland-Altman analysis. The mean difference and 95% limits-of-agreement (LOA, mean difference ± 1.96SD) were calculated. The mean difference, also called bias, was the tendency for one modality to underestimate or overestimate the measurement relative to the other [21]. LOA was the range within which 95% of the differences between measurements by the two observers would lie [22] and expressed as the absolute magnitude of the reliability between the two observers. The width of the LOA varied with the precision of the measurements. LOA was wider when measurements were imprecise and vice versa [23]. Before the Bland-Altman analysis, the Kolmogorov-Smirnov test was used to assess the normality of the distribution. If a non-normal distribution was shown, a logarithmic transformation was performed, and the Bland-Altman analysis was applied to the transformed data. Antilogarithm was performed to obtain values relating to the ratios of measurements by the two observers to fully understand the LOA and easily interpret the results of the Bland-Altman analysis [22,23]. Moreover, the conclusion on reliability should be made based on the width of LOA in comparison to a priori defined clinical criteria [23,24]. The clinical criteria for thyroid nodule volume using the ellipsoid formula were reported to be between ±13.1% and ±48.6% [16,18,19]. Therefore, the acceptable clinical criteria of volume in this study should be an LOA ranging from 0.5 to 1.5. The inter-observer reliability was determined based on the total number of nodules and then on the subgroups defined by the initial volume before RFA, namely <10 ml and !10 ml.

Results
The clinical characteristics of patients are presented in Table 1. A total of 76 patients (67 females, nine males) with 88 benign thyroid nodules were evaluated in this study. The mean age was 46.53 ± 10.21 years and the initial volume was 10.49 ± 14.87 ml. The number of nodules <10 ml was 59 and !10 ml was 29.
During the RFA, the mean power was 5.67 ± 1.78 W. The mean RFA time was 376.37 ± 264.32 s and the mean energy was 2110.67 ± 1865.17 J.
The overall incidence of nodule regrowth was 20.45% (18/ 88). All the regrowth nodules were confirmed benign by CNB. The time of nodule regrowth developed were as follows: 1 at 6 months, 5 at 12 months, 7 at 24 months, 3 at 36 months and 2 at 48 months. The mean time of nodule regrowth developed was at 24.33 ± 12.42 months.

Inter-observer reliability
The volume and VRR measured by the two observers at each follow-up period are summarized in Table 2. No significant differences in volume and VRR were found (all p > .05). The inter-observer reliability of volume and VRR by the two observers are presented in Table 3. The Bland-Altman analysis showed that after antilogarithms, the mean differences Values are presented as mean ± SD (range) or number of tumors (%). of volume and VRR were approximately 1 during the followup period. The 95% LOA of volume became wider and larger over the follow-up period (Figure 2), whereas the 95% LOA of VRR became narrower over the follow-up period ( Figure  3). After the antilogarithm, the largest 95% LOA of volume was from 0.8471 to 1.1733 at 36 months, which meant that in approximately 95% of cases, the volume measured by observer A was between 0.8424 and 1.1793 times the volume measured by observer B. The largest 95% LOA of VRR was from 0.9541 to 1.0469 at 1 month, which meant that the VRR measured by observer A was between 0.9541 and 1.0469 times the VRR measured by observer B. This applied to all the reported LOA hereinafter with corresponding variations. At the same follow-up period, VRR had a narrower 95% LOA than volume. Moreover, compared with an initial volume !10 ml at the same follow-up period, initial volume <10 ml revealed a larger 95% LOA in both the volume and VRR.
The inter-observer reliability of volume in the nonregrowth and regrowth group are presented in Table 4. The mean differences in both groups were also approximately 1. The 95% LOA in the non-regrowth group became larger over the follow-up, and the largest 95% LOA was from 0.8264 to 1.2248 at 36 months. By contrast, in the regrowth group, the 95% LOA increased in the first 12 months and then decreased slightly. The largest 95% LOA was from 0.9038 to 1.1284 at 12 months (Figure 4).

Discussion
The inter-observer reliability of thyroid nodule measurement during the follow-up of thermal ablation is unclear. This study showed that the 95% LOA of volume became wider and larger over the follow-up period, whereas the 95% LOA of VRR became narrower over the follow-up period. The largest 95% LOA of volume was within the clinical criteria. At the same follow-up period, VRR had a narrower 95% LOA than volume. Compared with volume ! 10 ml, nodules < 10 ml revealed a larger 95% LOA in both the volume and VRR.
US, as a safe, cost-effective and noninvasive measurement modality, is the most common method to evaluate thyroid nodules, both in the pretreatment and posttreatment setting [12]. After ablation, changes in nodule volume directly reflected treatment efficacy and indicated the need for nodule management during the follow-up period [16]. Other parameters for efficacy, such as VRR, technique success and nodule regrowth, were all based on the volume measurement. Therefore, there is a definite requirement for accurate and reliable volume measurement on US. However, US has a Values are presented as mean ± SD. major drawback, which is the observer dependence [18]. Brauer et al. [19] found that the inter-observer reliability of thyroid volume was 48.96% by a logarithmic method and suggested that the volume changes of at least 49% or more could be interpreted as nodule reduction, growth or therapy effects, which was also recommended by the 2015 American Thyroid Association Guidelines as the cutoff value of volume change [9]. Recently, two studies used the Bland-Altman analysis to evaluate the inter-observer reliability of nodule volume measurement and found that the 95% LOA was ±13.1% and ±48.6%, respectively [16,18]. However, to the best of our knowledge, no study has reported the interobserver reliability of nodule measurement after ablation. This study found that as the nodule volume decreased over the follow-up period, its 95% LOA became wider and larger. The largest 95% LOA of volume was between 0.8471 and 1.1733 at 36 months, which was still within the clinical criteria. This indicated that the volume evaluation during the follow-up by different observers was acceptable. Only reliable post-ablation evaluation can help better understand the efficacy of RFA, which reduces the volume of nodules and relieves the related symptoms and cosmetic problems, making RFA was a reliable option for patients with symptomatic benign nodules. Moreover, this study also found that compared with an initial volume of !10 ml at the same followup period, an initial volume <10 ml revealed a larger 95% LOA. This was consistent with a previous study by Choi et al. [16], which found that the inter-observer reliability in volume measurement was greater among the thyroid nodules with a maximum diameter of <2 cm than among nodules with a maximum diameter of !2 cm. These results indicated that the inter-reliability of volume measurement might be greater in patients with small nodules, and thus the measurement for small nodules should also be cautious. VRR is another important index for post-ablation evaluation. VRR !50% is defined as the technique success, which was also used to evaluate the efficacy of ablation [15]. Metaanalysis showed that the volume reduction after RFA was significant, and the VRR of benign nodules was 87% at 24 months after RFA [25]. A similar VRR was observed in this study, which was 88% at 24 months. After a mean follow-up time of 35.52 ± 12.82 months, the VRR was 94%. Furthermore, we also investigated the 95% LOA of VRR at each follow-up period. In this study, the 95% LOA of VRR became narrower over the follow-up period, and the largest one was between 0.9541 and 1.0469 at 1 month. Compared with the volume at each follow-up period, VRR had a much narrower 95% LOA, suggesting that the inter-observer reliability of VRR was much better than that of volume. The explanation might be associated with its calculation formula, which could also be converted into VRR ¼ [1 -(final volume/ initial volume)] Â 100%. As the initial volume was determined before ablation in the routine clinical procedure, its interobserver reliability was not available. Therefore, as the 95% LOA of volume became wider and larger over the follow-up period, the 95% LOA of VRR became narrower.
RFA and other thermal ablation techniques have been considered safe and effective alternatives to surgery for benign thyroid nodules [4,5,[9][10][11][12], and the long-term efficacy has attracted research attention. Several studies reported that after 2-3 years of ablation, 4.1-37.5% of the treated nodules developed regrowth [26][27][28][29][30][31][32]. Nodule regrowth may be associated with incompletely resolved symptoms [33], or even a potential sign of malignancy [12]; thus, most patients needed additional treatment. Sim et al. [27] reported that the nodule volume consisted of two parts after RFA, which was the ablated volume and vital volume. They found that tracing the vital volume, calculated as the total volume minus the ablated volume, could indicate regrowth earlier than tracing total volume. Schiaffino et al. [34] recently reported that the reliability of ablated volume measurement was higher with CEUS than with US, and thus CEUS should be used for the early prediction of regrowth. However, the inter-observer reliability of the total volume measurement was not reported. Considering that the definition of regrowth was an increase in total volume 50% over the previously recorded smallest volume [15], it was very important to obtain a reliable volume measurement at each follow-up period. This study showed that in the regrowth group, each follow-up presented mean differences of approximately 1.
The 95% LOA became wider in the first 12 months and then decreased slightly. The largest 95% LOA was at 12 months and within clinical criteria. These results showed that the evaluation of regrowth nodules was not affected by the volume measured by different observers during the followup period.
US measurements were based on imaging plane acquisition and manipulation of calipers on the screen [18]. The variation in measurements depended on several factors, including transducer location, angulation, pressure and the differences in the manipulation of the calipers [18]. In the clinical routine procedure of post-ablation assessment, it was almost impossible to measure the nodule by the same observer at each follow-up period. It was also difficult to reproduce the same two-dimensional image plane at each follow-up period even for the same observer. Although this study showed that the 95% LOA of volume was within the clinical criteria, the post-ablation measurement still needed to be carefully performed. The measurement methods of the nodule after ablation should to be standardized. The definition of the three diameters needed to be clarified, and the measurements could be repeated twice to obtain the mean volume. Physicians should also preview the previous US images and check the ablated nodule location before measurement. When the ablated nodule decreased and reabsorbed during the follow-up, it might be difficult to differentiate it from other small nodules, particularly in patients with multiple nodules.
There were some limitations to this study. First, it was a single-center study. Second, the sample size was relatively small. Considering each subgroup's number, the nodules in this study were not divided into three subgroups, which was recommended by the recent reporting criteria for thyroid ablation [15]. Third, the follow-up time was relatively short. Longer-term inter-observer reliability of US measurements needs further investigation. Moreover, Sim et al. [27] found that there were two peaks of regrowth after ablation. The first peak began at 12 months and tended to be prominent at 2-4 years of follow-up, and the second peak appeared 5 years later. As the follow-up time was 35.52 ± 12.82 months Table 4. Inter-observer reliability of volume in the non-regrowth group and regrowth group during the follow-up by the Bland-Altman analysis. in this study, the inter-observer reliability of the US measurements of regrowth nodules was focused on the first peak of regrowth.
In conclusion, the inter-observer reliability in the US measurements of benign thyroid nodules during the follow-up period of RFA was acceptable. In clinical post-ablation follow-up, the variation by different observers could not affect the evaluation of efficacy.

Disclosure statement
The authors declare that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported.