# An artificial intelligence algorithm is highly accurate in detecting endoscopic features of eosinophilic esophagitis

This was a 3-phase study in which an AI model was trained to detect EoE on white light endoscopic images. In the first phase, the AI ​​model was trained and validated with an internal dataset (InD). In the second phase, the performance of the AI ​​model was tested on an external dataset (ExD) from a separate hospital; in this phase, the interest of using EREFS scores in the AI ​​model was studied. In the third phase, the performance of the AI ​​model was compared to human endoscopists with different levels of experience.

### Data and image acquisition

Pathology reports archived in the laboratory information system (Nexus, Frankfurt am Main, Germany) of the Institute for Pathology and Molecular Diagnostics at the University Hospital of Augsburg, Germany, were reviewed for the German terms “Osophagus” and “Eosinophilic Ösophagitis”. The corresponding endoscopic reports and white light images of patients identified over a 10-year period between 06/2010 and 05/2020 were extracted from the endoscopy database (Viewpoint 5, GE Healthcare Systems (Germany) ) from the University Hospital of Augsburg, Germany, by two board-certified gastroenterologists. Endoscopic images were selected for AI training based on the following criteria:

(1) Inclusion criteria:

• Images of patients with active EoE (≥ 15 eosinophils/HPF) who were diagnosed according to consensus guidelines5

• Images of patients with an endoscopically normal esophagus who also had normal esophageal biopsies

(2) Exclusion Criteria:

• Images with other visible pathology, such as reflux esophagitis, candida esophagitis, mass, or other findings

• Images with visible stenosis formation and stenosis

• Poor quality images with blurring, out of focus, excessive bubbles, blood or mucus covering the mucosa

All InD images were taken with an Olympus gastroscope (GIF-HQ190, GIF-HQ-180; Olympus Medical Systems, Tokyo, Japan) at University Hospital Augsburg, Germany.

### EREFS

The images were evaluated for EREFS by two board-certified gastroenterologists. EREFS were reported using the standard scoring system, including 0–1 point edema, 0–3 point rings, 0–2 point exudates, and 0–2 point furrows7.10. Images with obvious stenoses were excluded (total score range, 0 to 8) because it was assumed that the additional benefit of AI in patients with stenosis formation or stenosis is limited, and the real challenge lies in identifying EoE patients with more subtle strictures. endoscopic features, which are probably in an earlier phase of the disease.

In addition to the main binary classification branch (EoE vs normal), a specific auxiliary branch for each of the EREFS categories was included in the training phase of the AI ​​system. In other words, two AI models were trained, one with (AI-EoE-EREFS) and a second without the EREFS auxiliary categories (AI-EoE).

### Building and training AI models

The training of both AI models was based on a CNN with a ResNet architecture22. The models were pre-trained on a non-medical dataset (ImageNet23) to learn basic abstract visual functionality. The final classification layer of the neural network was then adjusted to allow for binary classification – EoE detection and classification. The probability threshold was set at 0.5. Prior to training, InD images were cropped to exclude black borders and resized to ensure consistency across the dataset, after which data augmentation, including image scaling and shifting, has been applied. The intention of the data augmentation was to allow the algorithm to be more robust to slight variations in the input images. During training, model parameters were optimized to minimize cross-entropy loss with label smoothing, achieve global binary prediction, and accurately classify particular EREFS features. The models were trained for 6000 iterations with a batch size of 48 and a sampling strategy such that both classes are equally represented in each batch. The initial learning and weight decay for the Stochastic Gradient Descent algorithm was set to 0.01 and 5e−4. During training, the learning rate decreased with a cosine annealing program. All models have been implemented in the PyTorch Deep-Learning framework.

### Internal validation

To internally validate the models, we performed five repeated rounds of five-fold cross-validation. In five-fold cross-validation, the data set is divided into five disjoint subsets. Four of the five folds are used as training data for the algorithm. The last is the retained validation set. The procedure is repeated so that each fold has been in the validation set role once. We did not perform hyperparameter optimization or early stopping techniques on the validation set, but trained our algorithms for a fixed number of iterations. The cross-validation scheme is repeated five times with compositions of random subsets and seeds for the random number generators from 0 to 4.

### Test set with external data

After building the AI ​​models, we evaluated their performance on an independent, externally acquired (ExD) test set. ExD included a total of 200 WL images, including 100 WL images of EoE patients with active disease (≥15 eos/hpf) diagnosed according to consensus guidelines and 100 WL images of normal esophagus in patients without any visible esophageal pathology , histological or known. The test set was provided by the University of North Carolina, Chapel Hill (UNC), with patients who underwent endoscopy between August 2020 and January 2021. Both AI algorithms had never seen the ExD images before evaluation. Evaluation and analyzes of these images were performed blinded, EoE code vs. normal being revealed only after the results of AI-EoE and AI-EoE-EREFS have been finalized and transmitted to the UNC. Examples of images are shown in Figs. 1 and 2. For the external evaluation, a set of the five individual models from the first round of cross-validation was used.

### Image evaluation by endoscopists

To better understand the performance of AI-EoE and the impact of EREFS on diagnostic accuracy, ExD images were evaluated by six endoscopists who were scored according to their level of experience, including:

1. 1.

Beginners in endoscopy (n = 2)

2. 2.

Senior Fellows (n=2)

3. 3.

Consultant endoscopists (n = 2)

Endoscopists were asked to assess the images for the presence of EoE according to the following process:

### Group 1

Evaluation of all 200 ExD images (1–200) according to the endoscopist’s clinical impression after viewing the images without explicit use of EREFS.

### Group 2

Evaluation of the first 100 ExD images (1–100) according to the clinical impression of the endoscopist. After that, endoscopists were asked to review the initial description of the EREFS criteria by Hirano et al.ten; they were also shown 30 representative white light endoscopic images of EoE with corresponding EREFS scores. Following this training phase, an additional evaluation of the 100 second images (101–200) using the EREF score was performed. The evaluation of the first 100 images was made to adjust the individual performance of the endoscopists. The evaluation of the 100 second images was performed to quantify the improvement in diagnosis with the EREFS explicitly in mind. Each group included an endoscopist of each level of experience.

### Statistical analysis and outcome measures

Sensitivity, specificity, precision, area under the ROC curve (AUC) and harmonic mean (F1) between sensitivity and precision on ExD images were used to measure model performance, AI-EoE and AI- EoE- EREFS, formed without and with the additional branches of the EREFS, respectively. These statistics are calculated from the true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) produced by the algorithm.

$${text{Harmonic}};{text{medium}};left( {{text{F1}}} right) = {text{2TP}}/left( {{ text{2TP}} + {text{FP}} + {text{FN}}} right)$$

$${text{Sensitivity}} = {text{TP}}/left( {{text{TP }} + {text{FN}}} right)$$

$${text{Specificity}} = {text{TN}}/left( {{text{TN}} + {text{FP}}} right)$$

$${text{Precision}} = left( {{text{TP}} + {text{TN}}} right)/left( {{text{TP}} + {text{ TN}} + {text{FP}} + {text{FN}}} right).$$

Statistical significance between groups was determined with McNemar’s test.

By testing several models, we investigated whether the inclusion of EREFS criteria leads to improved performance of AI-EoE.

The performance of human endoscopists on the same data set (ExD) was also evaluated using the same parameters described above.

### Ethics

Ethical approval was granted by the institutional review board of Augsburg University Hospital (BKF Nr. CCE03022021_0002, date: 04/07/2020), as well as the institutional review board of UNC (number 20-3655; date of initial approval: January 28, 2021). All methods used in this study were performed in accordance with the Declaration of Helsinki and in accordance with relevant guidelines and regulations. All images used in this study were obtained from endoscopic procedures for which patients had given informed consent. For patients under the age of 16, parents or legal representatives provided informed consent.