Kvasir is a multi-class dataset from Bærum Hospital of Vestre Viken Health Trust (Norway), collected from 2010 to 201424. Kvasir (v2) contains 8000 endoscopic images labeled with eight distinct classes, with approximately 1000 images per class including ulcerative colitis. Images are assigned image-level labels only, provided by at least one experienced endoscopist as well as medical trainees (minimum of 3 reviewers per label). The images are independent, with only one image per patient.
Standard endoscopy equipment was used. HyperKvasir is an extension of the Kvasir dataset, collected from the same Bærum Hospital from 2008 to 2016, containing 110,079 images, of which 10,662 are labeled with 23 classes of findings25. Pathologic findings in particular accounted for 12 of 23 classes, which are aggregated and summarized in Table 1. They can be grouped into Barret’s esophagus and esophagitis in the upper gastrointestinal tract, and polyps, ulcerative colitis, and hemorrhoids in the GI tract. inferior.
Importantly, the dataset includes 851 ulcerative colitis images that are labeled and categorized using the Mayo Endoscopic Subscore26.27 by at least one board-certified gastroenterologist and one or more junior doctors or doctoral students (total of 3 reviewers per image). The images are in JPEG format, with various image resolutions, the most common being 576×768, 576×720, and 1072×1920. Table 2 shows the number of images available for each Mayo grade.
The HyperKvasir study, including the HyperKvasir dataset available through the Center for Open Science that we use here, was approved by the Norwegian Personal Data Protection Authority and exempted from patient consent as the data was completely anonymized . All metadata was removed and all files were renamed with randomly generated filenames before Bærum Hospital’s internal IT department exported the files from a central server. The study was exempted from approval by the Regional Ethics Committee for Medical and Health Research in South East Norway because the data collection did not interfere with patient care. Since the data is anonymous, the dataset can be shared publicly and complies with Norwegian laws and the General Data Protection Regulation (GDPR). Other than that, the data has not been pre-processed or augmented in any way.
Two binary classification tasks were formulated from the dataset:
Diagnosis: All pathologic findings for ulcerative colitis were grouped with all other classes of pathologic findings in the dataset (Fig. 1a). The problem was formulated as a binary classification task to distinguish CU from non-CU pathology on endoscopic still images.
Grading: Assessment of disease severity using endoscopic images of UC pathology. Mayo graded image labels have been grouped into grades 0–1 and 2–3. (Fig. 1b) This clustering has been used in previous machine learning studies and for clinical trial endpoints19. Therefore, the task was to distinguish inactive/mild CU from moderate/severe CU.
A filter has been designed to remove the green image overlay representing the endoscope. The filter applied a uniform crop to all images, filling in missing pixels with 0 values, making them black.
The source images were then normalized to [− 1, 1] and scaled down to 299×299 resolution using bilinear resampling. The images underwent random rotation, zoom, transparency, vertical and horizontal flip transformations, using a defined seed. Image augmentation was only applied to images in the training set (not the validation or test set), within each fold of the quintuple cross-validation.
There is a growing variety of machine learning frameworks that could form the basis of our study. Our choices here acknowledge the current dominance of deep neural network methods, despite emerging challenges of explainability (explainable artificial intelligence = XAI) and confidence in practical clinical implementation41. Most of our picks use the most popular method of image classification (convolutional neural networks), the main differences of which lie in their overlapping depth (50-160) and the recorded dimensionality of annotated relationships between image segments. (until 2048).
The following four different CNN architectures were tested on the Kvasir dataset:
Pre-formed InceptionV3, a 159-layer CNN. The output of InceptionV3 in this configuration is a 2048 dimensional feature vector28.
Pre-formed ResNet50a Keras implementation of ResNet50, a 50-layer CNN that uses residual functions that reference inputs from the previous layer29.
Pre-formed VGG19a Keras implementation of VGG which is a 19-layer CNN developed by Visual Geometry Group30.
Pre-formed DenseNet121a Keras implementation of DenseNet with 121 layers31.
All pre-trained models were TensorFlow implementations initialized using ImageNet weights32.Training was done end-to-end with no freezing of the layers. All models performed a final classification step via a one-node dense layer. Sigmoid activation was used at this last dense layer, with binary cross-entropy for the model loss function.
For both classification tasks, the final dataset was randomly shuffled and split into training and validation sets in a 4:1 ratio, where 80% of images were used for cross-validation quintuple and 20% invisible images were used to evaluate the performance of the model. The best model from each fold was combined and used as the final model for the prediction on the test set.
The hyperparameters were refined using Grid Search, where the search space included the following parameters: optimizers: Adam, Stochastic Gradient Descent (SGD), learning rate: 0.01, 0.001, 0.0001; momentum (for SGD): 0, 0.5, 0.9, 0.99. For all models, the training phases consisted of 20 epochs with a batch size of 32.
Models were assessed using accuracy, recall, precision, and F1 scores. As a binary classification problem, confusion matrices and ROC curves were used to visualize model performance.
Explainability Analysis (XAI)
To provide a visual explanation of what the models learn, we chose the Gradient-weighted Class Activation Mapping (Grad-CAM) technique.33. Grad-CAM produces a heatmap for each model output, showing which part(s) of the image the model uses to make predictions (produces the strongest activation). The heat map is a course location map produced by using gradient information flowing through the final convolutional neural network layer, to assign importance values to each neuron.
We also asked an experienced gastroenterologist (DCB) to annotate and highlight regions of interest in representative images to provide comparison with regions of interest generated by heatmaps.
The construction of the model was carried out and the figures created were made using the TensorFlow and Keras packages32 in Python 3.6.9, running on the Google Colab notebook (https://research.google.com/colaboratory/).