AIMS: A Strong New Way to Pick Key Data for Radiology
The adoption of AI in healthcare faces several significant challenges, despite its vast potential benefits. One of the challenges is patient privacy. Patient privacy regulations (such as HIPAA in the U.S. and GDPR in Europe) impose strict controls on the sharing of healthcare data, which limits the diversity of datasets that AI developers can access. This limitation introduces bias into the datasets, as the data available for model training may not fully represent the broader patient population. When datasets are biased or non-diverse, AI models are at risk of delivering inaccurate or inequitable healthcare predictions. This data bottleneck makes it challenging to create robust AIsolutions that generalize well across different populations. The next challenge is heavy dependence on clinical supervision during the development process. It is crucial for model reliability, yet it is also a major bottleneck to scaling AI solutions in healthcare, since it drives up both costs and time requirements.
Both of these factors contribute to the slow adoption of AI in healthcare. A balance needs to be struck between maintaining patient privacy and enabling the use of diverse datasets to improve AI accuracy and performance. Additionally, finding innovative ways to reduce the need for costly clinical supervision in data labeling could help accelerate AI implementation in the medical field.
Coreset selection is indeed a powerful technique used in machine learning and active learning to efficiently reduce data size while maintaining data representativeness. By selecting a smaller subset of the data (called the "coreset"), we can make the training process faster and less resource-intensive while still achieving good performance from the model.
In the context of our study on radiology, we are focusing on using the AIMS coreset selection method and evaluating its effectiveness compared to random sampling. The main advantage of coreset selection is that it reduces data volume and minimizes biases, which can be particularly useful when training models with noisy or unbalanced datasets, common in medical imaging.
The fact that we are focusing here only a single iteration (without an active learning loop) means we are primarily analyzing the quality of the initial subset of data chosen by the coreset method. This makes it easier to directly compare to random sampling and to measure the robustness of the models to noise using the F1 score.
Since we are using the F1 score as a primary evaluation metric, it shows that we are concerned about both precision and recall — crucial factors in medical applications like radiology where both false positives and false negatives can have serious consequences.
F1 = 2 * Precision * Recall / (Precision + Recall).
Our goal of comparing the coreset-selected model with a random sub-sampling-based AI model in the healthcare context is a critical study, especially with a focus on trust, robustness, and generalization capabilities. In healthcare, AI models must demonstrate robustness, meaning they should perform reliably under different scenarios and datasets, even when encountering new or noisy data. A robust AI model should be generalized well, avoiding overfitting to the training data, so that it accurately predicts outcomes on unseen data from diverse patient populations. Model robustness also includes resilience against adversarial attacks, outlier conditions, and biased data distributions that can mislead the prediction. To mimic real-world scenarios, we are adding noise to the testing dataset to simulate different device outputs. This approach helps evaluate how well the model generalizes to variations in input data, reflecting its performance in diverse clinical settings.
As a sample dataset we consider MRI brain tumor dataset.
This dataset contains 7023 images of human brain MRI images, 4 classes: glioma, meningioma, pituitary and no tumor. It is a combination of data from different sources: figshare and Br35H.
Exploratory data analysis (EDA) is a crucial but often overlooked step in machine learning. It involves collecting, cleaning, and analyzing data to identify patterns, remove outliers, and prepare features. That is how we have started. The first simple analysis shows that the dataset contains 806 identical images. The cleaned dataset is available here containing 6217 images. It is rather balanced dataset:
- glioma - 1620
- meningioma - 1526
- pituitary - 1738
- notumor - 1334
Once the dataset is cleaned we can proceed to coreset selection and then training.
Coreset Selection: A Brief Overview.
Coreset selection aims to extract a small subset of data points from a larger dataset while preserving its essential characteristics. This subset, known as a coreset, can be used to train machine learning models with minimal loss in performance compared to training on the entire dataset. In general the effectiveness of a coreset selection method depends on its ability to capture the underlying structure and diversity of the original data.
AIMS AI: A Novel Coreset Selection Approach.
AIMS AI is a FiveBrane proposed coreset selection method that stands out for its robustness to noise and its ability to capture the intrinsic geometry of the data. It employs a two-step process:
- Feature Selection: AIMS AI identifies the most informative features by analyzing their contribution to the overall variance of the data. This step helps to reduce the dimensionality of the data and mitigate the impact of noise.
- Coreset Construction: Using the selected features, AIMS AI constructs a coreset by iteratively selecting data points that maximize the diversity and representativeness of the original dataset.
Experimental Evaluation.
To evaluate the performance of AIMS compared to random sampling, we conducted experiments on the brain MRI dataset. The benchmarking methodology is as follows:
- Select 1% of the data, split into training and validation. The rest 99% use it as a testing subset.
- Train the model based on the data.
- Save best model based on validation accuracy and validation loss.
- Register the F1 score for the testing dataset.
- Apply noise on the testing dataset.
- Register the F1 score for the noisy testing dataset.
We introduced varying levels of noise to the dataset to assess the robustness of both methods. We have used inceptionV3 fine tuning with Focal Loss, SGD optimizer with a cosine annealing learning rate, see details in [1], [2].
Results
Subsampling-method | Glioma | Meningioma | Pituitary | Notumor | |
---|---|---|---|---|---|
AIMS | 51 from 1555 | 37 from 1476 | 61 from 1662 | 51 from 1274 | |
Random | 42 from 1570 | 46 from 1467 | 50 from 1665 | 62 from 1265 |
Each class representation in training/testing dataset.
The random sampling and the AIMS produce the similar results for the original testing data.
Subsampling-method | F1 | F1-Glioma | F1-Meningioma | F1-Pituitary | F1-Notumor |
---|---|---|---|---|---|
AIMS | 0.888 | 0.88 | 0.797 | 0.930 | 0.93 |
Random | 0.889 | 0.89 | 0.8 | 0.92 | 0.945 |
F1 score for each subsampling method and class.
The random sampling and the AIMS produce the similar results for the original testing data.
A sufficiently good performance of the random sampling is expected because of statistical nature, however we want to assure that the model performs well on the data from a different source. For that reason we have tested the models by adding Gaussian and Salt and Pepper noise to testing datasets. The results of our experiments demonstrate the superiority of AIMS over random sampling. In order to achieve reproducible results the random seed has been fixed.
Gaussian noise
Subsampling-method | F1 | Accuracy | F1-Glioma | F1-Meningioma | F1-Pituitary | F1-Notumor |
---|---|---|---|---|---|---|
AIMS | 0.57215 | 56% | 0.311 | 0.423 | 0.623 | 0.748 |
Random | 0.45534 | 45.6% | 0.38 | 0.30 | 0.539 | 0.486 |
F1 scores when a Gaussian noise is applied on the testing dataset.
Salt and Pepper noise
Subsampling-method | Accuracy | F1-Glioma | F1-Meningioma | F1-Pituitary | F1-Notumor |
---|---|---|---|---|---|
AIMS | 57.66% | 0.67 | 0.31 | 0.70 | 0.56 |
Random | 46% | 0.68 | 0.04 | 0.09 | 0.49 |
F1 scores when a Salt and Pepper noise is applied on the testing dataset.
The results obtained by AIMS AI highlight its effectiveness as a coreset selection method for radiology applications. By focusing on the method's robustness to noise, ability to achieve high F1 scores with small coreset sizes, and its role in reducing computational burden, AIMS appears to be a promising solution for efficiently managing large medical imaging datasets. This could be a significant advantage in accelerating the development and deployment of AI-powered solutions in the field of radiology, making it a valuable tool for clinical use and research.
[1] Pranav Singh, Elena Sizikova, Jacopo Cirrone. CASS: Cross Architectural Self-Supervision for Medical Image Analysis, 3rd Workshop on Self-Supervised Learning at NeurIPS'22 (arxiv).
[2] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. 2017 IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017.