Note:
This is a copy of the notebook I uploaded to Kaggle. You can find it here.
This notebook’s goal is to preprocess and upload the dataset Chest X-Ray Images to the Hugging Face Hub.
There will be two version of this dataset. The first is a raw version of the images as provided by the Guangzhou Women and Children’s Medical Center of the University of California San Diego. The second will be a preprocessed version of the dataset.
You can also find this dataset availble in Kaggle
Let’s first import the libraries we will use:
import osimport reimport matplotlib.pyplot as pltimport seaborn as snsimport numpy as np
from PIL import Image
TensorFlow
import tensorflow as tf
from tensorflow import keras
from keras import utils
Hugging Face
import datasetsimport huggingface_hub
Local paths that store the files:
RAW_DATA_PATH = "../data/raw/pneumonia_xray/"DATA_PATH = "../data/processed/"TFRECORDS_PATH = "../data/processed/"NAME_RAW_DATASET = "mmenendezg/raw_pneumonia_x_ray"NAME_DATASET = "mmenendezg/pneumonia_x_ray"AUTOTUNE = tf.data.AUTOTUNE
IMG_SIZE = (500, 500)
CLASSES = ["Normal", "Pneumonia"]
The dataset contains X-Ray chest images from independent patients. The images are classified into two classes:
- 0: Normal
- 1: Pneumonia
The shape, aspect ratio and size of the images vary. There are images with 3 channels of color (i.e., RGB color), and other images that have no channel (i.e., grayscale. The only channel is implicit). It is important to take this into consideration when preprocessing the dataset.
The structure of the folders of the original dataset is the following:
|-- pneumonia_x_ray|-- train|-- normal|-- pneumonia|-- test|-- normal|-- pneumonia
This folder will be copy to the RAW_DATA_PATH
folder (see above).
The first version of the dataset will be a raw version of the dataset. This will provide more flexibility to preprocess the images according to the project needs.
It is important to login to Hugging Face to upload the dataset. In the code below change the [TOKEN]
for the one provided by Hugging Face.
!huggingface-cli login --token [TOKEN]
The load_dataset()
method allows us to download datasets stored in the Hugging Face Hub, or to load data stored locally. To load images from a local folder it is necessary to set "imagefolder"
as the first argument, and the path of the folder containing the images in the data_dir
argument.
This will automatically identify the structure of the folders (see above) and create a DatasetDict
containing the train and test splits, and it creates the two labels in each split.
raw_dataset = datasets.load_dataset("imagefolder", data_dir=RAW_DATA_PATH)
raw_dataset
Once we have loaded the data, we can push the dataset to the hub:
raw_dataset.push_to_hub(NAME_RAW_DATASET)
Once that the dataset has been successfully uploaded to the Hub, we can download the data using the same name set when pushing the dataset:
pneumonia_x_ray = datasets.load_dataset(NAME_RAW_DATASET)
The raw images gives us a lot of flexibility to work with the data, but this creates some challenges when preprocessing the images to train a model.
The second version of the dataset contains preprocessed images that solves the 2 main challenges of the raw data:
- It converts all images to RGB (i.e., All the images have 3 channels)
- It resizes them to a fixed size, and therefore, a fixed aspect ratio (1:1 in this case)
It is necessary to preprocess the images. TensorFlow offers a wide variety of methods to load and preprocess images. The method tf.keras.utils.image_dataset_from_directory()
provides all the preprocessing we need: it resizes the images and converts them to RGB . Additionally, it infers the labels of the images based on the structure of the folders.
When resizing the images, it is able to uses different interpolation methods. For this dataset we will use the nearest
option that uses k-nearest neighbors algorithm to calculate the size of every pixel. This gives as result pixel values that are integers, and we do not need to process float values higher than 1. When working with images, it asumes that if the dtype
of the tensors are float
, the values should be between 0 and 1.
defload_dataset(path: str, shuffle: bool = False) -> tf.data.Dataset:"""Loads a dataset of images from a directory.
Args:
path: The path to the directory containing the images.
shuffle: Whether to shuffle the dataset.
Returns:
A `tf.data.Dataset` of images and labels.
"""
dataset = utils.image_dataset_from_directory(
path,
interpolation="nearest",
image_size=IMG_SIZE,
label_mode="int",
color_mode="rgb",
batch_size=None,
shuffle=shuffle,
class_names=["normal", "pneumonia"],
)
return dataset
defsave_image(image_array: np.array, filepath: str):
"""Saves an image to a file.
Args:
image_array: The image array to save.
filepath: The path to the file to save the image to.
Returns:
None.
"""
image = Image.fromarray(image_array)
image.save(filepath)
defsave_images(dataset: tf.data.Dataset, set_type: str = "train"):
"""Saves images from a dataset to a directory.
Args:
dataset: A `tf.data.Dataset` of images and labels.
set_type: The type of dataset, either `"train"` or `"test"`.
Returns:
None.
"""
id_images = [0, 0]
classes = ["normal", "pneumonia"]
paths = [
f"../data/processed/{set_type}/{classes[0]}",
f"../data/processed/{set_type}/{classes[1]}",
]
for path in paths:
tf.io.gfile.makedirs(path)
for idx, (image, label) in enumerate(dataset):
id_image = id_images[label.numpy()]
image_path = tf.io.gfile.join(
paths[label.numpy()], f"{set_type}-{id_image}.jpeg"
)
save_image(image.numpy(), image_path)
id_images[label.numpy()] += 1
Let’s load the images from the raw data folder:
train_path = tf.io.gfile.join(RAW_DATA_PATH, "train")
test_path = tf.io.gfile.join(RAW_DATA_PATH, "test")
train_ds = load_dataset(train_path, shuffle=True)
test_ds = load_dataset(test_path)
Once the images have been processed, we need to convert them to datasets.Dataset
(Hugging Face dataset object). The datasets
library does not provide an specific method to create a dataset from tf.data.Dataset
. There are several ways to achieve this: using a generator, creating a dictionary, converting the dataset to a pandas dataframe, etc. In this case we will save the dataset to images in a local folder, and then we wil load the dataset using the dataset.load_dataset()
method as above.
save_images(train_ds, "train")save_images(test_ds, "test")
It is possible to load the whole dataset in a single line of code, but for a reason I was not able to find, this causes to duplicate the images. One alternative, is to load the sets separate and then to create a datasets.DatasetDict
object containg both sets.
defcreate_and_upload_dataset_hf():"""Creates and uploads a dataset to the Hugging Face Hub.
This function creates a dataset from two directories, one for training and one for testing.
The dataset is then uploaded to the Hugging Face Hub.
"""
train_path = tf.io.gfile.join(DATA_PATH, "train")
test_path = tf.io.gfile.join(DATA_PATH, "test")
train_dataset = datasets.load_dataset("imagefolder", data_dir=train_path, split="train")
test_dataset = datasets.load_dataset("imagefolder", data_dir=test_path, split="test")
dataset = datasets.DatasetDict({
"train": train_dataset,
"test": test_dataset
})
dataset.push_to_hub(NAME_DATASET)
create_and_upload_dataset_hf()
After running, the dataset will be successfully uploaded to the Hub.
Finally, lets see if the classes are balanced or not.
defplot_examples_per_class():
"""
Plots the distribution of examples per classin the train and test datasets.
The function calculates the count of examples for each classin the train and test datasets,
and then visualizes the distribution using a bar plot.
"""
count_normal = [0, 0]
count_pneumo = [0, 0]
for_, labelin train_ds:
count_normal[label.numpy()] += 1for_, labelin test_ds:
count_pneumo[label.numpy()] += 1
counts = {
"Classes": ["Normal", "Pneumonia", "Normal", "Pneumonia"],
"Examples": count_normal + count_pneumo,
"Set": ["Train", "Train", "Test", "Test"],
}
data = pd.DataFrame.from_dict(counts)
ax = sns.barplot(data, x="Examples", y="Set", hue="Classes", palette="magma")
ax.bar_label(ax.containers[0])
ax.bar_label(ax.containers[1])
plt.show()
plot_examples_per_class()
The classes are unbalanced, specially in the training set. Using the accuracy
metric to evaluate a model trained on this dataset may not be the best option, and a better choice would be to use recall
, precision
or F1-Score
metrics.
- Kermany, Daniel; Zhang, Kang; Goldbaum, Michael (2018), “Large Dataset of Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images”, Mendeley Data, V3, doi: 10.17632/rscbjbr9sj.3