Introduction to Radiomics and Its Application to Medical Images

Hugo Bottois
9 min readJun 4


Photo by National Cancer Institute on Unsplash

Medical imaging is arguably one of the most valuable fields in medicine. It involves techniques used to view the interior of the body to diagnose, treat, and monitor medical conditions. This allows us to visualize organs and internal structures beneath the skin and tissues without the need for invasive procedures.

“To see inside the human body without causing harm”

Hippocrate (Cos, 460 — Thessalie, 377 BC)

Medical imaging is used to detect abnormalities in the microvascular system, identify fractures and neuronal diseases, and detect tumors in various body areas. Each use case helps physicians to better assess a patient’s medical condition, the evolution of diseases, and determine the most suitable medical procedure for treatment.

This complex field intersects with physics, chemistry, and biology/anatomy. In recent years, a growing emphasis has been placed on the utilization of machine learning to assist clinicians in analyzing medical images.

Machine Learning interest in medical imaging

Machine Learning (ML) is a powerful tool that assists radiologists in focusing on elements within images that are more likely to be relevant. It can emphasize details that might be overlooked due to the sheer number of images and a heavy workload. For instance, ML can be used to detect fractures in images, segment them, and classify tumors into malignant or benign types.

There are several approaches to apply ML to medical imaging data, including computer vision and radiomics. Computer vision is a field of machine learning that uses deep learning algorithms to solve classification or segmentation problems in images. These models utilize multiple “hidden layers” to capture patterns in images and continually learn to improve their performance. Deep learning models are often very performant for specific tasks, but they require a substantial amount of annotated data to be efficient.

Data acquisition is a significant limitation in obtaining a substantial database, especially in the healthcare field where data are confidential, and data annotation needs to be performed by qualified medical experts or through extra exams (histology, blood analysis).

Another growing approach to medical image analysis is radiomics. This does not provide direct insights like ML models, but it allows the extraction of mineable data from medical images. Radiomics is mainly applied in the oncology field as it provides mineable data to qualify information about tumors.

Radiomics data include descriptive variables such as the size, shape, and texture of tumors, which can make sense clinically (especially the volume of tumors when working with 3D data). Although radiomics data are mainly mathematical descriptors from the region of interest (typically tumors), such as interactions between pixels in 2D or voxels in 3D, and grayscale intensity, they can also be used to train machine learning models.

There are many possible ways to extract radiomic data, including software (ImageJ) and in-house algorithms developed by laboratories. This heterogeneity can cause difficulties in standardizing and comparing radiomic studies.

Pyradiomics is an open-source Python package coded to extract radiomic data from medical images (link Here). This initiative aimed to provide a reference standard for radiomics study, increase awareness of radiomics potential and stimulate the community. It can perform image loading and preprocessing and feature extraction of 2D and 3D images on regions of interest.

Today, we are going to use pyradiomics to extract radiomic features from an open-source brain tumor dataset. This dataset contains 3064 T1-weighted contrast-enhanced images from 233 patients suffering from three kinds of brain tumors: meningiomas (708 slices), gliomas (1426 slices), and pituitary tumors (930 slices).

(1) barplot and (2) piechart of disease repartition over 233 patients

Data are split into several .zip folders, each containing MATLAB .mat files that contain:

  1. cjdata.label: 1 for meningioma, 2 for glioma, 3 for pituitary tumor
  2. cjdata.PID: patient ID
  3. cjdata.image: image data
  4. cjdata.tumorBorder: a vector storing the coordinates of discrete points on tumor border. For example, [x1, y1, x2, y2,…] in which x1, y1 are planar coordinates on tumor border. It was generated by manually delineating the tumor border. So we can use it to generate
    binary image of tumor mask.
  5. cjdata.tumorMask: a binary image with 1s indicating tumor region

Description of data and how to extract them are elegantly describe in this repository Here.

We actually need both the image data and the tumor mask, which is the segmented tumor (as below) in the image to extract radiomics.

(1) total MRI (2) tumor zone zoomed in total image (3) tumor mask

This first step of segmentation can be performed by experienced radiologists (usually a very time consuming operation) or by a segmentation model.

labels = np.load(base_url + 'labels.npy')
images = np.load(base_url + 'images.npy')
masks = np.load(base_url + 'masks.npy')
patient_ids = np.load(base_url + 'patient_ids.npy')


unique, counts = np.unique(patient_ids, return_counts=True)

____________ output
(3064, 512, 512)
(3064, 512, 512)
(3064, 1, 1)

So we have here the image, the mask, the label (disease’s name) and the patient ID from 233 total patients and 3064 images.

We want to work in 3D. To do so we need to create 3D images for each patients (all slices corresponding to a same the patient_ids).

Let gather all corresponding images and patients_id:

patients_slices = {} 

# gather all slices coording to patient in a dicto called patients_slices (key: patient name, value: array containing slices id)
for patient in patient_ids:
patient = patient[0][0]
patients_slices[patient] = np.asarray(np.where(patient_ids == patient))[0]

# check if all slices id from patients_slices are correct in patients_slices
for patient, slice_id in patients_slices.items():
#for id in slice_id:
if all(patient_ids != patient):
print(f"error for {patient}: slice {id}")

# visualize first patient 98772


Let’s reconstruct all slices from a patient to a 3D image SimpleITK.Image

# set spacing bewteen pixel
data_spacing = [1,1,1]

images_radiomic = []
masks_radiomic = []
patients = []

for patient in patients_slices:

# images
img = images[patients_slices[patient]]
sitk_img = sitk.GetImageFromArray(img)
sitk_img.SetSpacing((float(data_spacing[0]), float(data_spacing[1]), float(data_spacing[2])))

# mask
mask_slice = masks[patients_slices[patient]].astype(int)
sitk_mask = sitk.GetImageFromArray(mask_slice)
sitk_mask.SetSpacing((float(data_spacing[0]), float(data_spacing[1]), float(data_spacing[2]) ))

# patient

Let gather the number of total images by patient

nbslice = []
patients_name = []

for (img, mask) in zip(images_radiomic, masks_radiomic):

# convert sitk to numpy array
image_3d = sitk.GetArrayFromImage(img)
mask_3d = sitk.GetArrayFromImage(mask)

# check if size mask and image are equal
if image_3d.shape[0] != mask_3d.shape[0] or image_3d.shape[1] != mask_3d.shape[1]:
print("size different between mask and image")
# check if same number of slice in mask and image
elif image_3d.shape[2] != image_3d.shape[2]:
print("slice number different between mask and image")
# gather number of images by patient
slice_image = image_3d.shape[0]

patients_slice_df = pd.DataFrame({'patients': pd.Series(patients), 'nbrslide': pd.Series(nbslice)})


patients nbrslide
0 98472 18
1 MR048994B 17
2 112074 14
3 90284 19
4 MR029209I 26
... ... ...
228 53669 1
229 95969 1
230 111696 1
231 88510 1
232 112650 2
print('Min slice : {}'.format(patients_slice_df['nbrslide'].min()))
print('Max slice : {}'.format(patients_slice_df['nbrslide'].max()))
print('Mean slice : {:.0f}'.format(patients_slice_df['nbrslide'].mean()))


patients nbrslide
0 98472 18
1 MR048994B 17
2 112074 14
3 90284 19
4 MR029209I 26

Min slice : 1
Max slice : 38
Mean slice : 13

Let plot these data!

Number of total slides by patients

Good. We see that patients have various number of slices.

This might cause a problem if we choose to work in 3D, as it requires analyzing multiple slices (or cross-sectional images) to accurately assess the volume of a tumor.

Lets see the result for a patient:

# Visualize one patient 
id = 30

image_radiomic = sitk.GetArrayFromImage(images_radiomic[id])
mask_radiomic = sitk.GetArrayFromImage(masks_radiomic[id])
image_radiomic = np.transpose(image_radiomic)
mask_radiomic = np.transpose(mask_radiomic)

We using a function from this repository Here to visualize 3D images as a succesion of 2D images selected to cover overall shape of the tumors (disclaimer, I am the author of this repo!)

# to vizualize contour
PlotImage(image_radiomic, masks=mask_radiomic, show_tumor_only = True, dislay_mode='contour')
Images from a patient brain MRI
# to visualize square around the tumor
PlotImage(image_radiomic, masks=mask_radiomic, show_tumor_only = True, dislay_mode='frame')
Images from a patient brain MRI

Unfortunately, we’ve noticed that the slices do not seem to be in the right order. More importantly, we have different plans, including sagittal, radial, and transversal.

Well, visual examination in images definitively matter…

We do need consistent plan and successive images to word in 3D to compute voxel or tumor volume for instance.

So, we are going to work with individual 2D images.

Without further ado, lets dig into the radiomics extraction part.

# Define settings for signature calculation
# These are currently set equal to the respective default values
settings = {}

# Normalizes the image by centering it at the mean with standard deviation. Normalization is based on all gray values in the image, not just those inside the segmentation.
settings['normalize'] = True
settings['binWidth'] = 32
settings['resampledPixelSpacing'] = None # [3,3,3] is an example for defining resampling (voxels with size 3x3x3mm)

# instanciation of featureextractor
extractor = featureextractor.RadiomicsFeatureExtractor(**settings)

We choose the config in the papers (1), (2) as this seems to be a similar use case, but more customization is possible (documentation Here).

We now extract the radiomic data for the entire dataset:

data_spacing = [1,1,1]
featureVector_list = []

for (img, mask,label, patient) in zip(images, masks, labels, patient_ids):

# convert to numpy array
img_slide = sitk.GetImageFromArray(img)
mask_slide = sitk.GetImageFromArray(mask.astype(int))

# radiomic extraction
order_dict = extractor.execute(img_slide, mask_slide)
order_dict['patient'] = patient
order_dict['disease'] = label
# gather all radiomics data into a df
radiomics = pd.DataFrame()

for featureVector in featureVector_list:
new_df = pd.DataFrame.from_dict(data=featureVector, orient='index')
radiomics = radiomics.append(new_df.T)


And finally, save the dataset

radiomics.to_csv("radiomic_variable.csv", index=False)

The first 22 rows represent metadata you can delete if you only want radiomic data and tumor type annotation.

We now have a clean dataset of radiomics data describing 3 types of tumors for 233 patients. A lot of ML application could be developed including both unsupervised (clustering of radiomic in regard to tumors type) and supervised machine learning (classification of tumors type).

But, now that I think of this, why doing all that if we could work directly on images with a deep learning (DL) approach?

Well there are several reasons:

DL is generally very performant as long as enough data are available. This is not always the case in the healthcare field as medical data are sensitive and under strict regulation and therefore more difficult to gather. More conventional ML models such as random forest, linear regression usually reach good performances with fewer data.

Some rare cancers with a low number of patients will not have enough data to train a DL model.

DL on images also require a lot a data cleaning, especially if the data is coming from multicentric cohort. As images can be acquired in different hospitals, settings, machines, static fields… images from a same patient acquired in different conditions are not going to be the same. This is what we call the Batch effect.

So, for images, we need to clean the data such as:

  • Modify of ratio/size into a standard without modifying tumors shape
  • Harmonize intensity scale
  • Enhance Image Accuracy with Algorithm such as N4ITK

However, this is still more difficult to work on a track of images than a single vector of radiomic data.

Batch correction algorithm for radiomics data such as COMBAT seems to be very efficient to correct batch effect and remove technical bias while preserving biological characteristics of the tumor.

Lastly, DL is more difficult to train as it requires intense computing resources.

But as always, it is highly dependent of the quality and availability of data, experimentation and complexity of the problem…

Final note

There are lots and lots of ways to exploit data. As always, this is highly dependent of the quality and availability of data, experimentation and complexity of the problems…

There are myriad ways to harness the potential of data. The efficacy of these methods, however, is largely dependent upon the quality and availability of data, the level of experimentation, and the complexity of the problem at hand.

Radiomics stands out as an innovative approach for extracting insightful data from images. Initiatives like the open-source project Pyradiomics exemplify how collaboration among researchers is crucial and can lead to remarkable and impressive solutions that help solving problems and benefit patients.

Thanks for reading!

Feedback are welcome and will be appreciated.

Github repository can be find Here



Hugo Bottois

I'm a Life scientist 👨‍🔬🔬 passionate about the intersections of Healthcare and Data science and programming 💻👨‍💻📊.