Indiana University — Chest X-Rays Automated Report Generation

Endeavour to Solve Image Captioning problem based on Indiana University — Chest X-Rays dataset as a part of a pioneering self-case study.

Rohan kumar soni
14 min readJan 8, 2021

Table of Contents

  1. Problem Description
  2. Application of Deep learning Algorithms to our problem
  3. Source of Data
  4. Model performance Evaluation Matric
  5. Extracting data from XML files
  6. Data preparation for Modelling
  7. Pre-processing text data & EDA
  8. Deep Learning Model
  9. Final pipeline
  10. Future work
  11. LinkedIn and GitHub Repository
  12. Reference

1. Problem Description

Chest X-rays produce images of our heart, lungs, blood vessels, airways, and the bones of our chest and spine. These images help doctors determine whether a patient has heart problems, a collapsed lung, pneumonia, broken ribs, emphysema, cancer or any of several other conditions. A chest X-ray is often among the first procedures that a patient undergoes when a doctor suspects any disease. Related to heart or lung .

A chest X-ray can reveal many things inside your body, including:

● The condition of your lungs.

● Heart-related lung problems.

● The size and outline of your heart.

● Blood vessels.

● Calcium deposits.

● Fractures.

Examining x-ray images for large number of patients manually can be very time consuming and cause delay. Also the chances of human error will be very high. It would be so wonderful if we could generate the reports accurately based on the x-ray images automatically, so that the reports can provided with minimal delay and remove any chances of any human error. If the model find some serious health issue, then we can involve the experts and thus minimize the stress on the healthcare sector.

2. Application of Deep learning Algorithms to our problem

● Develop a deep neural network that generates report given an image of radiography(X-ray).

3. Source of Data

Open-i has a collection of chest X-Ray Images from the Indiana University hospital network. Data contains two folders, one for X-ray Images and the other for the XML report of radiography. For each report, there could be multiple images.

These files can be downloaded for the below link :

https://openi.nlm.nih.gov/imgs/collections/NLMCXR_png.tgz

https://openi.nlm.nih.gov/imgs/collections/NLMCXR_reports.tgz

4. Model performance Evaluation Matric:

BLEU (Bilingual Evaluation Understudy) score :

It is a metric for evaluating a generated sentence to a reference sentence. BLUE is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human: “the closer a machine translation is to a professional human translation, the better it is” — this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.

Scores are calculated for individual translated segments — generally sentences — by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation’s overall quality. Intelligibility or grammatical correctness are not taken into account.

BLEU’s output is always a number between 0 and 1. This value indicates how similar the candidate text is to the reference texts, with values closer to 1 representing more similar texts. Few human translations will attain a score of 1, since this would indicate that the candidate is identical to one of the reference translations. For this reason, it is not necessary to attain a score of 1. Because there are more opportunities to match, adding additional reference translations will increase the BLEU score.

Click here to know more.

Benefits of using BLEU score :

○ quick and inexpensive to calculate.

○ easy to understand.

○ language independent.

○ correlates highly with human evaluation

5. Extracting data from XML files :

Dataset we have contains XML files that contains all images id as well as the actual report based on them. We need to extract them into a data frame to proceed with the project.

Above code reads each of the XML files one by one and extracts all the information in a dataset. Below is how our spooled dataset should look like.

we need to create one more field that contains information about the absolute path of an image.

Now we have all the data we need for working on this project. Lets see the distribution of image height and width of each image.

plot distribution of image height and width

We observe that most of the image size is 600 X 510

Now lets load an image along with its report to see how out data looks like.

Ohh !! It is clearly evident that we need to take some data cleaning steps. No problem !! we will take care of it in data pre-processing section.

One more thing, we observed that some patients has multiple x-ray images. We make a dictionary to hold number of images as key and the corresponding report as value.

Below is one example from the above dictionary. We see key value as 2 and the corresponding report.

From the below bar plot we observe that 2500 reports(maximum) has two images associated with it, around 490 has only one image,100 has three images and 10 of them has four.

Since that most of them has 2 images per report, we will prepare our dataset accordingly. We Shall have only two images per patient. For patients having only one image, we will replicate it.

Now lets break of data frame into train, test and cross validation.

We create a list that stores all the images per patient.

From below example we see we have two images for a patient id.

We can do one more thing, lets plot the images for the patients that had 4 image for him and see how it looks like.

Sample chest scans of a person(4 images) .2 side view and 2 front view images for the same ID. We will select only two, one side view and one front view.

6. Data preparation for Modelling

We now have all the information about our rough dataset. We need to transform it in such a way that we can use it for or model development. We also concluded that we are going to have only two images per patient for the report generation task.

So it is good to have a final dataset with following fields :

1. Person_id : unique id to each patient

2. Image 1 : absolute path name of the first image to be used for training

3. Image 2 : absolute path name of the second image to be used for training

4. Report : Original report for each patient Id for which the model will be trained.

Above is how our final dataset should look like.

Wait !! We are not fully ready yet ! As discussed, our text data needs some cleaning and pre-processing.

7. Pre-processing text data and EDA

We need to remove all numbers, stop words from our text data. Also we will convert all the text in lower case and perform deconstruction on the each report.

After all the data cleaning and pre-processing step this is how our final data looks like.

Lets check the report that contains maximum number of words.

Looks like almost all the report contains less than 60 words. 153 is the maximum word count.

We create a word cloud to visualize mostly used words in our corpus.

Also plot a bar graph to see how many document in our corpus has same number of word count.

For NLP task, we must ass ‘startseq’ and ‘endseq’ to each of the document in the corpus, so that our model understands where a new sentence being and where does it ends.

After all these steps, now we are ready with our dataset for the model development.

8. Deep Learning Models

In this project we are going to implement a Simple encoder decoder model and generate predicted reports using these two most popular algorithms

  1. Greedy Search
  2. Beam Search

8.1. Simple Encoder Decoder

A sequence-to-sequence model is a deep learning model that takes a sequence of items (in our case, features of an image) and outputs another sequence of items (reports).

The encoder processes each item in the input sequence, it compiles the information it captures into a vector called the context. After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.

The encoder in our case is a CNN which produces a context vector by taking in our image features. The decoder is a Recurrent Neural Network.

8.2 Obtaining Image Features

Images along with partial reports are the inputs to our model. We need to convert every image into a fixed sized vector which can then be fed as input to the model. We will use transfer learning for this purpose.

In scope of this article, we will be using pretrained CheXNet model to extract features from the Images .CheXNet, is a 121-layer convolutional neural network trained on ChestX-ray14, currently the largest publicly available chest X-ray dataset, containing over 100,000 frontal-view X-ray images with 14 diseases. However, our purpose here is not to classify the images but just to get the bottleneck features for each image. Therefore the last classification layer of this network is not needed.

We can download the pertained weights from here. link

Output shape of our feature vector should like look below.

Below is the code to get to this result.

Once we loaded the pretrained weights to our chexNet model, we shall extract features of each of the images present in out data set.

Each image is resized to (224,224,3) and is passed through the CheXNet and a 1024 length feature vector is obtained. Later both these feature vectors are concatenated to obtain a 2048 feature vector. If you notice, we have added an average pooling layer as the last layer. There’s a specific reason for this. Since we are concatenating both images, the model might learn some order of concatenation. For example, image1 always comes after image2 or vice-versa, but that isn’t the case here. We are not keeping any order while concatenating them. This problem is solved through pooling which creates location in-variance.

Save these features for each image in a pickle file for future use.

8.3 Preparing text data

Since we cannot pass text data into our model, we need to convert this into vectors.

First we we segregate dataset in train, text and cv.

Now we tokenize our text data and create a emending matrix.

8.4 Create Dataset generator

We need to pass image features for two images as well as the the actual report to our model for training. Thus we create a data generator function to serve our purpose.

The generator gives provides data in bytes. We create a function that converts them back to strings for manipulation. Also we create a function that takes a batch of data and converts them into new data frame.

8.5 Defining our final model

Now Since we are equipped with all the necessary utility function and image feature we finally define our sequence to sequence model.

From this model summary we see have 2,055,325 trainable parameters. Training model at an high end system should not take much time. I am using google colab for its awesome features for this project.

We can also plot our full model architecture for better understanding of the shapes at each step.

8.6 Defining loss function

We defined Masked Loss Function for this problem. For eg: If we have a sequence of tokens- [3],[10],[7],[0],[0],[0],[0],[0].We only have 3 words in this sequence, the zeros correspond to the padding which is actually not a part of the report. But the model will think that the zeros are also a part of the sequence and will start learning them. When the model starts to correctly predict the zeros, the loss will decrease because for the model it is learning correctly. But for us the loss should only decrease if the model is predicting the actual words(non-zeros) correctly.

Therefore we should mask the zeros in the sequence so that the model don’t give its attention to them and only learns the needed words in the report.

code

8.7 Model Training

Now its time to compile and train our model on the dataset.

We are going to use Adam Optimizer for faster convergence. The Adam optimization algorithm is an extension to stochastic gradient descent that has recently seen broader adoption for deep learning applications in computer vision and natural language processing.

Adam as combining the advantages of two other extensions of stochastic gradient descent. Specifically:

  • Adaptive Gradient Algorithm (AdaGrad) that maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g. natural language and computer vision problems).
  • Root Mean Square Propagation (RMSProp) that also maintains per-parameter learning rates that are adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing). This means the algorithm does well on online and non-stationary problems (e.g. noisy).

Adam realizes the benefits of both AdaGrad and RMSProp.

Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance).

To read more about Adam optimiser click here

8.8 Interface

Now that we have trained our model, it’s time to prepare our model to predict reports. For this purpose we have to make some adjustments in our model. This will save us some time during testing. First we will separate the encoder and decoder part from our model. The features predicted by the encoder will be used as the input to our decoder along with the partial reports.

To predict the report form our model given the images, we will use greedy search and beam search algorithm.

8.9 Greedy search

Greedy is an algorithmic paradigm that builds up a solution piece by piece, always choosing the next piece that offers the most obvious and immediate benefit. So the problems where choosing locally optimal also leads to global solution are best fit for Greedy.

A simple approximation is to use a greedy search that selects the most likely word at each step in the output sequence. This approach has the benefit that it is very fast, but the quality of the final output sequences may be far from optimal.

Lets define a function that takes index of an image as input and returns both actual and predicted report using greedy search.

Now lets print some of the examples :

Model is preforming decent for the short images. lets see what is our avg BLUE score for text data.

8.10 Beam search

The beam search that expands upon the greedy search and returns a list of most likely output sequences.

Instead of greedily choosing the most likely next step as the sequence is constructed, the beam search expands all possible next steps and keeps the k most likely, where k is a user-specified parameter and controls the number of beams or parallel searches through the sequence of probabilities.

We do not need to start with random states; instead, we start with the k most likely words as the first step in the sequence. Common beam width values are 1 for a greedy search and values of 5 or 10 for common benchmark problems in machine translation. Larger beam widths result in better performance of a model as the multiple candidate sequences increase the likelihood of better matching a target sequence. This increased performance results in a decrease in decoding speed.

To read more about greedy search and beam search click here.

Code

We observe that using beam search the BLUE score has increased thus indicating beam search is performing better than greedy search.

We will tries experimenting other values of the beam width. We concluded that using beam width = 5 is returning better results.

9. Final Pipeline

So far we have done our project piece by piece. Now we create a function that Index of an image and algorithm type(either greedy or beam) as input and return the final predicted report along with the X ray images.

Example 1:

Example 2:

10. Future Work

  1. We can use EfficientNetB7 pretrained model to extract features from the images and then implement out model on top of it.
  2. Collecting more data samples in order to make this model more powerful is highly required. Our present model is not so accurate due to fewer dataset at hand.
  3. No major hyperparameter tuning were done for any of the models. Therefore, a better hyperparameter tuning might produce better results.
  4. More advanced techniques like transformers or BERT, be implemented that might give get better the results.

11. LinkedIn and GitHub Repository

GitHub : https://github.com/rohansoni634/Indiana-University-Chest-X-Rays-Automated-Report-Generation

LinkedIn : : https://bit.ly/3aSDxPn

12. Reference

  1. https://machinelearningmastery.com/beam-search-decoder-natural-language-processing/
  2. https://machinelearningmastery.com/adam-optimization-algorithm-for-deeplearning/#:~:text=Adam%20is%20an%20optimization%20algorithm,iterative%20based%20in%20training%20data.&text=The%20algorithm%20is%20called%20Adam,derived%20from%20adaptive%20moment%20estimation.
  3. https://www.analyticsvidhya.com/blog/2018/04/solving-an-image-captioning-task-using-d eep-learning/
  4. https://medium.com/@ahmdtaha/show-attend-and-tell-neural-image-caption-g eneration-with-visual-attention-9772ca582be5
  5. https://towardsdatascience.com/image-captioning-using-deep-learning-fe0d929cf337
  6. https://appliedaicourse.com

--

--

Rohan kumar soni
Rohan kumar soni

Written by Rohan kumar soni

Data Science Professional with 1.5 years of experience in creating data regression models, using predictive data modelling, and analysing data mining algorithms

No responses yet