Cassava Leaf Disease Classification: Part 2

The Team

Recap from Part 1

On our previous post, we introduced our problem and dataset, which was a multi-label classification problem where we have to correctly distinguish a cassava leaf disease from a pool of 5 different cassava leaf diseases. We conducted EDA on our dataset and discovered that the majority class (Cassava Mosaic Disease) occupied over 60% of the dataset. Though this disease makes up a significant portion of our dataset, one might argue that the dataset can’t truly be considered “imbalanced”, unlike anomaly detection datasets, where data is imbalanced 90-to-10 in many cases.

After EDA, we applied a logistic regression model on our dataset to serve as a baseline model. We decided to use logistic regression because it is commonly known as the simplest, least complex classification algorithm in machine learning. Our baseline results yielded a 50% accuracy.

So, what now?

We sought to build upon our baseline model by implementing more complex algorithms using a host of deep learning methods. Since we’re classifying images, we opted to use different pre-existing convolutional neural network (CNN) architectures to fulfill our goal of classifying cassava leaf disease images.

Our data was also available in multiple forms. We were given a csv file that mapped each image file to its corresponding label and the raw image files themselves. We were also provided the image files in TFRecord format. Our group experimented with different ways of loading and importing the data, while evaluating its effects on overall training time and performance. Shuya and Alex worked with the csv file and raw images, where I (Kevin) loaded and imported the data using the TFRecord files in a separate notebook.

Team “CSV + Raw Images” — Shuya & Alex

As previously mentioned, Shuya and Alex imported the raw images and mapped them to the file-label mappings on a csv file. They noticed that the directory with the raw images contained extra images. That is, there were extraneous images that was not mapped to any image. They then removed these images from the dataset.

To load the dataset so it was applicable to our deep learning methods, they created two separate lists, one for images and one for labels. Using the library, they parsed the images in our image directory and converted them into numpy arrays using . The images’ labels were taken from the dataframe created by reading in the file-to-label mappings. Once all images and labels were collected, they were both converted into numpy arrays themselves, and the images were rescaled such that its pixel values were between 0 and 1. Labels were then one-hot encoded. The data was then partitioned 85/15 using . Images were augmented using . Due to limited RAM, only 1000–2000 samples could be trained at a time. Any larger, and Colab would crash.

Team “TFRecord” — Kevin

I imported and loaded the data using TFRecord, a feature in Tensorflow that serializes structured data into binary files. TFRecord advertises faster training times, more efficient RAM usage, and in some cases, improved performance. With these considerations in mind, I tried my hand at TFRecord.

Source: Raghav Sharma

I first connected to the available TFRecord directories. TFRecord files can be parsed according to different features, depending on the data type stored in the file. Through inspection of the file, our TFRecord contained three parsable features: “image”, “image_name”, and “target”. Next, I defined a few helper functions that shared the responsibility of parsing the TFRecord file, preprocessing the images, and consolidating these images into a TFRecord dataset. Instead of being stored in Python lists, the data structure holding my parsed data was a TFRecord dataset with (image, label) tuples. The last step before training was shuffling the TFRecord files and dividing them into a training set and a test set. To do this, I implemented a helper function that does a 60/20/20 train/validation/test split.

Unlike Team CSV + raw images, I was able to train models on all ~21,000 images, rather than 1000 to 2000. Shoutout Colab Pro.

Now, let’s take a look at our models’ performance on the data and diagnose behaviors that underlie its performance.

Surveying the Performance of Various CNN Architectures

We applied transfer learning on four different CNN architectures: ResNet50, VGG16, VGG19, and DenseNet169. All models were pre-loaded with weights. These weights were then frozen and rendered untrainable. Data augmentation was applied for each architecture, as well as various regularization techniques. These include EarlyStopping callbacks, learning rate schedulers, and Dropout layers. Since we froze the default weights, we implemented fully-connected layers with trainable parameters that connect the base model to the output. Because we have 5 classes, our output layer was a Dense layer with activation.


ResNet50 is a CNN architecture that uses residual blocks. The main premise of ResNet, according to its authors, is that “every additional layer should more easily contain the identity function as one of its elements”. To simplify this premise, ResNet learns residual mappings ( f(x)-x ) rather than f(x) alone. Below, we compare the differences between a regular convolutional block and a residual block.

Source: Ch. 7.6

Alex and Shuya trained ResNet50 on 2000 of the 21397 available images. Like I’ve said, this was the maximal number of images that was tolerable to the RAM. Anything higher, and we’d run out. After freezing ResNet50’s base model, they added three base layers with three Dropout layers that connected to the output layer. They trained the model with a batch size of 16 and learning rate of 0.001 for 30 epochs.

Below, we observe that training loss sharply decreases after 2–3 epochs, while the validation loss experiences a less sharp decrease. They continue to minimally fluctuate around 1.25 throughout the 30 epochs. Interestingly, we observe a slight decrease in both training and validation accuracy before they both rebound and flatten. In fact, the validation accuracy remains identical at an accuracy of 0.6 from epochs 3 to 30, whereas the training accuracy makes incredibly miniscule fluctuations. This behavior is bizarre; however, this is what we commonly see in our training processes. We’ll dig deeper into the causes of this in a bit.

ResNet50 loss/accuracy history

Kevin trained ResNet50 using the same output architecture as Shuya and Alex and received similarly bizarre results. This time, the training accuracy appears to fluctuate around the validation accuracy.

Kevin’s ResNet50 loss/accuracy history


Next, we trained both VGG16 and VGG19, which are variants of the VGG architecture invented by the Visual Geometry Group at Oxford University. Predecessors of VGGNet trained a large sequential chain of neural network components, including a convolutional layer, a nonlinearity, and a pooling. VGGNet refactored these basic elements into blocks and pioneered the transition from training individual neurons to training entire blocks.

Shuya and Alex trained both VGG16 and VGG19 with a learning rate of 0.001 and batch size of 16 for 30 epochs. Similarly to ResNet50, they attached three fully-connected layers between the base model and output layers.

We observe VGG19 to be more numerically stable in training, as opposed to ResNet50. However, we don’t observe much learning in the network. Training accuracy marginally improves, and validation accuracy is largely inconsistent, teetering around 0.61 once again. Both training and validation loss gradually decrease.

Shuya & Alex’s VGG19 loss/accuracy history

My VGG19 showed some unusual behavior. While the training and validation accuracy show healthy, gradual decrease, the training and validation accuracy express peculiar behavior. For one, the training accuracy is almost always lower than the validation accuracy. Second, both the training and validation accuracy started off quite low, around 12%. After 2 epochs, the validation accuracy quickly shot up to 61% and remained at that exact value for the rest of training.

Kevin’s VGG19 loss/accuracy history

Alex and Shuya’s VGG16 expressed more normal behavior than the VGG19 models above. However, we still notice fluctuations that we believe can be smoothened out. This model was actually our best-performing model, with a maximum validation accuracy of 74%.

Shuya & Alex’s VGG16 loss/accuracy history


DenseNet can be considered an extension of ResNet. While ResNet decomposes functions into a linear term and a more complex nonlinear function, DenseNet outputs are concatenated rather than added. This means that the further we progress in training, the more complex our functions become. DenseNet employs dense blocks and transition layers, which control the concatenation of inputs and outputs and the number of channels, respectively. Each dense block uses the same number of output channels, and these are moderated by transition layers that use 1x1 convolutions.

I fit DenseNet169 on a 70/30 train-validation split of all samples. I used a batch size of 16 and a learning rate of 0.0001, with sparse categorical crossentropy as my loss function. Training lasted 20 epochs.

Performance was — you guessed it — bizarre in behavior. Training loss and accuracy fluctuate around high and low values, respectively, and never improve as training progresses. Meanwhile, validation loss very slightly improves over training, while the validation accuracy remains constant at 61.8% accuracy.

Kevin’s DenseNet169 loss/accuracy history

Why are our models acting like this?

It’s definitely odd to observe similarly obtuse behavior in our models, not just in any isolated case, but throughout all the models we’ve trained thus far. However, there are several possible explanations behind this behavior, as well as proposed solutions to further improve our models.

Data Imbalance & Shuffling

Shuffling could play a role in our models’ behavior, especially in situations where the validation accuracy remains constant. Notably, most of our validation accuracies converge around 61%, which coincidentally (or not) is the fraction of the majority class in the dataset. While it’s tough to consider our data to be imbalanced, the imbalances in our data can explain the mismatch between our training and validation accuracies, as the training and validation sets could have been unrepresentative of the dataset’s class distribution. This can be addressed by ensuring that all sets contain the representative distribution of images.

Learning Rate

Our learning rate assignment could have played a role in our models’ behavior. Vanishing gradients can occur when the learning rate is too small. As a result, our model fails to learn anything new, which can explain our lack of learning during training. However, Adam is known to work well with especially low learning rates. While we can manually experiment by testing an array of learning rates, we can also address this issue by using a learning rate scheduler that adjusts the learning rate as training progresses. This is not only a useful regularization method, it also helps us avoid false local minima, which is precisely our next issue. Learning schedules can also address cases where the loss is wildly fluctuating.


When we observe our loss and accuracy to converge to a value and assume that value for the remainder of training, it wouldn’t be totally unreasonable for us to believe that our optimizer and loss function have both converged to a local minima, a false one that is. Again, we can tackle this by including a learning rate schedule, but we can also experiment with different optimizers and loss functions to see if any are able to avoid false local minima.

Approach to Transfer Learning

As you may know, we applied transfer learning to our models, and in doing so, we imported and froze pre-trained weights for different architectures. These weights were trained on a completely different dataset from ours. It really could be the case where the ImageNet weights aren’t the most compatible with our dataset. We can try unfreezing some of the base model’s top layers, see if our model improves, and slowly repeat this process to avoid overfitting. Also, different CNN architectures function at their best with different input dimensions. We set our input dimension to 224 x 224 x 3, which is ResNet and DenseNet’s ideal input shape. However, it is possible to observe improvements by tweaking our input shapes so they’re most compatible to the architecture at hand.

That was a lot. I hope you enjoyed our journey and struggles training our networks. There’s definitely lots of room to improve, and we hope to share those improvements with you on our next edition of Cassava Leaf Disease Classification! If you have any other ideas or corrections, let me know in the comments. Peace!


Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store