Cassava Leaf Disease Classification: Part 2

The Team

Recap from Part 1

On our previous post, we introduced our problem and dataset, which was a multi-label classification problem where we have to correctly distinguish a cassava leaf disease from a pool of 5 different cassava leaf diseases. We conducted EDA on our dataset and discovered that the majority class (Cassava Mosaic Disease) occupied over 60% of the dataset. Though this disease makes up a significant portion of our dataset, one might argue that the dataset can’t truly be considered “imbalanced”, unlike anomaly detection datasets, where data is imbalanced 90-to-10 in many cases.

So, what now?

We sought to build upon our baseline model by implementing more complex algorithms using a host of deep learning methods. Since we’re classifying images, we opted to use different pre-existing convolutional neural network (CNN) architectures to fulfill our goal of classifying cassava leaf disease images.

Team “CSV + Raw Images” — Shuya & Alex

As previously mentioned, Shuya and Alex imported the raw images and mapped them to the file-label mappings on a csv file. They noticed that the directory with the raw images contained extra images. That is, there were extraneous images that was not mapped to any image. They then removed these images from the dataset.

Team “TFRecord” — Kevin

I imported and loaded the data using TFRecord, a feature in Tensorflow that serializes structured data into binary files. TFRecord advertises faster training times, more efficient RAM usage, and in some cases, improved performance. With these considerations in mind, I tried my hand at TFRecord.

Source: Raghav Sharma

Surveying the Performance of Various CNN Architectures

We applied transfer learning on four different CNN architectures: ResNet50, VGG16, VGG19, and DenseNet169. All models were pre-loaded with ImageNet weights. These weights were then frozen and rendered untrainable. Data augmentation was applied for each architecture, as well as various regularization techniques. These include EarlyStopping callbacks, learning rate schedulers, and Dropout layers. Since we froze the default weights, we implemented fully-connected layers with trainable parameters that connect the base model to the output. Because we have 5 classes, our output layer was a Dense layer with softmax activation.


ResNet50 is a CNN architecture that uses residual blocks. The main premise of ResNet, according to its authors, is that “every additional layer should more easily contain the identity function as one of its elements”. To simplify this premise, ResNet learns residual mappings ( f(x)-x ) rather than f(x) alone. Below, we compare the differences between a regular convolutional block and a residual block.

Source: Ch. 7.6
ResNet50 loss/accuracy history
Kevin’s ResNet50 loss/accuracy history


Next, we trained both VGG16 and VGG19, which are variants of the VGG architecture invented by the Visual Geometry Group at Oxford University. Predecessors of VGGNet trained a large sequential chain of neural network components, including a convolutional layer, a nonlinearity, and a pooling. VGGNet refactored these basic elements into blocks and pioneered the transition from training individual neurons to training entire blocks.

Shuya & Alex’s VGG19 loss/accuracy history
Kevin’s VGG19 loss/accuracy history
Shuya & Alex’s VGG16 loss/accuracy history


DenseNet can be considered an extension of ResNet. While ResNet decomposes functions into a linear term and a more complex nonlinear function, DenseNet outputs are concatenated rather than added. This means that the further we progress in training, the more complex our functions become. DenseNet employs dense blocks and transition layers, which control the concatenation of inputs and outputs and the number of channels, respectively. Each dense block uses the same number of output channels, and these are moderated by transition layers that use 1x1 convolutions.

Kevin’s DenseNet169 loss/accuracy history

Why are our models acting like this?

It’s definitely odd to observe similarly obtuse behavior in our models, not just in any isolated case, but throughout all the models we’ve trained thus far. However, there are several possible explanations behind this behavior, as well as proposed solutions to further improve our models.

Data Imbalance & Shuffling

Shuffling could play a role in our models’ behavior, especially in situations where the validation accuracy remains constant. Notably, most of our validation accuracies converge around 61%, which coincidentally (or not) is the fraction of the majority class in the dataset. While it’s tough to consider our data to be imbalanced, the imbalances in our data can explain the mismatch between our training and validation accuracies, as the training and validation sets could have been unrepresentative of the dataset’s class distribution. This can be addressed by ensuring that all sets contain the representative distribution of images.

Learning Rate

Our learning rate assignment could have played a role in our models’ behavior. Vanishing gradients can occur when the learning rate is too small. As a result, our model fails to learn anything new, which can explain our lack of learning during training. However, Adam is known to work well with especially low learning rates. While we can manually experiment by testing an array of learning rates, we can also address this issue by using a learning rate scheduler that adjusts the learning rate as training progresses. This is not only a useful regularization method, it also helps us avoid false local minima, which is precisely our next issue. Learning schedules can also address cases where the loss is wildly fluctuating.


When we observe our loss and accuracy to converge to a value and assume that value for the remainder of training, it wouldn’t be totally unreasonable for us to believe that our optimizer and loss function have both converged to a local minima, a false one that is. Again, we can tackle this by including a learning rate schedule, but we can also experiment with different optimizers and loss functions to see if any are able to avoid false local minima.

Approach to Transfer Learning

As you may know, we applied transfer learning to our models, and in doing so, we imported and froze pre-trained weights for different architectures. These weights were trained on a completely different dataset from ours. It really could be the case where the ImageNet weights aren’t the most compatible with our dataset. We can try unfreezing some of the base model’s top layers, see if our model improves, and slowly repeat this process to avoid overfitting. Also, different CNN architectures function at their best with different input dimensions. We set our input dimension to 224 x 224 x 3, which is ResNet and DenseNet’s ideal input shape. However, it is possible to observe improvements by tweaking our input shapes so they’re most compatible to the architecture at hand.