Traceability of Deep Learning: Cooking With ML and GitOps

In this blog post, I’ll share how we at Primaa cook up deep learning models, from experimentation to production, all while ensuring complete traceability of the process. We also provide our customers with custom models that are fine-tuned to their specific data.

Published in

Better Programming

10 min readMay 23, 2023

Cleo, Our Lovely and Challenging Digital Pathology Product

Primaa’s main product, Cleo, is designed to assist pathologists in diagnosing cancer. In a nutshell, Cleo connects to slide scanners, runs analyses based on numerous ML models, and presents the results in its own user interface.

At its core, the ML component takes a Whole Slide Image (WSI) as input, which is a very high resolution image of a histological section, and outputs a variety of visual information, such as heatmaps, polygons or regions. Additionally, several cancer biomarkers are detected during the analysis, including invasive carcinoma and mitosis.

We faced five major challenges while developing Cleo’s ML components:

WSI are very large files (from 1 to 4Gb, depending on the resolution), with resolutions reaching several hundred thousand pixels.
A dozen deep learning models are involved in each analysis.
The detection of some biomarkers depends on others, for instance, mitoses can only be found where invasive carcinoma is detected.
Each pathology laboratory has its own histological section-making process and scanner configuration, leading to variability in input data.
Although Cleo can easily be deployed in a cloud environment, all of our customers, due to the sensitive nature of the data, prefer to install it on their own infrastructure.

Therefore, the constraints we defined to design the ML components were:

The ML component cannot be further split into sub-components, e.g., one for each model. All models and logic should be embedded in a single Docker image. This is primarily because it facilitates the on-premise deployments, and most importantly, because the WSI must be divided into a large number (tens of thousands) of small tiles that are passed to the DL models. Having these models in separate components would lead to incredibly high network overhead.
The component must be fully configurable (model weights and other parameters) since the models are fine-tuned on each customer’s data to account for variability in slide preparation.
Traceability is essential for regulatory purposes. It is crucial to track and retrieve all models and configuration cooking, along with the data on which they were trained.
Training models for a new customer should not take too much time for our AI team.

In the next sections, I’ll present the three main stages of our journey from experiments to production for our ML component.

Experimenting

The Rise of PrimaaML: Principles

The first step in the ML journey is experimentation. To help our AI team quickly test new ideas, we developed an internal framework, uncreatively called PrimaaML. Initially, this framework aimed to provide tools that would prevent the AI team from writing the same code repeatedly: model pre and post processing, WSI splitting, filtering, etc. In other words, all components involved in the analysis before and after the model training stage.

Later, we added abstraction to the tools to create clean and clear APIs for each component. This way, data scientists only had to change a component without worrying about the rest of their code: components became plug-ins.

As the team grew and the number of experiments increased, so did the number of trained models. Keeping track of all these models was necessary to meet regulatory requirements. Consequently, we took a significant step further to:

Track all of our experiments
Standardize the way models were trained and saved

We chose to track and schedule our experiments using ClearML (https://clear.ml/) for three main reasons:

We didn’t want to rely on any specific cloud provider for core business. Moreover, due to cloud GPU prices, most of our experiments run on GPUs we own.
It has a clean and user-friendly UI to track and compare experiments.
It is easy to deploy and maintain.

PrimaaML evolved to become a complete ML tool, with the following main features:

Single model training, saving and testing
Full analytical testing of analyses results, with multiple models and algorithms
Full end-to-end testing of analyses results
Dataset management, from WSI to tiles and annotations
Support for custom tasks
Plug-in oriented: PrimaaML is an ML workflow execution engine. Every stage is defined by APIs so that plug-ins (image preprocessing, augmentation, post-processing, training callback, losses, metrics, data loaders, dataset transformations) can be quickly tested without pushing any code into its repository.
Declarative & GitOps philosophy: any experiment is configured in a YAML file that PrimaaML takes as input to run, which is stored in a git repository. This way, every aspect of the experiment, from configuration to source code, is tracked in ClearML, making it fully reproducible.

PrimaaML in Practice

Nice, but how does it concretely work? Let’s say we have the following basic workflow:

Simple training workflow. In blue, components that are plug-ins.

Now, let’s say a data scientist wants to test a new preprocessing algorithm. For the sake of clarity, let’s assume they haven’t configured a project. In reality, the new preprocessing would likely be tested in an existing project.

First, they need to create a new git repository, where the ML task configurations and experiment plug-ins will be stored. Then, they can start working on their new brilliant preprocessing idea. Of course, PrimaaML is already installed :)

The preprocessing is defined in PrimaaML as:

import numpy.typing as npt

class Preprocessing:
    def __call__(self, image: npt.NDArray) -> npt.NDArray:
        pass

Our clever data scientists just discovered normalization, so they write the following preprocessing plug-in in their experiment repository:

from primaaml.image import Preprocessing
import numpy.typing as npt

class Normalization(Preprocessing):
    
    def __init__(self, factor: float):
        self.factor = factor

    def __call__(self, image: npt.NDArray) -> npt.NDArray:
        return image / self.factor

Now, to test it, they declare their experiment in a YAML file:

version: 1
project:
  project_name: NormalizationTest
  sources_path:
    # Paths where plugins source code are looked for
    - ./

models:
  # Models configuration
  my_class_model:
    architecture:
      cls: MyArchitecture
      params:
        n_layers: 3
    loss:
      cls: BinaryFocal
    metrics:
      - cls: accuracy
    preprocessing:
      cls: Normalization
      params:
        factor: 255.0

data:
  loaders:
    disk_loader:
      cls: DiskLoader
      params:
        wsi_path: /home/bibi/wsi/
        labels_path: /home/bibi/labels/

tasks:
  simple_train:
    name: SimpleTrain
    tags:
      - demo
    description: "Simple task for demo"
    type: training
    params:
      model: my_class_model
      n_epochs: 50
      train_loader: disk_loader
      validation_split: 0.2
      callbacks:
        - cls: MetricsLogger

Just a quick push to git, and everything is ready for the real thing. The data scientists open a terminal in their repository folder and enter the magical command:

primaaml run simple_train --remote gpu-queue

The task is added to the gpu-queue and will be executed as soon as a GPU worker is available. The results of the experiment, as well as its configuration and git diff, will be stored in ClearML.

From the Lab to Production

The AI team is quite proud of their high-performing models, but a lingering question remains: “where did I save this fantastic model?” And they’re still arguing with the regulatory team, who insists that code should not change. What? How can we be innovative if the code can’t change???

Let’s take a step back. How can we:

Allow the AI team to innovate without losing control of what is under marking clearance.
Ensure that we ship the right model to the right customer.

Production Source Code

Because Cleo is considered an in vitro medical device, it must be marked before being sold. And the marking requires that the code under marking must not change over time, except, of course, for bug fixes. In the case of an AI product, this implies that model architectures must not change either.

Consequently, one of our most important requirements is to keep the production code stable and safe. However, at the same time, the AI team must be able to use production code to ensure that the performance of the fine tuning for new customers will be the same as in production.

To address these requirements, we externalized the production features from PrimaaML, resulting in the following source code architecture:

Where:

MLCore implements plug-in interfaces, analysis workflows, WSI helpers, etc.
CleoML wraps MLCore and integrates it into the Cleo product environment
PrimaaML is where our latest innovations reside, waiting to go into production in the next marked release

Production Models

Models Publishing

The source code was the easy part of the production process. The models were a bit more tricky, because they cannot easily be stored in a git repository.

The first thing we did was to mark the models we wanted in production as such and store them in a safe place where spring cleaning won’t delete them. To achieve this, we added a command to the PrimaaML CLI to publish models, which involves:

Retrieving the ClearML task the model was trained in
Retrieving from ClearML the path where the model is stored
Marking the model as published in ClearML, preventing users from deleting the experiment it was trained in
Copying the model on a dedicated S3 bucket, versioned according to the ClearML task id

Now, production models are ready to be packaged for production, we just need to know which one to choose :)

A Note on CleoML

CleoML implements the ML component of Cleo, wrapping the core features of MLCore. It essentially:

Loads the analysis configuration file
Loads the required models
Instantiates the analysis from MLCore
Listens to a messaging server such as RabbitMQ
Runs the analysis

Just like PrimaaML tasks, CleoML configuration is stored as a YAML file.

Packaging CleoML Docker Image

CleoML Docker image is built-in CI/CD pipelines. We developed a set of tools to ensure all the necessary resources are included, especially the models. The following chart illustrates the Docker image-building process:

In #1, the analysis configuration is parsed to determine which models are needed in the Docker image. The configuration YAML file looks like (I removed all the configuration unrelated to models):

analyses:
  - name: breast
    models:
      carcinoma_model:
        sourcing: auto
        reference: a957f623-e6d3-4159-a07a-ec1b8323eb54
      mitosis_model:
        sourcing: auto
        reference: 3fcf116c-ae4f-461a-85be-a1fd4913993f
    features:
      - cls: Carcinoma
        models:
          classification_model: carcinoma_model
      - cls: Mitosis
        models:
          segmentation_model: mitosis_model

In #2, the model configurations are fetched from ClearML based on their id. These configurations include, for instance, preprocessing, postprocessing, and feature-specific parameters.

In #3, the initial configuration is translated using the data fetched from ClearML, allowing CleoML to load the proper resources and configure analysis at startup. This leads to a more verbose YAML file, such as:

analyses:
  - name: breast
    models:
      carcinoma_model:
        sourcing: manual
        reference: models/carcinoma/
        preprocessing:
          - cls: Normalization
            params:
              factor: 255.0
        postprocessing:
          - cls: BinaryClassification
            params:
              threshold: 0.5
      mitosis_model:
        sourcing: manual
        ...
    features:
      - cls: Carcinoma
        models:
          classification_model: carcinoma_model
      - cls: Mitosis
        models:
          segmentation_model: mitosis_model

In #4, the models needed by the analysis are downloaded from the production S3 bucket and copied into the expected folders.

Finally, in #5, the Docker image is built with all of the resources for CleoML.

Handling the Custom Models Curse with GitOps

Everything would be great if we only needed one Docker image to deploy for every customer. However, due to the variability in the slide acquisition process, most of our customers need fine-tuned models, necessitating dedicated Docker images while maintaining traceability at every stage of the process, from data selection to fine-tuning. This is the custom model curse: having to manage a large number of models while ensuring that every customer runs the expected version of the code and their models.

To address this curse, we have set up the following GitOps process:

Firstly, a base Docker image for CleoML is built in the CI/CD pipeline and pushed to our private Docker registry.

Every customer has their own git repository where the analysis configuration file is stored. This way, any change in the configuration is tracked, and we can retrieve the one for each version, which is mandatory for regulations. In addition to the analysis configuration, the customer YAML file defines the customer name, used to tag the resulting image, and the version of CleoML to use as base image.

Finally, when a stable release of CleoML occurs, its CI/CD triggers stable releases of all customer images, saving time on the operations side.

Final Thoughts

Ensuring full traceability of every aspect of deep learning in production, from source code to training data and model weights, is a hot and challenging topic. In this blog post, I presented how we manage to do so by leveraging the GitOps methodology, while preserving data scientists’ freedom.

In a future post, I’ll show you how we managed, in a cost effective way, the dozens of terabytes of data we store and use daily to annotate WSI and train our models. Stay tuned!

About Primaa

Primaa is an innovative MedTech startup based in Paris, leading the way in digital pathology. Our AI-based diagnostic solutions help pathologists deliver rapid, accurate diagnoses to improve patient outcomes and quality of life. By using advanced technologies like computer vision, we uncover hidden patterns and offer valuable clinical insights from whole-slide images and electronic medical records. Our goal is to provide our customers with cutting-edge technology and services that optimize decision-making and streamline lab workflows. We’re proud to be pioneers in digital transformation, improving clinical care and transforming the future of pathology.