Dr. Marco Maier

Evaluation-Driven Machine Learning

published: 2021-12-28 | updated: 2021-12-31

In the last years, I've had the chance to work with a lot of talented people on various kinds of machine learning and deep learning projects. One observation I've made is that people tend to jump directly into the model training part (after all, it's the most fun). Usually, one then realizes very quickly that a lot of work also has to go into data collection / preparation. But what is often left out from a more thorough thought process is how to actually evaluate what one is doing.

While the initial rush of getting a first version of the super cool machine learning model definitely is exciting and a source of motivation, I'd argue that by not following a more structured process, we might not learn a lot from all the tinkering and experimenting. Most of the time, the just-do-it approach might result in some better intuition for the problem space at hand - and this can be very valuable - but it will lack the clarity to have more clear-cut insights for the actual questions we're probably looking to answer. This is especially true if you're developing machine learning solutions for use in real applications (instead of for scientific work).

A typical example I encounter regularly, especially in my course at the university, is students telling me excitedly about their model reaching an accuracy of 85%. 85% is a lot, isn't it? The thing is: Without more context, we do not know. At all. 85% actually might be really bad. Totally useless. It might also be the new state of the art. And still totally useless.

There is a better way. A way that will require some work upfront but will result in less work overall because we set ourselves up to directly get the insights we're looking for. To be clear: By no means have I invented this process. It's just a collection of best practices. However, I haven't found a short guide that summarizes all the ideas and techniques, so that's why I've put down this article.

If you have a software engineering background, you probably are aware of a method called test-driven development (TDD). Without going into much detail, TDD is the idea to define the desired behavior of a program by writing (unit) tests before the actual implementation of the program. On the one hand, this forces you to think about all the details the program has to take care of (all the edge cases, etc.), on the other hand, after having created the test suite, you have an automatic way to estimate the status and progress of the implementation. In the beginning, your program does solve 0% of the test cases. Step by step, this number increases. When you reach 100%, you're done (and you know it).

While TDD definitely is not the end all be all solution to the software development process, it still can serve as an example for how we might want to approach the development of machine learning models. In analogy to TDD, I call this set of guidelines evaluation-driven machine learning. The idea is to not treat the evaluation as an afterthought, but to make it the first thing we create. To do so, we need to consider a set of questions and tasks, which are laid out as a step-by-step guide in this article.

Step 1: Define the evaluation and success criteria

Before we jump right into our editor to start implementing, we should make sure that we have it clear where we want to end up. This mostly comes down to the following points:

What kind of task/problem are you dealing with and what's a suitable evaluation metric?

When starting out with machine learning, you'll probably try to do some kind of classification, e.g., of digits, flowers, or cats and dogs. But classification is only one type of machine learning task. You will also encounter regression, clustering, contrastive learning, generative models, and many others. For each of these tasks we need different kinds of evaluation criteria and metrics. And even for a single kind of task, there are different options.

For simplicity, let's look at how to evaluate classification tasks. The first thought might be to calculate the accuracy of your model on the evaluation data set. However, it might also be important to know about the precision and recall of your model. Maybe some combination of the two in form of, e.g., the F1 score is even more informative. Do we actually have a binary task or do we have more categories? If the latter, maybe a confusion matrix gives us important information about which classes get mixed up with others. It could also be the case that this is not even just a simple multi-class but even a multi-label task (i.e., an input can be assigned to more than one class, or none at all).

To further complicate things, you probably will have additional constraints such as how resource intensive a solution is allowed to be. Thus, it might make sense to collect not only your primary evaluation metric but also additional stats like how long it takes to predict on your validation set. This can later result in insights such as that a bigger model improves accuracy by 3% but takes twice the time to run, which might or might not be a reasonable trade-off.

What performance is required for the solution to be actually usable?

If you're doing scientific research, the definition of your goal will often be simple: Be better than the current state of the art according to the same metrics the respective work uses.

In practice, things get more complicated. It can definitely be the case, that you build the best model of all time for your given task and it won't be usable for the intended application nevertheless. Therefore, it makes sense to think about the success criteria right from the start. What performance level does your solution need to be viable for its intended use?

Another important question: Are different types of errors of equal weight or do they differ in their severity regarding the actual use? A typical example is predicting whether a person has a certain disease or not. In binary tasks like this, the outcomes can be categorized into true positives, true negatives, false positives and false negatives. A false negative means that your model predicts the person to be healthy although they have the disease. A false positive means that your model predicts the person to have the disease, but the person actually does not. Are these two classes of errors equally problematic? Often, you might want to be "better safe than sorry", i.e., accepting a higher number of false positives in order to catch as many cases as possible. On the other hand, pushing the balance too far in this direction has consequences, too. Getting a positive test result can have psychological consequences for the patient, and it wastes resources because of additional testing and examinations.


Make sure to choose the right metrics for evaluating your model and be clear about the success criteria. The latter is especially important if different error cases bear different weights.

Step 2: Decide on the evaluation data sets

In many tutorials and scientific scenarios, you only have one data set which you split into subsets for training, validating and testing. In practice, things might be different again, mainly because creating data sets is expensive.

Let's take an example: We might want to classify a set of hand gestures (things like thumbs up, thumbs down, clapping, etc.) from videos taken of people sitting in a car. To train a model, ideally we would want videos of thousands or tens of thousands of different individuals performing the gestures in all kinds of scenarios which are relevant to our use case. Creating such a video data set requires a lot of effort for obtaining the videos, annotating them, etc. because we literally would have to put thousands of people in a car, drive around and have them perform the gestures to record the respective videos. It is a lot easier to have a panel of participants which perform the gestures at home in front of their laptop or smartphone, recorded by the integrated camera. The former approach could be prohibitively expensive, the latter might indeed be feasible. However, the actual field of application is the in-car setting. To make sure that our model achieves the required performance also in the real setting, we probably will have to create a data set in exactly the setting we want to apply it to. But now we are just talking about evaluating the model, so we often can live with a significantly smaller data set. Other examples for this situation are: Using synthetic data for training, training on images but applying on videos, inferring actual labels from related annotations, etc.


You might be forced to find creative ways for how to create (large) training data sets and how to do the training, but when it comes to the evaluation, you should limit creativity and stick as close to the actual use case as possible. This means, you have to decide which data sets are appropriate to evaluate your solution for your use case (and maybe you have to specifically create these data sets).

Step 3: Create and fix data set splits

A typical setup for creating machine learning models comprises three types of data subsets: training, validation and testing. They serve different purposes:

  • The train set is used to train the model, i.e., to learn the model's parameters with the goal to make accurate predictions.

  • The validation set is used to optimize the model's hyper parameters and/or to compare different models/architectures with each other.

  • The test set is used to assess the final model's performance on unseen data, i.e., it is only used after models have been trained and the best one has been selected, and in the best case, it also is only used once.

Most of the time, you have an initial data set and you create the mentioned subsets from it. One approach is to first split off the test set from the initial data set (e.g., 10 to 20% of the data). This test set then is kept aside until the very end of the process. We want to avoid that any information from the test set leaks into our model development, directly but also indirectly. As we've seen in the last section, we might also have several additional test sets which do not come from the main data set. They also are not used in the process right until the end.

The remaining part of the main data set now is split into the training set and the validation set. The size of the validation set again depends on your situation, but you don't want this to be too small because it makes your decisions more noisy. A rule of thumb again is to put 20% of the data in the validation set. Removing the train and validation partitions from the initial data set greatly reduces the amount of data left for the actual model training. However, especially for deep learning, we want the training data set to be large. An at least partial solution to this problem is to apply cross validation. For example, if you use a 5-fold cross validation, you would perform the train-validation split 5 times, each time with different data ending up in the validation set. This allows you to split off smaller validation sets (keeping more training data) at the cost of having to repeat the process several times.

Correctly creating these data set splits requires to take care of many details, in order to not leak information from one set into another. I'll take a deeper look at this topic in another article.

What's important at this step is the following: Creating these data set splits usually involves random sampling from the initial data set. With cross validation involved, we end up with tens of random subsets of the data used for different purposes at different points in time. This can be a problem to make results reproducible. We'll see later that we want to follow certain steps to "control" the randomness of the process, but with regard to data sets, I have one clear recommendation: Create the splits once and save them to disk. Since copying the actual files (e.g., images, videos, etc.) would be very resource-intensive, you can simply store the filenames in CSV files. These files should also be committed to your code repository so that you always have a clear connection between your model code and the used data. During the actual model training, do not create the splits anew but just load them from the saved files. This removes most of the uncertainty that might arise from differences in handling the data.


In order to make your results reproducible, set up the train, validation and test data sets (and all the randomly produced data set splits) once and save all of them to disk. For the actual model training, always operate from these saved splits. Don't create new ones.

Step 4: Define baseline models

Before we start into creating our new billion dollar model, we have to think about what the baseline actually is. As we've seen in step 1, especially in practice, there will be some success criteria which result from the final application. But aside from these criteria, there are some reasonable baseline models we might want to look at as a reference for our own developments. The baseline models can be put into three main categories which will be discussed in the following.

"Dumb" baseline models

These models are not based on any actual learning, but follow some simple rules in producing predictions. Which types of models are suited to a given scenario depends on the learning task. If we again use classification as an example, there are two quite helpful baseline models:

  • The random model, which predicts one of the possible classes at random.

  • The most frequent model, which always predicts the class that occurs most often in the training data.

The most frequent model is especially helpful when you're dealing with imbalanced data sets. If your data set contains three classes of objects, but 85% of the data belongs to class A, you can reach 85% accuracy simply by always predicting class A (remember the example from the beginning?). Obviously, our billion dollar model should be better than this baseline to provide additional value.

We can get the information about the class distribution also from examining the data set manually, but it removes another thing to think about when we simply include these two baseline models in our pipeline.

Standard baseline models

Many of us seem to have a tendency towards architecting our own custom solutions, often forgetting simpler approaches in the process. In order to put things in perspective, it often makes sense to have one or more standard models which are suited to the task at hand. Let's take an example: If we want to build a model for image classification, it's reasonable and easy to have a default DenseNet or MobileNet or VGG, etc. in the pool of models. Deep learning frameworks like Tensorflow offer these architectures by default. It's just a few lines of code to include them and gives you a first baseline result of an actually learning model.

State-of-the-art models

Finally, it is very likely that you are not the first to build a model for a certain use case. If you're doing scientific research, it's obligatory to take into account related work, but also in more practical scenarios, it might make sense to implement an existing SOTA model first. If you train such a model in the same pipeline as your other models, you have a better understanding of the third party model's results compared to just taking the evaluation data from a paper.


In order to understand the performance of your model, you need reference values. I recommend to always include some dumb models (random or those based on simple rules) which uncover imbalances in your data set. Additionally, putting some default models quickly gives you a reasonable starting point for further refinement. Especially in research, it's also desirable to have existing SOTA models implemented directly in your pipeline.

Step 5: Implement pipeline

Steps 1 to 4 so far were mostly conceptual (and data-related) preparations for the actual implementation of our training and evaluation pipeline. I've mentioned the word pipeline already several times throughout the article, so it's time to define what that actually means. The pipeline comprises all the steps from loading the data (organized in the previously prepared data set splits) to training different models to evaluating the trained models on the validation data set.

The pipeline should be completely separated from the model implementations. I even recommend that in a team, the person implementing the pipeline should be different from the person(s) implementing the models. To create the pipeline, I recommend to use a dumb model, e.g., a model just randomly predicting one of the possible values. This allows us to see if the pipeline successfully runs through all the steps, stores the relevant results, etc. After implementing the pipeline, we simply define a list of models (all adhering to a certain interface so that the pipeline knows how to interact with the models) which run through the pipeline, getting trained and evaluated one after another on exactly the same data.

If we go back to our inspiration from the test-driven development approach, we can regard the pipeline with the included evaluation as the test suite a certain model has to solve. By implementing the pipeline first, we define the framework including all the important decisions (which evaluation metrics to use, how to split the data, etc.) in the beginning. The pipeline describes what the end result should be, just as the unit tests in TDD. With a given pipeline, another person can simply implement a new model and immediately check how it compares to the other models and whether it holds up to the expected success criteria.

One desirable property of our setup is that it should produce reproducible results, i.e., when we run the pipeline with the same settings (same data, model, parameters, etc.), we should end up with the same numbers as in previous runs. However, most of the time, there will be some parts of the pipeline which rely on one or several random number generators (RNGs), e.g., to shuffle the data set during training. RNGs usually can be initialized with a seed value, which then results in getting the same "random" sequence every time. Again, outlining all the details and potential pitfalls in setting up the RNGs correctly might require a dedicated article. For now, I just would like to make you aware of the following:

  • Often, there will be different modules / libraries with their own RNG. For example, if you're using the typical Python libraries, you might want to seed the RNGs for numpy, scikit-learn and tensorflow / PyTorch. So make sure you're not missing any component.

  • You should think about where to best initialize a certain RNG. For example, if you're training several models, usually it is not advisable to share an RNG between them. If you do, your pipeline will get sensitive to, e.g., the order in which the models are trained.


Your training and evaluation pipeline is the framework within which you build your model. Create the pipeline first and independently from the model. Make sure to control the randomness of the pipeline to make results reproducible.

Step 6: Finally, build your model

Now that we have the pipeline in place we can finally build our new model(s). If we created the pipeline correctly, we should be able to focus on the details of the model and then automatically get the evaluation results. Running the pipeline several times should produce the same results. In a team, it should be easy to allow several people to come up with new models and have them evaluated in the same manner.


In this article, I tried to describe the idea of evaluation-driven machine learning. By starting with the evaluation, we set ourselves up to not just "play around" but efficiently produce reliable insights.

The described steps can only serve as a rough guideline. Actually implementing such a pipeline can get very complex with huge datasets, complicated tasks and many models to evaluate. Obviously, at scale, things always get more difficult. Nevertheless, the idea of evaluation-driven machine learning primarily is a way of thinking about machine learning projects, not necessarily about the individual steps.

While writing this article, I realized how many questions might arise from it and how many important details are not covered. If you'd be interested in a more thorough coverage of the different aspects, please let me know.

Let's keep in touch:

Feel free to follow me on Twitter, connect on LinkedIn, or just shoot me a message, especially if you are interested in working together / collaborating.

Impressum & Datenschutz