Statistical practice
Any experiment that involves a source of randomness needs to stick to correct statistical practice. Without it, your conclusions will by definition be invalid.
Intended for: BSc, MSc, PhD
Experiments & Noise
A key insight in science is that most experiments will not give the exact same outcome when your repeat them (which is the basis for the entire replication crisis in science). There are various reasons why the outcome of an experiment may change over repetitions:
Finite sample size noise: You use a real-world dataset, but you of course only have access to a finite number of measurements. With a different dataset of the same problem, your outcome could be different. This is a prime reason we need statistical analysis, to quantify what our conclusions are worth beyond this specific dataset.
Note that this problem is further aggravated by environment noise (inherent stochasticity in the world) and measurement noise: they only make the variation between datasets you could observe larger.
Algorithm (repetition) noise: Second, your algorithm execution may also have inherent noise. Maybe you stochastically initialize the parameters of a neural network that you will train (which is necessary to break ties between parameters). Or you use a search algorithm that samples noise to explore. Each new run of your experiment on the same environment/dataset will then still give a different outcome.
The key message of this page is that you always need to think about these possible sources of noise in your experiment, and how you will handle them, to make your interpretations valid and useful.
Finite sample size noise
Finite sample size noise is a key issue in nearly all empirical research. How we deal with the problem depends on our underlying question, which brings us to the major distinction between statistics and machine learning:
Statistics
Main interest: interpretation. We want to estimate the effect of a certain input variable on the outcome, in a way that we can explain.
Challenge: Our finite sample size will only give us a single estimate. We want to quantify its uncertainty.
Solution: Use an appropriate parametric or non-parametric statistical test whether the parameter differs form a baseline hypothesis (usually 'no effect'). Collecting extra data should make your uncertainty smaller.
Machine learning / supervised learning
Main interest: prediction. We are not (less) interested in any interpretable effect of input on output), but only want to make accurate predictions on the output level.
Challenge: Our finite sample gives us only one set of data, while we want to measure generalization to new (unseen) data.
Solution: Use a train/validate/test split, and measure generalization on the test set. Use n-fold cross-validation to get n generalization measures out of a single dataset. Collecting extra data should make your generalization error smaller.
Reinforcement learning
In reinforcement learning we do not have a finite dataset, but access to an environment/simulator from which we can in theory sample an infinite amount of data. Thereby, the train/test split usually disappears. At test time, we simply sample new data from the environment/simulator.
If the environment is big enough and/or has enough stochasticity, then this is totally valid (the test traces will not overlap much with the train samples). In Chess you can for example always use the game simulator to both train and test, since it is so big.
If your environment is small, then train and test data may overlap. This could be fine, but not that you are then only assessing the ability to solve the specific problem instance, not any generalization.
If you are in the offline RL case, where we only have access to a finite RL dataset, then all train/test considerations return.
Algorithm (repetition) noise
Statistics
Amount of noise: Given a certain dataset and model specification, statistical methods usually estimate the same solution (the global optimum). Variation between runs is therefore almost absent.
Number of repetitions: A single run of your algorithm is usually fine.
Supervised learning
Amount of noise: Given a certain dataset and model specification, performance of supervised learning methods may change. Variation between runs is therefore present. The stochastic initialization of neural network parameters is a good example that will influence performance.
Number of repetitions: You may therefore want to try multiple runs (repetitions) per dataset and hyperparameter setting.
Reporting repetitions: It's usually best to report average performance (this is what someone would get on average if they try your experiment once). In this case you could also argue for the best run, since the neural network parameter initializations are really hyperparameters that you can control yourself. However, do make sure that you give each method equal hyperparameter tuning effort.
Reinforcement learning
Amount of noise: In reinforcement learning, the variation between runs is usually very high. The issue is that we need to collect our own data, and with a slight change in noise from the environment or our own policy, we may see different data and push our solution in a totally different direction (local optimum).
Note that this is not the same problem as 'finite sample size' (for which we use the train/test split). We can still sample as much train or test data from the environment as we want, but our entire run can simply be lucky or unlucky from the start.
This is the major evaluation difference of reinforcement learning compared to supervised learning: it's huge repetition variance.
Number of repetitions: You therefore definitely need to run multiple experiments per environment and hyperparameter setting.
Reporting repetitions: You now need to report average performance (with standard deviation/standard error, see below), definitely when the environment is stochastic. We don't want to see your lucky or unlucky single run, but how well your algorithm does on average, i.e., in a next run.
Running repetitions
Some advice for running experiment repetitions (for reinforcement learning and supervised learning):
1. Run enough repetitions (as is computationally feasible): Aim for at least 3 to 5.
2. Make repetitions truly independent: Make sure that every repetition is a completely new run. Reinitialize your network parameters, reinitialize your environment, etc. Make sure that the comparison is fair: each run should be a clear new problem instance that another researcher (that wants to replicate/use your work) would also face.
3. Never tune the seed: You perform repetitions to get an estimate of how your method does on average. Therefore, your repetitions need to be randomly drawn from the space of possible experiments.
Therefore, never optimize the seed for performance. This makes comparison unfair, since you start reporting the best possible (lucky) performance of your method, not the average. In real applications the seed is not tunable, and your method will suddenly work way worse.
You can include the seed as a hyperparameter to make your experiments reproducible, but if you do so, set the seed only once (and don't touch it anymore!). For example, set 'seed=12345' and then leave it like that. When you start to adjust it along the way (depending on observed results), then your comparison becomes unfair.
Reporting results
Finally, you need to decide how you report on the noise in your experiments:
1. Statistical measure of interest: This is usually the mean, but the max, median etc. statistics may also make sense, as long as you can argue why.
2. Final performance versus learning curves: Determine whether you only want to show the final performance, or whether you want to show the entire learning curve (with some measure of execution time on the x-axis, and performance on the y-axis).
Learning curves can give more insight in the overall process, and can especially be helpful for noisy processes (for which final performance is very unstable). Learning curves are very common in reinforcement learning. Don't forget to average your learning curve over your repetitions, and apply smoothing if necessary to make them interpretable. See Figures and Tables as well.
Remember, don't feel restricted in any way when it comes to visualization: anything figure/table that's useful to you, is likely useful to the reader.
3. Standard deviation versus standard error: You then usually want to report on the amount of noise, either in your parameters estimates (for finite sample size noise) or over your repetitions (for repetition noise). You have two choices, which are both relevant:
Standard Deviation (SD):
It estimates the amount of variation between runs.
This quantity stays stable with additional data.
Standard Error of the Mean (SE):
It estimates the amount of variation/uncertainty in your current estimate of the mean.
This quantity will decrease with additional data, scaling as SD/sqrt(n) for n repetitions of your experiment. With more data, you become more and more certain about the true mean performance.
Make sure you understand this difference, think about what you need to report, and how you can interpret it. Both are relevant measures, but they give different information. The first indicates what your algorithm may do at best or worst (in an extreme case), while the second is useful to determine how a method does on average (which is an intermediate step towards statistical hypothesis testing, see below).
It's good to add one of the above quantities in a learning curve as well. Packages like Weights and Biases let you automatically plot this.
4. Statistical testing: When you formally want to assess whether one method is better than another, you should use a statistical hypothesis test.
In statistics, this is always necessary.
In machine learning, it is useful, although researchers sometimes only report mean estimates with SD/SE, and leave it up to the reader to judge whether they find the difference relevant (which is actually bad practice).
In reinforcement learning, it is even worse, and statistical tested are almost totally unused. The problem is that the learning curves are usually pretty noisy, and outcomes may therefore change a lot depending at which timepoint you measure. Therefore, researchers often report learning curves with SD/SE confidence bands, and leave it up to the reader whether they judge one method better than another (actually also bad practice).