Experimental design
Good experiments are build upon good experiment design. Before you start experimenting, walk by the topics below.
Intended for: BSc, MSc, PhD
1. Choose test environments (unless your work is fully theoretical).
Take this step seriously, because once you start working on an environment it will take quite some time to get results, and this time should be worth it. In general, you want the environment to be:
Representative of the challenge you investigate. (However, watch out to not overly design your test for your problem).
Computationally feasible. Very important, see full details in Practical Considerations. The general rule: start with a small problem in which your debug loop is quick, and only gradually increase the dimensionality of tasks once it works in smaller ones.
Interesting. Of course, you also want to draw your readers in. And you have to enjoy working on the problem itself. Search for a set of tasks that you consider interesting.
2. Choose baselines.
When you experiment with you new method, you want to find out whether it works better than alternative available approaches (a 'baseline').
Finding baselines: See Finding Related Work. Make sure to include the method that currently holds best performance on you problem (the SOTA, 'state of the art').
Implementation: Coding the baseline method yourself could take a lot of time. However, this may not be necessary.
Official code: Often, the authors of a paper publish official code along with their paper, which you can simply rerun. Otherwise, email the authors that you want to use their method as a baseline, and whether you could get their code to indeed run it.
Third-party implementation: You may also search for another public implementation of the algorithm. However, be careful to check whether the implementation is correct.
Own implementation: If both of the above options fail, you will have to implement the baseline yourself.
3. Choose your evaluation criteria.
Although the evaluation criterion is often straightforward, do not forget to think about it in advance (or your experiments will be worthless).
Supervised learning:
Criterion: The criterion depends on the task, but may involve 'accuracy' (for classification), 'loss' (e.g. MSE for regression), 'log likelihood' (probabilistic models), 'nats' (common in generative modelling), etc.
Train/validation/test: In supervised learning, always split you data into 1) a training set, 2) a validation set, and 3) a test set. You train on the training set, monitor performance during training on the validation set (to determine when to stop), and then report results on the test set.
Reinforcement learning:
Criterion: In reinforcement learning, the most common evaluation criterion is 'cumulative reward'. In goal-based RL, you could also consider alternative such as '% of times the goal is reached'.
No train/validation/test: There is usually no train/validation/test split in RL, since there is no fixed dataset (the environment provides infinite sample). Instead, the challenge is usually to sample the right data ('exploration'), since we need to collect data ourselves.
Evaluation episodes: However, proper evaluation in RL problems may require separate evaluation episodes. During training, you often have some exploration switched on, which means your agent may perform suboptimally. Therefore, instead of logging the performance in training episodes, you may want to include separate evaluation episodes at fixed intervals, where you switch exploration almost completely off.
(Note: picking fair evaluation criteria is actually not easy, and entire papers have been written about it.)
4. Keep comparisons fair.
Make sure to keep the comparison between methods fair. This is very important, and (unfortunately) very often ignored.
Fix the seed (only once): Fix your seed in the beginning, and never tune the seed until your method does better than the baselines. See Statistical Practice for more details.
Hyperparameter tuning: You should technically invest the same amount of hyperparameter tuning effort into each baseline as you do into yours.
It is for example unfair to extensive tune your own method on a given task, and then only run the baseline algorithm with some default parameters.
When you study the exact same task as a baseline paper, you can simply copy their results (without rerunning).
See Hyperparameter Tuning for advice on how to tune hyperparameters.
5. Decide how to challenge your method (optional).
Apart from comparing your method to other approaches (baseline), you also want to figure out how your method works. Two types of questions/approaches are common:
Sensitivity analysis: A typical experiment is to vary the setting of a certain hyperparameter of your algorithm, and investigate the effect on the performance. Does your method only work with exactly the right exploration parameter, or is the performance more robust? How does performance change when you increase the computational budget per timestep? Think in advance which variations might be interesting.
Ablations: A second typical type of experiment are 'ablations', where you completely switch off a part of your algorithm. Maybe your approach combines different ideas, and you want to find out what the contribution is of each element. What happens if you entirely switch of the search module, or you eliminate a certain term in your loss function? How much performance do you lose (or even gain)?
TO DO: Type of experiment: proof of concept, insight, comparison.
6. Choose your visualizations.
A final crucial consideration is how you will visualize your experiments. Think about this in advance, because you will have to log the right data! (Although often new ideas about visualization come up during your experimentation cycles.)
Learning curves: A common feature in (reinforcement) learning are learning curves, where you display timesteps on the x-axis, and cumulative reward on the y-axis. Make sure to compare different methods in a single plot (i.e., don't generate separate plots for every method/setting).
Tables: Another common feature in learning experiments are tables that display some (final) performance measure, such as 'accuracy' (for classification), 'log likelihoods' (probabilistic fit of test set under model), 'average return over last 10% of training episodes' (for reinforcement learning), etc.
Insight plots: These plots are often problem specific. Can you visualize a characteristic example of your method, that illustrates its strength and/or weakness? Can you visualize what happens inside your algorithm during execution? etc.