From backtest result to statistically expectable PNL

Interpreting the result of a backtest correctly is complicated. Most strategy developers never master the simple basic rules of statistics, simply interpreting numbers they do not understand. They take, for example, the result of a backtest – the PNL – as a realistically to be expected number. It is not.

The PNL has a problem – it may simply be luck

Robert Pardo’s "The Evaluation and Optimization of Trading Strategies" is a great book that has and had a big influence in how we develop trading strategies. We recommend this book for anyone to read who is in the strategy development business.

There are a significant number of important aspects in Robert Pardo’s book. And while this is not a review (we already published one in this blog, check the tags) we are going to have look here at one of the important aspects of this book that has made us re-evaluate our backtest analysis approach.

One important aspect when analysing backtest information is always whether the results are representative. A lot of programming is going into Monte Carlo Simulations to simulate alternative results based on the trades in a backtest, recombining them in a high number of random combinations. Because the result of one backtest has a very critical statistical problem: the significance of the result. If I have 200 trades but only 5 loosing ones, are the losses really representative or just a random element that may be better than would be normal by lock? The less samples I have in a sample size, the less statistically stable it is.

What does this mean? Let us take a perfect die. On a roll I returns numbers from 1 to 6, everyone has the same equal chance. If I make a test of 12 rolls, though, I will not normally get a result that shows 2 x the 1, 2 x the 2 and so forth – this would be a perfect result, and this is rare. I can easily have an average below 3 (the statistic average is 3.5) or above 4 and that with perfectly good dice. Over a large sample size, this variation will decrease. But if one has a strategy with a high percentage of profitable trades, then the small number of losing trades may hide a much worse loss. The backtest may simply be lucky to have avoided them.

To get a PNL one can expect, the number must be corrected

Robert Pardo has introduced the concept of a statically corrected PNL. It takes both, the Profit and the Loss, and corrects them by their statistical error. This means a Profit side that is based on a large number of stable profits will get a small correction down, while a Loss that is based on a small number of possibly high variation losses will get a significant increase in loss. The result is an “expectancy profit” – a profit that I can statistically expect. It will always be worse than the original result it is based upon – but that is not necessarily a bad thing. It is always better to make more money than expected than to loose when one expects a profit. All this is described in his book “The Evaluation and Optimization of Trading Strategies” (a core book for our work at on page 127 onward.

The statistically corrected PNL is based on core statistical principles

In order to calculate the correct expected PNL – which will always be lower than the result of the backtest – we calculate statistically pessimistic profit and loss numbers, separately. Every trading strategy will have losses – otherwise it is ridiculously curve fitted or the sample size is too small (say, only 3 or 4 trades). By calculating a pessimistic Profit (lower than the one in the backtest) and a pessimistic loss (higher than the one in the backtest) we can get an expected total profit (also known as PNL).

To calculate those numbers, we need some basic core numbers to start. For both, Profit and Loss this is done based on the same formula, so we will just describe it once.

What we need is:

  • The number of trades (separate for profit and loss, obviously)
  • The average value (separate for profit and loss, again)
  • The Standard Deviation (separate for profit and loss)

I can assume that anyone even dabbles in automated trading can calculate an average – it is the sum divided by the number of elements. The Standard Deviation is more complex but a base stable of statistic calculations. Any trader not knowing how to calculate this please gets some basic statistics knowledge – or heads over to Wikipedia: Standard Deviation – or, without so much maths and simple at the 30 second Standard Deviation tutorial. The Standard Deviation is done by aggregating (summing up) the differences of every sample to the average of the samples, squared (i.e. the difference is squared, then added). And then taking the square root of this sum. The Standard deviation is a measure of uncertainty. Around 68% of the sample values will fall within 1 standard deviations from the middle (i.e. average +1 standard deviation and average -1 standard deviation) assuming the data is normally distributed.

It is noteworthy that the standard deviation is smaller when results are closed together – obviously. This means that a large standard deviation is the result of unstable numbers. If 100% of my profitable trades are 1 tick (because I scalp for a 1 tick target) and I either make this 1 tick or have a loss (stopped out) then this is a perfect aligned profit (1 tick always) without any deviation. If profits range from 1 to 1000 ticks with an average of 400 ticks, this will result in a quite large standard deviation – unless 1 and 1000 tick results are real outliers and form for example only 3 of 400 trades. Stability in both profits and loss results in a small standard deviation for those.

From Standard Deviation we need to calculate the expected Profit and Loss

Now that we have the standard deviation and the other core numbers for profit and loss, we need to calculate a corrected profit and loss number. For this we calculate the standard error. This is done by taking the standard deviation and dividing it by the square root of the number of trades. This standard error is the number by which we have to correct every single trade (separate for profit and loss). This means that this standard error is multiplied by the number of trades (in profit and loss) and then subtracted (profit) or added (loss). The result is a higher loss and lower profit and thus a much lower total profit. It is a much more realistic expectation, though.

Obviously – this should be done on gross profit and loss and fees should be applied separately. The mathematics behind is nothing a computer cannot do in very little time – which is a significant advantage this approach has compared to a Monte Carlo Simulation. It does not replace it, but it does give a fast expectancy that is both realistic and easy to compute. The difference between the backtest PNL and the statistical PNL Expectation can be seen as an indication of the robustness of the backtest.

The statistical expected PNL will be more robust with more samples

European Copyright Laws allow us to quote Robert Pardo here – this is from page 128 of this excellent book:

To get an idea of how this plays out with different sample sizes, con- sider three examples of standard error based on different trade sample sizes of 10, 30, and 100. We will assume a standard deviation for our winning trades of $100. When our number of wins is 10, the standard error is:

Standard Error=100/SqRt(10) Standard Error=100/3.16 Standard Error=31.65 With a sample of 10 trades, the standard error is 31.65 rounded to $32. Plugging this value into our formula, the range of wins is $200 +/– $32 or $168 to $232. With a sample of 30 winning trades, the standard error is $18. ($18.25 rounded). The expected range then of wins is $200 +/– $18 or $182 to $218. Finally, with a sample of 100 wining trades, the standard error is $10. The expected range then of wins is $200+/– $10 or $190 to $210. From these examples, it is clear that the larger the trade sample size, the lower the standard error or variance of winning trades. Whereas we selected the average winning trader for our analysis, this relationship of larger sample size to smaller standard error will hold true for all performance statistics produced by a historical simulation. The larger the trade sample, the smaller the standard error.

This is a very important aspect. I have often seen backtest results that people show where there are a large number of profits and a small number of losses. The problem with a backtest that has 220 trades out of which 20 are losses – which vary widely – are that these 20 trades may be pure luck, and especially if their total loss is varying widely – this may have a significant impact.

Statically Expected PNL vs. Monte Carlo Simulations

It is already said that the statistically expected PNL approach we describe here is not a replacement for a Monte Carlo simulation. What is interesting, though, is that it actually contains some advantages. It is not only faster to compute, it can also create results that are more in line with what really is to be expected. The weak point of the Monte Carlo simulation is that it is based on ONLY the samples. It combines the samples in thousands of combinations. It never questions the distribution of the samples per se. 20 losses with a high distribution may be 100 losses in a larger sample size where some of them are even larger – very unlikely but still statistically sound. To go back to the die example on the beginning, I may make a Monte Carlo simulation based on a sample size of 30 die rolls – and by pure chance this samples may not include a single result with a “1”. This means in all combinations the Monte Carlo Simulation then creates, based on this sample size, there never will be a die roll of 1.

The statistical expected PNL calculations avoid this by using standard statistical formulas that are widely accepted and that are based on uncertainty. I can never be sure a – especially small – sample size contains all possible samples.

As such, both – Monte Carlo Simulation and Statically Expected PNL – are good tools in evaluating what to expect from a strategy.

Calculating the Statistically Expected PNL is not complicated

Those using a standard software package have a problem of exporting the data – for example into excel – first. Those working with a flexible and expandable infrastructure or their own frameworks – as we do with the Reflexo Trading Framework we use at NetTecture – can just put it into the core of their infrastructure.