How to handle backtest data - 3000GB in our case
Backtests and Optimizations. Tons of data that needs to be analysed. Tons of data that is useful and a lot of it is not something you throw away immediately. Yes, some of it is – for example a backtest to test whether or not a new programmed code path works. This is not useful long term and discarded right away. But once you start the real optimization and a strategy is somehow stable, data should be preserved. Automated Trading is likely one of the most data intensive things even single people can try.
We develop strategies and we have quite a lot of them. This means we have a lot of data. After our last upgrade our storage capacity is now 3000gb – and we already plan an upgrade to 6000gb. This is a lot of data that we keep in a SQL Server. Most database developers do not really work with that much data (though some do work with a lot more data). NetTecture (that is us – the company behind Trade-Robots) has origins and does IT consulting and we are specialized in large data applications, among other things. As such, we do know what it takes to manage large amounts of backtest data.
Let us start with the basics ;)
You need a backup plan
No, this is not a joke. If you want to work with large amounts of data, make sure you know how to keep a copy. We keep a copy on a separate isolated backup system that gets updated every 15 minutes. It is isolated because it is even behind a separate USV. We plan adding tape backups to this mix this year, but are not there yet. Backups are important because if you have months of results stored – a loss would mean months of running the simulations again.
You need a decent server
Not, this is also not a joke.. Forget a small cheap VM. Forget cloud hosting – moving the data in and our of the cloud is too slow and the price for storage of the level needed is sadly a log higher than buying it. We run our database on our oldest server in a Hyper-V virtual machine but – this VM “owns” the server. Most memory goes to it, CPU priority goes to it and pretty much all discs are pass through. Yes, we sometimes stretch the CPU a little but it is doable. Obviously a 10G network is preferred for the complete backups that will regularly be taken.
Fast large data analysis? Go SSD
This may sound redundant with “you need a decent server” but it is a so often ignored item that it is necessary to elaborate on it. And this is our “secret”. SSD. Lots of them. Database servers live and die with their storage performance. Right now we can retrieve 1gb/second from our storage, the 6000gb configuration will allow us to get 2gb/second. Random. Analysis runs over a lot of data and it is not possible to keep it all in memory (the price of 1tb of memory is ridiculous) – so retrieval speed is critical. Yes, you can put in a number of 2TB SATA discs – but you will not be happy if you start doing portfolio level analysis over gigabyte size data. And you will need this when you try strategy combinations to reduce your drawdown.
Our original configuration had 800gb of space, using a Raid 10 of 4 SAS discs, 450GB each, 10k RPM. We got 60MB/Second out of that “on good days” (without too much interruption from uploading data.
Latencies of our storage columns where in the 50ms range most of the time. The new all SSD setup gives us 600MB for a single SSD with a lot less latency. Just the upgrade made a visible and measurable performance difference for us.
We right now use 6 SSD (each 750GB) in 2 Raid 5 groups for the data, 2 smaller SSD in a Raid 0 as temporary space.
And waht about burn out? SSD can only handle a limited number of writes. Well, in our case the Samsung 843T 960GB, reconfigured to around 750GB used space, is good for 5 complete rewrites per day for 5 years. Way more than we will need.
You will need to know SQL.
And this is more than the simple basics - the margin of error is small when the data gets big. If all the data one has fits into memory, it is easy to get away with really crappy usage. Once you reach a large data amount, review of all running SQL Statements is a must. As is knowing more than the first chapter of “SQL for Dummies”. Indices, partioning. FInancial analysis can lead to multi page SQL statements.
This is 2014…
…and if you know what you do, 3000GB Of data is not something to be afraid of. It is quite easy to deal with them if you get professional hardware. Throw proper hardware at it – not even high end – and it will be a pleasure to work with this amount of data.