How to efficiently store and read tick data

Dealing with Tick data is a problem everyone doing strategy development has. The more professional an outlet is, the more tick data it has. This is not only about years – it is also about the number of instruments and whether one is storing the full order book or not. For a good simulator, one can never have enough data (if the data is good). So a full order book is an advantage. And futures are a small problem – US stocks produce more data, and the OPRA feed from Nanex, which is highly compressed, is about 25GB per trading day. That is with the quite terrific Nanex compression.

At NetTecture – the company behind Trade-Robots - we are lucky. Our Trade-Robots data archive for the CME alone is now around 1.4TB. We use Nanex as data provider, which delivers us a file of about 2GB for every trading day. It contains every symbol traded. Those files are not easy / efficient to be worked with – because they contain a lot of data so they take a lot of processing power to play back. When we do a single instrument playback, we only need a lot less data. For that, we use our own binary storage architecture. Because we currently do not deal with options, we ignore most instruments – still our current data volume is significant. So, we extract the data into a more efficient format. We use a command line based export routine to extract the symbols we are interested from Nanex into our own file based data storage – an approach that allows us to also write importers for other sources, which we currently do to get the DukasCopy forex data archive into our Reflexo installation. A small archive. Compared to the 100TB Nanex archive one company we deal with has to store – because they have US Stocks and the very big OPRA feed to deal with.

When we started with the Reflexo framework, the original prototype level storage was a daily data file per instrument as a text file. Since then we have improved significantly, both in storage efficiency but also in processing efficiently. The original file format was typical to what you get when you download tick data in text format and contained a textual timestamp. Those are easy to read. They are good to compress, so not really inefficient in size. They are likely what every developer who start his own infrastructure considers first. But they are slow. A text with a profiler easily showed us that we are spending 95% of the processing time parsing this string – because the calendar calculations to get it into the internal representation are very heavy. This is typical for any text format. Every parsing costs time – and should be avoided.

We moved away from that very fast. The first step was a fixed size binary data format. Within days that was replaced with a delta driven storage. Because we found an error in our code there, and because we wanted additional capabilities down the road, we now, after more than 3 years, just replaced that format with a new one that is 20% more efficient than the old delta format. It also is completely written in C# - while the old decoder was written in C#, but the encoder embedded in our C++/CLR Nanex extraction code, which makes it hard to use it for other purposes. In this new format, we use a packet organized data structure with variable length fields to efficiently code delta information for an instrument. There are some facts that we know about price information that we can use to take advantage to keep data sizes small. For example, the timestamp only moves forward, in normally very small increments, and the increments are static in most data feeds (Rithmic being one of the notable exceptions). The price does not jump widely under normal circumstances.

Same with prices. They are not really float – the term “tick” gives the minimum change in a price. It is easy to divide the price by the tick size and get an integer number that represents the price. This is what we do. We encode our prices in ticks, and a tick size that is set once (or when the tick representation changes, which does rarely happen anyway, depending on data feed) says what a tick is – 0.25, 0.125, 1. ES for example has a marker saying a tick is 0.25 points.

Similar time. We record time in time ticks – which is the windows native way of dealing with time and is 100 nanoseconds. Obviously we do not have data that fine, which is why we apply a “slice factor” to every tick. So, if the time moves forward 1 time slice, then this is X time ticks – and this is coded in the stream with a “TimeConfig” marker (which contains as number the number of time ticks per slice).

Then we go with deltas. A delta is a difference. For example the difference in slices from the last trade, the difference in price ticks.. A trade/bit/ask can be a minimum of 1 byte, indicating “Trade” (which is one of the message types coded in 4 bits of the header) and header information that says “I have no options” (which means: Same time, Same price, Same size like last time – which happens. Nanex delivers time data with 25m granularity – so on active symbols quite a lot of updates go into one time slice.

If the header indicates that there are options, then the next byte codes 4 possible option data storage options, each in 2 bits indicating the length of the corresponding data. Those options are:

  • Instrument Number. This is not used at the moment but will alter allows us to store multiple instruments in one file (obviously an instrument will be announced with a packet saying what it is).
  • Time Delta, which is 1-4 bytes indicating how many ticks to add to the slice. Most of the time this is just 1 byte.
  • Price Delta, which indicates how many ticks the price is different from the last price. . Volume delta, which indicates how many ticks the price is different from the last price.

So, if a trade happens at the same time with the same price in volumes 2, 4, 2, we would code that as volume delta +2 (from 0), +2 (for the 4 which is 2 higher than the last volume) and -2 (for the next volume being 2 lower than the 4).

Same with price. And prices most of the time do not change a lot – 1 byte is enough to handle the usual updates on all the order book while the price (in ticks) is jumping around up and down around the last trade.

To be even more efficient one additional bit in the first byte (that says whether there is an option) can indicate a wipe. This is an update with volume 0 – typical for clearing a quote from the order book. Wipes do not code the volume delta and do not change the volume delta for the next update. A small optimization.

Here, for example, is the source code responsible for writing a price update.

    internal void Write(BinaryWriter writer)
    {

        var isStart = flags.HasFlag(FileTickFlags.UpdateStart);

        var valueTime = timeDelta;
        var codeTime = StreamTools.GetValueSizeCode(valueTime);

        var valueTick = tickDelta;
        var codeTick = StreamTools.GetValueSizeCode(valueTick);

        var valueQty = qtyDelta;
        var codeQty = StreamTools.GetValueSizeCode(valueQty);
        if (IsWipe) {
            codeQty = 0;
        }

        var option = 0;
        option = codeTime;
        option = option << 2 | codeTick;
        option = option << 2 | codeQty;

        StreamTickType valuetickType = StreamTickType.Unknown;
        switch (tickType) {
            case FileTickType.Trade:
                valuetickType = StreamTickType.Trade;
                break;
            case FileTickType.Bid:
                valuetickType = StreamTickType.Bid;
                break;
            case FileTickType.Ask:
                valuetickType = StreamTickType.Ask;
                break;
        }
        if (IsWipe) {
            valuetickType = valuetickType | StreamTickType.UpdateWipe;
        }
        if (option != 0) {
            valuetickType = valuetickType | StreamTickType.UpdateOption;
        }
        if (isStart) {
            valuetickType = valuetickType | StreamTickType.UpdateStart;
        }
        StreamTools.WriteValue(writer, 1, (byte) valuetickType);
        if (option != 0) {
            StreamTools.WriteValue(writer, 1, option);
        }
        StreamTools.WriteValueCoded(writer, codeTime, valueTime);
        StreamTools.WriteValueCoded(writer, codeTick, valueTick);
        StreamTools.WriteValueCoded(writer, codeQty, valueQty); // 0 if wipe...
    }

We use specialized methods to determine the size needed for a value (GetSizeValueCode, which results in a number 0-3 for 0, 1, 2 and 4 bytes) and to write the data. Not all variables are visible because this is an instance method on a TickUpdate structure. Any missing variable is defined in the structure.

Reading the data is done similar, through the method is static this time:

    internal static TickUpdate Read(BinaryReader reader, StreamTickType tickType)
    {
        var inputFlags = tickType & StreamTickType.MaskTickOptions;
        var isStart = inputFlags.HasFlag(StreamTickType.UpdateStart);
        var isWipe = inputFlags.HasFlag(StreamTickType.UpdateWipe);
        var isOption = inputFlags.HasFlag(StreamTickType.UpdateOption);

        FileTickType retTickType = FileTickType.Unknown;
        switch (tickType & StreamTickType.MaskTickType) {
            case StreamTickType.Trade:
                retTickType = FileTickType.Trade;
                break;
            case StreamTickType.Bid:
                retTickType = FileTickType.Bid;
                break;
            case StreamTickType.Ask:
                retTickType = FileTickType.Ask;
                break;
        }

        int valueTime = 0;
        int codeTime = 0;

        int valueTick = 0;
        int codeTick = 0;

        int valueQty = 0;
        int codeQty = 0;

        if (isOption)
        {
            int options = (int) StreamTools.ReadValueCoded(reader, 1);

            codeQty = options & 0x03;
            codeTick = (options >> 2) & 0x03;
            codeTime = (options >> 4) & 0x03;

            valueTime = (int) StreamTools.ReadValueCoded(reader, codeTime);
            valueTick = (int)StreamTools.ReadValueCoded(reader, codeTick);
            valueQty = (int)StreamTools.ReadValueCoded(reader, codeQty);
        }
        FileTickFlags flags = FileTickFlags.None;
        if (isStart)
        {
            flags = flags | FileTickFlags.UpdateStart;
        }
        var retval = new TickUpdate() {
            tickType = retTickType,
            qtyDelta = valueQty, tickDelta = valueTick, timeDelta = valueTime,
            flags = flags,
            IsWipe = isWipe,
        };
        return retval;
    }

Obviously this is a quite complicated coding mechanism. Not really complicated mathematically, but very prone to programming errors. There are quite a number of packet types that have to work together without any error. To avoid errors a special version of the file writing mechanism is automatically playing back recorded data and checking for differences. So, the (slow) debug writing routine gets a Tick, writes all the information to the file and then reads the next tick from the file and compares it to the written data – and indicates an error on any discrepancy. Running this slow debug build on a week’s data after any code change makes quite sure there is no error in recovering data (which incidentally is what kept us busy last week.

But we are not ending here. We have plans to improve the system.

  • Altogether this is a very efficient way to deal with data. Our problem right now is that we store a lot of files – often with little data for symbols that are not the front month. So, we consider getting more coarse – storing possibly one week in a file Sunday to Saturday) and storing all months of a future in the same file. This is where the multi-instrument coding will come in handy.
  • We consider additionally to apply compression to the stream. This would take a little performance decoding, but would possibly save significantly on file size. A small test with a Windows compressed folder indicates a 50% compression ratio… not bad. And the windows UI does not do high quality compression.
  • We also consider adding an index to the time. This would replace the time delta in a tick (there is a separate packet that can update the time) and would index all symbol updates in a data stream. The goal? Multi-instrument playback. This trick allows us to combine the original order or events as we have seen it with data in multiple files between files, even if the updates have the same timestamp.
  • Given our packet oriented structure can store more rare information like executions. While this is not usable for a back test, it makes this format suitable for the communication between the server connecting to the broker and the strategies (which are hosted in separate AppDomains – something like a lightweight .NET internal process) in order to be fully isolated. We possibly could even expand the format to be usable over the network.

The advantage of binary coding? We save a lot of processing and storage time. When our simulation infrastructure is doing a backtest (not an optimization) we deliver up to 6 gigabits of data at the moment. We are network limited – and more compression definitely would mean that we are able to do backtests faster. On the other hand, the processing overhead for an optimization (which uses up a lot more CPU for the calculations because we calculate up to 128 parameter variants at the same time) means we do not deliver too much data during those – but then ,we want to be efficient in CPU.

We will see what the next, 3rd version, of this format brings – and we consider making the coding and decoding source code public domain, if anyone is interested in them.

And that concludes the first real in depth technical post. I know a lot of people deal with problems storing their tick data. A blog post is limited in length, but I hope I could provide those struggling with enough information how we solve this problem to make reading his worthwhile.