Repeating groups and their implementation at fix parser

Last week http://fixparser.targetcompid.com gained the ability to parse repeating groups. This was the single most requested feature and some asked why it wasn’t implemented. Here’s why.

Parsing fix messages, without repeating groups can be fairly simple. If very low latency is not a requirement, parsing logic can even be as simple as:

fix_dict = dict([tag_val.split('=') for tag_val in raw_fix_string.split('\01')])
parsed = fix_dict

The above line is python code for splitting text on the SOH character, then splitting each of those chunks on ‘=’ to get tag/val pairs. All of this is turned into a dictionary.

The actual implementation behind the fix parser is a bit more complicated to handle the kind of text users actually paste (full of noise and errors). However, the basic idea is simple.

Adding repeating groups complicates this:

  1. a dictionary only contains unique tags, while a repeating group does not. The existing functionality assumes the availability of a simple dictionary data structure.
  2. repeating groups require knowledge of which tags indicate the start of repeating groups: NoXXX This means adding a huge amount of book keeping!
  3. repeating groups can be arbitrarily nested, which means simple iteration or regular expressions can’t be used for parsing. This dramatically increases complexity.

Last week I finally pushed out an update which added repeating groups functionality to fix parser. It worked beautifully for most cases. The new logic avoided needing knowledge of NoXXX tags by analyzing duplicated tags. Once a duplicate was found, the parser would rewind and guess the starting and ending locations of a repeating group block. This meant that the parser remained free of dependence on specific FIX specs.

However, I started receiving bug reports that some tags were getting eaten by the parser. It turns out that there are production FIX implementations which use the same tag multiple times in the same message, outside of the repeating group framework. My parser also stumbled when the number of elements in a repeating group varied. I’m sure nested repeating groups would have been even more complex and bug prone.

The solution to all these problems was, as is often the case, to simplify and generalize my algorithm. Recall how I was parsing a FIX message into a dictionary of tag/values. The nature of the data-structure meant that my system couldn’t handle duplicate tags. I simply changed the main data-structure from a dictionary to an array of tag/value pairs (with the dictionary being used only to calculated certain features):

list_of_pairs = [tag_val.split('=') for tag_val in raw_fix_string.split('\01')]
fix_dict = dict(list_of_pairs)
parsed = {list:list_of_pairs, dictionary:fix_dict}

Iterating over all the keys is the dictionary meant a tag could only show up once. Iterating over a list means that there can be as many tags in there as someone’s system desires.

By converting to using lists/arrays as my main data-structure, repeating groups didn’t require any additional logic. I actually ended up removing all the code I had added a week earlier! All of us could have been parsing groups, and who knows what else, for years if I had thought about this problem more thorougly (or at least made this mistake much earlier).

Doh!

Could you cite some examples, please?