Parsing TFRecords inside the TensorFlow Graph

Wednesday April 26, 2017

You can parse TFRecords using the standard protocol buffer .FromString method, but you can also parse them inside the TensorFlow graph.

The examples here assume you have in memory the serialized Example my_example_str and SequenceExample my_seq_ex_str from TFRecords for Humans. You could create them, or read them from my_example.tfrecords and my_seq_ex.tfrecords. That loading could be via tf.python_io.tf_record_iterator or via tf.TFRecordReader following the pattern shown in Reading from Disk inside the TensorFlow Graph.

The tf.parse_single_example decoder works like tf.decode_csv: it takes a string of raw data and turns it into structured data, based on the options it's created with. The structured data it turns it into is not a protocol buffer message object, but a dictionary that is hopefully easier to work with.

import tensorflow as tf

serialized = tf.placeholder(tf.string)

my_example_features = {'my_ints': tf.FixedLenFeature(shape=[2], dtype=tf.int64),
                       'my_float': tf.FixedLenFeature(shape=[], dtype=tf.float32),
                       'my_bytes': tf.FixedLenFeature(shape=[1], dtype=tf.string)}
my_example = tf.parse_single_example(serialized, features=my_example_features)

session = tf.Session()

session.run(my_example, feed_dict={serialized: my_example_str})
## {'my_ints': array([5, 6]),
##  'my_float': 2.7,
##  'my_bytes': array(['data'], dtype=object)}

The shape parameter is part of the schema we're defining. A shape of [] means a single element, so the result returned won't be in an array, as for my_float. The shape of [1] means an array containing one element, like for my_bytes. Within a Feature, things are always listed, so the choice of how to get a single element back out is decided by the choice of shape argument. A shape of [2] means a list of two elements, naturally enough, and there's no alternative.

The dtype=object is how NumPy works with strings.

When some feature might have differing numbers of values across records, they can all be read with tf.VarLenFeature. This distinction is made only when parsing. Records are made with however many values you put in; you don't specify FixedLen or VarLen when you're making an Example. So the my_ints feature just parsed as FixedLen can also be parsed as VarLen.

my_example_features = {'my_ints': tf.VarLenFeature(dtype=tf.int64),
                       'my_float': tf.FixedLenFeature(shape=[], dtype=tf.float32),
                       'my_bytes': tf.FixedLenFeature(shape=[1], dtype=tf.string)}
my_example = tf.parse_single_example(serialized, features=my_example_features)

session.run(my_example, feed_dict={serialized: my_example_str})
## {'my_ints': SparseTensorValue(indices=array([[0], [1]]),
##                               values=array([5, 6]),
##                               dense_shape=array([2])),
##  'my_float': 2.7,
##  'my_bytes': array(['data'], dtype=object)}

When parsing as a VarLenFeature, the result is a sparse representation. This can seem a little silly, because features here will always be dense from left to right. Early versions of TensorFlow didn't have the current behavior. But this sparseness is a mechanism by which TensorFlow can support non-rectangular data, for example when forming batches from multiple variable length features, or as seen next with a SequenceExample:

my_context_features = {'my_bytes': tf.FixedLenFeature(shape=[1], dtype=tf.string)}
my_sequence_features = {'my_ints': tf.VarLenFeature(shape=[2], dtype=tf.int64)}
my_seq_ex = tf.parse_single_sequence_example(
                serialized,
                context_features=my_context_features,
                sequence_features=my_sequence_features)

result = session.run(my_seq_ex, feed_dict={serialized: my_seq_ex_str})
result
## ({'my_bytes': array(['data'], dtype=object)},
##  {'my_ints': SparseTensorValue(
##                  indices=array([[0, 0], [0, 1], [1, 0], [1, 1], [1, 2]]),
##                  values=array([5, 6, 7, 8, 9]),
##                  dense_shape=array([2, 3]))})

The result is a tuple of two dicts: the context data and the sequence data.

Since the my_ints sequence feature is parsed as a VarLenFeature, it's returned as a sparse tensor. This example has to be parsed as a VarLenFeature, because the two entries in my_ints are of different lengths ([5, 6] and [7, 8, 9]).

The way the my_ints values get combined into one sparse tensor is the same as the way it would be done when making a batch from multiple records each containing a VarLenFeature.

To make it clearer what's going on, we can look at the sparse tensor in dense form:

session.run(tf.sparse_tensor_to_dense(result[1]['my_ints']))
## array([[5, 6, 0],
##        [7, 8, 9]])

The other option for parsing sequence features is tf.FixedLenSequenceFeature, which will work if each entry of the sequence feature is the same length. The result then is a dense tensor.

To parse multiple Example records in one op, there's tf.parse_example. This returns a dict with the same keys you'd get from parsing a single Example, with the values combining the values from all the parsed examples, in a batch-like fashion. There isn't a corresponding op for SequenceExample records.

More could be said about sparse tensors and TFRecords. The tf.sparse_merge op is one way to combine sparse tensors, similar to the combination that happened for my_ints in the SequenceExample above. And there's tf.SparseFeature for parsing out general sparse features directly from TFRecords (better documentation in source).

I'm working on Building TensorFlow systems from components, a workshop at OSCON 2017.