# Sparse Tensors and TFRecords

Thursday April 27, 2017

When a matrix, array, or tensor has lots of values that are zero, it can be called sparse. You might want to represent the zeros implicitly with a sparse representation. TensorFlow has support for this, and the support extends to its TFRecords `Example` format.

Here is a sparse one-dimensional tensor:

``[0, 7, 0, 0, 8, 0, 0, 0, 0]``

The tensor is sparse, in that it has a lot of zeros, but the representation is dense, in that all those zeros are represented explicitly.

A sparse representation of the same tensor will focus only on the non-zero values.

``values = [7, 8]``

We have to also remember where those values occur, by their indices:

``indices = [1, 5]``

The one-dimensional `indices` form will work with some methods, for this one-dimensional example, but in general indices have multiple dimensions, so it will be more consistent (and work everywhere) to represent `indices` like this:

``indices = [[1], [5]]``

With `values` and `indices`, we don't have quite enough information yet. How many zeros are there to the right of the last value? We have to represent the dense shape of the tensor.

``dense_shape = [9]``

These three things together, `values`, `indices`, and `dense_shape`, are a sparse representation of the tensor.

TensorFlow accepts lists of values and NumPy arrays to define dense tensors, and it returns NumPy arrays when dense tensors are evaluated. But what to do with sparse tensors? SciPy has several sparse matrix representations, but not a good match for TensorFlow's general sparse tensor form. So for sparse tensors, instead of reusing an existing Python class, TensorFlow provides `tf.SparseTensorValue`. These are values that exist outside the TensorFlow graph, so they can be made without a `tf.Session`, for example.

``````tf.SparseTensorValue(values=values, indices=indices, dense_shape=dense_shape)
## SparseTensorValue(indices=[[1], [5]], values=[7, 8], dense_shape=[9])``````

Using `tf.SparseTensor` puts that in the TensorFlow graph.

``````tf.SparseTensor(values=values, indices=indices, dense_shape=dense_shape)
## <tensorflow.python.framework.sparse_tensor.SparseTensor at 0x11a4e0c10>``````

That `tf.SparseTensor` will be constant, since we specified all the pieces of it, and if you run it in a session, you'll get back the equivalent `tf.SparseTensorValue`.

TensorFlow has operations specifically for working with sparse tensors, such as `tf.sparse_matmul`. And you can change a sparse matrix to a dense one with `tf.sparse_tensor_to_dense`. These operations live in the graph, so they have to be run to see a result.

``````sparse = tf.SparseTensor(values=values, indices=indices, dense_shape=dense_shape)
dense = tf.sparse_tensor_to_dense(sparse)
session.run(dense)
## array([0, 7, 0, 0, 0, 8, 0, 0, 0], dtype=int32)``````

Going from dense to sparse seems a little less straightforward at the moment, so let's continue assuming we already have the components of our sparse representation.

Going to more dimensions is quite natural. Here's a two-dimensional tensor with three non-zero values:

``````[[0, 0, 0, 0, 0, 7],
[0, 5, 0, 0, 0, 0],
[0, 0, 0, 0, 9, 0],
[0, 0, 0, 0, 0, 0]]``````

This can be represented in sparse form as:

``````indices = [[0, 5],
[1, 1],
[2, 4]]

values = [7, 5, 9]

dense_shape = [4, 6]

tf.SparseTensorValue(values=values, indices=indices, dense_shape=dense_shape)
## SparseTensorValue(indices=[[0, 5], [1, 1], [2, 4]], values=[7, 5, 9], dense_shape=[4, 6])``````

Now, to represent this in a TFRecords `Example` requires a little bit of transformation. TFRecords only support lists of integers, floats, and bytestrings. The values are easily represented in one `Feature`, but to represent the `indices`, each dimension will need its own `Feature` in the `Example`. The `dense_shape` isn't represented at all; that's left to be specified at parsing.

``````my_example = tf.train.Example(features=tf.train.Features(feature={
'index_0': tf.train.Feature(int64_list=tf.train.Int64List(value=[0, 1, 2])),
'index_1': tf.train.Feature(int64_list=tf.train.Int64List(value=[5, 1, 4])),
'values': tf.train.Feature(int64_list=tf.train.Int64List(value=[7, 5, 9]))
}))
my_example_str = my_example.SerializeToString()``````

This TFRecord sparse representation can then be parsed inside the graph as a `tf.SparseFeature`.

``````my_example_features = {'sparse': tf.SparseFeature(index_key=['index_0', 'index_1'],
value_key='values',
dtype=tf.int64,
size=[4, 6])}
serialized = tf.placeholder(tf.string)
parsed = tf.parse_single_example(serialized, features=my_example_features)
session.run(parsed, feed_dict={serialized: my_example_str})
## {'sparse': SparseTensorValue(indices=array([[0, 5], [1, 1], [2, 4]]),
##                              values=array([7, 5, 9]),
##                              dense_shape=array([4, 6]))}``````

Support for multi-dimensional sparse features seems to be new in TensorFlow 1.1, and TensorFlow gives this warning when you use `SparseFeature`:

``````WARNING:tensorflow:SparseFeature is a complicated feature config
and should only be used after careful consideration
of VarLenFeature.``````

`VarLenFeature` doesn't support real sparsity or multi-dimensionality though; it only supports "ragged edges" as in the case when one example has three elements and the next has seven, for example.

It is a little awkward to put together a sparse representation for TFRecords, but it does give you a lot of flexibility. To put a point on it, I don't know what you can do with a `SequenceExample` that you can't do with a regular `Example` using all of `FixedLenFeature`, `VarLenFeature`, and `SparseFeature`.

I'm working on Building TensorFlow systems from components, a workshop at OSCON 2017.