Use only what you need from TensorFlow

Sunday March 12, 2017

There isn't just one decision to use TensorFlow or not use TensorFlow; you have to make decisions about which pieces of TensorFlow you're going to use.

I've thought about whether Tensorflow suffers from the second-system effect, and my conclusion is that while TensorFlow has a huge abundance of features, it can't really be said to "suffer" from this. The engineering is solid and it supports very interesting applications.

For the individual user of TensorFlow, however, over-engineering is a relative term, and it's relative to the problem at hand.

One relevant dimension to consider is where you'll be running your code, which is related to the size of your data.

mac mini

If your data fits in memory and you're running on one local machine, a lot of simple approaches will work for you.

data center

If you're using massive quantities of data on multitudes of machines in a data center, you have a lot of concerns that you wouldn't have on one machine.

TensorFlow can run well in large distributed settings. But if you're not going to run that way, you may not need to use all that functionality.

TensorFlow data pipeline

For example, the bulk of the TensorFlow documentation on reading data focuses on a data pipeline that involves multiple processes and multiple coordinating queues. For reading large datasets from network storage (in HDFS, say) this might make a lot of sense.

the Titanic

On the other hand, you may be trying out the Titanic machine learning challenge on Kaggle, where the entire training set is a single CSV file with 892 lines.

the Titanic sinking

If you decide you need to use the full TensorFlow data pipeline instead of something like pandas.read_csv, you're doing a lot of unnecessary work, before you've even started working on the substance of your problem.

There's something to be said for trying out functionality you don't really need, just for the sake of learning it for possible future application. And if you're working on a toy problem like Titanic, maybe that is exactly what you want to do.

But for most work, if you aren't gonna need it, don't use it—even if it's available in TensorFlow.