Simple Regression with a TensorFlow Estimator
Saturday May 6, 2017
With TensorFlow 1.1, the Estimator API is now at tf.estimator
. A number of "canned estimators" are at tf.contrib.learn
. This higher-level API bakes in some best practices and makes it much easier to do a lot quickly with TensorFlow, similar to using APIs available in other languages.
Data
This example will use the very simple US Presidential Party and GDP Growth dataset: president_gdp.csv.
The regression problem will be to predict annualized percentage GDP growth from presidential party.
R
R is made for problems such as this, with an API that makes it quite easy:
> data = read.csv('president_gdp.csv')
> model = lm(growth ~ party, data)
> predict(model, data.frame(party=c('R', 'D')))
## 1 2
## 2.544444 4.332857
The dataset is very small, and we won't introduce a train/test split. Linear regression is just a way of calculating means: we expect our model to predict the mean GDP growth conditional on party. Annual GDP growth during Republican presidents has been about 2.5%, and during Democratic presidents about 4.3%.
sklearn
Moving into Python, let's first read in the data and get it ready, using NumPy and Pandas.
import numpy as np
import pandas as pd
data = pd.read_csv('president_gdp.csv')
party = data.party == 'D'
party = np.expand_dims(party, axis=1)
growth = data.growth
With R, we relied on automatic handling of categorical variables. Here we explicitly change the strings 'R' and 'D' to be usable in a model: Boolean values will become zeros and ones. We also adjust the party
data shape to be one row per observation.
Tracking TensorFlow Python APIs, the Estimator API comes from TF Learn, which is inspired by scikit-learn. Here's the regression with scikit:
import sklearn.linear_model
model = sklearn.linear_model.LinearRegression()
model.fit(X=party, y=growth)
model.predict([[0], [1]])
## array([ 2.54444444, 4.33285714])
TensorFlow
This will abuse the API a little to maximize comparability to the examples above; you'll see warnings when you run the code, which will be addressed in the next section.
import tensorflow as tf
party_col = tf.contrib.layers.real_valued_column(column_name='')
model = tf.contrib.learn.LinearRegressor(feature_columns=[party_col])
Unlike with scikit, we need to specify the structure of our regressors when we instantiate the model object. This is done with FeatureColumns. There are several options; real_valued_column
is probably the simplest but others are useful for general categorical data, etc.
We're providing that data as a simple matrix, so it's important that we use the empty string ''
for column_name
. If there is a substantial column_name
, we'll have to provide data in dictionaries with column names as keys.
model.fit(x=party, y=growth, steps=1000)
list(model.predict(np.array([[0], [1]])))
## [2.5422058, 4.3341689]
TensorFlow needs to be told how many steps of gradient descent to run, or it will keep going indefinitely, without additional configuration. A thousand iterations gets very close to the results achieved with R and with scikit.
There are a lot of things that LinearRegressor
takes care of. In this code, we did not have to explicitly:
- Create any TensorFlow variables.
- Create any Tensorflow ops.
- Choose an optimizer or learning rate.
- Create a TensorFlow session.
- Run ops in a session.
This API also does a lot more than the R or scikit examples above, and allows for even more extensions.
TensorFlow Extensions
The Estimator API does a lot by default, and allows for a lot more optionally.
First, there is a model_dir
. Above, TensorFlow automatically used a temporary directory. It's nicer to explicitly choose a model_dir
.
model = tf.contrib.learn.LinearRegressor(feature_columns=[party_col],
model_dir='tflinreg')
The model_dir
is used for two main purposes:
- Saving TensorBoard summaries (log info)
- Saving model checkpoints
Automatic TensorBoard
Like an input_producer
, an Estimator automatically writes information for TensorBoard. To check them out, point TensorBoard at the model_dir
and browse to localhost:6006
.
$ tensorboard --logdir tflinreg
For the example above, we get the model graph and two scalar summaries.
Here's what was was constructed in the TensorFlow graph for our LinearRegressor
:
In the scalar summaries, we get a measure of how fast the training process was running, in global steps per second:
The variation in speed shown here is not particularly meaningful.
And we get the training loss:
We didn't really need to train for a full thousand steps.
By default, summaries are generated every 100 steps, but this can be set via save_summary_steps
in a RunConfig
, along with several other settings.
Further customization, with support for additional metrics, validation on separate data, and even automatic early stopping, is available with ValidationMonitor.
Automatic Model Save/Restore
After training for 1,000 steps above, TensorFlow saved the model to the model_dir
. If we point to the same model_dir
again in a new Python session, the model will be automatically restored from that checkpoint.
import numpy as np
import tensorflow as tf
party_col = tf.contrib.layers.real_valued_column(column_name='')
model = tf.contrib.learn.LinearRegressor(feature_columns=[party_col],
model_dir='tflinreg')
list(model.predict(np.array([[0], [1]])))
## [2.5422058, 4.3341689]
For more control over how often and when checkpoints are saved, see RunConfig
.
Using input functions
Above, training data was provided via x
and y
arguments, which is like how scikit works, but not really what TensorFlow Estimators should use.
The appropriate mechanism is to make an input function that returns the equivalents to x
and y
when called. The function is passed as the input_fn
argument to model.fit()
, for example.
This approach is flexible and makes it easy to avoid, for example, keeping track of separate data structures for data and labels.
Distributed Training
Among the tf.contrib.learn
goodies is tf.contrib.learn.Experiment
, which works with an Estimator to help do distributed training. It looks like this one is still settling down, with a lot of deprecated bits at the moment. I'm interested to see more about this. For now, you could check out a Google Cloud ML example that works with learn_runner
.
I'm working on Building TensorFlow systems from components, a workshop at OSCON 2017.