We will now use a more realistic regression example and introduce tf.estimator.
, su
It will take place in the following steps:
-
Collect Data & Create Variables
- X_data = All values to be used for calculations
- y_labels = All answers of those values after calculations
-
Splice X_data into X_train and X_test
- X_train will be used as values to train your model. After, it will be evaluated with X_test, and will receive a percentage of accuracy in terms of how well it performed/how close the values from X_test were to their corresponding y_labels
-
Scale the Feature Data
-
Use sklearn preprocessing to create a MinMaxScaler for the feature data. Fit this scaler only to the training data. Then use it to transform X_test and X_train. Then use the scaled X_test and X_train along with pd.Dataframe to re-create two dataframes of scaled data.
-
**Remember!!: DO NOT USE THE SCALER ON THE X_train data. **When you test your model, you don't want it believing it'll have more data...such as X_train.
-
-
Create Placeholders
-
Define operations in your Graph (Set operations being taken)
-
Define error or loss function
-
Setup Trainer
-
Initialize global objects
-
If big dataset:
- Create batches
Let's start with our imports
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline
Sweet. Let's create a "HUGE" dataset_. Huge is relative. This will have 1M datapoints._
x_data = np.linspace(0.0,10.0,1000000)
Now let's create the same amount of noise as our dataset. (1M)
noise = np.random.randn(len(x_data))
Our line will be modeled as y = mx + b. We will set b = 5 and add noise.
y_true = (0.5 * x_data) + 5 + noise
The noise is added so it's our true line is not just a perfect fitted line. The 5 represents the shift in b that we want.
Let's use pandas to concatenate our data.
Create our x_dataFrame
x_df = pd.DataFrame(data=x_data, columns=['X Data'])
Create our y_dataFrame
y_df = pd.DataFrame(data=y_true, columns=['Y'])
Now let's check the first five entries of x_df
x_df.head()
X Data
0 0.00000
1 0.00001
2 0.00002
3 0.00003
4 0.00004
Now let's concat them both to have one frame!
my_data = pd.concat([x_df,y_df], axis=1)
Output of my_data.head()
X Data Y
0 0.00000 5.718261
1 0.00001 5.000671
2 0.00002 5.544956
3 0.00003 5.070396
4 0.00004 4.691148
We add the axis=1 so that the data isn't stacked like a pancake!
Now let's say we want to graph this. Unfortunately, graphing a million plot points will take awhile.
Let's only graph a small random sample of 10.
my_data.sample(n=10)
Output
index X Data Y
140413 1.404131 5.637388
763166 7.631668 9.500907
210459 2.104592 5.871338
238403 2.384032 6.298123
74718 0.747181 5.466532
68900 0.689001 4.138463
691081 6.910817 9.494513
797712 7.977128 7.670427
517287 5.172875 6.555583
63754 0.637541 6.250593
Create a scatter plot w/ 250 values
my_data.sample(n=250).plot(kind='scatter', x='X Data', y='Y')
Now let's use TensorFlow to train this model. Now, we can't run 1M of points at a time, we have to create batches of data. They're no true right or wrong answer for batch_sizes. It depends on your data.
batch_size = 8
**Create our slope and b variable
**They're random numbers.
m = tf.Variable(0.3)
b = tf.Variable(0.11)
Create our placeholders. 1 for x and 1 for y.
Don't forget to set the data type and the size of the batch.
xph = tf.placeholder(tf.float32, [batch_size])
yph = tf.placeholder(tf.float32, [batch_size])
Remember yph is the True answer.
Define our model
y_model = m*xph + b
Great! Now let's Create our error.
error = tf.reduce_sum(tf.square(yph - y_model))
Now for our Gradient Descent Optimizer and create a train variable using our optimizer on error.=
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
train = optimizer.minimize(error)
Almost done! Just need to initialize our global variables, then run our analysis!
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
batches = 1000
for i in range(batches):
rand_ind = np.random.randint(len(x_data), size=batch_size)
feed = {xph:x_data[rand_ind], yph:y_true[rand_int]}
sess.run(train, feed_dict = feed)
model_m, model_b = sess.run([m,b])
Alright, so what's happening here?
We're grabbing 8 random data points. rand_ind grabs a random index of our data, and then put that as our xph and yph, which becomes our feed. Feed are two points (x,y-true) and it is added to the dictionary feed{}. We then use train to adjust errors, and set our feed-dict to begin to run our analysis. Runt his. If it takes too long, lower the number of batches.
model_m
0.52406013
model_b
4.9413366
Set v_hat and graph!
y_hat = x_data*model_m + model_b
my_data.sample(250).plot(kind='scatter', x='X Data', y='Y')
plt.plot(x_data,y_hat, 'r')
Now we will solve the regression task using the Estimator API.
They're are lot of other higher level APIs (Keras, Layers, etc.) but we will conver those later on in the Miscellaneous Section.
The tf.estimator has many different options/types
- tf.estimator.LinearClassifier
- Constructs a linear classification model
- tf.estimator.LinearRegressor
- Constructs a linear regression model
- tf.estimator.DNNClassifer
- Construct a neural network classification model
- tf.estimator.DNNRegressor
- Constructs a neural network regression model
- tf.estimator.DNNLinearCombinedClassifer
- Constructs a neural network and linear combined classification
- tf.estimator.DNNLinearCombinedRegressor
- Constructs a neural network and linear combined regression model
To use the Estimator API:
- Define a list of feature columns
- Create the Estimator Model
- Create a Data Input Function
- Call train(), eval(), and predict() on the estimator object.
feat_cols = [ tf.feature_column.numeric_column('x', shape=[1]) ]
Now we setup our estimator. This is the main part of the API. We will do a LinearRegressor and point to the feature columns. We will see more complex examples with multiple featuresl.
estimator = tf.estimator.LinearRegressor(feature_columns=feat_cols)
There will be an output, but it's just default configuration stuff.
We are splitting up the data into a training set and an evaluation set. We set the test_size to 0.3 aka 30% of an evaluation size and 70% of a test size.
from sklearn.model_selection import train_test_split
x_train, x_eval, y_train, y_eval = train_test_split(x_data, y_true, test_size=0.3, random_state=101)
Let's see if we got what we wanted
print(x_train.shape)
print(x_eval.shape)
(700000,)
(300000,)
70% of 1M is 700,000, so it has appeared to work.
You need to have a n input function that kinda acts like your feed dictionary and batch_size_ indicator all at once. We will be inputing from an nump array.** You can also send in pandas array! **We then define a dictionary of 'x' key to the values of xtrain, then y_train as the
input_func = tf.estimator.inputs.numpy_input_fn({'x':x_train}, y_train, batch_size=8, num_epochs=None, shuffle=True)
Let's copy and paste this to get 2 more variables, **train_input_func, and ** eval_input_func.
train_input_func = input_func = tf.estimator.inputs.numpy_input_fn({'x':x_train}, y_train, batch_size=8, num_epochs=None, shuffle=True)
eval_input_func = input_func = tf.estimator.inputs.numpy_input_fn({'x':x_train}, y_train, batch_size=8, num_epochs=None, shuffle=True)
Time to train this bitch. Let's give it 1000 steps and see how it does.
estimator.train(input_fn=input_func, steps=1000)
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/3v/vpv_7q_95dj_87nc7vkrf88h0000gn/T/tmppdgsysdb/model.ckpt.
INFO:tensorflow:loss = 543.0, step = 1
INFO:tensorflow:global_step/sec: 563.558
INFO:tensorflow:loss = 26.5623, step = 101 (0.178 sec)
INFO:tensorflow:global_step/sec: 572.734
INFO:tensorflow:loss = 49.0421, step = 201 (0.174 sec)
INFO:tensorflow:global_step/sec: 476.958
INFO:tensorflow:loss = 22.2036, step = 301 (0.210 sec)
INFO:tensorflow:global_step/sec: 556.155
INFO:tensorflow:loss = 13.5479, step = 401 (0.181 sec)
INFO:tensorflow:global_step/sec: 470.728
INFO:tensorflow:loss = 3.91283, step = 501 (0.212 sec)
INFO:tensorflow:global_step/sec: 594.255
INFO:tensorflow:loss = 12.6644, step = 601 (0.170 sec)
INFO:tensorflow:global_step/sec: 445.313
INFO:tensorflow:loss = 7.96772, step = 701 (0.224 sec)
INFO:tensorflow:global_step/sec: 440.612
INFO:tensorflow:loss = 19.8273, step = 801 (0.227 sec)
INFO:tensorflow:global_step/sec: 428.152
INFO:tensorflow:loss = 9.37158, step = 901 (0.236 sec)
INFO:tensorflow:Saving checkpoints for 1000 into /var/folders/3v/vpv_7q_95dj_87nc7vkrf88h0000gn/T/tmppdgsysdb/model.ckpt.
INFO:tensorflow:Loss for final step: 12.2294.
We use the train_input_func for the input_fn because we do not want the data to be shuffled. Recall we turned shuffle off.
train_metrics = estimator.evaluate(input_fn=train_input_func, steps=1000)
INFO:tensorflow:Starting evaluation at 2018-01-29-00:00:22
INFO:tensorflow:Restoring parameters from /var/folders/3v/vpv_7q_95dj_87nc7vkrf88h0000gn/T/tmppdgsysdb/model.ckpt-1000
INFO:tensorflow:Evaluation [1/1000]
INFO:tensorflow:Evaluation [2/1000]
INFO:tensorflow:Evaluation [3/1000]
---snip---
INFO:tensorflow:Evaluation [997/1000]
INFO:tensorflow:Evaluation [998/1000]
INFO:tensorflow:Evaluation [999/1000]
INFO:tensorflow:Evaluation [1000/1000]
INFO:tensorflow:Finished evaluation at 2018-01-29-00:00:27
INFO:tensorflow:Saving dict for global step 1000: average_loss = 1.0769, global_step = 1000, loss = 8.61518
eval_metrics = estimator.evaluate(input_fn=eval_input_func, steps=1000)
INFO:tensorflow:Starting evaluation at 2018-01-29-00:01:29
INFO:tensorflow:Restoring parameters from /var/folders/3v/vpv_7q_95dj_87nc7vkrf88h0000gn/T/tmppdgsysdb/model.ckpt-1000
INFO:tensorflow:Evaluation [1/1000]
INFO:tensorflow:Evaluation [2/1000]
INFO:tensorflow:Evaluation [3/1000]
---snip---
INFO:tensorflow:Evaluation [997/1000]
INFO:tensorflow:Evaluation [998/1000]
INFO:tensorflow:Evaluation [999/1000]
INFO:tensorflow:Evaluation [1000/1000]
INFO:tensorflow:Finished evaluation at 2018-01-29-00:01:34
INFO:tensorflow:Saving dict for global step 1000: average_loss = 1.06595, global_step = 1000, loss = 8.52763
Now let's print both metrics and see if they're close. this will let us know if we are overfitting to our data.A good indicator is when you have a really low loss on training data metrics, but high on th eval data. We want them to be close to each other. Preferably not e these two metrics to be as cval perfroms worse than train_set. If training metrics is way higher, but lower. than eval, then you're overfitting.
Overfitting: VERY low loss train_metrics && VERY HIGH loss of eval_metrics
print('TRAINING DATA METRICS')
print(train_metrics)
TRAINING DATA METRICS
{'average_loss': 1.0768981, 'global_step': 1000, 'loss': 8.6151848}
print('EVAL METRICS')
print(eval_metrics)
EVAL METRICS
{'average_loss': 1.0659537, 'global_step': 1000, 'loss': 8.5276299}
Let's say we have some new data of X. These are 10 points that the model has never seen before.
brand_new_data = np.linspace(0,10,10)
Now let's create a new input function that predicts y, given the new data, using the TensorFlow Estimator.
input_fn_predict = tf.estimator.inputs.numpy_input_fn({'x':brand_new_data}, shuffle=False)
We have our input function to predict. So now we put this in.
input_fn_predict = tf.estimator.inputs.numpy_input_fn({'x':brand_new_data}, shuffle=False)
Another way to do this is:
predict_input_func = tf.estimator.inputs.pandas_input_fn(
x=X_test,
batch_size=10,
num_epochs=1,
shuffle=False)
To actually view the results, cast it as type list.
list(estimator.predict(input_fn=input_fn_predict))
[{'predictions': array([ 4.43565083], dtype=float32)},
{'predictions': array([ 5.09498167], dtype=float32)},
{'predictions': array([ 5.75431252], dtype=float32)},
{'predictions': array([ 6.41364384], dtype=float32)},
{'predictions': array([ 7.07297468], dtype=float32)},
{'predictions': array([ 7.73230553], dtype=float32)},
{'predictions': array([ 8.39163589], dtype=float32)},
{'predictions': array([ 9.05096722], dtype=float32)},
{'predictions': array([ 9.71029854], dtype=float32)},
{'predictions': array([ 10.36962891], dtype=float32)}]
Use the estimator to predict the outcome of all the X values. Save it as a list variable called predictions.
predictions = []
for pred in estimator.predict(input_fn=input_fn_predict):
predictions.append(pred['predictions'])
predictions
[array([ 4.43565083], dtype=float32),
array([ 5.09498167], dtype=float32),
array([ 5.75431252], dtype=float32),
array([ 6.41364384], dtype=float32),
array([ 7.07297468], dtype=float32),
array([ 7.73230553], dtype=float32),
array([ 8.39163589], dtype=float32),
array([ 9.05096722], dtype=float32),
array([ 9.71029854], dtype=float32),
array([ 10.36962891], dtype=float32)]
Plot data sample of 250 items
my_data.sample(n=250).plot(kind='scatter', x='X Data', y='Y')
plt.plot(brand_new_data, predictions, 'r*')
Straight line looks like this: