-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhomework_answers.py
140 lines (71 loc) · 1.97 KB
/
homework_answers.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
#!/usr/bin/env python
# coding: utf-8
# In[3]:
get_ipython().system('pip install scikit-learn==1.0.2')
# In[11]:
import pickle
import pandas as pd
import numpy as np
# In[6]:
with open('model.bin', 'rb') as f_in:
dv, lr = pickle.load(f_in)
# In[7]:
categorical = ['PUlocationID', 'DOlocationID']
def read_data(filename):
df = pd.read_parquet(filename)
df['duration'] = df.dropOff_datetime - df.pickup_datetime
df['duration'] = df.duration.dt.total_seconds() / 60
df = df[(df.duration >= 1) & (df.duration <= 60)].copy()
df[categorical] = df[categorical].fillna(-1).astype('int').astype('str')
return df
# In[8]:
df = read_data('../data/fhv_tripdata_2021-02.parquet')
# In[9]:
dicts = df[categorical].to_dict(orient='records')
X_val = dv.transform(dicts)
y_pred = lr.predict(X_val)
# ## Q1. Notebook
# Run this notebook for the February 2021 FVH data.
#
# What's the mean predicted duration for this dataset?
#
# - 11.19
# - 16.19
# - 21.19
# - 26.19
# In[13]:
print(f"The mean predicted duration for this dataset is {np.mean(y_pred)}")
# ## Q2. Preparing the output
#
# Like in the course videos, we want to prepare the dataframe with the output.
#
# First, let's create an artificial ride_id column:
# In[17]:
year = 2021
month = 2
# In[18]:
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
# Next, write the ride id and the predictions to a dataframe with results.
# In[20]:
df_result = df.copy(deep=True)
# In[31]:
df_result.drop(df_result.columns.difference(['ride_id']), 1, inplace=True)
# In[36]:
df_result["predictions"] = y_pred
# In[37]:
df_result.to_parquet(
"../data/results.parquet",
engine='pyarrow',
compression=None,
index=False
)
# ## Q3. Creating the scoring script
#
# Now let's turn the notebook into a script.
#
# Which command you need to execute for that?
# **Answer**:
# ```
# jupyter nbconvert --to script homework_answrs.ipynb
# ```
# In[ ]: