-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathassignment2.py
146 lines (122 loc) · 5.57 KB
/
assignment2.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from sklearn.cluster import KMeans
matplotlib.style.use('ggplot') # Look Pretty
def showandtell(title=None):
if title != None: plt.savefig(title + ".png", bbox_inches='tight', dpi=300)
plt.show()
exit()
#
# INFO: This dataset has call records for 10 users tracked over the course of 3 years.
# Your job is to find out where the users likely live and work at!
df=pd.read_csv('D:\learning\DAT210x-master\Module5\Datasets\cdr.csv')
#
# TODO: Load up the dataset and take a peek at its head
# Convert the date using pd.to_datetime, and the time using pd.to_timedelta
#
# .. your code here ..
df.CallDate=pd.to_datetime(df.CallDate, errors='coerce')
df.Duration=pd.to_timedelta(df.Duration, errors='coerce')
#
# TODO: Get a distinct list of "In" phone numbers (users) and store the values in a
# regular python list.
# Hint: https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.tolist.html
#
# .. your code here ..
ab=df.In.unique()
cd=ab.tolist()
#
# TODO: Create a slice called user1 that filters to only include dataset records where the
# "In" feature (user phone number) is equal to the first number on your unique list above
#
# .. your code here ..
dfinal=pd.DataFrame([[,]])
for i in len(cd)
user1=df[df.In==cd[i]]
# INFO: Plot all the call locations
#-------user1.plot.scatter(x='TowerLon', y='TowerLat', c='gray', alpha=0.1, title='Call Locations')
#showandtell() # Comment this line out when you're ready to proceed
#
# INFO: The locations map above should be too "busy" to really wrap your head around. This
# is where domain expertise comes into play. Your intuition tells you that people are likely
# to behave differently on weekends:
#
# On Weekdays:
# 1. People probably don't go into work
# 2. They probably sleep in late on Saturday
# 3. They probably run a bunch of random errands, since they couldn't during the week
# 4. They should be home, at least during the very late hours, e.g. 1-4 AM
#
# On Weekdays:
# 1. People probably are at work during normal working hours
# 2. They probably are at home in the early morning and during the late night
# 3. They probably spend time commuting between work and home everyday
#
# TODO: Add more filters to the user1 slice you created. Add bitwise logic so that you're
# only examining records that came in on weekends (sat/sun).
#
# .. your code here ..
user1w=user1[(user1.DOW=='Sat') | (user1.DOW=='Sun') ]
user1w
#
# TODO: Further filter it down for calls that are came in either before 6AM OR after 10pm (22:00:00).
# You can use < and > to compare the string times, just make sure you code them as military time
# strings, eg: "06:00:00", "22:00:00": https://en.wikipedia.org/wiki/24-hour_clock
#
# You might also want to review the Data Manipulation section for this. Once you have your filtered
# slice, print out its length:
#
# .. your code here ..
userwt1=user1w[(user1w.CallTime<"06:00:00")|(user1w.CallTime>"22:00:00")]
userwt1
#
# INFO: Visualize the dataframe with a scatter plot as a sanity check. Since you're familiar
# with maps, you know well that your X-Coordinate should be Longitude, and your Y coordinate
# should be the tower Latitude. Check the dataset headers for proper column feature names.
# https://en.wikipedia.org/wiki/Geographic_coordinate_system#Geographic_latitude_and_longitude
#
# At this point, you don't yet know exactly where the user is located just based off the cell
# phone tower position data; but considering the below are for Calls that arrived in the twilight
# hours of weekends, it's likely that wherever they are bunched up is probably near where the
# caller's residence:
fig = plt.figure()
ax = fig.add_subplot(111)
###ax.scatter(userwt1.TowerLon,userwt1.TowerLat, c='g', marker='o', alpha=0.2)
ax.set_title('Weekend Calls (<6am or >10p)')
#showandtell() # TODO: Comment this line out when you're ready to proceed
GH=userwt1[['TowerLon','TowerLat']]
#
# TODO: Run K-Means with a K=1. There really should only be a single area of concentration. If you
# notice multiple areas that are "hot" (multiple areas the usr spends a lot of time at that are FAR
# apart from one another), then increase K=2, with the goal being that one of the centroids will
# sweep up the annoying outliers; and the other will zero in on the user's approximate home location.
# Or rather the location of the cell tower closest to their home.....
#
# Be sure to only feed in Lat and Lon coordinates to the KMeans algo, since none of the other
# data is suitable for your purposes. Since both Lat and Lon are (approximately) on the same scale,
# no feature scaling is required. Print out the centroid locations and add them onto your scatter
# plot. Use a distinguishable marker and color.
#
# Hint: Make sure you graph the CORRECT coordinates. This is part of your domain expertise.
#
# .. your code here ..
kmeans = KMeans(n_clusters=1)
kmeans.fit(GH)
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=1, n_init=10,
n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
verbose=0)
#labels = kmeans.predict(df)
centroids = kmeans.cluster_centers_
centroids
T=pd.DataFrame(centroids)
dfinal.append(T, ignore_index=True)
#---T.columns = ['component1', 'component2']
T.plot.scatter(x='component2', y='component1', marker='o', c='r', alpha=0.5, linewidths=3, s=169)
plt.show()
showandtell() # TODO: Comment this line out when you're ready to proceed
#
# TODO: Repeat the above steps for all 10 individuals, being sure to record their approximate home
# locations. You might want to use a for-loop, unless you enjoy typing.
#
# .. your code here ..