-
Notifications
You must be signed in to change notification settings - Fork 0
/
help.txt
144 lines (116 loc) · 4.98 KB
/
help.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
remove_http (text):
This function remove 'https' + 10 following characters in
a string
parameter:
text (string)
return:
text (string)
*********************************************************************************************************
df_remove_http (dataframe, inputName, newColName):
parameter:
dataframe(dataframe)
inputName (string)
newColName (string)
return:
dataframe (dataframe)
*********************************************************************************************************
remove_punctuation ( ) -> remove all punctuation that listed in the library of string.punctuation
parameters:
text (string)
return:
list of string without any punctuation
*********************************************************************************************************
remove_punctuation ( , , )-> remove all punctuations in a column of dataframe
parameters:
dataframe (dataframe)
inputName (string)
newColName (string)
return:
single column dataframe
*********************************************************************************************************
lowerCase ( , , )-> change all the capital letter to lower case
parameters:
dataframe (dataframe)
inputName (string)
newColName (string)
return:
single column dataframe
*********************************************************************************************************
tokenization ( ) -> change a line of string into token in a list
parameters:
text (string)
return:
list of string
*********************************************************************************************************
df_tokenization ( , , )-> tokenize each row of string into a list for each row
parameters:
dataframe (dataframe)
inputName (string)
newColName (string)
return:
single column dataframe
*********************************************************************************************************
remove_stopwords ( ) -> input a list of string and remove all English stopwords
parameters:
text (list)
return:
output (list)
*********************************************************************************************************
df_remove_stopwords ( , , )-> remove stopwords in rows in dataframe
parameters:
dataframe (dataframe)
inputName (string)
newColName (string)
return:
single column dataframe
*********************************************************************************************************
remove_another_stopwords ( ) -> input tokenList of words
and remove second set of stopwords, "ANOTHER_STOPWORDS"
parameters:
text (list) -> tokenized list
return:
output (list) -> tokenized list
*********************************************************************************************************
lemmatizer ( ) -> lemmatize the words in the list
parameters:
text (list)
return:
output (list)
*********************************************************************************************************
df_lemmatizer ( , , )-> lemmatize the words in a row in dataframe
parameters:
dataframe (dataframe)
inputName (string)
newColName (string)
return:
single column dataframe
*********************************************************************************************************
remove_emoji ( ) -> remove the emoji in the string
parameters:
string (string)
return:
output (string)
*********************************************************************************************************
processFunction (rows, nrows):
is a functions to execute steps
to perform the data preparation for wordcloud.
parameter:
rows (int) -> number of rows in a csv file for NOT from 1 to "rows"
nrows (int) -> number of rows to load in a dataframe
return
text (string) -> to input in a wordcloud function.
The data preparation is applied in following steps:
Step 1: Remove all 'remove all 'https://t.co/' + 10
Step 2: Remove all emoji
Step 3: lower case on text
Step 4: remove first set of tailor
Step 5: remove the punctuation
Step 6: tokenization
Step 8: remove English stopwrods
Step 7: lemmatization
Step 8: remove second set of tailor stopwords
Step 9: join all the words together seperated by " " within a row
Step 10: join all rows into a string
(These functions are written in wordCloud_FN.py
please read the docstring "printDoc.py" for details).
*********************************************************************************************************