-
Notifications
You must be signed in to change notification settings - Fork 0
/
version_history
executable file
·104 lines (67 loc) · 1.97 KB
/
version_history
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
Bash - 11/18
parellelization by writing URLs to disk then opening subshells on them
multiple keywords
spoof user agent
open in browser
prevent dups and checked pages
1.0 - 4/19
switch from Bash to Python
multiprocessing
write to error log
domain limiter
create DBs
1.1 - 9/19
separate CLI and GUI versions
proper HTML parsing with BeautifulSoup
search broader <p> tag for jbws
decompose hidden elements
separate high conf jbw lists
always include pagination links
Sel to confirm all keyword matches
2.0 - 1/20
website on local machine
save HTML to text files
Splash for dynamic webpages
reorganize errorlog by URL first
save CML and errorlog for resumption
update DBs
2.1 - 9/20
website live on remote server
empty visible HTML text detection
domain limiter
percent encode URLs
dedicated email
2.2 - 3/21
Sel and static requesters as fallback
move portals out of scraper file
fuzzy matching keywords
new errorlog using nested lists
dedicated error parser
move zip code coords out of results.py
2.3 - 6/21
multiple Splash instances to eradicate alleviate the plague
allow sharing of results between multiple URLs of an org
move logging out of start_scraper.sh to try prevent broken pipe plague error
dedicated processes for manager tasks
Splash soft 404s now detected as jj_error 4
high / low jbw bug fixed. would allow links with high or low jbws
domain limiter bug fixed. would only work on URLs with a query
3.0 - 11/21
replaced Splash with Playwright
replaced multiproc with asyncio
replaced Requests with aiohttp
removed Selenium
use list of em urls in DBs (no?)
3.1 - 8/22
replaced workinglist with class obj
switched to pickle and json dump/load
async subprocesses
new nic reset (fixed borken pipe plague)
auto blacklist
respect robots
3.2 - ?
replace prant with logging module
major refactor with smaller funcs. including server side code
polymorphic requester classes. requests, error handling, ping
one pw browser. removed unnecessary brow restart behavior
progress bar