Skip to content

Commit 4039d5b

Browse files
author
Yaroslav Kargin
committed
* Since each URL from the website must be retrieved only once, it's easier to use a set() to store the list.
* Added filtering for # in URLs. Signed-off-by: Yaroslav Kargin <ykargin@outlook.com>
1 parent e88cfbc commit 4039d5b

File tree

1 file changed

+7
-0
lines changed

1 file changed

+7
-0
lines changed

wbot.py

+7
Original file line numberDiff line numberDiff line change
@@ -42,11 +42,18 @@ def retrieve(drv, domain, url):
4242
logging.warning(f'could not load links for {url}')
4343
return
4444

45+
scopeset=set()
46+
4547
for l in links:
4648
u = l.get_attribute('href')
4749
if not u:
4850
continue
51+
elif '#' in u:
52+
u=u.split('#')[0]
53+
scopeset.add(u)
54+
4955

56+
for u in scopeset:
5057
# Get the status with requests library and then retrieve the URL again
5158
# recursively with Selenium driver. We need the double requests for now
5259
# becauseit's not easy to get response status from selenium.

0 commit comments

Comments
 (0)