Simple spider to check for bad links (404s etc.)
See this article for more information.
debug_print("Links found --> ")
for link in links:
debug_print("-- DEBUG --> {link}".format(link=link))
debug_print("--------")
regex = r'<a[\s\S]*?href=["\'](\S*?)["\']>'
m = re.findall(regex, r.text, re.MULTILINE)
- Check for missing images
- Stack overflow due to too many levels of recursion when spidering NDP
- Try to make it work locally - this would be a lot faster https://localhost:3000
- Use HTML parser or something to process links (my regex is probably not robust)
- Add checks for links that point offsite but do not crawl them
- Add exception handling because some sites cause an exception rather than returning an error code. For example if the site doesn;t exist.