-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change to https #873
Change to https #873
Conversation
The tests fail due to the obsolete url mentioned in #870. I can't find the url. Let me know if there is more I should do. Thanks. |
I will just add the invalid URL to Alternatively, you could update the URL to the new site I mentioned in issue #870, but I will at least verify that it is actually possible to get the data on the new site. So there are two ways forward: one is easy, and one is hard. I will let it be up to you which one to choose. |
I would prefer to have a valid url from the new site. I tried to find it without luck. As I don't have the original context I would not be able to know if I found the right url. I don't even know how Intake works or what the purpose of the meta data is. And for now I don't need to learn. Can someone else help find the correct url? |
Think this is probably the best link: https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2019 |
Alternatively there's also: https://crime-data-explorer.fr.cloud.gov/pages/explorer/crime/crime-trend |
I would change the test_links.py to the following. Even though it is nice with the fixtures, I think we should run all URLs in one go. This can both be good if there are duplicate URLs in different files and avoid overhead when starting up the ThreadPoolExecutor. If a URL fail it should be easy to find the file by searching for the URL. I changed also changed it to a simpler regex and using """Urls in docstrings etc. should be valid and secure, i.e.
- exist, i.e. provide a 200 response
- use https:// instead of http:// unless
- https:// is not supported by the web site
- https:// cannot be used. For example in SVGs.
"""
import pathlib
import re
from concurrent.futures import ThreadPoolExecutor
from functools import partial
import requests
URL_REGEX = re.compile(r"(https?:\/\/?[\w/\-?=%.]+\.[\w/\-&?=%.]+)")
ROOT = pathlib.Path(__file__).parents[2]
POST_FIXES = ["*.py", "*.ipynb", "*.md", "*.yaml", "*.yml"]
SKIP_URLS = [
"https://www.ucrdatatool.gov/Search/Crime/State/StatebyState.cfm", # Not available anymore
"http://weegee.vision.ucmerced.edu/datasets/landuse.html", # Not available anymore
"https://github.com/holoviz/hvplot", # Alot of issues/PRs
"http://octopart.com", # Located in Notebook output, not relevant
"https://jsperf.com", # Located in Notebook output, not relevant
"https://web.archive.org", # Can be slow and does not work with the regex.
"https://anaconda.org/anaconda/hvplot", # Give a status code of 400?
"https://anaconda.org/conda-forge/hvplot", # Give a status code of 400?
"https://anaconda.org/pyviz/hvplot", # Give a status code of 400?
"https://bugs.chromium.org/p/chromium/issues", # Give status code 405 (needs login)
"https://mybinder.org", # Give status code 405
]
SKIP_FILES = [
".ipynb_checkpoints", # Can give problems when running test local
]
def _get_files_to_check():
for post_fix in POST_FIXES:
for file in ROOT.rglob(post_fix):
if any(skip in str(file) for skip in SKIP_FILES):
continue
yield file
def _find_urls(text):
urls = set()
for url in set(re.findall(URL_REGEX, text)):
if url.endswith("."):
# Regex will keep a dot at the end
urls.add(url[:-1])
continue
if any(url.startswith(skip) for skip in SKIP_URLS):
continue
urls.add(url)
return urls
def _verify_urls(urls):
func = partial(requests.head, timeout=5)
with ThreadPoolExecutor() as executor:
futures = executor.map(func, urls)
for url, response in zip(urls, futures):
if not response.ok:
raise ValueError(
f"The url {url} responded with status {response.status_code}."
)
def test_urls():
urls = set()
for file in _get_files_to_check():
text = file.read_text()
urls |= _find_urls(text)
urls = sorted(urls)
_verify_urls(urls) |
I replaced the broken link by an internet archive link. I'd be happy if one day we could replace this crime dataset by another one, I find it a little creepy. Thanks @MarcSkovMadsen !!! |
Fixes #862
Notes
test_links.py
file to test all files and urls. But some domain refuse to respond to request requests so I exempted those.