-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
catalog-bsp site url is indexed by search engine #1491
Comments
@FuhuXia do you have ideas on the best way to fix? Since we don't actually want to serve anything at catalog-bsp.data.gov and essentially only use it as a CNAME for the BSP origin, I'm thinking we should just have CKAN restrict Host to catalog.data.gov via Apache config. That way CKAN will return 400s for any requests with Then there is a one-time cleanup to tell search engines to purge their indexes. |
Yes, that will work. Still, using host http://localhost should be allowed so we don't have to spoof ip for local testing. So the change can happen on port 443 conf only, or we do allowing anything except |
I prefer whitelists over blacklists for this kind of thing. It's usually more secure. So only allow localhost and catalog.data.gov. Otherwise folks can get creative in using our app to serve arbitrary hosts which is problematic. For port 443, I wouldn't tackle that yet. There's an outstanding issue to move CKAN instances to serve exclusively on HTTPS, but that involves some BSP configuration #570 |
Cool. Whitelisting |
Just for reference, we already implemented this for nginx hosts and define the whitelist in a single variable. |
A short-term fix would be to add a robots.txt that only appears for catalog-bsp.data.gov to tell search engines not to index. |
Looks like we'd have to add a noindex directive as well. |
Using robots.txt or a meta tag would require some changes to CKAN. Setting the Host whitelist in apache is theoretically a one-line change to the apache site config. |
We can't set up a directive in Apache to serve it? |
I think configuring apache to restrict hostname to |
That's fair, it's the same amount of work to update the apache config. There is no content hosted at catalog-bsp.data.gov, so having a robots.txt when there's no content seems unnecessary unless it helps search engines to remove content. I believe that returning a 400 will have the same effect though. Even if we add robots.txt, we still want to whitelist the domains since its also a potential security risk to accept arbitrary Hosts. |
catalog-bsp.data.gov is not supposed to be a public site. However, we started to see Google is crawling catalog site at this url.
How to reproduce
Go to https://www.google.com/search?q=site%3Acatalog-bsp.data.gov
Expected behavior
No results should be returned
Actual behavior
About 42,600 results returned.
The text was updated successfully, but these errors were encountered: