Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catalog-bsp site url is indexed by search engine #1491

Closed
FuhuXia opened this issue Mar 23, 2020 · 11 comments
Closed

catalog-bsp site url is indexed by search engine #1491

FuhuXia opened this issue Mar 23, 2020 · 11 comments
Assignees
Labels
bug Software defect or bug component/catalog Related to catalog component playbooks/roles O&M Operations and maintenance tasks for the Data.gov platform

Comments

@FuhuXia
Copy link
Member

FuhuXia commented Mar 23, 2020

catalog-bsp.data.gov is not supposed to be a public site. However, we started to see Google is crawling catalog site at this url.

How to reproduce

Go to https://www.google.com/search?q=site%3Acatalog-bsp.data.gov

Expected behavior

No results should be returned

Actual behavior

About 42,600 results returned.

@adborden
Copy link
Contributor

@FuhuXia do you have ideas on the best way to fix? Since we don't actually want to serve anything at catalog-bsp.data.gov and essentially only use it as a CNAME for the BSP origin, I'm thinking we should just have CKAN restrict Host to catalog.data.gov via Apache config. That way CKAN will return 400s for any requests with Host: catalog-bsp.data.gov, but any valid requests (Host: catalog.data.gov) will respond 200 as usual.

Then there is a one-time cleanup to tell search engines to purge their indexes.

@FuhuXia
Copy link
Member Author

FuhuXia commented Mar 23, 2020

Yes, that will work. Still, using host http://localhost should be allowed so we don't have to spoof ip for local testing. So the change can happen on port 443 conf only, or we do allowing anything except catalog-bsp.data.gov instead of blocking anything except catalog.data.gov.

@adborden
Copy link
Contributor

I prefer whitelists over blacklists for this kind of thing. It's usually more secure. So only allow localhost and catalog.data.gov. Otherwise folks can get creative in using our app to serve arbitrary hosts which is problematic.

For port 443, I wouldn't tackle that yet. There's an outstanding issue to move CKAN instances to serve exclusively on HTTPS, but that involves some BSP configuration #570

@FuhuXia
Copy link
Member Author

FuhuXia commented Mar 23, 2020

Cool. Whitelisting catalog.data.gov sounds good to me as long as localhost is allowed.

@adborden adborden added bug Software defect or bug component/catalog Related to catalog component playbooks/roles labels Mar 23, 2020
@adborden
Copy link
Contributor

Just for reference, we already implemented this for nginx hosts and define the whitelist in a single variable.

@mogul
Copy link
Contributor

mogul commented Mar 24, 2020

A short-term fix would be to add a robots.txt that only appears for catalog-bsp.data.gov to tell search engines not to index.
https://support.google.com/webmasters/answer/6062608?hl=en

@mogul
Copy link
Contributor

mogul commented Mar 24, 2020

Looks like we'd have to add a noindex directive as well.
https://developers.google.com/search/reference/robots_meta_tag

@adborden
Copy link
Contributor

Using robots.txt or a meta tag would require some changes to CKAN. Setting the Host whitelist in apache is theoretically a one-line change to the apache site config.

@mogul
Copy link
Contributor

mogul commented Mar 24, 2020

We can't set up a directive in Apache to serve it?

@FuhuXia
Copy link
Member Author

FuhuXia commented Mar 24, 2020

I think configuring apache to restrict hostname to catalog is good solution and low level of effort to implement. Using robots, or noindex directive, either in meta or apache, are same amount of work, if not more. Also there is a risk that some crawlers might not follow the rule of directives.

@adborden
Copy link
Contributor

That's fair, it's the same amount of work to update the apache config. There is no content hosted at catalog-bsp.data.gov, so having a robots.txt when there's no content seems unnecessary unless it helps search engines to remove content. I believe that returning a 400 will have the same effect though.

Even if we add robots.txt, we still want to whitelist the domains since its also a potential security risk to accept arbitrary Hosts.

@mogul mogul added the O&M Operations and maintenance tasks for the Data.gov platform label Apr 23, 2020
@FuhuXia FuhuXia self-assigned this Sep 29, 2020
@mogul mogul added this to the Sprint 20201015 milestone Oct 15, 2020
@mogul mogul closed this as completed Oct 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug component/catalog Related to catalog component playbooks/roles O&M Operations and maintenance tasks for the Data.gov platform
Projects
None yet
Development

No branches or pull requests

3 participants