EES-5003 - Sitemap and robots not dynamically generated #4959

sambiramairelogic · 2024-06-14T07:47:01Z

Next Sitemap is generating the robots.txt and sitemap.xml files at build time, and not re-generating them dynamically on each page load.

Because we only build the frontend once - without any environment information - we cannot distinguish at build time what the value of PUBLIC_URL should be.

There are a few approaches to how we could tackle this, but the fastest and simplest right now is to create a .env.production config file which will be prioritized over .env during next build.

Also, I missed some require changes when converting from sitemap-summaries to sitemap-items so this adds those in too 🙈

ntsim · 2024-06-14T09:41:20Z

src/explore-education-statistics-frontend/.env.production

@@ -0,0 +1 @@
+PUBLIC_URL=https://explore-education-statistics.service.gov.uk/


Think we'll need to update the environment variables in the Playwright tests too?

The PUBLIC_URL variable will be be set to the actual environment URL, so imagining we'll need a PROD_PUBLIC_URL variable or similar.

Hrm yes good point, but actually I wonder if we should make these tests prod only?

So with these changes, if frontend is ran locally in dev mode we should still get a localhost robots file so the tests should pass as they are. If frontend is ran in prod mode then they'll fail.

If we make the tests always use the live site prod url, then at each stage of the pipeline we'll effectively be running these tests against prod. If this got merged and deployed then those tests failed for any reason, they'd fail on every environment which obviously is a bit misleading and hard to get a fix through the pipeline steps.

As a side note, in my local branch for improving how we test new redirects I've introduced tags to the playwright tests so I can filter which ones are ran. The intention being to replicate the way robot tests can be ran on selected environments. We could do something similar here?
Or, just skip these tests for now for the sake of getting a fix out, and do it in another PR?

Making them prod only might be a good idea. I'm just a little concerned about spending too much time improving the Playwright tests given we're not committing to them for the time being.

Imagine we'll need to have some more conversations about the future of them soon, but my preference would be to have any important suites moved back into the Robot tests to keep everything cohesive.

Also, just realised we're not actually running the Playwright or Robot tests for non-dev environments in the pipeline anymore. This makes me think that adding tagging functionality probably is probably overkill.

Regardless, we might not be on the same page about the PROD_PUBLIC_URL variable approach. I was thinking of having this alongside PUBLIC_URL (it wouldn't be a replacement) and would be used for the robots.txt assertions like:

test('The robots.txt file points to the sitemap', async ({ page }) => { await page.goto(`${PUBLIC_URL}/robots.txt`); await expect(page).toHaveURL(`${PUBLIC_URL}/robots.txt`); await expect( page.getByText(`Sitemap: ${PROD_PUBLIC_URL}/sitemap.xml`), ).toHaveCount(1); await expect( page.getByText(`Sitemap: ${PROD_PUBLIC_URL}/server-sitemap.xml`), ).toHaveCount(1); });

This way, the tests are running against their respective environment (and not only prod).

Hrm yeah fair enough, very happy to do that
(& agree on not spending more time on playwright - I'm only modifying existing tests / reducing the faff involved in running them, certainly not intending to add more. Certainly not doing anything that would take me longer than 15 mins or so)

ntsim · 2024-06-14T09:47:33Z

src/explore-education-statistics-frontend/next-sitemap.config.js

+    policies: [
+      {
+        userAgent: '*',
+        allow: '/',
+        disallow: [
+          '/data-tables/fast-track/',
+          '/data-tables/permalink/',
+          '/subscriptions/',
+        ],
+      },
+    ],


Following our discussion yesterday, had a little think and this will probably be fine as bots can't really crawl the non-prod environments because our basic auth will prevent general access.

Does make me think that the meta tags we added were probably overkill as well 🤔 guess they don't hurt to have anyway!

Yeah exactly - I had the same thoughts when we put them in tbh but I guess it's a good regression in case we ever remove that auth 😄

sambiramairelogic requested a review from ntsim June 14, 2024 07:47

ntsim reviewed Jun 14, 2024

View reviewed changes

ntsim approved these changes Jun 14, 2024

View reviewed changes

sambiramairelogic added 6 commits June 14, 2024 13:07

Fix incorrect paths

94b8130

Trial if setting env in Dockerfile is enough

8b80d86

Create .env.production file

f5ef876

Drop requirement for robots.txt to be different on staging

a8cf102

Update assertions in playwright tests

79bd190

Add new env var to pipeline

2a1f28c

sambiramairelogic force-pushed the EES-5003/sitemap-and-robots-not-dynamically-generated branch from 88a5fb4 to 2a1f28c Compare June 14, 2024 12:07

sambiramairelogic merged commit b04b24e into dev Jun 14, 2024
2 of 7 checks passed

sambiramairelogic deleted the EES-5003/sitemap-and-robots-not-dynamically-generated branch June 14, 2024 12:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EES-5003 - Sitemap and robots not dynamically generated #4959

EES-5003 - Sitemap and robots not dynamically generated #4959

sambiramairelogic commented Jun 14, 2024

ntsim Jun 14, 2024

sambiramairelogic Jun 14, 2024

ntsim Jun 14, 2024

sambiramairelogic Jun 14, 2024

ntsim Jun 14, 2024

sambiramairelogic Jun 14, 2024

		@@ -0,0 +1 @@
		PUBLIC_URL=https://explore-education-statistics.service.gov.uk/

EES-5003 - Sitemap and robots not dynamically generated #4959

EES-5003 - Sitemap and robots not dynamically generated #4959

Conversation

sambiramairelogic commented Jun 14, 2024

ntsim Jun 14, 2024

Choose a reason for hiding this comment

sambiramairelogic Jun 14, 2024

Choose a reason for hiding this comment

ntsim Jun 14, 2024

Choose a reason for hiding this comment

sambiramairelogic Jun 14, 2024

Choose a reason for hiding this comment

ntsim Jun 14, 2024

Choose a reason for hiding this comment

sambiramairelogic Jun 14, 2024

Choose a reason for hiding this comment