Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static caching and query parameters #3111

Closed
jameswtc opened this issue Jan 16, 2021 · 15 comments
Closed

Static caching and query parameters #3111

jameswtc opened this issue Jan 16, 2021 · 15 comments
Labels

Comments

@jameswtc
Copy link

Bug Description

When the full static caching is enabled, statmic generate a new version of cached html in the public/static folder. This might not be ideal since many marketing inbound traffics contain additional params and this would create thousands of the same document in different names.

Some param is not impacting the rendered output, like marketing params, then 1 version of the cached html should used for all without re-rendering.
If the param is impacting the output, like pagination, then it should be rendered and cached.

We need some way to configure this, but in CMS and in Apache htaccess

How to Reproduce

Simply enable full caching and add parameters to the URL query.

Configure the .htaccess:

RewriteCond %{DOCUMENT_ROOT}/statics/%{REQUEST_URI}_%{QUERY_STRING}\.html -s
RewriteCond %{REQUEST_METHOD} GET
RewriteRule .* static/%{REQUEST_URI}_%{QUERY_STRING}\.html [L,T=text/html]

Note: I have changed the static folder to "statics" due to conflicting name.

Extra Detail

Request the homepage with some random parameter:

image

Environment

Mac OS, test environment with MAMP PRO.

Statamic version: 3.0.{?}
Version 3.0.38

PHP version: 7.{?}
7.4

@jelleroorda
Copy link
Contributor

Maybe it could be an idea to write your own custom cacher or an adapter that filters out the marketing parameters? That way you can enable query parameters like page.

@jasonvarga
Copy link
Member

I implemented this in #3075 but looks like I haven't documented it. 😬

Try it out:

Add 'ignore_query_strings' => true to config/statamic/static_caching.php

and remove the %{QUERY_STRING} from the 2 lines in your .htaccess. Leave the underscore in there.

@jelleroorda
Copy link
Contributor

@jasonvarga I think OP wants to use only specific query strings to impact the static cache. I.e. it should generate a new page for https://some-url.com/?page=2 but not for https://some-url.com/?source=email. Not that I think about it again I'm not sure whether that is possible since you would need to strip some query string parameters on NGINX level as well.

@jameswtc
Copy link
Author

jameswtc commented Jan 19, 2021

@jasonvarga I think OP wants to use only specific query strings to impact the static cache. I.e. it should generate a new page for https://some-url.com/?page=2 but not for https://some-url.com/?source=email. Not that I think about it again I'm not sure whether that is possible since you would need to strip some query string parameters on NGINX level as well.

Yes, exactly.

And also in some cases where some unique tracking code are added to the param based on user hash.

@jasonvarga
Copy link
Member

Ah yeah my mistake.

I don't think there's a way we can achieve it using the "full" file based driver.

But we could on the half one.

@wanze
Copy link
Contributor

wanze commented Jan 21, 2021

This feature request looks similar to what's implemented in the Super Static Cache addon for Statamic 2. This addon allows you to whitelist query strings from which the cacher service should create static HTML files.

@jelleroorda
Copy link
Contributor

Seeing your CacheExclusionChecker, it doesn't do what OP requests. In that case the whole page will not get cached at all if there is at least one query parameter that is not whitelisted.

What OP wants is the following:

Some param is not impacting the rendered output, like marketing params, then 1 version of the cached html should used for all without re-rendering.
If the param is impacting the output, like pagination, then it should be rendered and cached.

So basically he wants to have a single HTML page cached on the server for multiple URLs.

Example of URLs that could potentially use the same HTML file:

Example of URLs that should NOT use the same HTML file:

I'm not sure whether the above functionality is possible since you'd need to have a list of query parameters you would like to ignore in your nginx configuration. It would definitely be interesting though, so once i find time I'll try to research it.

Basically there are two things that need to happen for this to work:

  • Custom static file cacher that removes the excluded query parameters from the file name when creating the static HTML file.
  • Nginx configuration should also be changed so it first tries to find a HTML file with the excluded query parameters, if not it should fall back to the usual.

@jasonvarga
Copy link
Member

Generating a html file that's aware of excluded query params would be simple enough. It's the htaccess/nginx rules that the hard part. I'm definitely no expert, but I think you just get "the query params" and not a way to filter them.

If you can come up with something, that'd be awesome.

Again though, this would not be too difficult with the half measure driver. But everyone loves the full measure driver. 😄

@jelleroorda
Copy link
Contributor

So I found some time tonight, and it's actually possible to do. The only thing is that it's not really pretty at the nginx configuration level, but oh well...

Creating the custom static file cacher

Like Jason already mentioned, this is by far the easiest part.

  1. Create the custom static cacher, I've created mine like the following:
<?php

declare(strict_types=1);

namespace App\Caching;

use Statamic\StaticCaching\Cacher;
use Statamic\StaticCaching\Cachers\FileCacher;

class StaticCacher extends FileCacher implements Cacher
{
    /**
     * Generate the file path for this url
     *
     * @param $url
     * @return string
     */
    public function getFilePath($url)
    {
        $parts = parse_url($url);

        return sprintf('%s%s_%s.html',
            $this->getCachePath(),
            $parts['path'],
            $this->getFilteredQuery(array_get($parts, 'query'))
        );
    }

    /**
     * Filters out the GET variables that should be ignored
     *
     * @param $query
     * @return string
     */
    private function getFilteredQuery($query): string
    {
        if (!$query) {
            return '';
        }

        parse_str($query, $variables);

        $filteredQuery = array_filter($variables, function($name) {
            return !in_array($name, $this->config('ignore_query_params', []));
        }, ARRAY_FILTER_USE_KEY);

        return http_build_query($filteredQuery);
    }
}
  1. Then we need to register this new cache driver with the cache manager, I've registered it under the driver name 'static';
<?php

namespace App\Providers;

use App\Caching\StaticCacher;
use Illuminate\Cache\Repository;
use Illuminate\Support\ServiceProvider;
use Statamic\StaticCaching\Cachers\Writer;
use Statamic\StaticCaching\StaticCacheManager;

class AppServiceProvider extends ServiceProvider
{
    /**
     * Register any application services.
     *
     * @return void
     */
    public function register()
    {
        $this->app->booting(function() {
            $manager = $this->app->make(StaticCacheManager::class);

            $manager->extend('static', function($app, $config) {
                return new StaticCacher(new Writer(), $app[Repository::class], $config);
            });
        });
    }
}
  1. Next we need to use this new driver, and add the ignored query parameters to the config as well.
    // Inside config/statamic/static_caching.php
    'strategies' => [
        'full' => [
            'driver' => 'static',
            'path' => public_path('static'),
            'lock_hold_length' => 0,
            'ignore_query_params' => [
                'utm_source',
                'utm_medium',
            ]
        ],
    ]

That's it, that was not that hard. You can now test your new driver and see whether visiting your-app.com?utm_source=newsletter creates a new static file in public/static with the utm_source parameter or not. It shouldn't have, if it did something is wrong.

Setting up the nginx configuration

So this is where it gets kind of ugly, since as far as I could tell it's not possible in nginx to make loops and arrays like you would in PHP for example. So it's really just an iterative process of removing the parameters you don't want (the ones you set up in ignore_query_params), cleaning up the url, and then try to match it against the static cache. Basically I got most from a mailing list dated from 2010 (thanks Ole!).

So to do this I did the following things:

  1. Copy the default parameters;
  2. Remove the parameters we want to ignore from this copy;
  3. Match with this modified copy of the parameters against the static cache.

The default static caching with Statamic looks like this:

    location / {
        try_files /static${uri}_${args}.html $uri /index.php?$args;
    }

That piece will need to be updated to something like this:

    # Create copy of query
    set $filtered_args $args;
    
    # Remove GET parameters
    if ($filtered_args ~ (.*)utm_source=[^&]*(.*)) {
        set $filtered_args $1$2;
    }
    if ($filtered_args ~ (.*)utm_medium=[^&]*(.*)) {
        set $filtered_args $1$2;
    }
    # ... add all the get parameters you have ignored in Statamic
    
    # Cleanup any repeated & introduced
    if ($filtered_args ~ (.*)&&+(.*)) {
        set $filtered_args $1&$2;
    }
    
    # Cleanup leading &
    if ($filtered_args ~ ^&(.*)) {
        set $filtered_args $1;
    }
    
    # Cleanup ending &
    if ($filtered_args ~ (.*)&$) {
        set $filtered_args $1;
    }

    location / {
        try_files /static${uri}_${filtered_args}.html $uri /index.php?$args;
    }

After updating and testing your nginx configuration (and restarting nginx) you can test the new configuration. For example, visit your-app.com, update the content in public/static/_.html to contain something like 'awesome', and then visit your-app.com?utm_source=newsletter. It should contain the content you've added. Also double check your logs, you shouldn't see something like Static cache loaded [https://your-app.com/?utm_source=newsletter] If you are seeing this, your server rewrite rules have not been set up correctly. . Since well, if you see it you've not set up the rewrite rules correctly :).

All in all it was some fun research, but I have no need to use it since I don't use any marketing tools ;). I hope it'll be useful to some people, feel free to yank that code.

@jelleroorda
Copy link
Contributor

Just noticed that OP used Apache in his post, I suppose you could do something similar to nginx, but since I'm not familiar with Apache you'd have to code that up yourself ¯_(ツ)_/¯

@jasonvarga
Copy link
Member

This is awesome, thanks 🎉

@jameswtc
Copy link
Author

Nice work. Is this going to turn into a feature?

@benlilley
Copy link

This would be very handy to have built in and documented - because I've just noticed that we have been serving non-cached very slow pages on all ad clicks for example because it's building a new page each time:

Jan 10 21:58 'my-page_gclid=Cj0KCQiAtvSdBhD0ARIsAPf8oNlkb1leT1fYZiNMS_OJ07_4EuRb3-ixRhwIO9zINufHGxLihKmZzy0aAnP0EALw_wcB.html'
Jan 10 21:26 'my-page_gclid=Cj0KCQiAtvSdBhD0ARIsAPf8oNnOAtUtmKOmrcdyKzrTJSEe3Q03Pj78skAgAUFqf-q59q8lnkQp5xMaAvWwEALw_wcB.html'
Jan 10 19:02  my-page_.html

@christophstockinger
Copy link
Contributor

We have a project where we actually have to exclude certain query params in static caching.
The background is that certain campaigns of the customer had very long loading times and thus a high bounce rate. This was triggered in that one parameter was always Unique and thus the page was cached every time.

We now have a solution that allows us to define query params via the .env that are then not cached. The whole thing is based on the suggestion of @jelleroorda .

I think that so far this can be integrated directly into the core without having to write a Custom-Static-Cacher. Is something like this already planned or are you open for a contribution? @jasonvarga @jackmcdade

@jasonvarga
Copy link
Member

We'd be more than happy to review a PR but have no immediate plans for this, especially since Jelle has shown it's doable using a custom driver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants