Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache strategy for channels in ChannelService #988

Closed
karel-advantitge opened this issue Jul 14, 2021 · 19 comments
Closed

Cache strategy for channels in ChannelService #988

karel-advantitge opened this issue Jul 14, 2021 · 19 comments
Labels
design 📐 This issue deals with high-level design of a feature @vendure/core

Comments

@karel-advantitge
Copy link
Contributor

Is your feature request related to a problem? Please describe.
The ChannelService as-is holds an array of all channels in memory, which is updated whenever a channel is created/updated. This makes the vendure instance stateful and can lead to invalid/out-of-sync data if a project uses multiple vendure instances.

Describe the solution you'd like
Querying from the database on every call would solve this, but since the data is used on every call (specifically getChannelFromToken) this would become a bottleneck.
Querying the session data presented a similar bottleneck, so I think we need a similar solution.
Similar to the SessionCacheStrategy config, I'd propose a ChannelCacheStrategy

@michaelbromley
Copy link
Member

michaelbromley commented Jul 14, 2021

Yes this needs to be solved. Let's consider the approaches:

  1. Cache nothing, fetch from DB when needed
    ✔ Data always correct
    ✔ Simple, no extra config.
    ❌ Performance impact. It would not be as bad as the session request, since there are only 2 top-level joins. The session query joins 3 levels deep over a much larger data set.
  2. Use a cache with TTL
    Use the TtlCache class to cache the channels for a limited period of time, e.g. 30 seconds.
    ✔ Performance would be similar to current solution
    ✔ No extra config to add to the API
    ❌ Still possibility of getting stale data up to seconds after a change.
  3. Configurable cache strategy
    ✔ Can externalize the cache store and share between instances
    ✔ Full control over caching
    ❌ Increases the config API surface

I personally like the simplicity of option 2, but in some circumstances I can imagine even a short amount of inconsistency might not be acceptable.

Investigating further, it can be seen that a similar in-memory caching technique is used in PromotionService.activePromotions, and ZoneService.updateZonesCache() which has the exact same issues as with the ChannelService.

Therefore it might make more sense to define a generalized caching strategy, which would encompass Sessions, Channels, Promotions, Zones and anything else needed in the future. This could be a single config field, which would deprecate and succeed the SessionCacheStrategy.

I'm not exactly sure of what the API of a general cache strategy would look like, but to me that's quite an elegant solution.

Thoughts?

@hendrik-advantitge
Copy link
Contributor

Hi Michael, sorry to get back to you on this so late.
Our preference is certainly your last proposal, a single CacheStrategy that works for all current and future entities.
The amount of config is not increased, and it seems unlikely that one would prefer different caching strategies for these entities.

@michaelbromley michaelbromley added @vendure/core design 📐 This issue deals with high-level design of a feature labels Aug 2, 2021
@hendrik-advantitge
Copy link
Contributor

Hi Michael. If at all possible I would like to bump the priority on this issue as it is really causing us trouble in our serverless setup. Multiple instances of Vendure exist in parallel, but they need to be restarted every time a Channel is added to or removed from the system.

@michaelbromley
Copy link
Member

michaelbromley commented Sep 22, 2021

Yep, let's get it in the next minor 👍

@michaelbromley
Copy link
Member

Also handle this line as part of the solution:

// TODO: Use caching (RequestContextCacheService) to reduce DB calls
const { outOfStockThreshold, trackInventory } = await this.globalSettingsService.getSettings(ctx);

This is making a db query for every product variant where the stock level is selected.

@michaelbromley
Copy link
Member

I've been doing some exploratory work on this issue and here's a problem with the configurable cache strategy option:

Several of the internal caches are storing deeply-nested class instances. For example, ChannelService.allChannels stores reference to an array of Channel instances, each of which has nested Zone instances.

A generic caching mechanism would need to be able to serialize this in order to store it in eg a database or Redis. The problem comes when we deserialize these data structures, the resulting objects will be POJOs and no longer class instances. Thus any downstream code which either relies on an instanceof check or attempts to access instance methods or getters will fail.

One solution to this would be to build-in some logic which allows us to reconstruct object instances based on the deserialized data. This rehydration process will however add time to the get() process, and may even nullify the gain of caching. It also adds considerable complexity to the solution.

I'm going to do a bit more exploration here and it might even be that with more use of the RequestContextCacheService (which is currently under-utilized), plus perhaps a short TTL cache, we might be able to solve this in a simpler way.

@hendrik-advantitge
Copy link
Contributor

Hi Michael, this is a very valid point you are making here. We have been using class-transformer https://github.com/typestack/class-transformer in the past to solve this exact problem, but the delays introduced by doing this have to be considered. Maybe some simple tests might be in order to determine the introduced delay?

A cache with a (configurable) TTL could make sense as well, although this might give very unexpected results in multi-server environment. Multiple parallel requests by a client are not necessarily processed by the same server instance. As a consequence, multiple requests might give different results.

@michaelbromley
Copy link
Member

Yes I looked at class-transformer, which would solve the serialization issue but the computation time is the unknown, plus I'm not keen on adding another layer of annotations onto the TypeORM entities.

I'm doing some benchmarking as I explore this, and will posts update this thread with results. To summarize, here is a list of all the points where data is currently cached in-memory:

Property What is cached? Access pattern
ChannelService.allChannels All Channels + related tax & shipping Zone channelService.getChannelFromToken() called on every request when building the RequestContext object.
ZoneService.zones All Zones + nested members (Countries) ProductVariantService.applyChannelPriceAndTax() - every ProductVariant view. OrderService.applyPriceAdjustments - every modification to Order.
PromotionService.activePromotions All Promotions where enabled: true Not actually used anywhere! Can be removed

@hendrik-advantitge
Copy link
Contributor

You might have overlooked ShippingMethodService.activeShippingMethods

@michaelbromley
Copy link
Member

@hendrik-advantitge thanks, that's right I missed that one.

So here is the direction I am going in so far:

I have decided to go with the second option mentioned in the above comment, namely keep caching in memory, but add a TTL to expire the cache, mainly due to the issues mentioned above with deserializing class instances.

I have created a function which allows the affected services to still work in the same way as presently, but adds a TTL to the caching. I found that it was important to keep the cache.get operation synchronous. I first of all attempted to make them async, but the infectious nature of async functions quickly started requiring a lot more changes than I wanted in the call stacks depending on these cached values.

So I am working on a design which will allow the cache.get operation to always be sync, while still triggering refreshes of the value async in the background:

/**
 * @description
 * A cache which always returns a value, and automatically refreshes itself if the value is found to be stale.
 */
export interface SelfRefreshingCache<V> {
    /**
     * @description
     * The current value of the cache. If the value is stale, the stale value is still returned,
     * but then will be asynchronously refreshed in the background.
     */
    value: V;

    /**
     * @description
     * Force a refresh of the value, e.g. when it is known that the value has changed such as after
     * an update operation to the source data in the database.
     */
    refresh(): Promise<void>;
}

export interface SelfRefreshingCacheConfig<V> {
    name: string;
    ttl: number;
    refreshFn: () => Promise<V>;
}

/**
 * @description
 * Creates a {@link SelfRefreshingCache} object, which is used to cache a single frequently-accessed value. In this type
 * of cache, the function used to populate the value (`refreshFn`) is defined during the creation of the cache, and
 * it is immediately used to populate the initial value.
 *
 * From there, when the `.value` property is accessed, it will _always_ return a value synchronously. If the
 * value has expired, it will still be returned and the `refreshFn` will be triggered to update the value in the
 * background.
 */
export async function createSelfRefreshingCache<V>(
    config: SelfRefreshingCacheConfig<V>,
): Promise<SelfRefreshingCache<V>> {
    const { ttl, name, refreshFn } = config;
    const initialValue = await refreshFn();
    let value = initialValue;
    let expires = new Date().getTime() + ttl;
    let refreshing = false;
    const refreshValue = () => {
        if (refreshing) {
            return Promise.resolve();
        }
        refreshing = true;
        Logger.debug(`Refreshing the SelfRefreshingCache "${name}"`);
        return refreshFn()
            .then(newValue => {
                value = newValue;
                expires = new Date().getTime() + ttl;
            })
            .catch(err => {
                Logger.error(
                    `Failed to update SelfRefreshingCache "${name}": ${err.message}`,
                    undefined,
                    err.stack,
                );
            })
            .finally(() => (refreshing = false));
    };
    return {
        get value() {
            const now = new Date().getTime();
            if (expires < now) {
                refreshValue();
            }
            return value;
        },
        refresh: refreshValue,
    };
}

In the example of the ChannelService, it would be used like this:

@Injectable()
export class ChannelService {

    private allChannels: SelfRefreshingCache<Channel[]>;

    /**
     * When the app is bootstrapped, ensure a default Channel exists and populate the
     * channel lookup array.
     */
    async initChannels() {
        await this.ensureDefaultChannelExists();
        this.allChannels = await createSelfRefreshingCache({
            name: 'ChannelService.allChannels',
            ttl: 10000,
            refreshFn: () => this.findAll(RequestContext.empty()),
        });
    }

   /**
     * Returns the default Channel.
     */
    getDefaultChannel(): Channel {
        const defaultChannel = this.allChannels.value.find(channel => channel.code === DEFAULT_CHANNEL_CODE);

        if (!defaultChannel) {
            throw new InternalServerError(`error.default-channel-not-found`);
        }
        return defaultChannel;
    }

   async update(
        ctx: RequestContext,
        input: UpdateChannelInput,
    ): Promise<ErrorResultUnion<UpdateChannelResult, Channel>> {
        // omitted the usual update logic

        // here we refresh the cached Channels array
        await this.allChannels.refresh();

        return assertFound(this.findOne(ctx, channel.id));
    }

In that example, the TTL is set to 10 seconds. This kind of timeframe seems reasonable, but we could also add some new config options to VendureConfig to make this configurable. I'd rather not have to do that, and if possible just find a value for each use-case which strikes a practical balance between avoiding staleness and avoiding hundreds of unneccesarry DB calls.

@hendrik-advantitge
Copy link
Contributor

Hi Michael
Using an in memory cache with TTL seems indeed to be the better option when we want to avoid creating classes constantly. I reread the code dealing with caching Sessions, and it seems the trick to avoid this problem there is the creation of a new entity CachedSession. This is something we could apply here as well, but resulting in many breaking changes throughout the code (especially for Channel, as this object is literally passed around to every function).

TTL avoids all this mess and keeps things much simpler. The idea of returning a value even when it is stale seems unintuitive at first sight. I totally understand the reasoning though and doubt it will give any issues in practice. But in theory, with a TTL of 10 seconds, you could still retrieve a value that is hours old if the value has not been queried in that time.

It not being an issue only holds if we stick to the entities that are currently cached. If we decide to keep Promotions cached for some reason, it might turn out to be an issue. Depending on the deployment setup, an administrator could disable a promotion while a customer still gets the promotion applied (only once though).

What are your thoughts?

@michaelbromley
Copy link
Member

But in theory, with a TTL of 10 seconds, you could still retrieve a value that is hours old if the value has not been queried in that time.

Does it happen that with a multi-instance setup you could have once instance that does not get traffic for such a long time? I would have thought that e.g. with serverless the instance might be cleaned up, and with a long-running instance with load balancer in front, each instance would get a similar frequency of requests. But I have not practical experience here so I'm interested to know if you ran into something like this so far.

@hendrik-advantitge
Copy link
Contributor

In a serverless environment, unused instances indeed do get cleaned up after some time under normal circumstances. However Google, and possibly other providers as well, have introduced features that can keep a number instances 'warm'. In this scenario, the old cache is used when spinning up the instance. Another scenario, common for us, is that the API instance is kept running by an importer service that is updating products. This is a long running process, that will not trigger updates of the shippingMethod cache for example.

@michaelbromley
Copy link
Member

OK, that makes sense. I took another go at making it async and I think it is doable. Rather get it correct than hope that edge cases won't appear - they always do!

@hendrik-advantitge
Copy link
Contributor

100% agree on this!
Any reason why you would prefer not making the TTL configurable?

@michaelbromley
Copy link
Member

Any reason why you would prefer not making the TTL configurable?

Yes, I have a bias against adding config if at all possible. Don't want to turn VendureConfig into webpack! I'm not dogmatic about it, it is just my default stance until convinced otherwise.

@hendrik-advantitge
Copy link
Contributor

I think there is nothing wrong with a lot of config options, as long as there are good default values. It gives options to the advanced users without getting in the way of new users.

michaelbromley added a commit that referenced this issue Sep 28, 2021
Relates to #988. This change resolves stale data issues when running multiple Vendure instances in
parallel, e.g. as serverless functions.
michaelbromley added a commit that referenced this issue Sep 28, 2021
Relates to #988. Zone members (countries) were being newly translated for _every_ call to
ZoneService.findAll(). In practice this was resulting in thousands of calls to `translateDeep`.
This optimization cuts the calls down by a factor of 5-10x in various scenarios.
@ashitikov
Copy link
Contributor

ashitikov commented Jan 23, 2022

Hey!
Looking at current implementation of ChannelService I found SelfRefreshingCache

private allChannels: SelfRefreshingCache<Channel[], [RequestContext]>;
it not suitable for my setup.
I built a multi-tenant solution for vendure using row level security (RLS) in postgres, so the design is to have on each vendure entity 'tenantId' property, which identifies a tenant. So in my environment ChannelService caches wrong, because on refresh, it executes findAll() which will find all channels from current request's tenant. So if ChannelService will receive multiple requests from a few tenants, cache can return channels from the last request instead on your own.

I understand that vendure recommends to use channels as tenants, but here my thoughts:

  1. If you have a 1000 tenants, you have 1000 channels. each time SelfRefreshingCache refreshes, it will hold-on in memory 1000 channels, and to getChannelFromToken you have complexity O(1000). Thats not so bad, but using this cache on a big environments become a bottleneck.
  2. Configurable TTL gives you a little control of caching behavior, but introduces a consistency problem for multi-instance deployment, which you point on.
  3. I'd like to implement my own cache to have full control on channel caching behavior. So in my setup, I'd like to have SelfRefreshingCache for each tenant, with flexible TTL setting:
    tenant1 -> cache1 -> [channel1, channel2, channel3] (TTL = 10 sec)
    tenant2 -> cache2 -> [channel4, channel5] (TTL = 5 sec)
    So TTL can be controlled by configurable cache strategy.

Currently there is no way to control caching behavior. I think 3rd option "Configurable cache strategy" would be better for custom deployments.

I think I can contribute this feature.

@michaelbromley What do you think?

@michaelbromley
Copy link
Member

Hi @ashitikov, thank you for your input on this.

  1. The O(n) lookup problem could be solved by using a Map rather than a Array.find() lookup. That would give O(1) regardless of number of Channels. As for memory usage, I'm not sure whether the impact is significant, it would need to be measured.
  2. Yes, that's the trade off with the current design. The reasoning being that Channels are one thing that very rarely changes under normal usage.
  3. I would be happy to explore designs that would allow this in a non-breaking way. Open up a new issue & PR with your ideas!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design 📐 This issue deals with high-level design of a feature @vendure/core
Projects
None yet
Development

No branches or pull requests

4 participants