Lower memory requirements for cached pressed pages

ETag Caching was introduced for Facia `PressedPage` JSON downloading with #26338 in order to improve scalability and address #26335, but a limiting factor was the number of `PressedPage` objects that could be stored in the cache. With a max `PressedPage` size of 22MB and a memory budget of 4GB, a cautious max cache size limit of only 180 `PressedPage` objects was set. As a result, the cache hit rate was relatively low, and we saw elevated GC, probably because of object continually being evicted out of the small cache: #26338 (comment) The change in this new commit dramatically reduces the combined size of the `PressedPage` objects held in memory, taking the average retained size per `PressedPage` from 4MB to 0.5MB (based on a sample of 125 `PressedPage` objects held in memory at the same time). It does this by deduplicating the `Tag` objects held by the `PressedPage`s. Previously, as the `Tag`s for different `PressedPage`s were deserialised from JSON, many identical tags would created over and over again, and held in memory. After dedeuplication, those different `PressedPage`s will all reference the _same_ `Tag` object for a given tag. The deduplication is done as the `Tag`s are deserialised - a new cache (gotta love caches!) holds `Tag`s keyed by their hashcode and tag id, and if a new `Tag` is created with a matching key, it's thrown away, and the old one is used instead. Thus we end up with just one instance of that `Tag`, instead of many duplicated ones. See also: * https://en.wikipedia.org/wiki/String_interning - a similar technique used by Java for Strings: https://www.geeksforgeeks.org/interning-of-string/
guardian · Sep 5, 2023 · 984f9f8 · 984f9f8
1 parent f43cb34
commit 984f9f8
Show file tree

Hide file tree

Showing 3 changed files with 49 additions and 2 deletions.
diff --git a/common/app/json/ObjectDeduplication.scala b/common/app/json/ObjectDeduplication.scala
@@ -0,0 +1,38 @@
+package json
+
+import com.github.blemale.scaffeine.Scaffeine
+import com.gu.etagcaching.ConfigCache
+import play.api.libs.json.Format
+
+/**
+  * This class is a memory optimisation for when many identical objects (eg `Tag`s) are being
+  * read from JSON, and then held in memory. If the objects are immutable & identical,
+  * we don't want to hold multiple copies of them - we can just hold one in a cache, and have all
+  * references to the object point to that same object.
+  *
+  * In practice, this means when we deserialize Json to some Scala object, we _do_ create a
+  * new instance of that Scala object, but we immediately throw it away if we find we already
+  * have a matching one in our cache - we use the cached one in preference, and we get the desired
+  * situation where all the instances of that unique value in memory are represented by one single
+  * instance.
+  *
+  * We key our cached instances by hashcode (which is mostly unique, apart from
+  * https://en.wikipedia.org/wiki/Birthday_problem collisions) combined with a user-defined id function
+  * (which is a only precaution, to eliminate Birthday Problem collisions). So in our cache, we
+  * might have several instances of the Tag `sport/cycling`, if the tag has changed - but the hashcode
+  * ensures that we do not substitute old data in place of new data. Old data can be configured to
+  * eventually expire out of the cache (eg `.expireAfterAccess(1.hour)`).
+  */
+object ObjectDeduplication {
+
+  /**
+    * @param id as precaution, this function provides an object distinguisher beyond the object hashcode. For a Tag,
+    *           you might just provide the tag id.
+    */
+  def deduplicate[A](f: Format[A], id: A => Any, configCache: ConfigCache): Format[A] = {
+    val cache = configCache(Scaffeine()).build[(Int, Any), A]().asMap()
+    def key(a: A): (Int, Any) = (a.hashCode(), id(a))
+
+    f.bimap(a => cache.getOrElseUpdate(key(a), a), identity)
+  }
+}
diff --git a/common/app/model/Formats.scala b/common/app/model/Formats.scala
@@ -2,12 +2,15 @@ package model
 
 import com.gu.contentapi.client.utils.DesignType
 import common.Pagination
+import json.ObjectDeduplication.deduplicate
 import model.content._
 import model.facia.PressedCollection
 import model.pressed._
 import play.api.libs.json._
 import play.api.libs.json.JodaReads._
 
+import scala.concurrent.duration.DurationInt
+
 object GenericThriftAtomFormat extends Format[com.gu.contentatom.thrift.Atom] {
   def reads(json: JsValue): JsError = JsError("Converting from Json is not supported by intent!")
   def writes(atom: com.gu.contentatom.thrift.Atom): JsObject = JsObject(Seq.empty)
@@ -216,7 +219,13 @@ object PressedContentFormat {
   implicit val podcastFormat = Json.format[Podcast]
   implicit val referenceFormat = Json.format[Reference]
   implicit val tagPropertiesFormat = Json.format[TagProperties]
-  implicit val tagFormat = Json.format[Tag]
+  implicit val tagFormat: Format[Tag] =
+    deduplicate(
+      Json.format[Tag],
+      _.id,
+      _.maximumSize(20000) // average Tag retains ~8KB in memory, so 20000 cached Tags retain only 160MB
+        .expireAfterAccess(1.hour),
+    )
   implicit val tagsFormat = Json.format[Tags]
   implicit val elementPropertiesFormat = Json.format[ElementProperties]
   implicit val imageAssetFormat = Json.format[ImageAsset]

diff --git a/common/app/services/fronts/FrontJsonFapi.scala b/common/app/services/fronts/FrontJsonFapi.scala
@@ -44,7 +44,7 @@ trait FrontJsonFapi extends GuLogging {
         }
       },
     AlwaysWaitForRefreshedValue,
-    _.maximumSize(180).expireAfterAccess(1.hour),
+    _.maximumSize(2000).expireAfterAccess(1.hour),
   )
 
   def get(path: String, pageType: PressedPageType): Future[Option[PressedPage]] =