Layout corruption with php-fpm (a Magento 1 problem returns) #6942

maderlock · 2016-10-10T10:58:44Z

Preconditions

Run Magento 2 (any version) under php-fpm. Any valid php version - tested with php 5.6 to 7.0.11

Steps to reproduce

Hit it with heavy traffic including xmlrpc calls
Visit a page on frontend or backend

Expected result

Actual result

Page displays without head tag (sometimes - intermittent)

This is fixed by clearing the layout cache. This displays in the same way as an issue in Magento 1 - see https://www.c3media.co.uk/blog/c3-news/security-fix-might-take-site

The fix in Magento 1 was to add the following to bootstrap.php:

if (function_exists('libxml_disable_entity_loader')) {
    libxml_disable_entity_loader(false);
}

Why is this needed? Because there are a number of places in Magento 2 where $oldval = libxml_disable_entity_loader(true) is called and then later libxml_disable_entity_loader($oldval). In theory this should be fine, but the value is not thread-safe, so when running under php-fpm it only requires a call to occur in between these statements for the entire thread to end up set to TRUE. Why this then corrupts to cache it was never clear in Magento 1, but that was the upshot.

The above fix was added in Magento 1.9.2.0, but was not included in Magento 2. In one file (vendor/magento/framework/Xml/Security.php) there is an acknowledgment of the issue:

        /**
         * If running with PHP-FPM we perform an heuristic scan
         * We cannot use libxml_disable_entity_loader because of this bug
         * @see https://bugs.php.net/bug.php?id=64938
         */

...but sadly it still calls libxml_disable_entity_loader despite the warning. This is also called in some of the Zend and Symfony components, so it's not really reasonable to handle all these situations. Thus the need for the suggested fix.

Not entirely sure how to fix this in the short-term, as unlike Magento 2 we cannot just edit the bootstrap.php file in Magento 2.1.0. Any suggestions?

The text was updated successfully, but these errors were encountered:

maderlock · 2016-10-12T12:06:18Z

I've tested the fix and unfortunately the cache corruption has occurred again, so it may be that the cause of this problem is different, or the fix needs to be different for Magento 2.

convenient · 2016-10-18T13:51:10Z

Can you please post a screenshot of the error when it happens? I'm trying to replicate on vanilla need to find something to hook into my bash script so that my scripts can identify a failure programmatically.

For now i'll focus on the page missing a tag in its entirety.

convenient · 2016-10-21T15:15:53Z

Theoretically if your instance was vanilla and your website suffered this error, would the output look similar to this?

convenient · 2016-10-22T18:42:52Z

I've spent an hour or so playing and have a few thoughts, although sadly no solution. I hope I actually get my hands on a demonstrable case of this bug at some point in the future.

M1 and configuration corruption

There were multiple errors that could cause config corruption in M1, aside from the cache locking race condition (https://github.com/convenient/magento-ce-ee-config-corruption-bug#update-good-news-a-patch-from-magento) the most likely culprit would be running PHP-FPM and the non thread-safe function libxml_disable_entity_loader.

https://bugs.php.net/bug.php?id=64938
https://bugs.php.net/bug.php?id=73328
https://bugs.php.net/bug.php?id=62577

`libxml_disable_entity_loader` breaks XML file loading

TL;DR

Breaks simplexml_load_file
Breaks XMLReader
But not simplexml_load_string

The summary of the above is the status of libxml_disable_entity_loader is erroneously shared between PHP-FPM threads, causing failures to load otherwise valid xml files. These failed loads do not cause the system to throw an exception, and the erroneously generated content is cached for the next user meaning an admin has to flush the relevant cache for the system to become active again.

The calls that trigger libxml_disable_entity_loader(true) will likely be completely unrelated to config or layout generation, but due to the threaded nature it simply needs to be called during the same time period.

My readings suggest that the libxml_disable_entity_loader functionality only affects the simplexml_load_file method and the XMLReader object, a combination of file_get_contents and simplexml_load_string seems to be safe.

See gist here for a quick confirmation of these thoughts https://gist.github.com/convenient/5af567eae7090ed333c6cd366e77d357.

The XML file loading loading in M2 looks safer than M1

Whereas M1 made extensive use of simplexml_load_file, M2 seems to have refactored to use the safer pairing of simplexml_load_string(file_get_contents($path).

This can be seen in Magento\Framework\View\Model\Layout\Merge::_loadFileLayoutUpdatesXml, which uses Magento\Framework\Filesystem\Driver\File::fileGetContents to load the layout xml files and then uses Magento\Framework\View\Model\Layout\Merge::_loadXmlString to actually turn the string into a usable object.

This is also consistent with your findings that smashing libxml_disable_entity_loader(false); into your index.php does not solve the issue. It doesn't fix the issue as Magento is not using simplexml_load_file nor XMLReader to load the file, it is using safe functionality which does not care about the status of libxml_disable_entity_loader.

A quick sanity check

I did a quick check of the system for any other bits of code that might be loading XML in a non-thread-safe-manner, however it only turned up tests/dev tools and the magento setup modules. None of these things should be kicking off in sites that are in production mode. Unless I'm making a massive misunderstanding, which very possible! Perhaps it's worthwhile running a similar check on your code base to ensure that no modules are using the affected functions.

grep -r "simplexml_load_file" . | grep -v "/dev/tests/"

./app/code/Magento/Dhl/Test/Unit/Model/CarrierTest.php:        $xml = simplexml_load_file(
./app/code/Magento/Dhl/Test/Unit/Model/CarrierTest.php:        $empty = $billingNumberOnly = $outputImageOnly = simplexml_load_file(
./lib/internal/Magento/Framework/Simplexml/Test/Unit/ElementTest.php:        $xml = simplexml_load_file($xmlData[0], $xmlData[1]);
./lib/internal/Magento/Framework/Simplexml/Test/Unit/ElementTest.php:        $xml = simplexml_load_file($xmlData[0], $xmlData[1]);
./pub/errors/processor.php:        return ($configPath) ? simplexml_load_file($configPath) : null;
./setup/src/Magento/Setup/Module/Dependency/Parser/Config/Xml.php:        return \simplexml_load_file($file)->xpath('/config/module')[0];
./setup/src/Magento/Setup/Module/I18n/Parser/Adapter/Xml.php:        $xml = simplexml_load_file($file);
./vendor/magento/magento-composer-installer/src/MagentoHackathon/Composer/Magento/PackageXmlParser.php:        $package = simplexml_load_file($this->getFile()->getPathname());
./vendor/magento/magento-composer-installer/src/MagentoHackathon/Composer/Magento/PackageXmlParser.php:            $targets = simplexml_load_file(__DIR__ . '/../../../../res/target.xml');
./vendor/magento/zendframework1/library/Zend/Config.php:     * Handle any errors from simplexml_load_file or parse_ini_file

Similarly I couldn't find any usages of XMLReader in the vanilla source code, no luck there.

I wonder if there are other xml reading functions that are affected by libxml_disable_entity_loader that I may have missed.

Perhaps logging the exception thats suppressed in Magento\Framework\Filesystem\Glob::glob would fully rule out the glob aspect of the layout generation as a candidate.

maderlock · 2016-10-31T17:31:12Z

Thanks for all that work! I will try and go through it soon. For now, I'll just note that I thought there were unsafe calls in symfony and zend components that are manually turning on libxml_disable_entity_loader. Thanks again,

eldonbite · 2017-01-05T09:22:22Z

One of our projects was experiencing this issue as well, and as a workaround, we configured a cron task for clearing the layout cache every 3 hours just to avoid broken pages from being shown on the site.

We recently upgraded to Magento 2.1.3, and along with it, disabled the cron task for cleaning the layout cache. So far there are no reports of this issue occurring again. Anybody else experienced the same?

convenient · 2017-01-05T09:23:37Z

@eldonbite What version did you upgrade from? Do you have any other debug info to add?

eldonbite · 2017-01-05T09:24:46Z

Hi @convenient,

We were previously running 2.1.2.

mrIntegrator · 2017-07-28T00:18:25Z

@maderlock @convenient @eldonbite Are you using Redis for cache storage?

eldonbite · 2017-07-28T14:51:20Z

Hi @mrIntegrator, yes we do use Redis for both cache and session storage, and Varnish for FPC. Just recently we encountered this same issue in the Admin (instead of the front-end), and clearing just the layout cache fixed the broken Admin pages.

mrIntegrator · 2018-03-16T03:41:33Z

Sorry, that it's been a while. This bug came back to mind when I noticed one of sites still had layout cache disabled in production as a precaution. Lo and behold, the bug still exists in the latest version of Magento (currently 2.2.3). So here's what I've discovered.

Steps to Reproduce:

Install Magento with Redis cache (I used db 0 for sessions, 1 for cache, 2 for FPC).
Navigate to a page not cached by FPC (I used customer/account/login).
Delete the page layout cache key in Redis. (Found by doing a var_dump of $cacheIdPageLayout in Magento\Framework\View\Model\Layout\Merge::load).
Refresh the page and behold the headlessness of your broken page.

Now this is just how to quickly reproduce what happens under heavy traffic as mentioned in the initial bug report. We have only experienced this happening under heavy traffic as well. The problem is in the logic of this function:

// Magento\Framework\View\Model\Layout\Merge
public function load($handles = [])
...
$cacheId = $this->getCacheId();
$cacheIdPageLayout = $cacheId . '_' . self::PAGE_LAYOUT_CACHE_SUFFIX;
$result = $this->_loadCache($cacheId);
if ($result) {
    $this->addUpdate($result);
    $this->pageLayout = $this->_loadCache($cacheIdPageLayout);
    return $this;
}
...

We can see here that there is a check as to whether the layout updates were loaded from cache, but if they were, it's just assumed that page layout will be loaded from cache as well. The problem is that when Redis reaches its maxmemory setting (as it might under heavy traffic), it will begin evicting keys (see https://redis.io/topics/lru-cache). So if the page layout key is evicted while the layout key is not, the bug occurs. To fix this bug, we have to check for a successful load of both layout and page layout from the cache.

I will create a PR with a fix for this. Hopefully Magento can get this fixed quickly and we'll all be able to turn layout cache back on!

andrey-legayev · 2019-05-07T12:38:28Z

The real production case for which I've spent 40 hours debugging:

Symptoms
Page with empty HTML. Literally any page can be empty.
(we catched it on production on checkout/success page)

Full cache flush helps temporary.

Pre-requisites

AWS Infra:
- front load balancer with two Varnish+nginx servers
- internal load balancer with several Apache+mod_php servers
- redis master + redis replica
Magento cache configuration:
- master node + replica for read requests
- write_control option is by default (Zend_Cache_Core feature)
High load during promo campaign

How It Happens

customer sends request when Varnish and Redis cache are empty
Magento builds layout and saves it in 3 cache keys: layout XML, page layout (for example: "1column"), JSON with layout structure
after every cache save Zend_Cache_Core checks written value by reading it again and comparing to original value
- if values do not match it deletes created cache key
under high load and in case of Redis master+slave configuration Redis doesn't have enough time to replicate changes
Zend_Cache_Core decides that value was not written correctly and tries to delete it (deletion happens on Master node of redis)
sometimes it happens that only 1st item from layout cache is saved (XML), but 2nd (1column) and 3rd (JSON structure) are not saved to cache
on next page load of the same URL customer gets to 2nd Varnish node which doesn't have cache
Magento starts to build layout from partially cached data, but fails to do it correctly
- Magento reads 1st item (XML) from cache
- then it tries to read 2nd item ("1column") but do not check returned value (!)
- then it builds layout structure JSON and saves broken value to cache ("root", "head" and many other blocks/containers are absent, because of pageLayout==false)
- and finally caches empty HTML in Varnish

How to easily reproduce

Open redis-cli and find cache keys of your page:
KEYS *LAYOUT*
You may need to filter by layouts hash.
Delete *_PAGE_LAYOUT cache keys (redis command "DEL")
Delete STRUCTURE_LAYOUT cache keys (redis command "DEL")
If you have Varnish - flush it. If you have full page cache - delete those items from redis manually.
Refresh page

Actual result: Empty page, no HTML code at all.
Error in logs "before.body.end" doesn't exists.

Root cause
Magento does not check return of cache item with page layout value (for example: "1column", "3columns")

Permanent fix
Store Layout XML and Page Layout items in one cache item (*_PAGE_CACHE_MERGED)
It will eliminate the case when one item is lost and also it should give very tiny performance improvement, because 1 item is saved/loaded instead of two.

Pull requests

Other action items

Check how magento behaves on Redis cache with write_control=false

andrey-legayev · 2019-06-27T14:17:48Z

As I see the fix has been included into v2.2.9 and v2.3.2, but pull requests weren't actually merged.

onepack · 2020-09-09T07:29:42Z

Actually. I ran into this old issue again in version 2.4. The site was running great in the months before and I have been working around the clock to get this fixed. I noticed there was a correlation between the load and these broken pages and I noticed that clearing the cache fixed things for a while (until a new load peak).
I initially thought it was due to a lack of available RAM as Redis uses a lot. (SSH root vps: top command to check).
But after I added 2gb of RAM, took the database away from the VPS and eliminated some php errors the issue still persisted with 2.5GB of free RAM.
It's less frequent but it still occurs randomly by going through categories or use the search.

onepack · 2020-09-09T08:34:34Z

I can hit 15 categories every second without a problem but then one hangs for a while and comes up broken.
It really seems a process is stuck.
However al lot of logs are not noticing anything.
VPS messages log: no error
PHP log: nothing
Magento logs: nothing
Redis log: empty
But then httpd/domains/domain.error.log I get these notices all related to the cache and memory consumption.
PHP fatal error allowed memory size of 1gb exhausted (tried to allocate 20480 bytes) in:
public_html/vendor/collinmollenhour/cache-backend-redis/Cm/Cache/Backend/Redis.php on line 1254.
PHP fatal error allowed memory size of 1gb exhausted (tried to allocate 20480 bytes) in:
public_html/vendor/magento/framework/Serialize/Serializer/Json.php on line 37.

I will raise the one GB again to two GB now the Database is moved away from the VPS but still that is a lot of GB for processing pages. I'll come back with the outcome of this change.

onepack · 2020-09-09T09:07:34Z

Setting it to 1.5GB didn't help.
Raising it to 2GB now on production against Magento 2 recommendation to keep production on 768mb.

Even with 2GB I get the error that 28480 bytes could not be allocated series on a page where a customer tried to set the product limit from 40 to 80.

ihor-sviziev · 2020-09-09T11:17:56Z

@onepack according to the logs you shown above - seems like you got the out of memory issue, please check if your increased memory is being used, the best option will checking the memory_limit parameter in the output of the phpinfo() function on the web request

onepack · 2020-09-09T11:42:44Z

@ihor-sviziev
Thank you for reaching out.
That was also my first idea but the same set of error messages keep on coming op.
First we raised to 1.5GB: same messages.
Then 2GB: Same messages
Allowed memory size of 2147483648 bytes exhaused, tried to allocate 33554432 bytes in the module-csp/Model/BlockCache.php line 78
Also the same in public_html/vendor/magento/framework/Serialize/Serializer/Json.php on line 37.

I was just thinking about the system running for months without a restart before we upgraded running on less RAM than we do now to keep things from crashing.
Checking the SSH TOP command shows huge RAM spikes by the PHP-fpm from 1GB to 1.8 or even 2.1GB.

Also I noticed other users with the same issue:
https://magento.stackexchange.com/questions/320037/magento-2-4-httpd-memory-leak-allowed-memory-size-of-792723456-bytes-exhausted

Checking above shows the same files causing memory peaks:
[Mon Aug 17 19:40:48.077047 2020] [php7:error] [pid 16939:tid 140670127806208] [client 127.0.0.1:40644] PHP Fatal error: Allowed memory size of 792723456 bytes exhausted (tried to allocate 20480 bytes) in /var/www/html/vendor/magento/module-csp/Model/BlockCache.php on line 108
[Mon Aug 17 19:40:48.125051 2020] [php7:error] [pid 16939:tid 140670127806208] [client 127.0.0.1:40644] PHP Fatal error: Allowed memory size of 792723456 bytes exhausted (tried to allocate 40960 bytes) in Unknown on line 0
[Mon Aug 17 19:40:50.052935 2020] [php7:error] [pid 16932:tid 140669993522944] [client 127.0.0.1:40654] PHP Fatal error: Allowed memory size of 792723456 bytes exhausted (tried to allocate 48238592 bytes) in /var/www/html/vendor/magento/framework/Serialize/Serializer/Json.php on line 24

ihor-sviziev · 2020-09-09T13:03:51Z

@onepack This is definitely not related to listed above issue.
I believe you have some custom extension or theme that causing such issue, I would recommend you to debug on which exact request this issue appears (usually there is 500 error), maybe add some additional logs if neede dand then try to reproduce it yourself. If you'll find the root cause - please report separate issue.
If you still have any questions - feel free to contact me in Slack.

onepack · 2020-09-09T13:21:39Z

I did some elastic searches and got a result on the result page but it also gave an error in the RAM.
Allowed memory size of 2147483648 bytes exhausted, tried to allocate 33554432 bytes in the module-csp/Model/BlockCache.php line 78
Again the cache files.

As a visitor on the frontend you don't notice this but the next search or going to a category around the same time will result in a broken layout or a blank page 500 error page.
At the time of searching I saw the PHP jobs peak around 1.9 and 1.7GB etc in "top" (ssh).

This is done via the search but the same happens when going through categories without the search.
I will now upgrade the RAM to 2.5GB for PHP and test again.

onepack · 2020-09-09T13:54:46Z

@ihor-sviziev
I have been working on this issue for some days now and believe me, the first thing to do is:
deactivate the theme.
deactivate all third party modules
recompile the whole thing.
Test.
I will create a new bug report as the outcome of the pages being broken seems the same but the source of the problem seems different as there seems to be a memory leak causing this.

onepack · 2020-09-09T15:12:00Z

@ihor-sviziev
for your information.
vendor/magento/module-csp/Model/BlockCache.php is the root of all evil.
The public function load($identifier) to be precise.
I will create a new issue to point this out.
Disabling the module fixed the issue for us.

andrey-legayev · 2020-09-09T15:22:34Z

@onepack Your case looks to me different than the issue described by me above.
Do you have one Redis node or master-slave replication?

onepack · 2020-09-09T16:01:54Z

@andrey-legayev
Thanks for replying.
It was not a Redis issue after all but the broken page results would look the same as in the old M1 days.
Please check the new case:
#29964

veloraven added Component: Framework/Cache Performance bug report labels Oct 10, 2016

KrystynaKabannyk assigned choukalos Feb 2, 2017

bka mentioned this issue Feb 14, 2017

PageCache: async rendering of blocks can corrupt layout cache #8554

Closed

choukalos removed their assignment Feb 21, 2017

magento-engcom-team added 2.0.x bug report Component: Framework/Cache Issue: Format is not valid Gate 1 Failed. Automatic verification of issue format is failed labels Sep 11, 2017

okorshenko removed the 2.0.x label Dec 13, 2017

magento-engcom-team removed 2.1.x labels Dec 14, 2017

ghost assigned andrey-legayev and unassigned andrey-legayev May 7, 2019

This was referenced May 7, 2019

Fix layout xml and page layout caching issue on redis cluster under high load #22765

Closed

Fix layout xml and page layout caching issue on redis cluster under high load #22766

Closed

ghost assigned andrey-legayev and unassigned andrey-legayev May 7, 2019

ihor-sviziev removed the Fixed in 2.3.x The issue has been fixed in 2.3 release line label May 7, 2019

ihor-sviziev mentioned this issue May 24, 2019

Corrupt layout cache causes white page on customer account login #22976

Closed

ghost assigned andrey-legayev and unassigned andrey-legayev Jun 27, 2019

andrey-legayev closed this as completed Jun 27, 2019

ihor-sviziev added Fixed in 2.2.x The issue has been fixed in 2.2 release line Fixed in 2.3.x The issue has been fixed in 2.3 release line labels Jun 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layout corruption with php-fpm (a Magento 1 problem returns) #6942

Layout corruption with php-fpm (a Magento 1 problem returns) #6942

maderlock commented Oct 10, 2016 •

edited by magento-engcom-team

Loading

maderlock commented Oct 12, 2016

convenient commented Oct 18, 2016

convenient commented Oct 21, 2016

convenient commented Oct 22, 2016 •

edited

Loading

maderlock commented Oct 31, 2016

eldonbite commented Jan 5, 2017

convenient commented Jan 5, 2017 •

edited

Loading

eldonbite commented Jan 5, 2017

mrIntegrator commented Jul 28, 2017

eldonbite commented Jul 28, 2017

mrIntegrator commented Mar 16, 2018 •

edited

Loading

andrey-legayev commented May 7, 2019 •

edited

Loading

andrey-legayev commented Jun 27, 2019

onepack commented Sep 9, 2020 •

edited

Loading

onepack commented Sep 9, 2020 •

edited

Loading

onepack commented Sep 9, 2020 •

edited

Loading

ihor-sviziev commented Sep 9, 2020

onepack commented Sep 9, 2020 •

edited

Loading

ihor-sviziev commented Sep 9, 2020 •

edited

Loading

onepack commented Sep 9, 2020

onepack commented Sep 9, 2020

onepack commented Sep 9, 2020

andrey-legayev commented Sep 9, 2020

onepack commented Sep 9, 2020

Layout corruption with php-fpm (a Magento 1 problem returns) #6942

Layout corruption with php-fpm (a Magento 1 problem returns) #6942

Comments

maderlock commented Oct 10, 2016 • edited by magento-engcom-team Loading

Preconditions

Steps to reproduce

Expected result

Actual result

maderlock commented Oct 12, 2016

convenient commented Oct 18, 2016

convenient commented Oct 21, 2016

convenient commented Oct 22, 2016 • edited Loading

M1 and configuration corruption

libxml_disable_entity_loader breaks XML file loading

The XML file loading loading in M2 looks safer than M1

A quick sanity check

maderlock commented Oct 31, 2016

eldonbite commented Jan 5, 2017

convenient commented Jan 5, 2017 • edited Loading

eldonbite commented Jan 5, 2017

mrIntegrator commented Jul 28, 2017

eldonbite commented Jul 28, 2017

mrIntegrator commented Mar 16, 2018 • edited Loading

andrey-legayev commented May 7, 2019 • edited Loading

andrey-legayev commented Jun 27, 2019

onepack commented Sep 9, 2020 • edited Loading

onepack commented Sep 9, 2020 • edited Loading

onepack commented Sep 9, 2020 • edited Loading

ihor-sviziev commented Sep 9, 2020

onepack commented Sep 9, 2020 • edited Loading

ihor-sviziev commented Sep 9, 2020 • edited Loading

onepack commented Sep 9, 2020

onepack commented Sep 9, 2020

onepack commented Sep 9, 2020

andrey-legayev commented Sep 9, 2020

onepack commented Sep 9, 2020

maderlock commented Oct 10, 2016 •

edited by magento-engcom-team

Loading

convenient commented Oct 22, 2016 •

edited

Loading

`libxml_disable_entity_loader` breaks XML file loading

convenient commented Jan 5, 2017 •

edited

Loading

mrIntegrator commented Mar 16, 2018 •

edited

Loading

andrey-legayev commented May 7, 2019 •

edited

Loading

onepack commented Sep 9, 2020 •

edited

Loading

onepack commented Sep 9, 2020 •

edited

Loading

onepack commented Sep 9, 2020 •

edited

Loading

onepack commented Sep 9, 2020 •

edited

Loading

ihor-sviziev commented Sep 9, 2020 •

edited

Loading