Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Layout corruption with php-fpm (a Magento 1 problem returns) #6942

Closed
maderlock opened this issue Oct 10, 2016 · 32 comments
Closed

Layout corruption with php-fpm (a Magento 1 problem returns) #6942

maderlock opened this issue Oct 10, 2016 · 32 comments
Labels
Component: Framework/Cache Fixed in 2.2.x The issue has been fixed in 2.2 release line Fixed in 2.3.x The issue has been fixed in 2.3 release line Issue: Cannot Reproduce Cannot reproduce the issue on the latest `2.4-develop` branch Issue: Clear Description Gate 2 Passed. Manual verification of the issue description passed Issue: Format is valid Gate 1 Passed. Automatic verification of issue format passed

Comments

@maderlock
Copy link

maderlock commented Oct 10, 2016

Preconditions

  1. Run Magento 2 (any version) under php-fpm. Any valid php version - tested with php 5.6 to 7.0.11

Steps to reproduce

  1. Hit it with heavy traffic including xmlrpc calls
  2. Visit a page on frontend or backend

Expected result

Actual result

  1. Page displays without head tag (sometimes - intermittent)

This is fixed by clearing the layout cache. This displays in the same way as an issue in Magento 1 - see https://www.c3media.co.uk/blog/c3-news/security-fix-might-take-site

The fix in Magento 1 was to add the following to bootstrap.php:

if (function_exists('libxml_disable_entity_loader')) {
    libxml_disable_entity_loader(false);
}

Why is this needed? Because there are a number of places in Magento 2 where $oldval = libxml_disable_entity_loader(true) is called and then later libxml_disable_entity_loader($oldval). In theory this should be fine, but the value is not thread-safe, so when running under php-fpm it only requires a call to occur in between these statements for the entire thread to end up set to TRUE. Why this then corrupts to cache it was never clear in Magento 1, but that was the upshot.

The above fix was added in Magento 1.9.2.0, but was not included in Magento 2. In one file (vendor/magento/framework/Xml/Security.php) there is an acknowledgment of the issue:

        /**
         * If running with PHP-FPM we perform an heuristic scan
         * We cannot use libxml_disable_entity_loader because of this bug
         * @see https://bugs.php.net/bug.php?id=64938
         */

...but sadly it still calls libxml_disable_entity_loader despite the warning. This is also called in some of the Zend and Symfony components, so it's not really reasonable to handle all these situations. Thus the need for the suggested fix.

Not entirely sure how to fix this in the short-term, as unlike Magento 2 we cannot just edit the bootstrap.php file in Magento 2.1.0. Any suggestions?

@maderlock
Copy link
Author

I've tested the fix and unfortunately the cache corruption has occurred again, so it may be that the cause of this problem is different, or the fix needs to be different for Magento 2.

@convenient
Copy link
Contributor

Can you please post a screenshot of the error when it happens? I'm trying to replicate on vanilla need to find something to hook into my bash script so that my scripts can identify a failure programmatically.

For now i'll focus on the page missing a tag in its entirety.

@convenient
Copy link
Contributor

Theoretically if your instance was vanilla and your website suffered this error, would the output look similar to this?

image

@convenient
Copy link
Contributor

convenient commented Oct 22, 2016

I've spent an hour or so playing and have a few thoughts, although sadly no solution. I hope I actually get my hands on a demonstrable case of this bug at some point in the future.

M1 and configuration corruption

There were multiple errors that could cause config corruption in M1, aside from the cache locking race condition (https://github.com/convenient/magento-ce-ee-config-corruption-bug#update-good-news-a-patch-from-magento) the most likely culprit would be running PHP-FPM and the non thread-safe function libxml_disable_entity_loader.

https://bugs.php.net/bug.php?id=64938
https://bugs.php.net/bug.php?id=73328
https://bugs.php.net/bug.php?id=62577

libxml_disable_entity_loader breaks XML file loading

TL;DR

  • Breaks simplexml_load_file
  • Breaks XMLReader
  • But not simplexml_load_string

The summary of the above is the status of libxml_disable_entity_loader is erroneously shared between PHP-FPM threads, causing failures to load otherwise valid xml files. These failed loads do not cause the system to throw an exception, and the erroneously generated content is cached for the next user meaning an admin has to flush the relevant cache for the system to become active again.

The calls that trigger libxml_disable_entity_loader(true) will likely be completely unrelated to config or layout generation, but due to the threaded nature it simply needs to be called during the same time period.

My readings suggest that the libxml_disable_entity_loader functionality only affects the simplexml_load_file method and the XMLReader object, a combination of file_get_contents and simplexml_load_string seems to be safe.

See gist here for a quick confirmation of these thoughts https://gist.github.com/convenient/5af567eae7090ed333c6cd366e77d357.

The XML file loading loading in M2 looks safer than M1

Whereas M1 made extensive use of simplexml_load_file, M2 seems to have refactored to use the safer pairing of simplexml_load_string(file_get_contents($path).

This can be seen in Magento\Framework\View\Model\Layout\Merge::_loadFileLayoutUpdatesXml, which uses Magento\Framework\Filesystem\Driver\File::fileGetContents to load the layout xml files and then uses Magento\Framework\View\Model\Layout\Merge::_loadXmlString to actually turn the string into a usable object.

This is also consistent with your findings that smashing libxml_disable_entity_loader(false); into your index.php does not solve the issue. It doesn't fix the issue as Magento is not using simplexml_load_file nor XMLReader to load the file, it is using safe functionality which does not care about the status of libxml_disable_entity_loader.

A quick sanity check

I did a quick check of the system for any other bits of code that might be loading XML in a non-thread-safe-manner, however it only turned up tests/dev tools and the magento setup modules. None of these things should be kicking off in sites that are in production mode. Unless I'm making a massive misunderstanding, which very possible! Perhaps it's worthwhile running a similar check on your code base to ensure that no modules are using the affected functions.

grep -r "simplexml_load_file" . | grep -v "/dev/tests/"

./app/code/Magento/Dhl/Test/Unit/Model/CarrierTest.php:        $xml = simplexml_load_file(
./app/code/Magento/Dhl/Test/Unit/Model/CarrierTest.php:        $empty = $billingNumberOnly = $outputImageOnly = simplexml_load_file(
./lib/internal/Magento/Framework/Simplexml/Test/Unit/ElementTest.php:        $xml = simplexml_load_file($xmlData[0], $xmlData[1]);
./lib/internal/Magento/Framework/Simplexml/Test/Unit/ElementTest.php:        $xml = simplexml_load_file($xmlData[0], $xmlData[1]);
./pub/errors/processor.php:        return ($configPath) ? simplexml_load_file($configPath) : null;
./setup/src/Magento/Setup/Module/Dependency/Parser/Config/Xml.php:        return \simplexml_load_file($file)->xpath('/config/module')[0];
./setup/src/Magento/Setup/Module/I18n/Parser/Adapter/Xml.php:        $xml = simplexml_load_file($file);
./vendor/magento/magento-composer-installer/src/MagentoHackathon/Composer/Magento/PackageXmlParser.php:        $package = simplexml_load_file($this->getFile()->getPathname());
./vendor/magento/magento-composer-installer/src/MagentoHackathon/Composer/Magento/PackageXmlParser.php:            $targets = simplexml_load_file(__DIR__ . '/../../../../res/target.xml');
./vendor/magento/zendframework1/library/Zend/Config.php:     * Handle any errors from simplexml_load_file or parse_ini_file

Similarly I couldn't find any usages of XMLReader in the vanilla source code, no luck there.

I wonder if there are other xml reading functions that are affected by libxml_disable_entity_loader that I may have missed.

Perhaps logging the exception thats suppressed in Magento\Framework\Filesystem\Glob::glob would fully rule out the glob aspect of the layout generation as a candidate.

@maderlock
Copy link
Author

Thanks for all that work! I will try and go through it soon. For now, I'll just note that I thought there were unsafe calls in symfony and zend components that are manually turning on libxml_disable_entity_loader. Thanks again,

@eldonbite
Copy link

One of our projects was experiencing this issue as well, and as a workaround, we configured a cron task for clearing the layout cache every 3 hours just to avoid broken pages from being shown on the site.

We recently upgraded to Magento 2.1.3, and along with it, disabled the cron task for cleaning the layout cache. So far there are no reports of this issue occurring again. Anybody else experienced the same?

@convenient
Copy link
Contributor

convenient commented Jan 5, 2017

@eldonbite What version did you upgrade from? Do you have any other debug info to add?

@eldonbite
Copy link

Hi @convenient,

We were previously running 2.1.2.

@mrIntegrator
Copy link

@maderlock @convenient @eldonbite Are you using Redis for cache storage?

@eldonbite
Copy link

Hi @mrIntegrator, yes we do use Redis for both cache and session storage, and Varnish for FPC. Just recently we encountered this same issue in the Admin (instead of the front-end), and clearing just the layout cache fixed the broken Admin pages.

@mrIntegrator
Copy link

mrIntegrator commented Mar 16, 2018

Sorry, that it's been a while. This bug came back to mind when I noticed one of sites still had layout cache disabled in production as a precaution. Lo and behold, the bug still exists in the latest version of Magento (currently 2.2.3). So here's what I've discovered.

Steps to Reproduce:

  1. Install Magento with Redis cache (I used db 0 for sessions, 1 for cache, 2 for FPC).

  2. Navigate to a page not cached by FPC (I used customer/account/login).

  3. Delete the page layout cache key in Redis. (Found by doing a var_dump of $cacheIdPageLayout in Magento\Framework\View\Model\Layout\Merge::load).

  4. Refresh the page and behold the headlessness of your broken page.

Now this is just how to quickly reproduce what happens under heavy traffic as mentioned in the initial bug report. We have only experienced this happening under heavy traffic as well. The problem is in the logic of this function:

// Magento\Framework\View\Model\Layout\Merge
public function load($handles = [])
...
$cacheId = $this->getCacheId();
$cacheIdPageLayout = $cacheId . '_' . self::PAGE_LAYOUT_CACHE_SUFFIX;
$result = $this->_loadCache($cacheId);
if ($result) {
    $this->addUpdate($result);
    $this->pageLayout = $this->_loadCache($cacheIdPageLayout);
    return $this;
}
...

We can see here that there is a check as to whether the layout updates were loaded from cache, but if they were, it's just assumed that page layout will be loaded from cache as well. The problem is that when Redis reaches its maxmemory setting (as it might under heavy traffic), it will begin evicting keys (see https://redis.io/topics/lru-cache). So if the page layout key is evicted while the layout key is not, the bug occurs. To fix this bug, we have to check for a successful load of both layout and page layout from the cache.

I will create a PR with a fix for this. Hopefully Magento can get this fixed quickly and we'll all be able to turn layout cache back on!

@andrey-legayev
Copy link
Contributor

andrey-legayev commented May 7, 2019

The real production case for which I've spent 40 hours debugging:

Symptoms
Page with empty HTML. Literally any page can be empty.
(we catched it on production on checkout/success page)

Full cache flush helps temporary.

Pre-requisites

  • AWS Infra:
    • front load balancer with two Varnish+nginx servers
    • internal load balancer with several Apache+mod_php servers
    • redis master + redis replica
  • Magento cache configuration:
    • master node + replica for read requests
    • write_control option is by default (Zend_Cache_Core feature)
  • High load during promo campaign

How It Happens

  • customer sends request when Varnish and Redis cache are empty
  • Magento builds layout and saves it in 3 cache keys: layout XML, page layout (for example: "1column"), JSON with layout structure
  • after every cache save Zend_Cache_Core checks written value by reading it again and comparing to original value
    • if values do not match it deletes created cache key
  • under high load and in case of Redis master+slave configuration Redis doesn't have enough time to replicate changes
  • Zend_Cache_Core decides that value was not written correctly and tries to delete it (deletion happens on Master node of redis)
  • sometimes it happens that only 1st item from layout cache is saved (XML), but 2nd (1column) and 3rd (JSON structure) are not saved to cache
  • on next page load of the same URL customer gets to 2nd Varnish node which doesn't have cache
  • Magento starts to build layout from partially cached data, but fails to do it correctly
    • Magento reads 1st item (XML) from cache
    • then it tries to read 2nd item ("1column") but do not check returned value (!)
    • then it builds layout structure JSON and saves broken value to cache ("root", "head" and many other blocks/containers are absent, because of pageLayout==false)
    • and finally caches empty HTML in Varnish

How to easily reproduce

  1. Open redis-cli and find cache keys of your page:
    KEYS *LAYOUT*
    You may need to filter by layouts hash.
  2. Delete *_PAGE_LAYOUT cache keys (redis command "DEL")
  3. Delete STRUCTURE_LAYOUT cache keys (redis command "DEL")
  4. If you have Varnish - flush it. If you have full page cache - delete those items from redis manually.
  5. Refresh page

Actual result: Empty page, no HTML code at all.
Error in logs "before.body.end" doesn't exists.

Root cause
Magento does not check return of cache item with page layout value (for example: "1column", "3columns")

Permanent fix
Store Layout XML and Page Layout items in one cache item (*_PAGE_CACHE_MERGED)
It will eliminate the case when one item is lost and also it should give very tiny performance improvement, because 1 item is saved/loaded instead of two.

Pull requests

Other action items

  • Check how magento behaves on Redis cache with write_control=false

@ihor-sviziev ihor-sviziev removed the Fixed in 2.3.x The issue has been fixed in 2.3 release line label May 7, 2019
@andrey-legayev
Copy link
Contributor

As I see the fix has been included into v2.2.9 and v2.3.2, but pull requests weren't actually merged.

@ghost ghost assigned andrey-legayev and unassigned andrey-legayev Jun 27, 2019
@ihor-sviziev ihor-sviziev added Fixed in 2.2.x The issue has been fixed in 2.2 release line Fixed in 2.3.x The issue has been fixed in 2.3 release line labels Jun 27, 2019
@onepack
Copy link

onepack commented Sep 9, 2020

Actually. I ran into this old issue again in version 2.4. The site was running great in the months before and I have been working around the clock to get this fixed. I noticed there was a correlation between the load and these broken pages and I noticed that clearing the cache fixed things for a while (until a new load peak).
I initially thought it was due to a lack of available RAM as Redis uses a lot. (SSH root vps: top command to check).
But after I added 2gb of RAM, took the database away from the VPS and eliminated some php errors the issue still persisted with 2.5GB of free RAM.
It's less frequent but it still occurs randomly by going through categories or use the search.

@onepack
Copy link

onepack commented Sep 9, 2020

I can hit 15 categories every second without a problem but then one hangs for a while and comes up broken.
It really seems a process is stuck.
However al lot of logs are not noticing anything.
VPS messages log: no error
PHP log: nothing
Magento logs: nothing
Redis log: empty
But then httpd/domains/domain.error.log I get these notices all related to the cache and memory consumption.
PHP fatal error allowed memory size of 1gb exhausted (tried to allocate 20480 bytes) in:
public_html/vendor/collinmollenhour/cache-backend-redis/Cm/Cache/Backend/Redis.php on line 1254.
PHP fatal error allowed memory size of 1gb exhausted (tried to allocate 20480 bytes) in:
public_html/vendor/magento/framework/Serialize/Serializer/Json.php on line 37.

I will raise the one GB again to two GB now the Database is moved away from the VPS but still that is a lot of GB for processing pages. I'll come back with the outcome of this change.

@onepack
Copy link

onepack commented Sep 9, 2020

Setting it to 1.5GB didn't help.
Raising it to 2GB now on production against Magento 2 recommendation to keep production on 768mb.

Even with 2GB I get the error that 28480 bytes could not be allocated series on a page where a customer tried to set the product limit from 40 to 80.

@ihor-sviziev
Copy link
Contributor

@onepack according to the logs you shown above - seems like you got the out of memory issue, please check if your increased memory is being used, the best option will checking the memory_limit parameter in the output of the phpinfo() function on the web request

@onepack
Copy link

onepack commented Sep 9, 2020

@ihor-sviziev
Thank you for reaching out.
That was also my first idea but the same set of error messages keep on coming op.
First we raised to 1.5GB: same messages.
Then 2GB: Same messages
Allowed memory size of 2147483648 bytes exhaused, tried to allocate 33554432 bytes in the module-csp/Model/BlockCache.php line 78
Also the same in public_html/vendor/magento/framework/Serialize/Serializer/Json.php on line 37.

I was just thinking about the system running for months without a restart before we upgraded running on less RAM than we do now to keep things from crashing.
Checking the SSH TOP command shows huge RAM spikes by the PHP-fpm from 1GB to 1.8 or even 2.1GB.

Also I noticed other users with the same issue:
https://magento.stackexchange.com/questions/320037/magento-2-4-httpd-memory-leak-allowed-memory-size-of-792723456-bytes-exhausted

Checking above shows the same files causing memory peaks:
[Mon Aug 17 19:40:48.077047 2020] [php7:error] [pid 16939:tid 140670127806208] [client 127.0.0.1:40644] PHP Fatal error: Allowed memory size of 792723456 bytes exhausted (tried to allocate 20480 bytes) in /var/www/html/vendor/magento/module-csp/Model/BlockCache.php on line 108
[Mon Aug 17 19:40:48.125051 2020] [php7:error] [pid 16939:tid 140670127806208] [client 127.0.0.1:40644] PHP Fatal error: Allowed memory size of 792723456 bytes exhausted (tried to allocate 40960 bytes) in Unknown on line 0
[Mon Aug 17 19:40:50.052935 2020] [php7:error] [pid 16932:tid 140669993522944] [client 127.0.0.1:40654] PHP Fatal error: Allowed memory size of 792723456 bytes exhausted (tried to allocate 48238592 bytes) in /var/www/html/vendor/magento/framework/Serialize/Serializer/Json.php on line 24

@ihor-sviziev
Copy link
Contributor

ihor-sviziev commented Sep 9, 2020

@onepack This is definitely not related to listed above issue.
I believe you have some custom extension or theme that causing such issue, I would recommend you to debug on which exact request this issue appears (usually there is 500 error), maybe add some additional logs if neede dand then try to reproduce it yourself. If you'll find the root cause - please report separate issue.
If you still have any questions - feel free to contact me in Slack.

@onepack
Copy link

onepack commented Sep 9, 2020

I did some elastic searches and got a result on the result page but it also gave an error in the RAM.
Allowed memory size of 2147483648 bytes exhausted, tried to allocate 33554432 bytes in the module-csp/Model/BlockCache.php line 78
Again the cache files.

As a visitor on the frontend you don't notice this but the next search or going to a category around the same time will result in a broken layout or a blank page 500 error page.
At the time of searching I saw the PHP jobs peak around 1.9 and 1.7GB etc in "top" (ssh).

This is done via the search but the same happens when going through categories without the search.
I will now upgrade the RAM to 2.5GB for PHP and test again.

@onepack
Copy link

onepack commented Sep 9, 2020

@ihor-sviziev
I have been working on this issue for some days now and believe me, the first thing to do is:
deactivate the theme.
deactivate all third party modules
recompile the whole thing.
Test.
I will create a new bug report as the outcome of the pages being broken seems the same but the source of the problem seems different as there seems to be a memory leak causing this.

@onepack
Copy link

onepack commented Sep 9, 2020

@ihor-sviziev
for your information.
vendor/magento/module-csp/Model/BlockCache.php is the root of all evil.
The public function load($identifier) to be precise.
I will create a new issue to point this out.
Disabling the module fixed the issue for us.

@andrey-legayev
Copy link
Contributor

@onepack Your case looks to me different than the issue described by me above.
Do you have one Redis node or master-slave replication?

@onepack
Copy link

onepack commented Sep 9, 2020

@andrey-legayev
Thanks for replying.
It was not a Redis issue after all but the broken page results would look the same as in the old M1 days.
Please check the new case:
#29964

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Framework/Cache Fixed in 2.2.x The issue has been fixed in 2.2 release line Fixed in 2.3.x The issue has been fixed in 2.3 release line Issue: Cannot Reproduce Cannot reproduce the issue on the latest `2.4-develop` branch Issue: Clear Description Gate 2 Passed. Manual verification of the issue description passed Issue: Format is valid Gate 1 Passed. Automatic verification of issue format passed
Projects
None yet
Development

No branches or pull requests