Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix serialization issues #11

Merged
merged 5 commits into from
Sep 30, 2014

Conversation

jrfnl
Copy link
Contributor

@jrfnl jrfnl commented Sep 2, 2014

Fix PHP Notice: unserialize(): Error at offset ... errors.

These kind of errors are caused most of the time by:

  • unescaped quotes/slashes etc
  • miscounted unicode characters

The cache files will now no longer be human readable. To still be able to identify them, the new cache filenames now include the package slug.

This also takes care of compatibility issues when upgrading as the new version will look for the cache files with the new filename pattern and will disregard the old cache files.

These kind of errors are caused most of the time by:
- unescaped quotes/slashes etc
- miscounted unicode characters

The change in this commit fixes this, though it does mean the cache directory has to be cleared completely (once - on upgrade to the version which includes this fix) for the fix to take effect.
…e was no way to determine which package the cache referred to which makes selective (manual) deletion difficult.
@YahnisElsts
Copy link
Owner

I don't think I've encountered that notice before. Can you provide steps to reproduce it?

In any case, patching over the problem with base64-encoding is probably not the best solution. First, lets figure out what caused the error, and then we can discuss how to fix it. Otherwise there's no way to know if the supposed fix actually works.

@jrfnl
Copy link
Contributor Author

jrfnl commented Sep 3, 2014

I can mail you a file which will give the error if that helps.

In any case, patching over the problem with base64-encoding is probably not the best solution. First, lets figure out what caused the error, and then we can discuss how to fix it. Otherwise there's no way to know if the supposed fix actually works.

I did ( research the cause of the error) and found that this was the most recommended way to fix it ;-)

Oh and my tests show that it works too ;-)

@YahnisElsts
Copy link
Owner

Sure, an example file would help. My email is whiteshadow@w-shadow.com

My gut feeling is that the error could be caused by a race condition: if one instance of the script starts writing the cache file and then another one overwrites it before the first one is finished, the file will end up in a corrupted state.

@jrfnl
Copy link
Contributor Author

jrfnl commented Sep 23, 2014

@YahnisElsts Sorry for my slow response, but I've send you some files now.

@YahnisElsts
Copy link
Owner

I've received the files. These are the original, unedited files, right? Depending on your text editor, even something as trivial as copying text from one file to another could affect whitespace and special characters.

It looks like the real problem is line endings. The files contain strings with CRLF line endings (Windows style), but the serialized string lengths are set as if they had LF-only line endings (Linux style). This means there's an extra CR character for each line break, so the real byte count doesn't match the stored length. As a result, unserialization fails.

It could be some kind of a PHP bug or config issue, but try as I might, I can't get my local version of PHP to produce a broken file like that. Maybe you could try the following code on your system? It basically runs a couple of sanity checks to make sure reading/writing files works as expected when the input contains mixed line endings.

var_dump(ini_get('auto_detect_line_endings'));
$repro = array(
    'a' => "first line\nsecond line\r\n\"third\": 'line';\n",
);
$serialized = serialize($repro);

file_put_contents('serialized.txt', $serialized);
$fromFile = file_get_contents('serialized.txt');
assert($serialized === $fromFile);

$unserialized = unserialize($serialized);
var_dump($unserialized);
assert($repro['a'] === $unserialized['a']);

@jrfnl
Copy link
Contributor Author

jrfnl commented Sep 23, 2014

@YahnisElsts I can run the check locally, but the files were produced on someone else's machine/server which I don't have access to. All the same, I see the same error popping up now and again erratically on my local test system.

Most of the time, things are fine, but I do see the error in my error logs some of the time. Although there is a backtrace, the error doesn't state which file it relates to and as I have set up a regular automated cache refresh, the files at the time I see the error may not be the same as when it was generated.

The package files however are the same (unchanged), so it really shouldn't happen at all.

Maybe a str_replace( array( "\r\n", "\r" ), "\n", $string ) just before the serialization would do the trick ?
All the same, I've used the base_encode for a while now as well and haven't seen any problems with that in place (nor any real slow down).

@YahnisElsts
Copy link
Owner

Maybe a str_replace( array( "\r\n", "\r" ), "\n", $string ) just before the serialization would do the trick?

It might, but all else being equal, a caching system shouldn't silently modify your data.

Anyway, given that this appears to be some kind of an obscure PHP bug and not something that could be reliably fixed in application code, base64-encoding the serialized string is probably an acceptable solution. Just two small suggestions:

  • Add comments explaining why data is encoded. This way, when someone stumbles upon this project in the future, they will be less tempted to throw out that seemingly-useless piece of code.
  • For cache file names, consider using metadata-b64-md5 instead of metadata-slug-md5.

@jrfnl
Copy link
Contributor Author

jrfnl commented Sep 28, 2014

@YahnisElsts I've adjusted the PR based on your suggestions with one minor difference:
I've added the 'b64' indicator to the cache filename, but also left the 'slug' part in.

That way if you'd have lots of packages and you replace one, you can still selectively (manually) delete just the one cache file to force it to update instead of having to clear the complete cache directory.
Without the slug component in the cache file name, there would be no way to identify individual cache files.

YahnisElsts added a commit that referenced this pull request Sep 30, 2014
Base64-encode cached data to fix serialization issues
@YahnisElsts YahnisElsts merged commit 338f373 into YahnisElsts:master Sep 30, 2014
@jrfnl jrfnl deleted the Fix-serialization-issues branch October 1, 2014 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants