-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use heuristic to choose the most likely UTF-16 decoded string #1381
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised that there isn't some library bytes decoder we can throw this stuff at. Do we have to do this because we're extracting arbitrary chunks of bytes for decoding?
Yeah I think so, especially when searching in binary blobs. I didn't search too hard for a library, so if you have one in mind we can check it out. I'm still not sure if we should adopt this heuristic or not though. Detecting UTF-16 in binaries is fiddly. |
Maybe it would be helpful if our sliding-window-with-overlap chunker could detect the beginning of the file and suggest what decoders are valid for the chunks it emits. |
pkg/decoders/utf16.go
Outdated
if i+1 >= len(b)-1 { | ||
continue | ||
} | ||
// Same check but offset by one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the offset by one check in case the chunk was cut off in the middle of a character? If so, we should also invert the result of the endianness check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the DLL case, the UTF-16 string was embedded within other binary data, so the guesser was guessing incorrectly.
I was thinking there were four ways to decode (endianness and even/odd starting index): (LE, even), (LE, odd), (BE, even), (BE, odd).
I'm not entirely convinced that the even/odd plays a role and it's really just BE vs LE though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of this https://github.com/trufflesecurity/trufflehog/compare/utf16-decoder-alt ?
It includes all valid BE or LE utf16 since both can exist together (like in the DLL). I'm could very well be wrong, but I think we just need to worry about BE or LE. The chunker will always cut off on an even byte, so we shouldn't get an off by one situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that makes sense to do. I was leaning that way myself but wasn't sure about appending the two decoded strings together.. it should be fine, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main issue I see with it is that line numbers would be wrong, but that's out the window anyway when we're decoding partial files.
pkg/decoders/utf16.go
Outdated
// Guard against index out of bounds for the next check. | ||
if i+1 >= len(b)-1 { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed since it's at the end of the for loop body.
pkg/decoders/utf16.go
Outdated
} | ||
} | ||
|
||
return buf.Bytes(), nil | ||
return append(bufLE.Bytes(), bufBE.Bytes()...), nil | ||
} | ||
|
||
func guessUTF16Endianness(b []byte) (binary.ByteOrder, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echoing the linter here - guessUTF16Endianness
is unused and can be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I'm the PR author I can't approve the PR, but it looks good to me, aside from the failing performance test 😕
I'm going to see what I can do about the performance. |
The slowdown seems to be due to the extra chunk data going through the detectors. We could speed it up if we only allow ascii, but I'm not sure that's a safe assumption. I think if we want to cover all cases, we're going to get a bit of a slowdown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both for cleaning up my mess 🙇
No description provided.