Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use heuristic to choose the most likely UTF-16 decoded string #1381

Merged
merged 4 commits into from
Jun 14, 2023

Conversation

mcastorina
Copy link
Collaborator

No description provided.

@mcastorina mcastorina requested a review from a team as a code owner June 5, 2023 13:52
Copy link
Collaborator

@rosecodym rosecodym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that there isn't some library bytes decoder we can throw this stuff at. Do we have to do this because we're extracting arbitrary chunks of bytes for decoding?

@mcastorina
Copy link
Collaborator Author

I'm surprised that there isn't some library bytes decoder we can throw this stuff at. Do we have to do this because we're extracting arbitrary chunks of bytes for decoding?

Yeah I think so, especially when searching in binary blobs. I didn't search too hard for a library, so if you have one in mind we can check it out.

I'm still not sure if we should adopt this heuristic or not though. Detecting UTF-16 in binaries is fiddly.

@dustin-decker
Copy link
Contributor

Maybe it would be helpful if our sliding-window-with-overlap chunker could detect the beginning of the file and suggest what decoders are valid for the chunks it emits.

if i+1 >= len(b)-1 {
continue
}
// Same check but offset by one.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the offset by one check in case the chunk was cut off in the middle of a character? If so, we should also invert the result of the endianness check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the DLL case, the UTF-16 string was embedded within other binary data, so the guesser was guessing incorrectly.

I was thinking there were four ways to decode (endianness and even/odd starting index): (LE, even), (LE, odd), (BE, even), (BE, odd).

I'm not entirely convinced that the even/odd plays a role and it's really just BE vs LE though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of this https://github.com/trufflesecurity/trufflehog/compare/utf16-decoder-alt ?

It includes all valid BE or LE utf16 since both can exist together (like in the DLL). I'm could very well be wrong, but I think we just need to worry about BE or LE. The chunker will always cut off on an even byte, so we shouldn't get an off by one situation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that makes sense to do. I was leaning that way myself but wasn't sure about appending the two decoded strings together.. it should be fine, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue I see with it is that line numbers would be wrong, but that's out the window anyway when we're decoding partial files.

Comment on lines 37 to 40
// Guard against index out of bounds for the next check.
if i+1 >= len(b)-1 {
continue
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed since it's at the end of the for loop body.

}
}

return buf.Bytes(), nil
return append(bufLE.Bytes(), bufBE.Bytes()...), nil
}

func guessUTF16Endianness(b []byte) (binary.ByteOrder, error) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

echoing the linter here - guessUTF16Endianness is unused and can be deleted.

Copy link
Collaborator Author

@mcastorina mcastorina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I'm the PR author I can't approve the PR, but it looks good to me, aside from the failing performance test 😕

@bill-rich
Copy link
Collaborator

I'm going to see what I can do about the performance.

@bill-rich
Copy link
Collaborator

The slowdown seems to be due to the extra chunk data going through the detectors. We could speed it up if we only allow ascii, but I'm not sure that's a safe assumption. I think if we want to cover all cases, we're going to get a bit of a slowdown.

Copy link
Collaborator

@ahrav ahrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for cleaning up my mess 🙇

@bill-rich bill-rich merged commit fb76eaf into main Jun 14, 2023
@bill-rich bill-rich deleted the utf16-decoder branch June 14, 2023 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants