Use heuristic to choose the most likely UTF-16 decoded string #1381

mcastorina · 2023-06-05T13:52:43Z

No description provided.

rosecodym

I'm surprised that there isn't some library bytes decoder we can throw this stuff at. Do we have to do this because we're extracting arbitrary chunks of bytes for decoding?

mcastorina · 2023-06-05T14:30:33Z

I'm surprised that there isn't some library bytes decoder we can throw this stuff at. Do we have to do this because we're extracting arbitrary chunks of bytes for decoding?

Yeah I think so, especially when searching in binary blobs. I didn't search too hard for a library, so if you have one in mind we can check it out.

I'm still not sure if we should adopt this heuristic or not though. Detecting UTF-16 in binaries is fiddly.

dustin-decker · 2023-06-05T20:49:33Z

Maybe it would be helpful if our sliding-window-with-overlap chunker could detect the beginning of the file and suggest what decoders are valid for the chunks it emits.

bill-rich · 2023-06-08T23:39:34Z

pkg/decoders/utf16.go

+		if i+1 >= len(b)-1 {
+			continue
+		}
+		// Same check but offset by one.


Is the offset by one check in case the chunk was cut off in the middle of a character? If so, we should also invert the result of the endianness check.

For the DLL case, the UTF-16 string was embedded within other binary data, so the guesser was guessing incorrectly.

I was thinking there were four ways to decode (endianness and even/odd starting index): (LE, even), (LE, odd), (BE, even), (BE, odd).

I'm not entirely convinced that the even/odd plays a role and it's really just BE vs LE though.

What do you think of this https://github.com/trufflesecurity/trufflehog/compare/utf16-decoder-alt ?

It includes all valid BE or LE utf16 since both can exist together (like in the DLL). I'm could very well be wrong, but I think we just need to worry about BE or LE. The chunker will always cut off on an even byte, so we shouldn't get an off by one situation.

Yeah I think that makes sense to do. I was leaning that way myself but wasn't sure about appending the two decoded strings together.. it should be fine, right?

The main issue I see with it is that line numbers would be wrong, but that's out the window anyway when we're decoding partial files.

mcastorina · 2023-06-09T14:40:56Z

pkg/decoders/utf16.go

+		// Guard against index out of bounds for the next check.
+		if i+1 >= len(b)-1 {
+			continue
 		}


This can be removed since it's at the end of the for loop body.

mcastorina · 2023-06-09T14:41:55Z

pkg/decoders/utf16.go

 		}
 	}

-	return buf.Bytes(), nil
+	return append(bufLE.Bytes(), bufBE.Bytes()...), nil
 }

 func guessUTF16Endianness(b []byte) (binary.ByteOrder, error) {


echoing the linter here - guessUTF16Endianness is unused and can be deleted.

mcastorina

Since I'm the PR author I can't approve the PR, but it looks good to me, aside from the failing performance test 😕

bill-rich · 2023-06-09T17:29:30Z

I'm going to see what I can do about the performance.

bill-rich · 2023-06-13T23:03:32Z

The slowdown seems to be due to the extra chunk data going through the detectors. We could speed it up if we only allow ascii, but I'm not sure that's a safe assumption. I think if we want to cover all cases, we're going to get a bit of a slowdown.

ahrav

Thank you both for cleaning up my mess 🙇

mcastorina requested a review from a team as a code owner June 5, 2023 13:52

rosecodym reviewed Jun 5, 2023

View reviewed changes

bill-rich reviewed Jun 8, 2023

View reviewed changes

mcastorina and others added 2 commits June 8, 2023 20:07

Use heuristic to choose the most likely UTF-16 decoded string

6195082

Assume ASCII and include valid BE and LE bytes

c2761b2

bill-rich force-pushed the utf16-decoder branch from 2a8766c to c2761b2 Compare June 9, 2023 03:07

mcastorina commented Jun 9, 2023

View reviewed changes

Remove unused code

58ef682

mcastorina commented Jun 9, 2023

View reviewed changes

Assume ASCII and return nil when not utf16

1e24666

ahrav approved these changes Jun 13, 2023

View reviewed changes

bill-rich merged commit fb76eaf into main Jun 14, 2023

bill-rich deleted the utf16-decoder branch June 14, 2023 00:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use heuristic to choose the most likely UTF-16 decoded string #1381

Use heuristic to choose the most likely UTF-16 decoded string #1381

mcastorina commented Jun 5, 2023

rosecodym left a comment

mcastorina commented Jun 5, 2023

dustin-decker commented Jun 5, 2023

bill-rich Jun 8, 2023

mcastorina Jun 9, 2023

bill-rich Jun 9, 2023

mcastorina Jun 9, 2023

bill-rich Jun 9, 2023

mcastorina Jun 9, 2023

mcastorina Jun 9, 2023

mcastorina left a comment

bill-rich commented Jun 9, 2023

bill-rich commented Jun 13, 2023

ahrav left a comment

Use heuristic to choose the most likely UTF-16 decoded string #1381

Use heuristic to choose the most likely UTF-16 decoded string #1381

Conversation

mcastorina commented Jun 5, 2023

rosecodym left a comment

Choose a reason for hiding this comment

mcastorina commented Jun 5, 2023

dustin-decker commented Jun 5, 2023

bill-rich Jun 8, 2023

Choose a reason for hiding this comment

mcastorina Jun 9, 2023

Choose a reason for hiding this comment

bill-rich Jun 9, 2023

Choose a reason for hiding this comment

mcastorina Jun 9, 2023

Choose a reason for hiding this comment

bill-rich Jun 9, 2023

Choose a reason for hiding this comment

mcastorina Jun 9, 2023

Choose a reason for hiding this comment

mcastorina Jun 9, 2023

Choose a reason for hiding this comment

mcastorina left a comment

Choose a reason for hiding this comment

bill-rich commented Jun 9, 2023

bill-rich commented Jun 13, 2023

ahrav left a comment

Choose a reason for hiding this comment