transport: partial/interrupted Read()'s cause incorrect packet length decoding #34

weixiaolan2019 · 2020-09-21T01:17:18Z

I have another issue, I don't know how does this happen, this is the code:

func (t *transport) recv() (*packet, error) {
buf := make([]byte, headerLength)
t.conn.SetReadDeadline(time.Now().Add(time.Second * 20))
_, err := t.conn.Read(buf)
if err != nil {
if ne, ok := err.(net.Error); ok && ne.Timeout() {
return nil, ne
}
return nil, fmt.Errorf("%v: %v", errTransport, err)
}

================================================================
pl := binary.BigEndian.Uint32(buf)
//netbank.cn
need check the validity of buf length, because sometime the length may be a very very big number, I don't know why
if pl > 1024*10 {
fmt.Fprintf(os.Stderr, "vici recv len %v\n, %s", pl, debug.Stack())
return nil, fmt.Errorf("vici error: recv wrong data len %v", pl)
}
//netbank.cn
================================================================

buf = make([]byte, int(pl))
_, err = t.conn.Read(buf)
if err != nil {
	return nil, fmt.Errorf("%v: %v", errTransport, err)
}

p := &packet{}
err = p.parse(buf)
if err != nil {
	return nil, err
}

return p, nil
}

Best,

The text was updated successfully, but these errors were encountered:

enr0n · 2020-09-21T13:17:33Z

I don't understand the issue you are having.

What happens when a packet is received that is larger than 1024*10?

Also, what are your strongswan and vici package versions?

weixiaolan2019 · 2020-09-22T11:10:52Z

My strongswan version is 5.8.2, the vici version is v0.2.2
Infact I don't know what happens when I receive such a large packet, I just load the config to charon. This is the funtion stack.

goroutine 44 [running]:
runtime/debug.Stack(0x2bbe1f0, 0x2a10670, 0x4)
/home/autobuild/go1.14.3/go/src/runtime/debug/stack.go:24 +0x78
owrt/vendor/github.com/govici/vici.(*transport).recv(0x2bbe1f8, 0x28c1c80, 0x0, 0x0)
/home/autobuild/project/cpe/linksdwan/workspace/src/owrt/vendor/github.com/govici/vici/transport.go:97 +0x268
owrt/vendor/github.com/govici/vici.(*Session).cmdTransportCommunicate(0x29022a0, 0x28c1c80, 0x0, 0x5cca70, 0x2897e90)
/home/autobuild/project/cpe/linksdwan/workspace/src/owrt/vendor/github.com/govici/vici/requests.go:132 +0x48
owrt/vendor/github.com/govici/vici.(*Session).streamEventRegisterUnregister(0x29022a0, 0x653bfc, 0xb, 0x2a02f00, 0x1, 0x2bbecb8)
/home/autobuild/project/cpe/linksdwan/workspace/src/owrt/vendor/github.com/govici/vici/requests.go:107 +0x6c
owrt/vendor/github.com/govici/vici.(*Session).handleStreamedRequest(0x29022a0, 0x651515, 0x8, 0x653bfc, 0xb, 0x2906530, 0x0, 0x7468d0, 0x2a02fa8)
/home/autobuild/project/cpe/linksdwan/workspace/src/owrt/vendor/github.com/govici/vici/requests.go:80 +0x3f4
owrt/vendor/github.com/govici/vici.(*Session).sendStreamedRequest(0x29022a0, 0x651515, 0x8, 0x653bfc, 0xb, 0x2906530, 0x0, 0x0, 0x0)
/home/autobuild/project/cpe/linksdwan/workspace/src/owrt/vendor/github.c

Usually, when this happens, there is an error log "vici header read error: Connection reset by peer".

enr0n · 2020-09-22T12:42:12Z

I still don't understand the issue exactly. Could you please do the following:

If possible, upgrade the package to the latest version (v0.5.0 as of now).
Post a code snippet of what you are trying to do.
Explain what you expected to happen
Explain what is happening

I understand that your modified code is printing that function stack, but I don't understand why you're modifying the code begin with.

weixiaolan2019 · 2020-09-23T02:55:34Z

I have upgraded the version to v0.5.0 now.

ms, err := s.StreamedCommandRequest("initiate", "control-log", m)

When I call function StreamedCommandRequest to do "initiate", or load other config, sometime the function may allocate a very big memory(about 110M, sometimes it may be more), I debug the function StreamedCommandRequest, and find the function will call "func (t *transport) recv() (*packet, error) ", and in the recv function, "pl := binary.BigEndian.Uint32(buf)", I find the pl is a very big number(the pl is 118186863), I think if the number is so big, the message is abnormal, so I suggest to check the validity of buf length "pl", if the pl is too large, we need return err.

Infact I don't know what exactly happens to vici, charon is alive, I just call s.StreamedCommandRequest as usual, most of the time, everything goes well, I don't know when and how does this hanppen.

enr0n · 2020-09-23T20:13:57Z

Okay, I'm understanding more now. Thanks.

To clarify, StreamedCommandRequest does not return an error in this case right? And are the messages in the MessageStream noticeably corrupt in anyway?

Are there any indicators of the issue besides the vici header read error: Connection reset by peer log message, and pl being very large?

I will try to reproduce this soon.

weixiaolan2019 · 2020-09-24T06:18:59Z

When pl is very big, StreamedCommandRequest does not return an error, no corrupt either.

Yesterday I found another problem, maybe it has relation to the issue, this is my code.

func CheckStatus() int {
	sess, err := vici.NewSession()
	if err != nil {
		return 0
	}
	ikeUpdown := "ike-updown"
	childUpdown := "child-updown"
	if err := sess.Subscribe(ikeUpdown, childUpdown); err != nil {
		sess.Close()
		return 0
	}
	for {
		_, err = sess.NextEvent(context.Background())
		if err != nil {
			**sess.Close()--------->It's hung here.**
			return 0
		} else {
			//do something
		}
	}
}

I find sometimes the code still is hung in sess.Close() (I have upgraded to v0.5.0), I debug the code, and find it's hung in function "func (el *eventListener) unregisterEvents(events []string, all bool) error", the function cannot get the lock " el.emux.Lock()". To get more info, I debug the code in function "func (el *eventListener) listen()", and find the lock is held here:

func (el *eventListener) listen() {
	.............................................
	for {
		select {
		...........................................................................................................
			// point in sending errors on the event channel. The error
			// must be for a event registration.
			el.emux.Lock()
                        cnt := len(el.events)
                        if len(el.events) > 0 {
				**el.ec <- Event{err: err} ----------->the code is hung here**
			}
			el.emux.Unlock()
			if cnt > 0 {
				fmt.Fprintf(os.Stderr, "emux.Unlock \n")
			}
                ...........................................................................................................

		}
		}
	}
}

What makes me surprise is that before the code is hung, there is a very big packet received.

bhatnitish · 2021-05-20T18:30:51Z

Any updates on this issue? I see the same issue in my setup too - I am running strongswan 5.9.0 on some arm nodes and use govici to talk to Charon and setup tunnels. I subscribe to the child-rekey and updown events. My test involves constant tunnel setup and teardown every minute and somewhere in this loop, govici tries to allocate a buf of 700MB and my process runs out of memory. Tracked it down to the same line as reported in this bug - for now I have added checks to limit pl to 500KB, any insights as to how to debug this more or why this happens will be appreciated.

My client only subscribes to the events and calls NextEvent(). There's no call to StreamedCommandRequest().

enr0n · 2021-05-20T18:50:31Z

@bhatnitish thanks for the additional information.

I believe the solution to this is to just read the packet in chunks if the packet length is larger than some limit. That should keep the buffer allocation reasonable. I can post a patch within the next week or so.

I am curious, however, what exactly is causing a packet of 700MB to be generated from charon.

enr0n · 2021-05-20T19:45:46Z

I may have spoke too soon; parsing all of that data into a single packet struct will still cause issues allocation issues.

@bhatnitish are you able to verify that charon is actually sending you a packet of 700MB? When you limit pl to 500KB, I assume that means you get incomplete messages, correct?

IIRC, charon is supposed to limit packet lengths to VICI_MESSAGE_SIZE_MAX (set at compile time, defaults to 512 * 1024). However, looking at this function, it does not appear that the length is checked on write. It is checked in do_read, however.

bhatnitish · 2021-05-21T02:21:24Z

I have added some logs when I see messages with abnormal header lengths, I also log the first 500KB of data. This issue is hard to reproduce, I usually see it after 7-8 hours of continuous events activity. Also, I don't receive some events when rekey happens or a child goes down due to DPD. I'm not sure if this is related to this issue - some sort of malformed packets leading to events being dropped. I'll update the thread if I see the malformed header. Maybe I could log these packets in do_write whenever the lengths go more than the 500KB limit? Any ideas how I could debug this further?

enr0n · 2021-05-21T13:40:19Z

I have added some logs when I see messages with abnormal header lengths, I also log the first 500KB of data. This issue is hard to reproduce, I usually see it after 7-8 hours of continuous events activity.

Understood, please post any relevant logs when the issue does pop up again.

Also, I don't receive some events when rekey happens or a child goes down due to DPD. I'm not sure if this is related to this issue - some sort of malformed packets leading to events being dropped.

If you are able to confirm that charon is generating these events, and govici is not receiving them for some reason, feel free to open another issue. If charon is not generating the events, that would be outside the scope of this repository.

I'll update the thread if I see the malformed header. Maybe I could log these packets in do_write whenever the lengths go more than the 500KB limit? Any ideas how I could debug this further?

Thanks. If you are able to re-compile the daemon for your environment, I think the best thing to do is add logging at the places events/messages are sent from the daemon. We need to check if the messages received by govici are consistent with what the daemon is actually generating. So, yes, logging large message lengths in do_write is a good idea.

If things are consistent between govici and the daemon, we may need to patch charon itself. But, if things are not consistent, we need to look at govici's transport and packet parsing code.

bhatnitish · 2021-06-04T06:15:14Z

I added some logs to govici to nail this issue and this is what I see -

2021-06-03 23:39:44 --> Header length: 902 buf: 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

2021-06-03 23:39:44 --> Header - [7 12 99 104] data - 696c642d7570646f776e030275700003796573010f506172656e742d32332e302e342e330308756e697175656964000437393434030776657273696f6e00013203057374617465000b45535441424c4953484544030a6c6f63616c2d686f7374000932302e302e312e3131030a6c6f63616c2d706f7274000335303003086c6f63616c2d69640008434e3d6361707269030b72656d6f74652d686f7374000832332e302e342e33030b72656d6f74652d706f72740003353030030972656d6f74652d6964000f434e3d353463313633393831313...

So looks like, the header was read fine, but no data was read (all 0s) - I'll add a print to see if any error was returned here. After this, the next header read basically read the data that was supposed to be read in the previous iteration - [7 12 99 104] basically points to a child-updown event.

bhatnitish · 2021-06-04T06:30:53Z

Could this mean there was a delay between when the header and data was written to the stream by charon?

enr0n · 2021-06-04T10:23:09Z

@bhatnitish Thanks for the update. Can you please post the diff showing your added log statements so that I can better understand the output?

bhatnitish · 2021-06-10T18:34:41Z

Hi @enr0n, I was caught up with some work and just got time to get back to this. Regarding my previous comment, I added a log to dump every header (in hex), decoded header length (int(pl) in the code) and the data read (in hex). What I noticed was that the data arrived slightly later than the 500ms read deadline. When this happens, we will read the first 4 bytes of the data as the header in the next call to recv() and decode a very large message size. I guess that 500ms deadline covers reading the header, decoding it, allocating a new buf and reading the data. For now, I have bumped up the deadline to 10s in my private repo and changed the read() to a ReadFull(). I will run my tests and let you know if this helps. However, theoretically we could still end up with cases where the data arrives later than the deadline we program. Do you have suggestions on how we can handle these situations cleanly?

enr0n · 2021-06-10T22:29:41Z

@bhatnitish I came to the same conclusion about how the packet sizes were getting so large. It is clearly wrong for recv() to not make sure the correct number of bytes are being read. I think using io.ReadFull is a good fix for that piece.

As far as the read deadline, I am looking back through git history to figure out exactly why it is there. From an early look, I know it was originally added to keep the event listener loop moving. However, that code has been re-factored a couple times, and it may be safe for the event loop to block on that recv() call now.

Currently, recv() does not check the number of bytes read into the buffer during its Read() calls. This means that in some cases a partial read may occur, which will cause the next recv() call to read the packet length incorrectly. In particular, this may be a huge number and cause a huge buffer allocation. To fix this, simply use io.ReadFull to make sure we do not get partial reads. Related: #34 Signed-off-by: Nick Rosbrook <nr@enr0n.net>

In 6dc3797, a read deadline was added so that listen() would not block forever on recv() calls when the listener was closed. This was necessary because the event listener was Close()'d by cancelling a context. If recv() was blocking, the loop would not ever notice the closed context. In the current logic, Close() will call el.conn.Close(), which will cause any blocking recv() call to return io.EOF, thus closing the event loop as expected. Therefore, the read deadline is no longer necessary. Related: #34 Signed-off-by: Nick Rosbrook <nr@enr0n.net>

enr0n · 2021-06-29T15:27:09Z

@bhatnitish Have you had success with the extended deadline? If you are able, please try this branch: https://github.com/strongswan/govici/tree/fix-issue-34. If that works for you I will do a patch release with those commits.

Currently, recv() does not check the number of bytes read into the buffer during its Read() calls. This means that in some cases a partial read may occur, which will cause the next recv() call to read the packet length incorrectly. In particular, this may be a huge number and cause a huge buffer allocation. To fix this, simply use io.ReadFull to make sure we do not get partial reads. Related: #34 Signed-off-by: Nick Rosbrook <nr@enr0n.net>

In 6dc3797, a read deadline was added so that listen() would not block forever on recv() calls when the listener was closed. This was necessary because the event listener was Close()'d by cancelling a context. If recv() was blocking, the loop would not ever notice the closed context. In the current logic, Close() will call el.conn.Close(), which will cause any blocking recv() call to return io.EOF, thus closing the event loop as expected. Therefore, the read deadline is no longer necessary. Related: #34 Signed-off-by: Nick Rosbrook <nr@enr0n.net>

Currently, recv() does not check the number of bytes read into the buffer during its Read() calls. This means that in some cases a partial read may occur, which will cause the next recv() call to read the packet length incorrectly. In particular, this may be a huge number and cause a huge buffer allocation. To fix this, simply use io.ReadFull to make sure we do not get partial reads. Related: #34 Signed-off-by: Nick Rosbrook <nr@enr0n.net>

In 6dc3797, a read deadline was added so that listen() would not block forever on recv() calls when the listener was closed. This was necessary because the event listener was Close()'d by cancelling a context. If recv() was blocking, the loop would not ever notice the closed context. In the current logic, Close() will call el.conn.Close() which is all we really need to stop the event loop. To do this, however, just return on any error from recv(). With this, strip out the perr logic as there is no longer a need for it. And, strip out the ctx and cancel logic because as noted, closing the connection and returning on error from recv() is sufficient. Overall, the reduction of complexity should make this code easier to maintain going forward. Related: #34 Signed-off-by: Nick Rosbrook <nr@enr0n.net>

enr0n · 2021-08-25T00:46:42Z

The fix for this is included in the latest tag: https://github.com/strongswan/govici/releases/tag/v0.5.2.

enr0n self-assigned this May 20, 2021

enr0n added the bug Something isn't working label May 20, 2021

enr0n changed the title ~~We need check the buf length in function recv~~ transport: partial/interrupted Read()'s cause incorrect packet length decoding Jun 10, 2021

enr0n closed this as completed Aug 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transport: partial/interrupted Read()'s cause incorrect packet length decoding #34

transport: partial/interrupted Read()'s cause incorrect packet length decoding #34

weixiaolan2019 commented Sep 21, 2020 •

edited

Loading

enr0n commented Sep 21, 2020

weixiaolan2019 commented Sep 22, 2020

enr0n commented Sep 22, 2020

weixiaolan2019 commented Sep 23, 2020 •

edited

Loading

enr0n commented Sep 23, 2020

weixiaolan2019 commented Sep 24, 2020 •

edited by enr0n

Loading

bhatnitish commented May 20, 2021 •

edited

Loading

enr0n commented May 20, 2021

enr0n commented May 20, 2021

bhatnitish commented May 21, 2021

enr0n commented May 21, 2021

bhatnitish commented Jun 4, 2021 •

edited

Loading

bhatnitish commented Jun 4, 2021

enr0n commented Jun 4, 2021

bhatnitish commented Jun 10, 2021 •

edited

Loading

enr0n commented Jun 10, 2021

enr0n commented Jun 29, 2021

enr0n commented Aug 25, 2021

transport: partial/interrupted Read()'s cause incorrect packet length decoding #34

transport: partial/interrupted Read()'s cause incorrect packet length decoding #34

Comments

weixiaolan2019 commented Sep 21, 2020 • edited Loading

enr0n commented Sep 21, 2020

weixiaolan2019 commented Sep 22, 2020

enr0n commented Sep 22, 2020

weixiaolan2019 commented Sep 23, 2020 • edited Loading

enr0n commented Sep 23, 2020

weixiaolan2019 commented Sep 24, 2020 • edited by enr0n Loading

bhatnitish commented May 20, 2021 • edited Loading

enr0n commented May 20, 2021

enr0n commented May 20, 2021

bhatnitish commented May 21, 2021

enr0n commented May 21, 2021

bhatnitish commented Jun 4, 2021 • edited Loading

bhatnitish commented Jun 4, 2021

enr0n commented Jun 4, 2021

bhatnitish commented Jun 10, 2021 • edited Loading

enr0n commented Jun 10, 2021

enr0n commented Jun 29, 2021

enr0n commented Aug 25, 2021

weixiaolan2019 commented Sep 21, 2020 •

edited

Loading

weixiaolan2019 commented Sep 23, 2020 •

edited

Loading

weixiaolan2019 commented Sep 24, 2020 •

edited by enr0n

Loading

bhatnitish commented May 20, 2021 •

edited

Loading

bhatnitish commented Jun 4, 2021 •

edited

Loading

bhatnitish commented Jun 10, 2021 •

edited

Loading