Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transport: partial/interrupted Read()'s cause incorrect packet length decoding #34

Closed
weixiaolan2019 opened this issue Sep 21, 2020 · 18 comments
Assignees
Labels
bug Something isn't working

Comments

@weixiaolan2019
Copy link

weixiaolan2019 commented Sep 21, 2020

I have another issue, I don't know how does this happen, this is the code:

func (t *transport) recv() (*packet, error) {
buf := make([]byte, headerLength)
t.conn.SetReadDeadline(time.Now().Add(time.Second * 20))
_, err := t.conn.Read(buf)
if err != nil {
if ne, ok := err.(net.Error); ok && ne.Timeout() {
return nil, ne
}
return nil, fmt.Errorf("%v: %v", errTransport, err)
}

================================================================
pl := binary.BigEndian.Uint32(buf)
//netbank.cn
need check the validity of buf length, because sometime the length may be a very very big number, I don't know why
if pl > 1024*10 {
fmt.Fprintf(os.Stderr, "vici recv len %v\n, %s", pl, debug.Stack())
return nil, fmt.Errorf("vici error: recv wrong data len %v", pl)
}
//netbank.cn
================================================================

buf = make([]byte, int(pl))
_, err = t.conn.Read(buf)
if err != nil {
	return nil, fmt.Errorf("%v: %v", errTransport, err)
}

p := &packet{}
err = p.parse(buf)
if err != nil {
	return nil, err
}

return p, nil
}

Best,

@enr0n
Copy link
Collaborator

enr0n commented Sep 21, 2020

I don't understand the issue you are having.

What happens when a packet is received that is larger than 1024*10?

Also, what are your strongswan and vici package versions?

@weixiaolan2019
Copy link
Author

My strongswan version is 5.8.2, the vici version is v0.2.2
Infact I don't know what happens when I receive such a large packet, I just load the config to charon. This is the funtion stack.

goroutine 44 [running]:
runtime/debug.Stack(0x2bbe1f0, 0x2a10670, 0x4)
/home/autobuild/go1.14.3/go/src/runtime/debug/stack.go:24 +0x78
owrt/vendor/github.com/govici/vici.(*transport).recv(0x2bbe1f8, 0x28c1c80, 0x0, 0x0)
/home/autobuild/project/cpe/linksdwan/workspace/src/owrt/vendor/github.com/govici/vici/transport.go:97 +0x268
owrt/vendor/github.com/govici/vici.(*Session).cmdTransportCommunicate(0x29022a0, 0x28c1c80, 0x0, 0x5cca70, 0x2897e90)
/home/autobuild/project/cpe/linksdwan/workspace/src/owrt/vendor/github.com/govici/vici/requests.go:132 +0x48
owrt/vendor/github.com/govici/vici.(*Session).streamEventRegisterUnregister(0x29022a0, 0x653bfc, 0xb, 0x2a02f00, 0x1, 0x2bbecb8)
/home/autobuild/project/cpe/linksdwan/workspace/src/owrt/vendor/github.com/govici/vici/requests.go:107 +0x6c
owrt/vendor/github.com/govici/vici.(*Session).handleStreamedRequest(0x29022a0, 0x651515, 0x8, 0x653bfc, 0xb, 0x2906530, 0x0, 0x7468d0, 0x2a02fa8)
/home/autobuild/project/cpe/linksdwan/workspace/src/owrt/vendor/github.com/govici/vici/requests.go:80 +0x3f4
owrt/vendor/github.com/govici/vici.(*Session).sendStreamedRequest(0x29022a0, 0x651515, 0x8, 0x653bfc, 0xb, 0x2906530, 0x0, 0x0, 0x0)
/home/autobuild/project/cpe/linksdwan/workspace/src/owrt/vendor/github.c

Usually, when this happens, there is an error log "vici header read error: Connection reset by peer".

@enr0n
Copy link
Collaborator

enr0n commented Sep 22, 2020

I still don't understand the issue exactly. Could you please do the following:

  1. If possible, upgrade the package to the latest version (v0.5.0 as of now).
  2. Post a code snippet of what you are trying to do.
  3. Explain what you expected to happen
  4. Explain what is happening

I understand that your modified code is printing that function stack, but I don't understand why you're modifying the code begin with.

@weixiaolan2019
Copy link
Author

weixiaolan2019 commented Sep 23, 2020

I have upgraded the version to v0.5.0 now.

ms, err := s.StreamedCommandRequest("initiate", "control-log", m)

When I call function StreamedCommandRequest to do "initiate", or load other config, sometime the function may allocate a very big memory(about 110M, sometimes it may be more), I debug the function StreamedCommandRequest, and find the function will call "func (t *transport) recv() (*packet, error) ", and in the recv function, "pl := binary.BigEndian.Uint32(buf)", I find the pl is a very big number(the pl is 118186863), I think if the number is so big, the message is abnormal, so I suggest to check the validity of buf length "pl", if the pl is too large, we need return err.

Infact I don't know what exactly happens to vici, charon is alive, I just call s.StreamedCommandRequest as usual, most of the time, everything goes well, I don't know when and how does this hanppen.

@enr0n
Copy link
Collaborator

enr0n commented Sep 23, 2020

Okay, I'm understanding more now. Thanks.

To clarify, StreamedCommandRequest does not return an error in this case right? And are the messages in the MessageStream noticeably corrupt in anyway?

Are there any indicators of the issue besides the vici header read error: Connection reset by peer log message, and pl being very large?

I will try to reproduce this soon.

@weixiaolan2019
Copy link
Author

weixiaolan2019 commented Sep 24, 2020

When pl is very big, StreamedCommandRequest does not return an error, no corrupt either.

Yesterday I found another problem, maybe it has relation to the issue, this is my code.

func CheckStatus() int {
	sess, err := vici.NewSession()
	if err != nil {
		return 0
	}
	ikeUpdown := "ike-updown"
	childUpdown := "child-updown"
	if err := sess.Subscribe(ikeUpdown, childUpdown); err != nil {
		sess.Close()
		return 0
	}
	for {
		_, err = sess.NextEvent(context.Background())
		if err != nil {
			**sess.Close()--------->It's hung here.**
			return 0
		} else {
			//do something
		}
	}
}

I find sometimes the code still is hung in sess.Close() (I have upgraded to v0.5.0), I debug the code, and find it's hung in function "func (el *eventListener) unregisterEvents(events []string, all bool) error", the function cannot get the lock " el.emux.Lock()". To get more info, I debug the code in function "func (el *eventListener) listen()", and find the lock is held here:

func (el *eventListener) listen() {
	.............................................
	for {
		select {
		...........................................................................................................
			// point in sending errors on the event channel. The error
			// must be for a event registration.
			el.emux.Lock()
                        cnt := len(el.events)
                        if len(el.events) > 0 {
				**el.ec <- Event{err: err} ----------->the code is hung here**
			}
			el.emux.Unlock()
			if cnt > 0 {
				fmt.Fprintf(os.Stderr, "emux.Unlock \n")
			}
                ...........................................................................................................

		}
		}
	}
}

What makes me surprise is that before the code is hung, there is a very big packet received.

@bhatnitish
Copy link

bhatnitish commented May 20, 2021

Any updates on this issue? I see the same issue in my setup too - I am running strongswan 5.9.0 on some arm nodes and use govici to talk to Charon and setup tunnels. I subscribe to the child-rekey and updown events. My test involves constant tunnel setup and teardown every minute and somewhere in this loop, govici tries to allocate a buf of 700MB and my process runs out of memory. Tracked it down to the same line as reported in this bug - for now I have added checks to limit pl to 500KB, any insights as to how to debug this more or why this happens will be appreciated.

My client only subscribes to the events and calls NextEvent(). There's no call to StreamedCommandRequest().

@enr0n
Copy link
Collaborator

enr0n commented May 20, 2021

@bhatnitish thanks for the additional information.

I believe the solution to this is to just read the packet in chunks if the packet length is larger than some limit. That should keep the buffer allocation reasonable. I can post a patch within the next week or so.

I am curious, however, what exactly is causing a packet of 700MB to be generated from charon.

@enr0n enr0n self-assigned this May 20, 2021
@enr0n enr0n added the bug Something isn't working label May 20, 2021
@enr0n
Copy link
Collaborator

enr0n commented May 20, 2021

I may have spoke too soon; parsing all of that data into a single packet struct will still cause issues allocation issues.

@bhatnitish are you able to verify that charon is actually sending you a packet of 700MB? When you limit pl to 500KB, I assume that means you get incomplete messages, correct?

IIRC, charon is supposed to limit packet lengths to VICI_MESSAGE_SIZE_MAX (set at compile time, defaults to 512 * 1024). However, looking at this function, it does not appear that the length is checked on write. It is checked in do_read, however.

@bhatnitish
Copy link

I have added some logs when I see messages with abnormal header lengths, I also log the first 500KB of data. This issue is hard to reproduce, I usually see it after 7-8 hours of continuous events activity. Also, I don't receive some events when rekey happens or a child goes down due to DPD. I'm not sure if this is related to this issue - some sort of malformed packets leading to events being dropped. I'll update the thread if I see the malformed header. Maybe I could log these packets in do_write whenever the lengths go more than the 500KB limit? Any ideas how I could debug this further?

@enr0n
Copy link
Collaborator

enr0n commented May 21, 2021

I have added some logs when I see messages with abnormal header lengths, I also log the first 500KB of data. This issue is hard to reproduce, I usually see it after 7-8 hours of continuous events activity.

Understood, please post any relevant logs when the issue does pop up again.

Also, I don't receive some events when rekey happens or a child goes down due to DPD. I'm not sure if this is related to this issue - some sort of malformed packets leading to events being dropped.

If you are able to confirm that charon is generating these events, and govici is not receiving them for some reason, feel free to open another issue. If charon is not generating the events, that would be outside the scope of this repository.

I'll update the thread if I see the malformed header. Maybe I could log these packets in do_write whenever the lengths go more than the 500KB limit? Any ideas how I could debug this further?

Thanks. If you are able to re-compile the daemon for your environment, I think the best thing to do is add logging at the places events/messages are sent from the daemon. We need to check if the messages received by govici are consistent with what the daemon is actually generating. So, yes, logging large message lengths in do_write is a good idea.

If things are consistent between govici and the daemon, we may need to patch charon itself. But, if things are not consistent, we need to look at govici's transport and packet parsing code.

@bhatnitish
Copy link

bhatnitish commented Jun 4, 2021

I added some logs to govici to nail this issue and this is what I see -

2021-06-03 23:39:44 --> Header length: 902 buf: 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

2021-06-03 23:39:44 --> Header - [7 12 99 104] data - 696c642d7570646f776e030275700003796573010f506172656e742d32332e302e342e330308756e697175656964000437393434030776657273696f6e00013203057374617465000b45535441424c4953484544030a6c6f63616c2d686f7374000932302e302e312e3131030a6c6f63616c2d706f7274000335303003086c6f63616c2d69640008434e3d6361707269030b72656d6f74652d686f7374000832332e302e342e33030b72656d6f74652d706f72740003353030030972656d6f74652d6964000f434e3d353463313633393831313...

So looks like, the header was read fine, but no data was read (all 0s) - I'll add a print to see if any error was returned here. After this, the next header read basically read the data that was supposed to be read in the previous iteration - [7 12 99 104] basically points to a child-updown event.

@bhatnitish
Copy link

Could this mean there was a delay between when the header and data was written to the stream by charon?

@enr0n
Copy link
Collaborator

enr0n commented Jun 4, 2021

@bhatnitish Thanks for the update. Can you please post the diff showing your added log statements so that I can better understand the output?

@bhatnitish
Copy link

bhatnitish commented Jun 10, 2021

Hi @enr0n, I was caught up with some work and just got time to get back to this. Regarding my previous comment, I added a log to dump every header (in hex), decoded header length (int(pl) in the code) and the data read (in hex). What I noticed was that the data arrived slightly later than the 500ms read deadline. When this happens, we will read the first 4 bytes of the data as the header in the next call to recv() and decode a very large message size. I guess that 500ms deadline covers reading the header, decoding it, allocating a new buf and reading the data. For now, I have bumped up the deadline to 10s in my private repo and changed the read() to a ReadFull(). I will run my tests and let you know if this helps. However, theoretically we could still end up with cases where the data arrives later than the deadline we program. Do you have suggestions on how we can handle these situations cleanly?

@enr0n
Copy link
Collaborator

enr0n commented Jun 10, 2021

@bhatnitish I came to the same conclusion about how the packet sizes were getting so large. It is clearly wrong for recv() to not make sure the correct number of bytes are being read. I think using io.ReadFull is a good fix for that piece.

As far as the read deadline, I am looking back through git history to figure out exactly why it is there. From an early look, I know it was originally added to keep the event listener loop moving. However, that code has been re-factored a couple times, and it may be safe for the event loop to block on that recv() call now.

@enr0n enr0n changed the title We need check the buf length in function recv transport: partial/interrupted Read()'s cause incorrect packet length decoding Jun 10, 2021
enr0n added a commit that referenced this issue Jun 10, 2021
Currently, recv() does not check the number of bytes read into the
buffer during its Read() calls. This means that in some cases a partial
read may occur, which will cause the next recv() call to read the packet
length incorrectly. In particular, this may be a huge number and cause a
huge buffer allocation. To fix this, simply use io.ReadFull to make sure
we do not get partial reads.

Related: #34

Signed-off-by: Nick Rosbrook <nr@enr0n.net>
enr0n added a commit that referenced this issue Jun 10, 2021
In 6dc3797, a read deadline was added
so that listen() would not block forever on recv() calls when the
listener was closed. This was necessary because the event listener was
Close()'d by cancelling a context. If recv() was blocking, the loop
would not ever notice the closed context.

In the current logic, Close() will call el.conn.Close(), which will
cause any blocking recv() call to return io.EOF, thus closing the event
loop as expected. Therefore, the read deadline is no longer necessary.

Related: #34

Signed-off-by: Nick Rosbrook <nr@enr0n.net>
enr0n added a commit that referenced this issue Jun 10, 2021
In 6dc3797, a read deadline was added
so that listen() would not block forever on recv() calls when the
listener was closed. This was necessary because the event listener was
Close()'d by cancelling a context. If recv() was blocking, the loop
would not ever notice the closed context.

In the current logic, Close() will call el.conn.Close(), which will
cause any blocking recv() call to return io.EOF, thus closing the event
loop as expected. Therefore, the read deadline is no longer necessary.

Related: #34

Signed-off-by: Nick Rosbrook <nr@enr0n.net>
@enr0n
Copy link
Collaborator

enr0n commented Jun 29, 2021

@bhatnitish Have you had success with the extended deadline? If you are able, please try this branch: https://github.com/strongswan/govici/tree/fix-issue-34. If that works for you I will do a patch release with those commits.

enr0n added a commit that referenced this issue Aug 22, 2021
Currently, recv() does not check the number of bytes read into the
buffer during its Read() calls. This means that in some cases a partial
read may occur, which will cause the next recv() call to read the packet
length incorrectly. In particular, this may be a huge number and cause a
huge buffer allocation. To fix this, simply use io.ReadFull to make sure
we do not get partial reads.

Related: #34

Signed-off-by: Nick Rosbrook <nr@enr0n.net>
enr0n added a commit that referenced this issue Aug 22, 2021
In 6dc3797, a read deadline was added
so that listen() would not block forever on recv() calls when the
listener was closed. This was necessary because the event listener was
Close()'d by cancelling a context. If recv() was blocking, the loop
would not ever notice the closed context.

In the current logic, Close() will call el.conn.Close(), which will
cause any blocking recv() call to return io.EOF, thus closing the event
loop as expected. Therefore, the read deadline is no longer necessary.

Related: #34

Signed-off-by: Nick Rosbrook <nr@enr0n.net>
enr0n added a commit that referenced this issue Aug 22, 2021
Currently, recv() does not check the number of bytes read into the
buffer during its Read() calls. This means that in some cases a partial
read may occur, which will cause the next recv() call to read the packet
length incorrectly. In particular, this may be a huge number and cause a
huge buffer allocation. To fix this, simply use io.ReadFull to make sure
we do not get partial reads.

Related: #34

Signed-off-by: Nick Rosbrook <nr@enr0n.net>
enr0n added a commit that referenced this issue Aug 22, 2021
In 6dc3797, a read deadline was added
so that listen() would not block forever on recv() calls when the
listener was closed. This was necessary because the event listener was
Close()'d by cancelling a context. If recv() was blocking, the loop
would not ever notice the closed context.

In the current logic, Close() will call el.conn.Close() which is all we
really need to stop the event loop. To do this, however, just return on
any error from recv(). With this, strip out the perr logic as there is
no longer a need for it. And, strip out the ctx and cancel logic because
as noted, closing the connection and returning on error from recv() is
sufficient.

Overall, the reduction of complexity should make this code easier to
maintain going forward.

Related: #34

Signed-off-by: Nick Rosbrook <nr@enr0n.net>
@enr0n
Copy link
Collaborator

enr0n commented Aug 25, 2021

The fix for this is included in the latest tag: https://github.com/strongswan/govici/releases/tag/v0.5.2.

@enr0n enr0n closed this as completed Aug 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants