-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transport: partial/interrupted Read()'s cause incorrect packet length decoding #34
Comments
I don't understand the issue you are having. What happens when a packet is received that is larger than 1024*10? Also, what are your strongswan and vici package versions? |
My strongswan version is 5.8.2, the vici version is v0.2.2 goroutine 44 [running]: Usually, when this happens, there is an error log "vici header read error: Connection reset by peer". |
I still don't understand the issue exactly. Could you please do the following:
I understand that your modified code is printing that function stack, but I don't understand why you're modifying the code begin with. |
I have upgraded the version to v0.5.0 now. ms, err := s.StreamedCommandRequest("initiate", "control-log", m) When I call function StreamedCommandRequest to do "initiate", or load other config, sometime the function may allocate a very big memory(about 110M, sometimes it may be more), I debug the function StreamedCommandRequest, and find the function will call "func (t *transport) recv() (*packet, error) ", and in the recv function, "pl := binary.BigEndian.Uint32(buf)", I find the pl is a very big number(the pl is 118186863), I think if the number is so big, the message is abnormal, so I suggest to check the validity of buf length "pl", if the pl is too large, we need return err. Infact I don't know what exactly happens to vici, charon is alive, I just call s.StreamedCommandRequest as usual, most of the time, everything goes well, I don't know when and how does this hanppen. |
Okay, I'm understanding more now. Thanks. To clarify, Are there any indicators of the issue besides the I will try to reproduce this soon. |
When pl is very big, StreamedCommandRequest does not return an error, no corrupt either. Yesterday I found another problem, maybe it has relation to the issue, this is my code. func CheckStatus() int {
sess, err := vici.NewSession()
if err != nil {
return 0
}
ikeUpdown := "ike-updown"
childUpdown := "child-updown"
if err := sess.Subscribe(ikeUpdown, childUpdown); err != nil {
sess.Close()
return 0
}
for {
_, err = sess.NextEvent(context.Background())
if err != nil {
**sess.Close()--------->It's hung here.**
return 0
} else {
//do something
}
}
} I find sometimes the code still is hung in sess.Close() (I have upgraded to v0.5.0), I debug the code, and find it's hung in function "func (el *eventListener) unregisterEvents(events []string, all bool) error", the function cannot get the lock " el.emux.Lock()". To get more info, I debug the code in function "func (el *eventListener) listen()", and find the lock is held here: func (el *eventListener) listen() {
.............................................
for {
select {
...........................................................................................................
// point in sending errors on the event channel. The error
// must be for a event registration.
el.emux.Lock()
cnt := len(el.events)
if len(el.events) > 0 {
**el.ec <- Event{err: err} ----------->the code is hung here**
}
el.emux.Unlock()
if cnt > 0 {
fmt.Fprintf(os.Stderr, "emux.Unlock \n")
}
...........................................................................................................
}
}
}
} What makes me surprise is that before the code is hung, there is a very big packet received. |
Any updates on this issue? I see the same issue in my setup too - I am running strongswan 5.9.0 on some arm nodes and use govici to talk to Charon and setup tunnels. I subscribe to the child-rekey and updown events. My test involves constant tunnel setup and teardown every minute and somewhere in this loop, govici tries to allocate a buf of 700MB and my process runs out of memory. Tracked it down to the same line as reported in this bug - for now I have added checks to limit pl to 500KB, any insights as to how to debug this more or why this happens will be appreciated. My client only subscribes to the events and calls NextEvent(). There's no call to StreamedCommandRequest(). |
@bhatnitish thanks for the additional information. I believe the solution to this is to just read the packet in chunks if the packet length is larger than some limit. That should keep the buffer allocation reasonable. I can post a patch within the next week or so. I am curious, however, what exactly is causing a packet of 700MB to be generated from charon. |
I may have spoke too soon; parsing all of that data into a single @bhatnitish are you able to verify that charon is actually sending you a packet of 700MB? When you limit pl to 500KB, I assume that means you get incomplete messages, correct? IIRC, charon is supposed to limit packet lengths to VICI_MESSAGE_SIZE_MAX (set at compile time, defaults to 512 * 1024). However, looking at this function, it does not appear that the length is checked on write. It is checked in |
I have added some logs when I see messages with abnormal header lengths, I also log the first 500KB of data. This issue is hard to reproduce, I usually see it after 7-8 hours of continuous events activity. Also, I don't receive some events when rekey happens or a child goes down due to DPD. I'm not sure if this is related to this issue - some sort of malformed packets leading to events being dropped. I'll update the thread if I see the malformed header. Maybe I could log these packets in do_write whenever the lengths go more than the 500KB limit? Any ideas how I could debug this further? |
Understood, please post any relevant logs when the issue does pop up again.
If you are able to confirm that charon is generating these events, and govici is not receiving them for some reason, feel free to open another issue. If charon is not generating the events, that would be outside the scope of this repository.
Thanks. If you are able to re-compile the daemon for your environment, I think the best thing to do is add logging at the places events/messages are sent from the daemon. We need to check if the messages received by govici are consistent with what the daemon is actually generating. So, yes, logging large message lengths in do_write is a good idea. If things are consistent between govici and the daemon, we may need to patch charon itself. But, if things are not consistent, we need to look at govici's transport and packet parsing code. |
I added some logs to govici to nail this issue and this is what I see - 2021-06-03 23:39:44 --> Header length: 902 buf: 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 2021-06-03 23:39:44 --> Header - [7 12 99 104] data - 696c642d7570646f776e030275700003796573010f506172656e742d32332e302e342e330308756e697175656964000437393434030776657273696f6e00013203057374617465000b45535441424c4953484544030a6c6f63616c2d686f7374000932302e302e312e3131030a6c6f63616c2d706f7274000335303003086c6f63616c2d69640008434e3d6361707269030b72656d6f74652d686f7374000832332e302e342e33030b72656d6f74652d706f72740003353030030972656d6f74652d6964000f434e3d353463313633393831313... So looks like, the header was read fine, but no data was read (all 0s) - I'll add a print to see if any error was returned here. After this, the next header read basically read the data that was supposed to be read in the previous iteration - [7 12 99 104] basically points to a child-updown event. |
Could this mean there was a delay between when the header and data was written to the stream by charon? |
@bhatnitish Thanks for the update. Can you please post the diff showing your added log statements so that I can better understand the output? |
Hi @enr0n, I was caught up with some work and just got time to get back to this. Regarding my previous comment, I added a log to dump every header (in hex), decoded header length (int(pl) in the code) and the data read (in hex). What I noticed was that the data arrived slightly later than the 500ms read deadline. When this happens, we will read the first 4 bytes of the data as the header in the next call to recv() and decode a very large message size. I guess that 500ms deadline covers reading the header, decoding it, allocating a new buf and reading the data. For now, I have bumped up the deadline to 10s in my private repo and changed the read() to a ReadFull(). I will run my tests and let you know if this helps. However, theoretically we could still end up with cases where the data arrives later than the deadline we program. Do you have suggestions on how we can handle these situations cleanly? |
@bhatnitish I came to the same conclusion about how the packet sizes were getting so large. It is clearly wrong for recv() to not make sure the correct number of bytes are being read. I think using io.ReadFull is a good fix for that piece. As far as the read deadline, I am looking back through git history to figure out exactly why it is there. From an early look, I know it was originally added to keep the event listener loop moving. However, that code has been re-factored a couple times, and it may be safe for the event loop to block on that recv() call now. |
Currently, recv() does not check the number of bytes read into the buffer during its Read() calls. This means that in some cases a partial read may occur, which will cause the next recv() call to read the packet length incorrectly. In particular, this may be a huge number and cause a huge buffer allocation. To fix this, simply use io.ReadFull to make sure we do not get partial reads. Related: #34 Signed-off-by: Nick Rosbrook <nr@enr0n.net>
In 6dc3797, a read deadline was added so that listen() would not block forever on recv() calls when the listener was closed. This was necessary because the event listener was Close()'d by cancelling a context. If recv() was blocking, the loop would not ever notice the closed context. In the current logic, Close() will call el.conn.Close(), which will cause any blocking recv() call to return io.EOF, thus closing the event loop as expected. Therefore, the read deadline is no longer necessary. Related: #34 Signed-off-by: Nick Rosbrook <nr@enr0n.net>
In 6dc3797, a read deadline was added so that listen() would not block forever on recv() calls when the listener was closed. This was necessary because the event listener was Close()'d by cancelling a context. If recv() was blocking, the loop would not ever notice the closed context. In the current logic, Close() will call el.conn.Close(), which will cause any blocking recv() call to return io.EOF, thus closing the event loop as expected. Therefore, the read deadline is no longer necessary. Related: #34 Signed-off-by: Nick Rosbrook <nr@enr0n.net>
@bhatnitish Have you had success with the extended deadline? If you are able, please try this branch: https://github.com/strongswan/govici/tree/fix-issue-34. If that works for you I will do a patch release with those commits. |
Currently, recv() does not check the number of bytes read into the buffer during its Read() calls. This means that in some cases a partial read may occur, which will cause the next recv() call to read the packet length incorrectly. In particular, this may be a huge number and cause a huge buffer allocation. To fix this, simply use io.ReadFull to make sure we do not get partial reads. Related: #34 Signed-off-by: Nick Rosbrook <nr@enr0n.net>
In 6dc3797, a read deadline was added so that listen() would not block forever on recv() calls when the listener was closed. This was necessary because the event listener was Close()'d by cancelling a context. If recv() was blocking, the loop would not ever notice the closed context. In the current logic, Close() will call el.conn.Close(), which will cause any blocking recv() call to return io.EOF, thus closing the event loop as expected. Therefore, the read deadline is no longer necessary. Related: #34 Signed-off-by: Nick Rosbrook <nr@enr0n.net>
Currently, recv() does not check the number of bytes read into the buffer during its Read() calls. This means that in some cases a partial read may occur, which will cause the next recv() call to read the packet length incorrectly. In particular, this may be a huge number and cause a huge buffer allocation. To fix this, simply use io.ReadFull to make sure we do not get partial reads. Related: #34 Signed-off-by: Nick Rosbrook <nr@enr0n.net>
In 6dc3797, a read deadline was added so that listen() would not block forever on recv() calls when the listener was closed. This was necessary because the event listener was Close()'d by cancelling a context. If recv() was blocking, the loop would not ever notice the closed context. In the current logic, Close() will call el.conn.Close() which is all we really need to stop the event loop. To do this, however, just return on any error from recv(). With this, strip out the perr logic as there is no longer a need for it. And, strip out the ctx and cancel logic because as noted, closing the connection and returning on error from recv() is sufficient. Overall, the reduction of complexity should make this code easier to maintain going forward. Related: #34 Signed-off-by: Nick Rosbrook <nr@enr0n.net>
The fix for this is included in the latest tag: https://github.com/strongswan/govici/releases/tag/v0.5.2. |
I have another issue, I don't know how does this happen, this is the code:
Best,
The text was updated successfully, but these errors were encountered: