Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

writeQueue full regression #4179

Open
krusadellc opened this issue Jan 19, 2025 · 9 comments
Open

writeQueue full regression #4179

krusadellc opened this issue Jan 19, 2025 · 9 comments
Labels
bug Something isn't working general

Comments

@krusadellc
Copy link

Which version are you using?

1.11.1

Which operating system are you using?

Linux amd64 standard

Describe how to replicate the issue

  1. Publish a RTSP Stream using video files (FFMpeg)
  2. Connect multiple read clients to this RTSP Stream
  3. Keep it running for a few hours
  4. Server starts to see writeQueue full message
  5. The clients pause indefinitely with no data available

We are using mediamtx for one of our products that does RTSP Streaming.

With around ~16 clients connected, the streaming was working fine for days. We recently updated mediamtx to the latest version and we now start seeing writeQueue full errors after about ~4 hours.

We went back to the previous version from December and the errors disappear.
So it seems there's some regression between the releases sometime in December and January that might be causing this issue.

We went through the other issues related to writeQueue full and tried workarounds like increasing the buffer size, but that did not seem to help.

Any help would be appreciated. Thank you!

Server logs

No response

Network dump

No response

@krusadellc
Copy link
Author

we are in the process of binary searching the culprit that caused this issue in our application.
we have rolled back to one commit before gortsplib was updated (#4097)

will keep the application running for couple hours to see if that helps.

will then roll forward to the latest version where gortsplib was updated again (#4181)

Please let us know if you think a known bug in gortsplib might be causing this issue so we can just deploy the latest version with the fix.

@krusadellc
Copy link
Author

commit 57addb1#diff-33ef32bf6c23acb95f5902d7097b7a1d5128ca061167ec0716715b0b9eeaa5f6 updated gortsplib from v4.11.2 -> v4.12.1.

The commit before this change works fine.

I tried the latest code and it fails the same way. write queue full messages after about 4-5 hours.

Can you please help with this issue? @aler9

@aler9
Copy link
Member

aler9 commented Jan 24, 2025

@krusadellc thanks for the feedback but of course I cannot rollback all the changes performed inside gortsplib that allowed to implement the new statistics system.
We have to find out the root cause and develop a patch that does not void the new features.

@aler9 aler9 added bug Something isn't working general labels Jan 24, 2025
@krusadellc
Copy link
Author

I do not expect a rollback :)
We definitely have to find a root cause, but its tricky to reproduce since it only happens after 4-5 hours of thousands of clients connecting/disconnecting every few seconds.

Please let me know if there's anything I can do to help root cause it.

@aler9
Copy link
Member

aler9 commented Jan 24, 2025

does the freeze involve a single client at a time or all the clients together?
I'm asking this because if the freeze involves a single clients, then it's a deadlock inside the client's routine, otherwise if it involves all the clients, then it's a deadlock inside the routine in charge of distributing packets to all connected clients.

Furthermore, today version v1.11.2 came out with an additional safety check on the function in charge of sending packets to clients. You can test it and check whether the bug persists.

@krusadellc
Copy link
Author

the freeze is not limited to a single client. the freeze involves whole mediamtx freezing.

even the API route on :9997 is inaccessible. It doesn't throw a 404 which means the server is still running, it just doesn't return any response (browser keeps looping).

Trying to connect a new RTSP client also shows the same behavior. ffmpeg play command just freezes forever without returning any data.

I already tried at commit 0a76806, but it was still failing the same way. So I don't think v1.11.2 would be any different.

@aler9
Copy link
Member

aler9 commented Jan 25, 2025

If you want to debug further, in these cases you can use pprof, which is a feature that allows to find out the list of active routines (also heap and memory but is unrelated from this).

You can enable it by setting pprof: true in the configuration file.

When the issue occurs, download a list of all active routines by using:

go tool pprof -text http://localhost:9999/debug/pprof/goroutine

@krusadellc
Copy link
Author

I will try that out, but I am afraid pprof endpoint may not respond when mediamtx freezes after the issue occurs.

@krusadellc
Copy link
Author

here's what pprof shows while mediamtx is in frozen state -

Showing nodes accounting for 100714, 100% of 100721 total
Dropped 98 nodes (cum <= 503)
      flat  flat%   sum%        cum   cum%
    100714   100%   100%     100714   100%  runtime.gopark
         0     0%   100%      99121 98.41%  github.com/bluenviron/gortsplib/v4.(*ServerConn).run
         0     0%   100%       1000  0.99%  github.com/bluenviron/mediamtx/internal/metrics.(*Metrics).onMetrics
         0     0%   100%       1000  0.99%  github.com/bluenviron/mediamtx/internal/metrics.(*Metrics).onMetrics.func1
         0     0%   100%       1276  1.27%  github.com/bluenviron/mediamtx/internal/protocols/httpp.(*handlerExitOnPanic).ServeHTTP
         0     0%   100%       1276  1.27%  github.com/bluenviron/mediamtx/internal/protocols/httpp.(*handlerFilterRequests).ServeHTTP
         0     0%   100%       1276  1.27%  github.com/bluenviron/mediamtx/internal/protocols/httpp.(*handlerLogger).ServeHTTP
         0     0%   100%       1276  1.27%  github.com/bluenviron/mediamtx/internal/protocols/httpp.(*handlerServerHeader).ServeHTTP
         0     0%   100%       1000  0.99%  github.com/bluenviron/mediamtx/internal/servers/rtsp.(*Server).APIConnsList
         0     0%   100%      99062 98.35%  github.com/bluenviron/mediamtx/internal/servers/rtsp.(*Server).OnConnOpen
         0     0%   100%       1276  1.27%  github.com/gin-gonic/gin.(*Context).Next (inline)
         0     0%   100%       1276  1.27%  github.com/gin-gonic/gin.(*Engine).ServeHTTP
         0     0%   100%       1276  1.27%  github.com/gin-gonic/gin.(*Engine).handleHTTPRequest
         0     0%   100%       1276  1.27%  net/http.(*conn).serve
         0     0%   100%       1276  1.27%  net/http.serverHandler.ServeHTTP
         0     0%   100%     100575 99.86%  runtime.goparkunlock (inline)
         0     0%   100%     100386 99.67%  runtime.semacquire1
         0     0%   100%      99109 98.40%  sync.(*Mutex).Lock (inline)
         0     0%   100%      99109 98.40%  sync.(*Mutex).lockSlow
         0     0%   100%      99110 98.40%  sync.(*RWMutex).Lock
         0     0%   100%       1275  1.27%  sync.(*RWMutex).RLock (inline)
         0     0%   100%      99109 98.40%  sync.runtime_SemacquireMutex
         0     0%   100%       1275  1.27%  sync.runtime_SemacquireRWMutexR

pprof.mediamtx.goroutine.001.pb.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working general
Projects
None yet
Development

No branches or pull requests

2 participants