Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Group::recv()->srt::CUDT::releaseSynch() caused deadlock #2138

Closed
gou4shi1 opened this issue Sep 27, 2021 · 3 comments · Fixed by #2218
Closed

[BUG] Group::recv()->srt::CUDT::releaseSynch() caused deadlock #2138

gou4shi1 opened this issue Sep 27, 2021 · 3 comments · Fixed by #2218
Labels
[core] Area: Changes in SRT library core Type: Bug Indicates an unexpected problem or unintended behavior
Milestone

Comments

@gou4shi1
Copy link
Contributor

gou4shi1 commented Sep 27, 2021

Describe the bug
There are two problems:

  1. pkt seq overflow from 2147483647 to 0 caused Group::recv() to close all members, you can found that by the SEQUENCE DISCREPANCY log below
14:18:28.781262/SRT:TsbPd D:SRT.ts: @950316545:NEXT PKT CHECK: rdy=false passack=false skipto=%-1 current=%2147483643 buf-base=%2147483646
14:18:28.811693/SRT:TsbPd D:SRT.ts: @950316545:NEXT PKT CHECK: rdy=true passack=false skipto=%-1 current=%2147483643 buf-base=%2147483646
14:18:28.885825/SRT:TsbPd D:SRT.ts: @950316545:NEXT PKT CHECK: rdy=false passack=true skipto=%-1 current=%2147483647 buf-base=%0
14:18:28.895965/SRT:TsbPd D:SRT.ts: @950316545:NEXT PKT CHECK: rdy=false passack=false skipto=%-1 current=%0 buf-base=%2
14:18:28.905664/SRT:TsbPd D:SRT.ts: @950316545:NEXT PKT CHECK: rdy=true passack=false skipto=%-1 current=%0 buf-base=%2
14:18:28.905881/SrtL:Recv:LIVE*E:SRT.gr: group/recv: @950316545: SEQUENCE DISCREPANCY: base=%2147483647 vs pkt=%0, setting ESECFAIL
14:18:28.916839/SRT:TsbPd D:SRT.ts: @950316545:NEXT PKT CHECK: rdy=true passack=false skipto=%-1 current=%1 buf-base=%3
14:18:28.927405/SRT:TsbPd D:SRT.ts: @950316545:NEXT PKT CHECK: rdy=true passack=false skipto=%-1 current=%1 buf-base=%4

Should we add seqoff() into the abs() below (in group.cpp)?

                if (m_RcvBaseSeqNo != SRT_SEQNO_NONE && abs(m_RcvBaseSeqNo - mctrl.pktseq) > CSeqNo::m_iSeqNoTH)
                {
                    // This error should be returned if the link turns out
                    // to be the only one, or set to the group data.
                    // err = SRT_ESECFAIL;
                    LOGC(grlog.Error,
                         log << "group/recv: @" << id << ": SEQUENCE DISCREPANCY: base=%" << m_RcvBaseSeqNo
                             << " vs pkt=%" << mctrl.pktseq << ", setting ESECFAIL");
                    broken.insert(ps);
                    break;
                }
  1. then deadlock
(gdb) where
#0  0x00007f15a2b9e98d in pthread_join () from target:/home/jingchi/remote_server/lib/libpthread.so.0
#1  0x00007f15a1479d63 in std::thread::join() () from target:/home/jingchi/remote_server/lib/libstdc++.so.6
#2  0x00007f15a3519c4d in srt::CUDT::releaseSynch() () from target:/home/jingchi/remote_server/lib/libsrt.so.1.4
#3  0x00007f15a3532e88 in srt::CUDT::closeInternal() () from target:/home/jingchi/remote_server/lib/libsrt.so.1.4
#4  0x00007f15a34f71ef in srt::CUDTUnited::close(srt::CUDTSocket*) () from target:/home/jingchi/remote_server/lib/libsrt.so.1.4
#5  0x00007f15a358038c in srt::CUDTGroup::recv(char*, int, SRT_MsgCtrl_&) () from target:/home/jingchi/remote_server/lib/libsrt.so.1.4
#6  0x00007f15a34ebb17 in srt::CUDT::recvmsg2(int, char*, int, SRT_MsgCtrl_&) () from target:/home/jingchi/remote_server/lib/libsrt.so.1.4
#7  0x00007f15a357135e in srt_recvmsg2 () from target:/home/jingchi/remote_server/lib/libsrt.so.1.4

To Reproduce
SRT broadcast video to server in live mode.

Expected behavior

  1. The normal pkt seq overflow from 2147483647 to 0 should not cause close().
  2. close() should not cause deadlock

Desktop (please provide the following information):

  • OS: Linux
  • SRT Version / commit ID: v1.4.4-rc.1
@gou4shi1 gou4shi1 added the Type: Bug Indicates an unexpected problem or unintended behavior label Sep 27, 2021
@maxsharabayko
Copy link
Collaborator

Should we add seqoff() into the abs() below (in group.cpp)?

I would say seqlen(..) should be used instead, but after it is adjusted to work with 64-bit integers instead of 32-bit to not return negative numbers.

CSeqNo::seqlen(2147483647, 0) == 2;
CSeqNo::seqlen(0, 2147483647) == -2147483648;

@gou4shi1
Copy link
Contributor Author

gou4shi1 commented Sep 30, 2021

        // Force closing
        {
            InvertedLock ung (m_GroupLock);
            for (set<CUDTSocket*>::iterator b = broken.begin(); b != broken.end(); ++b)
            {
                CUDT::s_UDTUnited.close(*b);
            }
        }

When Group::recv() call close(), it should not hold any lock, but from the log and backtrace, it was holding a lock needed by tsbpd(), can we just report error here, and let applications to close the broken sockets?

@maxsharabayko
Copy link
Collaborator

@gou4shi1
So far I think It would probably be better to just mark a socket as "broken", thus it would be eventually closed in a background GC thread.

PR #2142 addresses the first issue described here.
Thus the send issue (hang up) is less likely to happen and can be postponed to v1.4.5 to reduce the number of changes in SRT RC.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants