Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Fixed too early closed caller socket in background. #1750

Merged
merged 3 commits into from
Jan 25, 2021

Conversation

ethouris
Copy link
Collaborator

@ethouris ethouris commented Jan 19, 2021

Proper fix for #1736.

This reverses a change that was done in order to properly update groups when a caller socket has failed to connect. There was added a request to perform close-alike things (with setting flags and releasing thread-related resources) that caused the socket to be actually prematurely closed. This caused inability to properly deliver closure information to the caller callback and possibly too early execution of the GC on that socket that deleted the socket before it could be properly reported in the receiver queue thread.

This PR also contains reversal of the changes in the tests that were failing because of this problem.

Changes from #1736 could be also incorporated here as a sanity check.

@ethouris ethouris added [core] Area: Changes in SRT library core Type: Maintenance Work required to maintain or clean up the code labels Jan 19, 2021
@ethouris ethouris added this to the v1.4.3 milestone Jan 19, 2021
@maxsharabayko
Copy link
Collaborator

This PR changes the behavior of a socket that failed to connect due to connection timeout or due to a rejected connection.
Previously such a socket was marked as "broken" and thus scheduled to be closed in the GC thread.
With this PR such a socket remains unbroken, only the error is notified via epoll that a connection attempt has failed.
Therefore it will not be closed by GC thread. The user can do another reconnection attempt or close it himself.

@maxsharabayko
Copy link
Collaborator

The behaviour described above should be mentioned somewhere in the documentation. Possible places:

@ethouris
Copy link
Collaborator Author

Let me clarify one thing. This doesn't "change behavior" towards the version 1.4.1 of SRT, but restores the old behavior that was spoilt in the latest version. The behavior since always and up to version 1.4.1 was such that a socket that failed to connect remains in the CONNECTING state and the only sign of error is in epoll ERR flag. During the works on fixing problems found on Makito there was added a PR that was doing consistently the same thing in case of connection timeout report, connection rejection and also connection broken, including on reception of SHUTDOWN. This however has changed the behavior of an async socket connection.

Comment on lines 1100 to +1108
i->u->m_bConnecting = false;
i->u->updateBrokenConnection();

// DO NOT close the socket here because in this case it might be
// unable to get status from at the right moment. Also only member
// sockets should be taken care of internally - single sockets should
// be normally closed by the application, after it is done with them.

// app can call any UDT API to learn the connection_broken error
CUDT::s_UDTUnited.m_EPoll.update_events(i->u->m_SocketID, i->u->m_sPollID, SRT_EPOLL_IN | SRT_EPOLL_OUT | SRT_EPOLL_ERR, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... a socket that failed to connect remains in the CONNECTING state and the only sign of error is in epoll ERR flag.

In the current version and in this PR it does not remain in the connecting state.
Does it change the behavior towards 1.4.1 then?

i->u->m_bConnecting = false;

Copy link
Collaborator Author

@ethouris ethouris Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the code that existed in 1.4.1. This had to be removed in order to move the closing activities outside the lock.

        if (CTimer::getTime() >= i->m_ullTTL)
        {
            HLOGC(mglog.Debug,
                  log << "RendezvousQueue: EXPIRED (" << (i->m_ullTTL ? "enforced on FAILURE" : "passed TTL")
                      << ". removing from queue");
            // connection timer expired, acknowledge app via epoll
            i->m_pUDT->m_bConnecting = false;
            CUDT::s_UDTUnited.m_EPoll.update_events(i->m_iID, i->m_pUDT->m_sPollID, UDT_EPOLL_ERR, true);
            /*
             * Setting m_bConnecting to false but keeping socket in rendezvous queue is not a good idea.
             * Next CUDT::close will not remove it from rendezvous queue (because !m_bConnecting)
             * and may crash here on next pass.
             */
            if (AF_INET == i->m_iIPversion)
                delete (sockaddr_in *)i->m_pPeerAddr;
            else
                delete (sockaddr_in6 *)i->m_pPeerAddr;

            // i_next was preincremented, but this is guaranteed to point to
            // the element next to erased one.
            i_next = m_lRendezvousID.erase(i);
            continue;
        }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it does not remain in the CONNECTING state, does it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing these flags have nothing to do with the "socket state". The socket state for a connection-timedout socket is still SRTS_CONNECTING.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So internally it will have m_Status == SRTS_CONNECTING, but the status from the API seems to be "BROKEN" when not connecting and not connected.
A bit confusing, but the motivation seems to be to have a socket in a BROKEN state, but not scheduled for closing by GC.

SRT_SOCKSTATUS CUDTSocket::getStatus()
{
    // TTL in CRendezvousQueue::updateConnStatus() will set m_bConnecting to false.
    // Although m_Status is still SRTS_CONNECTING, the connection is in fact to be closed due to TTL expiry.
    // In this case m_bConnected is also false. Both checks are required to avoid hitting
    // a regular state transition from CONNECTING to CONNECTED.

    if (m_pUDT->m_bBroken)
        return SRTS_BROKEN;

    // Connecting timed out
    if ((m_Status == SRTS_CONNECTING) && !m_pUDT->m_bConnecting && !m_pUDT->m_bConnected)
        return SRTS_BROKEN;

    return m_Status;
}

@maxsharabayko
Copy link
Collaborator

TestEnforcedEncryption.CASE_B_3_Blocking_Enforced_On_Off_Pwd_Set_None test is failing in Travis.

@jeandube
Copy link
Collaborator

jeandube commented Jan 22, 2021

I am investigating a change of behavior with EnforcedEncryption on Makito (ref: NKE-1042):
test case:
caller/sender (EnforcedEncryption=default(on), no passphrase) ->listener/receiver(EnforcedEncryption=off, passphrase)
result pre-1.4.2: STREAMING, 1.4.2: not STREAMING unless passphrases match or caller/sender EnforedEncryption=forced off.
I wonder if this is or was a bad assumption of defaults or a change in the lib.
I found that my test case is NonBlocking B.4 in Travis tests (if EnforcedEncryption default assumption is right).

@maxsharabayko
Copy link
Collaborator

The test matrix can be found here.
Which one are you investigating?

@jeandube
Copy link
Collaborator

jeandube commented Jan 22, 2021

The test matrix can be found here.
Which one are you investigating?

Found it while you replied and updated my previous comment NonBlocking B.4.

@maxsharabayko
Copy link
Collaborator

Test SyncEvent.WaitFor failed in GitHub CI Windows. Restarting.

 149: [ RUN      ] SyncEvent.WaitFor
149: SyncEvent::wait_for(50us) took 796us
149: SyncEvent::wait_for(100us) took 1920us
149: SyncEvent::wait_for(500us) took 989us
149: SyncEvent::wait_for(1 ms) took 1.954 ms
149: SyncEvent::wait_for(101 ms) took 101.023 ms
149: D:\a\srt\srt\test\test_sync.cpp(317): error: Expected: (waittime_us) >= (timeout_us), actual: 1000979 vs 1001000
149: SyncEvent::wait_for(1001 ms) took 1000.98 ms
149: [  FAILED  ] SyncEvent.WaitFor (1109 ms)

@maxsharabayko maxsharabayko merged commit 0f8623e into Haivision:master Jan 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core Type: Maintenance Work required to maintain or clean up the code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants