Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added changes to handle dependency check in FdbSyncd and FpmSyncd for warm-boot #1556

Merged
merged 14 commits into from
Mar 3, 2021
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 63 additions & 2 deletions fdbsyncd/fdbsync.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,62 @@ FdbSync::~FdbSync()
}
}


// Check if interface entries are restored in kernel
bool FdbSync::isIntfRestoreDone()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FdbSync::isIntfRestoreDone() , FdbSync::isReadyToReconcile() and RouteSync::isReadyToReconcile() are doing similar tasks, it seems we could make them into library calls with different input parameters. Later PR is probably fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure .. will consider in future updates to the code

{
vector<string> required_modules = {
"vxlanmgrd",
"intfmgrd",
"vlanmgrd",
"vrfmgrd"
};

for (string& module : required_modules)
{
WarmStart::WarmStartState state;

WarmStart::getWarmStartState(module, state);
if (state == WarmStart::REPLAYED || state == WarmStart::RECONCILED)
{
SWSS_LOG_INFO("Module %s Replayed or Reconciled %d",module.c_str(), (int) state);
}
else
{
SWSS_LOG_INFO("Module %s NOT Replayed or Reconciled %d",module.c_str(), (int) state);
return false;
}
}

return true;
}

// Check if vxlan entries are re-conciled to kernel
bool FdbSync::isReadyToReconcile()
{
vector<string> required_modules = {
"orchagent",
};

for (string& module : required_modules)
{
WarmStart::WarmStartState state;

WarmStart::getWarmStartState(module, state);
if (state == WarmStart::RECONCILED)
{
SWSS_LOG_INFO("Module %s Reconciled %d",module.c_str(), (int) state);
}
else
{
SWSS_LOG_INFO("Module %s NOT Reconciled %d",module.c_str(), (int) state);
return false;
}
}

return true;
}

void FdbSync::processCfgEvpnNvo()
{
std::deque<KeyOpFieldsValuesTuple> entries;
Expand Down Expand Up @@ -447,14 +503,17 @@ void FdbSync::macDelVxlanDB(string key)
fvVector.push_back(t);
fvVector.push_back(v);

SWSS_LOG_NOTICE("%sVXLAN_FDB_TABLE: DEL_KEY %s vtep:%s type:%s",
m_AppRestartAssist->isWarmStartInProgress() ? "WARM-RESTART:" : "" ,
key.c_str(), vtep.c_str(), type.c_str());

// If warmstart is in progress, we take all netlink changes into the cache map
if (m_AppRestartAssist->isWarmStartInProgress())
{
m_AppRestartAssist->insertToMap(APP_VXLAN_FDB_TABLE_NAME, key, fvVector, true);
return;
}

SWSS_LOG_INFO("VXLAN_FDB_TABLE: DEL_KEY %s vtep:%s type:%s", key.c_str(), vtep.c_str(), type.c_str());
m_fdbTable.del(key);
return;

Expand All @@ -476,14 +535,16 @@ void FdbSync::macAddVxlan(string key, struct in_addr vtep, string type, uint32_t
fvVector.push_back(t);
fvVector.push_back(v);

SWSS_LOG_INFO("%sVXLAN_FDB_TABLE: ADD_KEY %s vtep:%s type:%s",
m_AppRestartAssist->isWarmStartInProgress() ? "WARM-RESTART:" : "" ,
key.c_str(), svtep.c_str(), type.c_str());
// If warmstart is in progress, we take all netlink changes into the cache map
if (m_AppRestartAssist->isWarmStartInProgress())
{
m_AppRestartAssist->insertToMap(APP_VXLAN_FDB_TABLE_NAME, key, fvVector, false);
return;
}

SWSS_LOG_INFO("VXLAN_FDB_TABLE: ADD_KEY %s vtep:%s type:%s", key.c_str(), svtep.c_str(), type.c_str());
m_fdbTable.set(key, fvVector);

return;
Expand Down
23 changes: 20 additions & 3 deletions fdbsyncd/fdbsync.h
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,24 @@
#include "netmsg.h"
#include "warmRestartAssist.h"

// The timeout value (in seconds) for fdbsyncd reconcilation logic
#define DEFAULT_FDBSYNC_WARMSTART_TIMER 30
/*
* The timeout value (in seconds) for fdbsyncd reconcilation logic
* This timers avoids indefinite wait for orchagent reconcillation
*/
#define DEFAULT_FDBSYNC_WARMSTART_TIMER 600
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of increasing this timer. @zhenggen-xu can you also review?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timer seems to be big, so we need 10 minutes for FDB to sync? how many entries and how do we get to that number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the max timer to wait for orchagent reconciliation. This will not be hit in normal circumstances. This is just to avoid indefinite wait on orchagent to reconcile and hence a large value. The code will not wait for this timer once the orchagent had reconciled.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like there were many timers, could you make a inline comments in the code to describe the relationships between different timers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconciliation and restoration timers are not actually related, except that the reconciliation starts after restoration ends. And the warm-start timer is like fallback timeout to avoid indefinite wait for reconciliation. Will try to include some more description there

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, we are mixing things here. FDBSYNC_RECON_WAIT_TIME in your code is the timer to wait for orchagent to be RECONSILED, that does not depending on other application timers. I don't see why it should be 120 based on fpmsyncd. Do we have a test log that showing the events in sequence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FDBSYNC_RECON_WAIT_TIME is not the timer to wait for orchagent to reconcile. FDBSYNC_RECON_WAIT_TIME is the time for control plane ( BGP) to re-establish and reconcile its data ( routes) which are then programmed to fpmsyncd/fdbsyncd and hence the time given is same as fpmsyncd.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See your fdbsyncd.cpp in your PR:
line 110: reconCheckTimer.setInterval(timespec{FDBSYNC_RECON_WAIT_TIME, 0});
line 124: if (sync.isReadyToReconcile())

and https://github.com/Azure/sonic-swss/blob/5384ace1440cb97cd4829e416e26b06f0361c9f8/orchagent/orchdaemon.cpp#L671

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not understand the comment completely. Can you please elaborate a little ? FDBSYNC_RECON_WAIT_TIME is the time we are waiting for the control plane ( in this case BGP ) to converge after restart and after this we also need to check if orchagent has reconciled after warm-reboot. Hence when this timer FDBSYNC_RECON_WAIT_TIME expires, we check if orchagent is reconciled and hereon we keep checking for orchagent to be reconciled so that we can mark fdbsyncd as reconciled and it can start updating the APP-DB.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code or even comments in the code does not specify the relationship between this FDBSYNC_RECON_WAIT_TIME timer and BGP timer. If this process has dependency on BGP reconciliation, we should make it explicit, e,g, fpmsyncd could write a flag to somewhere (e,g stateDB) and this process can pick it up. Having a hardcoded timer here to assume the BGP has the same timer (BGP timer, which is configurable) does not seem to be a good idea.


/* Time to wait to run fdb reconcillation after fdbsyncd replay.
* fdbsyncd will wait for this time and for orchagnet to reconcile
* before reconciling itself
*/
#define FDBSYNC_RECON_WAIT_TIME 120

/*
* This is the MAX time in seconds, fdbsyncd will wait after warm-reboot
* for the interface entries to be recreated in kernel before attempting to
* write the FDB data to kernel
*/
#define INTF_RESTORE_MAX_WAIT_TIME 180

namespace swss {

Expand Down Expand Up @@ -43,7 +59,8 @@ class FdbSync : public NetMsg

virtual void onMsg(int nlmsg_type, struct nl_object *obj);

bool isFdbRestoreDone();
bool isIntfRestoreDone();
bool isReadyToReconcile();

AppRestartAssist *getRestartAssist()
{
Expand Down
65 changes: 64 additions & 1 deletion fdbsyncd/fdbsyncd.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ int main(int argc, char **argv)
Selectable *temps;
int ret;
Select s;
SelectableTimer replayCheckTimer(timespec{0, 0});
SelectableTimer reconCheckTimer(timespec{0, 0});

using namespace std::chrono;

Expand All @@ -45,7 +47,31 @@ int main(int argc, char **argv)
if (sync.getRestartAssist()->isWarmStartInProgress())
{
sync.getRestartAssist()->readTablesToMap();

steady_clock::time_point starttime = steady_clock::now();
while (!sync.isIntfRestoreDone())
Copy link
Contributor

@qiluo-msft qiluo-msft Dec 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isIntfRestoreDone [](start = 29, length = 17)

CPU is wasted on waiting. Could you subscribe Redis? #WontFix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is not a continuous busy wait ( sleep is present), this should not cause the cpu to be continuously busy. Also there is nothing for fdbsyncd to do until the interface info is populated to kernel after system warm-reboot, hence it needs to wait till such time.

{
duration<double> time_span =
duration_cast<duration<double>>(steady_clock::now() - starttime);
int pasttime = int(time_span.count());

if (pasttime > INTF_RESTORE_MAX_WAIT_TIME)
{
SWSS_LOG_INFO("timed-out before all interface data was replayed to kernel!!!");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if intf is not restored after max_wait_time? Shouldn't we abort to avoid more issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System will proceed further. Some mac programming to kernel might fail because underlying interface is not yet created.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we could not restore interface, why we should proceed further and get into some limbo state that may or may not have critical issues. I would suggest we abort to bring user's attention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interfaces will be eventually restored. The only impact will be that warm-reboot might not be hitless and there will be traffic loss seen. Not sure if we need to go for full abort and impact everything and all traffic. Requesting @prsunny to comment on this too

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Some mac programming to kernel might fail because underlying interface is not yet created." so this condition will be recovered by someone later? Again, if it is a critical condition, we should raise/abort so we don't get into limbo state.

break;
}
sleep(1);
}
SWSS_LOG_NOTICE("Starting ReconcileTimer");
sync.getRestartAssist()->startReconcileTimer(s);
replayCheckTimer.setInterval(timespec{1, 0});
replayCheckTimer.start();
s.addSelectable(&replayCheckTimer);
}
Copy link
Contributor

@qiluo-msft qiluo-msft Dec 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove extra blank line #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

else
{
sync.getRestartAssist()->warmStartDisabled();
sync.m_reconcileDone = true;
}

netlink.registerGroup(RTNLGRP_LINK);
Expand All @@ -67,14 +93,51 @@ int main(int argc, char **argv)
{
s.select(&temps);

if(temps == (Selectable *)sync.getFdbStateTable())
if (temps == (Selectable *)sync.getFdbStateTable())
{
sync.processStateFdb();
}
else if (temps == (Selectable *)sync.getCfgEvpnNvoTable())
{
sync.processCfgEvpnNvo();
}
else if (temps == &replayCheckTimer)
{
if (sync.getFdbStateTable()->empty() && sync.getCfgEvpnNvoTable()->empty())
{
sync.getRestartAssist()->appDataReplayed();
SWSS_LOG_NOTICE("FDB Replay Complete, Start Reconcile Check");
reconCheckTimer.setInterval(timespec{FDBSYNC_RECON_WAIT_TIME, 0});
reconCheckTimer.start();
s.addSelectable(&reconCheckTimer);
}
else
{
replayCheckTimer.setInterval(timespec{1, 0});
// re-start replay check timer
replayCheckTimer.start();
}

}
else if (temps == &reconCheckTimer)
{
if (sync.isReadyToReconcile())
{
if (sync.getRestartAssist()->isWarmStartInProgress())
{
sync.m_reconcileDone = true;
sync.getRestartAssist()->stopReconcileTimer(s);
sync.getRestartAssist()->reconcile();
SWSS_LOG_NOTICE("VXLAN FDB VNI Reconcillation Complete (Event)");
}
}
else
{
reconCheckTimer.setInterval(timespec{1, 0});
// Restart and check again in 1 sec
reconCheckTimer.start();
}
}
else
{
/*
Expand Down
57 changes: 52 additions & 5 deletions fpmsyncd/fpmsyncd.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,9 @@ using namespace swss;
* Default warm-restart timer interval for routing-stack app. To be used only if
* no explicit value has been defined in configuration.
*/
const uint32_t DEFAULT_ROUTING_RESTART_INTERVAL = 120;
const uint32_t DEFAULT_ROUTING_RESTART_INTERVAL = 600;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of increasing this timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timer is now similar to the fallback timer in fdbsyncd. This will take effect only to avoid indefinite wait on orchagent getting reconciled. The default reconcile timer is still 120 seconds as defined by the other macro
const uint32_t DEFAULT_ROUTING_RECON_CHECK_INTERVAL = 120;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, we should well define these two timers: DEFAULT_ROUTING_RESTART_INTERVAL and DEFAULT_ROUTING_RECON_CHECK_INTERVAL in document/code-comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will add comment in the code


const uint32_t DEFAULT_ROUTING_RECON_CHECK_INTERVAL = 120;

// Wait 3 seconds after detecting EOIU reached state
// TODO: support eoiu hold interval config
Expand Down Expand Up @@ -67,6 +69,10 @@ int main(int argc, char **argv)
SelectableTimer eoiuCheckTimer(timespec{0, 0});
// After eoiu flags are detected, start a hold timer before starting reconciliation.
SelectableTimer eoiuHoldTimer(timespec{0, 0});

// Timers to wait for dependent reconcillation
SelectableTimer reconcileCheckTimer(timespec{0, 0});
SelectableTimer reconcileHoldTimer(timespec{0, 0});
/*
* Pipeline should be flushed right away to deal with state pending
* from previous try/catch iterations.
Expand All @@ -85,13 +91,14 @@ int main(int argc, char **argv)
{
/* Obtain warm-restart timer defined for routing application */
time_t warmRestartIval = sync.m_warmStartHelper.getRestartTimer();
warmStartTimer.setInterval(timespec{DEFAULT_ROUTING_RESTART_INTERVAL, 0});
if (!warmRestartIval)
{
warmStartTimer.setInterval(timespec{DEFAULT_ROUTING_RESTART_INTERVAL, 0});
reconcileCheckTimer.setInterval(timespec{DEFAULT_ROUTING_RECON_CHECK_INTERVAL, 0});
}
else
{
warmStartTimer.setInterval(timespec{warmRestartIval, 0});
reconcileCheckTimer.setInterval(timespec{warmRestartIval, 0});
}

/* Execute restoration instruction and kick off warm-restart timer */
Expand All @@ -100,6 +107,9 @@ int main(int argc, char **argv)
warmStartTimer.start();
s.addSelectable(&warmStartTimer);
SWSS_LOG_NOTICE("Warm-Restart timer started.");
reconcileCheckTimer.start();
s.addSelectable(&reconcileCheckTimer);
SWSS_LOG_NOTICE("reconcileCheckTimer timer started ");
}

// Also start periodic eoiu check timer, first wait 5 seconds, then check every 1 second
Expand All @@ -108,6 +118,10 @@ int main(int argc, char **argv)
s.addSelectable(&eoiuCheckTimer);
SWSS_LOG_NOTICE("Warm-Restart eoiuCheckTimer timer started.");
}
else
{
sync.m_warmStartHelper.setState(WarmStart::WSDISABLED);
}

while (true)
{
Expand All @@ -122,17 +136,26 @@ int main(int argc, char **argv)
* select() loop.
* Note: route reconciliation always succeeds, it will not be done twice.
*/
if (temps == &warmStartTimer || temps == &eoiuHoldTimer)
if (temps == &warmStartTimer || temps == &eoiuHoldTimer || temps == &reconcileHoldTimer)
{
bool readyToReconcile = false;
if (temps == &warmStartTimer)
{
readyToReconcile = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of orchagent still not ready, should we just abort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconcile should take care of cleaning up stale entries. So not sure abort is necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a reason we wait for "sync.isReadyToReconcile", I assume it was a must condition for reconcile to be working as expected. Again, if that condition is broken, we should make it visible to user. This probably won't happen in normal case, if it does, we should have information for user to debug, so raise or abort could help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again same as earlier, orchagent will eventually reconcile. If it does not, it would have its own mechanism to recover/abort. Here we are making effort to reconcile the system to pre-warm-reboot state. Hence if we continue, we would reconcile prematurely and hence warm-reboot may not be hitless. However if we abort, we would impact everything and the full traffic will be hit.

SWSS_LOG_NOTICE("Warm-Restart timer expired.");
}
else if (temps == &reconcileHoldTimer)
{
readyToReconcile = true;
SWSS_LOG_NOTICE("Warm-Restart reconcile hold timer expired.");
}
else
{
readyToReconcile = sync.isReadyToReconcile();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will basically invalidate the eoiu design for fast reconciliation. Understood we probably have to rely on orchagent reconciliation status, then the DEFAULT_ROUTING_RECON_CHECK_INTERVAL should be reduced to way smaller to take advantage of eoui.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_ROUTING_RECON_CHECK_INTERVAL timer is only fallback timer. Is will only come into action when the replay/reconcile does not happen in given time. Hence it should not affect eoui handling.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eoiuHoldTimer is one-shot timer as is in the code, it could be triggered very early like after a few seconds, at that time readyToReconcile could be false (likely) , then the code will fall back to the reconcileCheckTimer (if timer not configured, then DEFAULT_ROUTING_RECON_CHECK_INTERVAL =120 seconds). This is what I meant invalidate the eoiu design. Example: eoui takes 10 seconds to finish, the orchagent takes 15 seconds to get readyToReconcile state, we still need wait at least 120 seconds to reconcile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got your point. When eoiuHoldTimer expires, that means control plane ( BGP ) has converged. So now we just need to wait for orchangent reconcile and proceed once that is done. To implement this, will restart reconCheck timer with 2 seconds when eoiuHoldTimer expires, to check if orchagent has reconciled. Will implement this change.

SWSS_LOG_NOTICE("Warm-Restart EOIU hold timer expired.");
}
if (sync.m_warmStartHelper.inProgress())

if (sync.m_warmStartHelper.inProgress() && readyToReconcile)
{
sync.m_warmStartHelper.reconcile();
SWSS_LOG_NOTICE("Warm-Restart reconciliation processed.");
Expand Down Expand Up @@ -174,6 +197,30 @@ int main(int argc, char **argv)
{
s.removeSelectable(&eoiuCheckTimer);
}
}
else if (temps == &reconcileCheckTimer)
{
SWSS_LOG_NOTICE("reconcile Check timer Expired ");
if (sync.m_warmStartHelper.inProgress())
{
if (sync.isReadyToReconcile())
{
reconcileHoldTimer.setInterval(timespec{2, 0});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the hold-timer for? Why do we need wait another 2 seconds before reconcile happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once fpmsyncd reconcile is done, this timer checks if orchagent is reconciled and we are ready to start updating the reconciled fpmsyncd entries into the APP-DB. Since did not want to check too frequently, kept the timer at 2 seconds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconcileHoldTimer is one-shot timer in the code, it just means we wait one time 2 seconds then do reconcile. Question was, why need wait another 2 seconds? what did we wait for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconcileHoldTimer is one shot timer which is fired once the control plane reconciliation is done. At this time if orchagnet is reconciled, we continue with fpnsyncd reconcillation. However when reconcileHoldTimer expires if orchangent is still not reconciled, we fire reconcileHoldTimer for another 2 seconds and re-check if the orchagent is reconciled yet. This continues till it finds orchagent has reconciled. The idea is it wait for minimum of control plane ( BGP ) reconciliation time ( default 120 seconds) and also check if orchagent is reconciled before reconciling fpmsyncd. Both conditions should be met. Orchagent reconcile can occur earlier or later based on system scale, however we need to wait for minimum control plane ( BGP ) reconcillation time, before reconciling fpmsyncd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some correction in the above explanation, reconcileCheckTimer ( not reconcileHoldTimer) is fired again for 2 seconds to re-check if orchagent has reconciled. Hence reconcileHoldTimer does not need 2 seconds. Will change it to 1 seconds like eoiu hold timer.

reconcileHoldTimer.start();
s.addSelectable(&reconcileHoldTimer);
SWSS_LOG_NOTICE("Warm-Restart started reconcile hold timer ");
s.removeSelectable(&reconcileCheckTimer);
continue;
}
// re-start check timer for every 2 seconds now on
reconcileCheckTimer.setInterval(timespec{2, 0});
reconcileCheckTimer.start();
SWSS_LOG_NOTICE("Warm-Restart reconcileCheckTimer restarted");
}
else
{
s.removeSelectable(&reconcileCheckTimer);
}
}
else if (!warmStartEnabled || sync.m_warmStartHelper.isReconciled())
{
Expand Down
26 changes: 26 additions & 0 deletions fpmsyncd/routesync.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -920,3 +920,29 @@ string RouteSync::getNextHopIf(struct rtnl_route *route_obj)

return result;
}

// Check if ependent entries are re-conciled
bool RouteSync::isReadyToReconcile()
{
vector<string> required_modules = {
"orchagent",
};

for (string& module : required_modules)
{
WarmStart::WarmStartState state;

WarmStart::getWarmStartState(module, state);
if (state == WarmStart::RECONCILED || state == WarmStart::WSDISABLED)
{
SWSS_LOG_INFO("Module %s Reconciled %d",module.c_str(), (int) state);
}
else
{
SWSS_LOG_INFO("Module %s NOT Reconciled %d",module.c_str(), (int) state);
return false;
}
}

return true;
}
3 changes: 3 additions & 0 deletions fpmsyncd/routesync.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ class RouteSync : public NetMsg
virtual void onMsg(int nlmsg_type, struct nl_object *obj);

virtual void onMsgRaw(struct nlmsghdr *obj);

bool isReadyToReconcile();

WarmStartHelper m_warmStartHelper;

private:
Expand Down
Loading