Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paddle_pserver2 Connection reset #8876

Closed
adrianhust opened this issue Mar 8, 2018 · 2 comments
Closed

paddle_pserver2 Connection reset #8876

adrianhust opened this issue Mar 8, 2018 · 2 comments
Labels
User 用于标记用户问题

Comments

@adrianhust
Copy link

Hi,paddle ps server throws err, Connection reset by peer

Thu Mar 8 16:54:29 2018[1,77]:+ ./paddle_pserver2 --num_gradient_servers=100 --nics=xgbe0 --port=7165 --ports_num=1 --ports_num_for_sparse=1 --rdma_tcp=tcp --comment=paddle_cluster_job
Thu Mar 8 17:17:08 2018[1,69]:F0308 17:17:08.399897 46342 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,73]:F0308 17:17:08.404299 36768 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,70]:F0308 17:17:08.399475 8739 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,72]:F0308 17:17:08.403358 10075 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,71]:F0308 17:17:08.402631 7730 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,76]:F0308 17:17:08.409656 16576 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,83]:F0308 17:17:08.398970 10447 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,87]:F0308 17:17:08.399874 1616 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,88]:F0308 17:17:08.406509 16449 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,86]:F0308 17:17:08.401118 27746 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,75]:F0308 17:17:08.399859 11851 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,91]:F0308 17:17:08.401727 18870 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,84]:F0308 17:17:08.396749 11437 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,80]:F0308 17:17:08.401507 32596 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,90]:F0308 17:17:08.405221 27163 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,99]:F0308 17:17:08.402868 16889 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,79]:F0308 17:17:08.400950 35561 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,85]:F0308 17:17:08.403419 46104 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,74]:F0308 17:17:08.402575 24915 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,93]:F0308 17:17:08.401983 8600 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,89]:F0308 17:17:08.401999 36127 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,82]:F0308 17:17:08.402421 14384 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,94]:F0308 17:17:08.400952 3120 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,97]:F0308 17:17:08.401190 17058 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,95]:F0308 17:17:08.404366 37113 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,81]:F0308 17:17:08.403667 41279 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,77]:F0308 17:17:08.408119 22789 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,78]:F0308 17:17:08.403462 47509 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,98]:F0308 17:17:08.403429 12925 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,96]:F0308 17:17:08.402117 17866 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.199.23: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,73]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,73]:F0308 17:17:08.411231 41691 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,73]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,76]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,76]:F0308 17:17:08.416602 21110 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,76]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,72]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,72]:F0308 17:17:08.410295 16444 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,72]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,99]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,99]:F0308 17:17:08.409775 18855 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,99]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,84]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,84]:F0308 17:17:08.403645 18169 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,84]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,80]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,80]:F0308 17:17:08.408404 37644 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,80]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,71]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,71]:F0308 17:17:08.409564 15520 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,71]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,82]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,82]:F0308 17:17:08.409303 18967 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,82]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,73]: @ 0x8d19fd google::LogMessage::Fail()
Thu Mar 8 17:17:08 2018[1,73]: @ 0x8d19fd google::LogMessage::Fail()
Thu Mar 8 17:17:08 2018[1,88]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,88]:F0308 17:17:08.413429 24016 SocketChannel.cpp:54] Check failed: len >= 0 peer=10.89.196.16: Connection reset by peer [104]
Thu Mar 8 17:17:08 2018[1,88]:*** Check failure stack trace: ***
Thu Mar 8 17:17:08 2018[1,73]: @ 0x8d54ac google::LogMessage::SendToLog()
Thu Mar 8 17:17:08 2018[1,73]: @ 0x8d54ac google::LogMessage::SendToLog()
Thu Mar 8 17:17:08 2018[1,24]:./start_server.sh: line 33: 39014 Killed GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_pserver2 --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --nics=${nics} ${server_arg} --rdma_tcp=${rdma_tcp} --comment=$comment
Thu Mar 8 17:17:08 2018[1,24]:+ check_return 'paddle_pserver2 failed'
Thu Mar 8 17:17:08 2018[1,24]:+ '[' 137 -ne 0 ']'
Thu Mar 8 17:17:08 2018[1,72]: @ 0x8d19fd google::LogMessage::Fail()
Thu Mar 8 17:17:08 2018[1,72]: @ 0x8d19fd google::LogMessage::Fail()
Thu Mar 8 17:17:08 2018[1,24]:+ echo '[./start_server.sh : 34] [main]'
Thu Mar 8 17:17:08 2018[1,24]:[./start_server.sh : 34] [main]
Thu Mar 8 17:17:08 2018[1,24]:+ echo '[FATAL]: paddle_pserver2 failed'
Thu Mar 8 17:17:08 2018[1,24]:[FATAL]: paddle_pserver2 failed

mem2018-03-08 17-37-10

@ranqiu92 ranqiu92 added the User 用于标记用户问题 label Mar 8, 2018
@wangkuiyi
Copy link
Collaborator

wangkuiyi commented Mar 9, 2018

@adrianhust Thank you for your feedback!

Could you please paste the process you start the experiment? Or, if you were following a tutorial or something, please at least paste a link to that. Thanks.

@shanyi15
Copy link
Collaborator

您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User 用于标记用户问题
Projects
None yet
Development

No branches or pull requests

4 participants