Export env to python #7792

lixinqi · 2022-03-14T09:41:39Z

oneflow的env生命周期应该是一个被oneflow python模块持有的对象，这样它的生命周期会持续到python解释器结束。

lixinqi · 2022-03-14T09:43:08Z

oneflow/core/vm/virtual_machine.cpp

@@ -199,6 +202,15 @@ Maybe<void> VirtualMachine::Receive(vm::InstructionMsgList* instr_list) {
      // `ComputeInFuseMode` will be replaced by `Compute` soon.
      instr_msg->mut_instr_type_id()->instruction_type().ComputeInFuseMode(instr_msg);
    }
+  } else if (IsShuttingDown()) {


在shutting down阶段直接在main线程里处理指令。

lixinqi · 2022-03-14T09:43:45Z

python/oneflow/__init__.py

@@ -210,7 +210,7 @@ def is_deprecated(func_or_class):

 if not env_util.HasAllMultiClientEnvVars():
    env_util.SetDefaultMultiClientEnvVars()
-env_util.api_env_init()
+_oneflow_global_unique_env_ = env_util.create_env()


最核心改动。

lixinqi · 2022-03-14T09:44:08Z

python/oneflow/env.py

@@ -13,17 +13,6 @@
 See the License for the specific language governing permissions and
 limitations under the License.
 """
-from oneflow.framework.env_util import api_all_device_placement as all_device_placement


这些接口都是过时的

all_device_placement 这个接口不是过时的，不能删除

lixinqi · 2022-03-14T09:50:58Z

python/oneflow/test/modules/test_shutting_down.py

+class TestCallWhenShuttingDown:
+    def __init__(self):
+        tensor = oneflow.ones((2, 2))
+        print(tensor)


如果把这一行注释，行为就会和pytorch不一致，pytorch会执行成功，oneflow会报错。
但是这一问题和本次pr肯定没关系，我们新开issue讨论这一问题。

lixinqi · 2022-03-14T11:16:15Z

寻找在shutting down环节执行torch代码的规律

首先考察pytorch。

# script0
import torch

device_type = "cpu"

class Foo:
    def __init__(self):
        pass

    def __del__(self):
        tensor = torch.ones((8, 8), device=torch.device(device_type))
        print(tensor)

foo = Foo()

上述示例代码能正常工作，输入如下：

tensor([[1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

如果把device_type改成gpu，也就是示例代码如：

# script1
import torch

device_type = "cuda"

class Foo:
    def __init__(self):
        pass

    def __del__(self):
        tensor = torch.ones((8, 8), device=torch.device(device_type))
        print(tensor)

foo = Foo()

这就不能工作，输出如下：

Exception ignored in: <bound method Foo.__del__ of <__main__.Foo object at 0x7f49326f8048>>
Traceback (most recent call last):
  File "a.py", line 11, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down

但如果我们在外层作用域先执行一次torch.ones，示例代码如：

# script2
import torch

device_type = "cuda"
torch.ones((32, 32), device=torch.device(device_type))

class Foo:
    def __init__(self):
        pass

    def __del__(self):
        tensor = torch.ones((8, 8), device=torch.device(device_type))
        print(tensor)

foo = Foo()

这又是能正常工作的，同样会正常输出：

import torch

device_type = "cuda"
torch.ones((32, 32), device=torch.device(device_type))

class Foo:
    def __init__(self):
        pass

    def __del__(self):
        tensor = torch.ones((8, 8), device=torch.device(device_type))
        print(tensor)

foo = Foo()

猜测原因

猜测背后的规则应该非常简单：在shutting down的环节不能再实质import新的module，我们可以这样解释上述现象：

script0能工作是因为cpu op除了需要torch module外，可能不需要任何module。
script1不能工作是因为cuda op除了需要torch module外，可能需要用到一个专门处理cuda 的module。
script2能工作是因为cuda op在顶层作用域执行过，相关的cuda python module已经Import并缓存了起来。

最终的原因只能查看python的文档或者代码。

python/oneflow/framework/unittest.py

lixinqi · 2022-03-15T03:38:26Z

关于py::gil_scoped_acquired在非主线程调用的问题

我单独测试表明，py::gil_scoped_acquired可以在python解释器退出时，可以安全地在非主线程里调用。

// example.cpp
#include <pybind11/pybind11.h>
#include <pybind11/functional.h>
#include <thread>
#include <iostream>
#include <condition_variable>
#include <mutex>
#include <chrono>

void TestGILInNonMainThread() {
  std::mutex mutex;
  std::condition_variable cond;
  std::thread thread([&]{
    std::unique_lock<std::mutex> lock(mutex);
    cond.wait(lock, []{ return true; });
    std::cerr << "before_gil_scoped_acquire" << " ... ";
    std::this_thread::sleep_for(std::chrono::milliseconds(2000));
    pybind11::gil_scoped_acquire lock_gil{};
    std::cerr << "after_gil_scoped_acquire" << std::endl;
  });
  cond.notify_one();
  pybind11::gil_scoped_release unlock_gil{};
  thread.join();
}

PYBIND11_MODULE(example, m) {
    m.def("TestGILInNonMainThread", &TestGILInNonMainThread);
}

g++ -O3 -Wall -shared -std=c++11 -fPIC $(python3 -m pybind11 --includes) example.cpp -o example$(python3-config --extension-suffix)

# a.py
import example

class Foo:
    def __init__(self):
        pass

    def __del__(self):
        example.TestGILInNonMainThread()

foo = Foo()

最后输出表明py::gil_scoped_acquired正常工作。

$ python3 a.py
before_gil_scoped_acquire ... after_gil_scoped_acquire

lixinqi · 2022-03-15T04:02:14Z

已经移除了virtual_machine.cpp对shutting down的依赖，回滚到master的逻辑。

lixinqi · 2022-03-15T04:15:09Z

python/oneflow/framework/unittest.py

-            env_util.api_env_init()
-            _unittest_env_initilized = True
-
+TestCase = unittest.TestCase


目前我们的TestCase相较于基类unittest.TestCase不需要多余的操作，所以直接导出。

@strint @caishenghang

lixinqi · 2022-03-16T13:41:03Z

彻底调查清楚py::gil_scoped_acquire 在python finalization阶段中被non-main调用的问题

pybind11 的这个链接 pybind/pybind11#3274 已经完全说明清楚了问题，而且与我们的观察完全一致。

剩下的问题是为什么本pr上述 #7792 (comment) 又莫名其妙的正常工作，原因是上述代码使用的python版本是3.6，出问题的python版本是3.8。如果上述代码用如下编译方式编译出python包：

g++ -O3 -Wall -shared -std=c++11 -fPIC $(python3.8 -m pybind11 --includes) example.cpp -o example$(python3.8-config --extension-suffix)

再执行python3 a.py就会复现这一python自身的BUG。

$ python3.8 a.py
before_gil_scoped_acquire ...

strint · 2022-03-16T13:55:10Z

彻底调查清楚py::gil_scoped_acquire 在python finalization阶段中被non-main调用的问题

pybind11 的这个链接 pybind/pybind11#3274 已经完全说明清楚了问题，而且与我们的观察完全一致。

这个bug来源于python本身， https://bugs.python.org/issue42969 ，关联的PR：python/cpython#28525 还没合并，看起来他们还没就处理办法达成一致。也就是即使升级到python 3.11 也不能解决问题。

而我们要兼容Python 3.6, 3.7, 3.8, 3.9, 3.10，所以使用atexit来避免Python的这个bug是长期的。

env_api.h is deleted by master

lixinqi · 2022-03-18T08:12:21Z

oneflow/api/python/env/env.cpp

+  if (is_normal_exit) {
+    JUST(vm::ClusterSync());
+    auto* vm = JUST(GlobalMaybe<VirtualMachine>());
+    JUST(vm->CloseVMThreads());
+  }
+  JUST(env->init_is_normal_exit(is_normal_exit));
+  SetShuttingDown(true);
+  return Maybe<void>::Ok();


旧版的逻辑写在python层。如果遇到系统异常退出，则完全不执行DeleteEnv。为了对齐此逻辑，我们让EnvGlobalObjectsScope的析构在!is_normal_exit的时候不执行那一系列的Global::Delete();

lixinqi · 2022-03-18T08:13:44Z

oneflow/core/job/env_global_objects_scope.cpp

@@ -229,6 +229,7 @@ Maybe<void> EnvGlobalObjectsScope::Init(const EnvProto& env_proto) {
 }

 EnvGlobalObjectsScope::~EnvGlobalObjectsScope() {
+  if (is_normal_exit_.has_value() && !CHECK_JUST(is_normal_exit_)) { return; }


也许应该命名为is_abnormal_exit

lixinqi · 2022-03-18T08:17:13Z

oneflow/core/vm/thread_ctx.h

+  std::mutex pending_instruction_mutex_;
+  PendingInstructionMutexedList pending_instruction_list_;
+  Notifier notifier_;


完全去掉channel，使用list + notifier代替。原因是我们在finalization阶段会结束worker线程，让指令在main线程运行，channel和线程绑定得过深了，必须关闭channel才能让worker线程退出，而一旦关闭了channel，其他线程就没法再通过channel发送指令。而list + notifier相当于拆解了channel的功能，关闭notifier才能让线程退出，之后list可以继续使用。

lixinqi · 2022-03-18T08:17:42Z

oneflow/core/vm/virtual_machine.cpp

+  while (thread_ctx->mut_notifier()->WaitAndClearNotifiedCnt() == kNotifierStatusSuccess) {
+    while (thread_ctx->TryReceiveAndRun()) {}
+  }


此处的逻辑非常类似scheduler线程和callback线程的处理。

lixinqi · 2022-03-18T08:18:04Z

oneflow/core/vm/virtual_machine.cpp

@@ -115,7 +118,7 @@ VirtualMachine::VirtualMachine(const Resource& resource, int64_t this_machine_id
  // In order to notify threads in VirtualMachineEngine, a notify callback lambda should be take as
  // an argument for VirtualMachineEngine's constructor.
  vm_ = intrusive::make_shared<vm::VirtualMachineEngine>(
-      vm::MakeVmDesc(resource, this_machine_id).Get(), [this]() { callback_notifier_.Notify(); });


callback_notifier_被ScheduleCtx代替了。

lixinqi · 2022-03-18T08:19:18Z

oneflow/core/vm/virtual_machine.cpp

+Maybe<void> VirtualMachine::CloseVMThreads() {
+  CHECK_OR_RETURN(!vm_threads_closed_);
  ControlSync();
  pending_notifier_.Close();
  schedule_thread_.join();
-  CHECK(!vm_);
+  vm_threads_closed_ = true;
+  return Maybe<void>::Ok();


关闭VMThread线程，从此之后vm将以单线程的方式执行。
这部分的功能从VirtualMachine的析构里单独抽取出来，供python的atexit调用。

lixinqi · 2022-03-18T08:19:56Z

oneflow/core/vm/virtual_machine.cpp

@@ -199,6 +212,8 @@ Maybe<void> VirtualMachine::Receive(vm::InstructionMsgList* instr_list) {
      // `ComputeInFuseMode` will be replaced by `Compute` soon.
      instr_msg->mut_instr_type_id()->instruction_type().ComputeInFuseMode(instr_msg);
    }
+  } else if (unlikely(vm_threads_closed_)) {


vm_threads_closed_在CloseVMThreads里被置为true

lixinqi · 2022-03-18T08:21:49Z

oneflow/core/vm/virtual_machine.cpp

+  void OnGarbageMsgPending() const override { vm_->Callback(); }
+  void OnWorkerLoadPending(vm::ThreadCtx* thread_ctx) const override {
+    while (thread_ctx->TryReceiveAndRun() > 0) {}
+  }


一旦接到任务，都是原地执行。

lixinqi · 2022-03-18T08:23:41Z

python/oneflow/__init__.py

-    if hook.is_normal_exit():
-        oneflow._oneflow_internal.DestroyEnv()
-    oneflow._oneflow_internal.SetShuttingDown()
+    _oneflow_global_unique_env_.SwitchToShuttingDownPhase(hook.is_normal_exit())


上边删掉的逻辑都放置在SwitchToShuttingDownPhase函数里。

… into export_env_to_python

github-actions · 2022-03-31T19:46:39Z

Static analysis with clang failed. PR label automerge has been removed

github-actions · 2022-04-01T12:37:33Z

Speed stats:

GPU Name: GeForce GTX 1080 

✔️ OneFlow resnet50 time: 128.3ms (= 12833.6ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 140.6ms (= 14059.1ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 140.6ms / 128.3ms)

✔️ OneFlow resnet50 time: 77.9ms (= 7794.3ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 86.2ms (= 8623.0ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.11 (= 86.2ms / 77.9ms)

OneFlow resnet50 time: 53.9ms (= 10770.8ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 57.1ms (= 11412.6ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.06 (= 57.1ms / 53.9ms)

OneFlow resnet50 time: 43.6ms (= 8729.5ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 50.7ms (= 10148.5ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.16 (= 50.7ms / 43.6ms)

OneFlow resnet50 time: 39.2ms (= 7839.0ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 38.4ms (= 7687.0ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 0.98 (= 38.4ms / 39.2ms)

OneFlow swin dataloader time: 0.245s (= 49.085s / 200, num_workers=1)
PyTorch swin dataloader time: 0.253s (= 50.666s / 200, num_workers=1)
✔️ Relative speed: 1.032 (= 0.253s / 0.245s)

OneFlow swin dataloader time: 0.067s (= 13.317s / 200, num_workers=4)
PyTorch swin dataloader time: 0.070s (= 14.092s / 200, num_workers=4)
✔️ Relative speed: 1.058 (= 0.070s / 0.067s)

OneFlow swin dataloader time: 0.036s (= 7.150s / 200, num_workers=8)
PyTorch swin dataloader time: 0.037s (= 7.474s / 200, num_workers=8)
✔️ Relative speed: 1.045 (= 0.037s / 0.036s)

✔️ OneFlow resnet50 time: 135.7ms (= 13574.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 157.3ms (= 15733.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 157.3ms / 135.7ms)

OneFlow resnet50 time: 89.5ms (= 8945.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.9ms (= 10285.7ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 102.9ms / 89.5ms)

OneFlow resnet50 time: 62.3ms (= 12462.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.8ms (= 15368.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.23 (= 76.8ms / 62.3ms)

OneFlow resnet50 time: 53.9ms (= 10771.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.9ms (= 13183.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.22 (= 65.9ms / 53.9ms)

OneFlow resnet50 time: 49.0ms (= 9807.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 61.1ms (= 12213.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.25 (= 61.1ms / 49.0ms)

github-actions · 2022-04-01T21:34:09Z

Speed stats:

GPU Name: GeForce GTX 1080 

✔️ OneFlow resnet50 time: 128.3ms (= 12834.0ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 141.6ms (= 14159.5ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 141.6ms / 128.3ms)

✔️ OneFlow resnet50 time: 77.5ms (= 7753.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.2ms (= 8417.0ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.09 (= 84.2ms / 77.5ms)

OneFlow resnet50 time: 53.5ms (= 10690.0ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 62.8ms (= 12557.9ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.17 (= 62.8ms / 53.5ms)

OneFlow resnet50 time: 44.5ms (= 8890.2ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 47.4ms (= 9472.0ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.07 (= 47.4ms / 44.5ms)

OneFlow resnet50 time: 40.4ms (= 8075.8ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 38.3ms (= 7658.7ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 0.95 (= 38.3ms / 40.4ms)

OneFlow swin dataloader time: 0.248s (= 49.546s / 200, num_workers=1)
PyTorch swin dataloader time: 0.249s (= 49.739s / 200, num_workers=1)
✔️ Relative speed: 1.004 (= 0.249s / 0.248s)

OneFlow swin dataloader time: 0.066s (= 13.169s / 200, num_workers=4)
PyTorch swin dataloader time: 0.068s (= 13.566s / 200, num_workers=4)
✔️ Relative speed: 1.030 (= 0.068s / 0.066s)

OneFlow swin dataloader time: 0.036s (= 7.226s / 200, num_workers=8)
PyTorch swin dataloader time: 0.036s (= 7.297s / 200, num_workers=8)
✔️ Relative speed: 1.010 (= 0.036s / 0.036s)

✔️ OneFlow resnet50 time: 135.7ms (= 13565.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 156.0ms (= 15603.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 156.0ms / 135.7ms)

OneFlow resnet50 time: 89.0ms (= 8902.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 105.0ms (= 10501.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 105.0ms / 89.0ms)

OneFlow resnet50 time: 61.3ms (= 12255.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.7ms (= 15336.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.25 (= 76.7ms / 61.3ms)

OneFlow resnet50 time: 52.6ms (= 10520.5ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 66.5ms (= 13296.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.26 (= 66.5ms / 52.6ms)

OneFlow resnet50 time: 51.0ms (= 10200.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.6ms (= 14315.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.40 (= 71.6ms / 51.0ms)

github-actions · 2022-04-01T21:35:25Z

CI failed when running job: cuda-speed-test. PR label automerge has been removed

github-actions · 2022-04-02T05:43:52Z

Speed stats:

GPU Name: GeForce GTX 1080 

✔️ OneFlow resnet50 time: 128.7ms (= 12871.4ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 140.5ms (= 14045.9ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.09 (= 140.5ms / 128.7ms)

✔️ OneFlow resnet50 time: 78.5ms (= 7848.3ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 88.5ms (= 8846.8ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.13 (= 88.5ms / 78.5ms)

OneFlow resnet50 time: 53.6ms (= 10718.6ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 59.0ms (= 11806.8ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.10 (= 59.0ms / 53.6ms)

OneFlow resnet50 time: 44.7ms (= 8940.1ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 52.3ms (= 10450.5ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.17 (= 52.3ms / 44.7ms)

OneFlow resnet50 time: 40.5ms (= 8101.2ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 43.1ms (= 8615.1ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.06 (= 43.1ms / 40.5ms)

OneFlow swin dataloader time: 0.252s (= 50.364s / 200, num_workers=1)
PyTorch swin dataloader time: 0.251s (= 50.187s / 200, num_workers=1)
✔️ Relative speed: 0.996 (= 0.251s / 0.252s)

OneFlow swin dataloader time: 0.069s (= 13.853s / 200, num_workers=4)
PyTorch swin dataloader time: 0.069s (= 13.782s / 200, num_workers=4)
✔️ Relative speed: 0.995 (= 0.069s / 0.069s)

OneFlow swin dataloader time: 0.036s (= 7.273s / 200, num_workers=8)
PyTorch swin dataloader time: 0.038s (= 7.681s / 200, num_workers=8)
✔️ Relative speed: 1.056 (= 0.038s / 0.036s)

✔️ OneFlow resnet50 time: 135.7ms (= 13573.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 158.4ms (= 15837.3ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.17 (= 158.4ms / 135.7ms)

OneFlow resnet50 time: 91.6ms (= 9158.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 103.6ms (= 10363.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.13 (= 103.6ms / 91.6ms)

OneFlow resnet50 time: 61.2ms (= 12249.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.0ms (= 15409.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.26 (= 77.0ms / 61.2ms)

OneFlow resnet50 time: 52.9ms (= 10578.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.6ms (= 13513.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.28 (= 67.6ms / 52.9ms)

OneFlow resnet50 time: 47.3ms (= 9458.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 61.2ms (= 12247.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.29 (= 61.2ms / 47.3ms)

github-actions · 2022-04-02T05:49:33Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/7792/

lixinqi added 2 commits March 14, 2022 00:11

the Env is never destroyed.

29ccd30

export Env into python

8509484

lixinqi requested review from chengtbf, strint, BBuf, daquexian and jackalcooper as code owners March 14, 2022 09:41

more unittests

434b7af

lixinqi commented Mar 14, 2022

View reviewed changes

strint approved these changes Mar 14, 2022

View reviewed changes

strint reviewed Mar 14, 2022

View reviewed changes

python/oneflow/framework/unittest.py Show resolved Hide resolved

lixinqi commented Mar 15, 2022

View reviewed changes

export unittest.TestCase in framework/unittest.py

86296cb

lixinqi force-pushed the export_env_to_python branch from c072305 to 86296cb Compare March 16, 2022 14:21

SwitchToShuttingDownPhase

454f5e7

lixinqi force-pushed the export_env_to_python branch from 1873471 to 454f5e7 Compare March 16, 2022 16:02

lixinqi and others added 4 commits March 17, 2022 00:07

optional is_normal_exit

d1d9ad7

VirtualMachine::CloseVMThreads

a58348d

merge master

8bb83a1

Delete env_api.h

fe64379

env_api.h is deleted by master

lixinqi commented Mar 18, 2022

View reviewed changes

lixinqi requested a review from oneflow-ci-bot March 18, 2022 08:24

lixinqi added automerge bug system labels Mar 18, 2022

Merge branch 'master' into export_env_to_python

b67a60e

chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot March 31, 2022 02:17

lixinqi and others added 9 commits March 31, 2022 10:52

Merge branch 'master' into export_env_to_python

fb8b9fa

Merge branch 'export_env_to_python' of github.com:Oneflow-Inc/oneflow…

ec2c402

… into export_env_to_python

Merge branch 'master' into export_env_to_python

45fd613

Merge branch 'master' into export_env_to_python

b730ef0

Merge branch 'master' into export_env_to_python

fbd921c

Merge branch 'master' into export_env_to_python

a365516

Merge branch 'master' into export_env_to_python

244ee42

Merge branch 'master' into export_env_to_python

e99bac0

Merge branch 'master' into export_env_to_python

595d13f

github-actions bot removed the automerge label Mar 31, 2022

Merge branch 'master' into export_env_to_python

340f6f9

strint added the automerge label Apr 1, 2022

strint requested review from oneflow-ci-bot and removed request for oneflow-ci-bot April 1, 2022 13:33

mergify bot added 2 commits April 1, 2022 17:23

Merge branch 'master' into export_env_to_python

9a078cd

Merge branch 'master' into export_env_to_python

61cc655

github-actions bot removed the automerge label Apr 1, 2022

Merge branch 'master' into export_env_to_python

e07072c

strint added the automerge label Apr 2, 2022

mergify bot merged commit a632a2e into master Apr 2, 2022

mergify bot deleted the export_env_to_python branch April 2, 2022 06:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export env to python #7792

Export env to python #7792

lixinqi commented Mar 14, 2022 •

edited

Loading

lixinqi Mar 14, 2022

lixinqi Mar 14, 2022

lixinqi Mar 14, 2022

chengtbf Mar 22, 2022

lixinqi Mar 14, 2022

lixinqi commented Mar 14, 2022

lixinqi commented Mar 15, 2022 •

edited

Loading

lixinqi commented Mar 15, 2022

lixinqi Mar 15, 2022

lixinqi Mar 15, 2022

lixinqi commented Mar 16, 2022 •

edited by strint

Loading

strint commented Mar 16, 2022 •

edited

Loading

彻底调查清楚py::gil_scoped_acquire 在python finalization阶段中被non-main调用的问题

lixinqi Mar 18, 2022

lixinqi Mar 18, 2022

lixinqi Mar 18, 2022

lixinqi Mar 18, 2022

lixinqi Mar 18, 2022

lixinqi Mar 18, 2022

lixinqi Mar 18, 2022

lixinqi Mar 18, 2022

lixinqi Mar 18, 2022

github-actions bot commented Mar 31, 2022

github-actions bot commented Apr 1, 2022

github-actions bot commented Apr 1, 2022

github-actions bot commented Apr 1, 2022

github-actions bot commented Apr 2, 2022

github-actions bot commented Apr 2, 2022

Export env to python #7792

Export env to python #7792

Conversation

lixinqi commented Mar 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lixinqi commented Mar 14, 2022

寻找在shutting down环节执行torch代码的规律

猜测原因

lixinqi commented Mar 15, 2022 • edited Loading

关于py::gil_scoped_acquired在非主线程调用的问题

lixinqi commented Mar 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lixinqi commented Mar 16, 2022 • edited by strint Loading

彻底调查清楚py::gil_scoped_acquire 在python finalization阶段中被non-main调用的问题

strint commented Mar 16, 2022 • edited Loading

彻底调查清楚py::gil_scoped_acquire 在python finalization阶段中被non-main调用的问题

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Mar 31, 2022

github-actions bot commented Apr 1, 2022

github-actions bot commented Apr 1, 2022

github-actions bot commented Apr 1, 2022

github-actions bot commented Apr 2, 2022

github-actions bot commented Apr 2, 2022

lixinqi commented Mar 14, 2022 •

edited

Loading

lixinqi commented Mar 15, 2022 •

edited

Loading

lixinqi commented Mar 16, 2022 •

edited by strint

Loading

strint commented Mar 16, 2022 •

edited

Loading