速看:一個(gè)算子在深度學(xué)習(xí)框架中的旅程

          來(lái)源:CSDN博客 | 2022-06-17 08:47:33 |

          撰文|趙露陽(yáng)

          算子即Operator,這里簡(jiǎn)稱op。op是深度學(xué)習(xí)的基礎(chǔ)操作,任意深度學(xué)習(xí)框架中都包含了數(shù)百個(gè)op,這些op用于各種類型的數(shù)值、tensor運(yùn)算。

          在深度學(xué)習(xí)中,通過nn.Module這樣搭積木的方式搭建網(wǎng)絡(luò),而op就是更基礎(chǔ)的,用于制作積木的配方和原材料。


          (資料圖)

          譬如如下的一個(gè)demo網(wǎng)絡(luò):

          import oneflow as torch class TinyModel(torch.nn.Module): def __init__(self): super(TinyModel, self).__init__() self.linear1 = torch.nn.Linear(100, 200) self.activation = torch.nn.ReLU() self.linear2 = torch.nn.Linear(200, 10) self.softmax = torch.nn.Softmax() def forward(self, x): x = self.linear1(x) x = self.activation(x) x = self.linear2(x) x = self.softmax(x) return xtinymodel = TinyModel()print("The model:")print(tinymodel)

          從結(jié)構(gòu)來(lái)看,這個(gè)網(wǎng)絡(luò)是由各種nn.Module如Linear、ReLU、Softmax搭建而成,但從本質(zhì)上,這些nn.Module則是由一個(gè)個(gè)基礎(chǔ)op拼接,從而完成功能的。這其中就包含了Matmul、Relu、Softmax等op。?在OneFlow中,對(duì)于一個(gè)已有op,是如何完成從Python層->C++層的調(diào)用、流轉(zhuǎn)和執(zhí)行過程?本文將以

          output = flow.relu(input)

          為例,梳理一個(gè)op從Python -> C++執(zhí)行的完整過程。

          首先,這里給出一個(gè)流程示意圖:

          下面,將分別詳細(xì)從源碼角度跟蹤其各個(gè)環(huán)節(jié)。

          1

          Binding

          這里,binding是指Python和C++代碼的綁定。通常,我們用Python搭建網(wǎng)絡(luò),訓(xùn)練模型,調(diào)用函數(shù)完成各種操作。實(shí)際上,這些函數(shù)通常在Python層只是一層wrapper,底層實(shí)現(xiàn)還是通過C++代碼完成的,那么Python -> C++是如何調(diào)用的?這就需要用到Python和C++的綁定。

          在深度學(xué)習(xí)框架的實(shí)現(xiàn)中,即可以用Python原生的C?API,也可以通過pybind11來(lái)完成函數(shù)綁定,在OneFlow中,二者均有使用,譬如:

          oneflow/api/python/framework/tensor.cpp

          oneflow/api/python/framework/tensor_functions.cpp

          中涉及到的 tensor.xxx 方法都是通過Python C?API完成了函數(shù)綁定;

          oneflow/core/functional/functional_api.yaml

          中定義的諸多 flow.xxx 方法則是通過pybind實(shí)現(xiàn)的綁定。這里關(guān)于Python C?API和pybind不做過多介紹,具體用法可以參考相應(yīng)文檔:

          https://docs.python.org/zh-cn/3.8/c-api/index.html

          https://pybind11.readthedocs.io/en/stable/index.html

          下面我們回到flow.relu方法,我們?cè)赑ython層調(diào)用的flow.relu實(shí)際是調(diào)用了在

          python/oneflow/__init__.py

          中定義的oneflow._C.relu。?_C表示其實(shí)現(xiàn)位于底層C++。和PyTorch類似,我們也基于.yaml定義了一套接口導(dǎo)出及code gen的規(guī)則,譬如在 functional_api.yaml 中,我們可以看到Relu的導(dǎo)出接口的函數(shù)簽名:

          - name: "relu" signature: "Tensor (Tensor x, Bool inplace=False) => Relu" bind_python: True

          從yaml定義可以看出,flow._C.relu 接收兩個(gè)參數(shù),tensor和一個(gè)bool值,其綁定了C++的Relu方法,函數(shù)返回值也是tensor。實(shí)際上,在OneFlow編譯時(shí),會(huì)通過執(zhí)行

          tools/functional/generate_functional_api.py

          這個(gè)文件,對(duì) functional_api.yaml 進(jìn)行解析和代碼生成,動(dòng)態(tài)生成C++的.h和.cpp文件。

          build/oneflow/core/functional/functional_api.yaml.h

          build/oneflow/core/functional/functional_api.yaml.cpp

          并在.cpp文件中調(diào)用相應(yīng)的functor完成C++層面的函數(shù)調(diào)用。這里,還是以flow._C.relu為例,其對(duì)應(yīng)的functor定義位于oneflow/core/functional/impl/activation_functor.cpp:

          class ReluFunctor { public: ReluFunctor() { op_ = CHECK_JUST(one::OpBuilder("relu").Input("x", 1).Output("y", 1).Build()); } Maybe operator()(const std::shared_ptr& x, bool inplace) const { ... } private: std::shared_ptr op_;};

          ?

          ReluFunctor通過

          ONEFLOW_FUNCTION_LIBRARY(m) { m.add_functor("Relu"); ...}

          ?

          完成functor的注冊(cè),注冊(cè)成functional接口后,在Python層flow._C.relu就完成了和“Relu”的綁定。同時(shí),這個(gè)函數(shù)在C++中也可以通過functional::Relu直接調(diào)用。

          2

          Functor

          Functor不僅是Python -> C++交互的核心,也是op調(diào)用、輸入?yún)?shù)推導(dǎo)和檢查的第一站。通常,各種op在functor層需要完成對(duì)輸入tensor的shape、dtype、維度、元素個(gè)數(shù)等各種check,以及對(duì)op特有的邏輯進(jìn)行解析和處理。Relu Functor代碼如下:

          class ReluFunctor { public: ReluFunctor() { op_ = CHECK_JUST(one::OpBuilder("relu").Input("x", 1).Output("y", 1).Build()); } Maybe operator()(const std::shared_ptr& x, bool inplace) const { if (inplace) { JUST(CheckInplaceValid(x)); std::shared_ptr outputs = std::make_shared(1); outputs->at(0) = x; JUST(OpInterpUtil::Dispatch(*op_, {x}, outputs.get(), AttrMap{})); return outputs->at(0); } else { return OpInterpUtil::Dispatch(*op_, {x}); } } private: std::shared_ptr op_;};

          可以看見,ReluFunctor是比較簡(jiǎn)單的,其定義了一個(gè)私有變量

          std::shared_ptr op_;

          這個(gè)op_即需要執(zhí)行的Relu op,通過OpBuilder進(jìn)行構(gòu)建;functor的operator()內(nèi)部,根據(jù)是否inplace走到2個(gè)不同分支,并最終通過OpInterpUtil::Dispatch()將op、輸入tensor和參數(shù)派發(fā)至Interpreter處理。

          3

          Dispatch

          各種op在functor中完成check和邏輯處理后,大多需要通過OpInterpUtil::Dispatch()?進(jìn)行派發(fā),其目的地是Interpreter。在Interpreter中,將會(huì)對(duì)op進(jìn)行更進(jìn)一步的處理。在oneflow/core/framework/op_interpreter/op_interpreter_util.h中,我們可以看見多種重載的Dispatch模板代碼:

          class OpInterpUtil { public: template static Maybe Dispatch(const OpExpr& op_expr, const TensorTuple& inputs, const AttrMap& attrs) { return Dispatch(op_expr, inputs, OpExprInterpContext(attrs)); } template static Maybe Dispatch(const OpExpr& op_expr, const TensorTuple& inputs) { return Dispatch(op_expr, inputs, OpExprInterpContext(AttrMap{})); } template static Maybe Dispatch(const OpExpr& op_expr, const TensorTuple& inputs, const OpExprInterpContext& ctx); static Maybe Dispatch(const OpExpr& op_expr, const TensorTuple& inputs, TensorTuple* outputs, const AttrMap& attrs) { return Dispatch(op_expr, inputs, outputs, OpExprInterpContext(attrs)); } static Maybe Dispatch(const OpExpr& op_expr, const TensorTuple& inputs, TensorTuple* outputs) { return Dispatch(op_expr, inputs, outputs, OpExprInterpContext(AttrMap{})); } static Maybe Dispatch(const OpExpr& op_expr, const TensorTuple& inputs, TensorTuple* outputs, const OpExprInterpContext& ctx);

          這些重載,是為了應(yīng)對(duì)不同的輸入、輸出以及OpExprInterpContext的情況。譬如這個(gè)OpExprInterpContext是op在Interpreter中所需的上下文,可能攜帶op計(jì)算所需要的屬性(如conv2d op所需要的kernel_size、padding等)、device、sbp、parallel等描述信息。這些重載的Dispatch最終都會(huì)走到:

          /* static */ Maybe OpInterpUtil::Dispatch( const OpExpr& op_expr, const TensorTuple& inputs, TensorTuple* outputs, const OpExprInterpContext& ctx) { return JUST(GetInterpreter(inputs, ctx, op_expr))->Apply(op_expr, inputs, outputs, ctx);}

          Dispatch至此,剩下的就要交給Interpreter了。

          4

          Interpreter

          Get Interpreter

          這里先看看GetInterpreter,這里其實(shí)就是獲取所需的Interpreter,來(lái)負(fù)責(zé)op接下來(lái)的執(zhí)行。省略check相關(guān)的邏輯,主要代碼如下:oneflow/core/framework/op_interpreter/op_interpreter_util.cpp

          Maybe GetInterpreter(const TensorTuple& inputs, const OpExprInterpContext& ctx, const OpExpr& op_expr) { static const auto& g_lazy_interpreter = BuildLazyInterpreter(); static const auto& g_eager_consistent_interpreter = BuildEagerInterpreter(/*is_mirrored=*/false); static const auto& g_eager_mirrored_interpreter = BuildEagerInterpreter(/*is_mirrored=*/true); if (!LazyMode::is_enabled()) { if (inputs.empty()) { if (ctx.parallel_desc.has_value()) { JUST(ctx.nd_sbp); CHECK_OR_RETURN(!ctx.device.has_value()); return g_eager_consistent_interpreter; } else { CHECK_OR_RETURN(!ctx.nd_sbp.has_value()); return g_eager_mirrored_interpreter; } } else { if (inputs.at(0)->is_consistent()) { ... return g_eager_consistent_interpreter; } else { ... return g_eager_mirrored_interpreter; } } UNIMPLEMENTED_THEN_RETURN(); } return g_lazy_interpreter;}

          通過上面的邏輯可以看出,Interpreter大體上分為Eager Interpteter和Lazy Interpreter;其中Eager Interpteter又根據(jù)Eager Mirrored和Eager Consistent有所區(qū)別。具體就是以下3種子類實(shí)現(xiàn):

          EagerMirroredInterpreter

          EagerConsistentInterpreter

          LazyInterpreter

          普通的Eager mode下(無(wú)論是單卡還是DDP的情況)都會(huì)走到?EagerMirroredInterpreter?的邏輯;在普通Eager Mode之外,為輸入tensor設(shè)置了sbp、placement則會(huì)進(jìn)入到EagerConsistentInterpreter的邏輯;在Lazy Mode時(shí)(使用nn.Graph),則會(huì)進(jìn)入到LazyInterpreter。

          下面,我們看下這3種Interpreter的構(gòu)建:

          std::shared_ptr BuildEagerInterpreter(const bool& is_mirrored) { std::shared_ptr internal; if (is_mirrored) { internal = std::make_shared(); } else { internal = std::make_shared(); } return std::make_shared(internal);}std::shared_ptr BuildLazyInterpreter() { auto internal = std::make_shared(); return std::make_shared(internal);}

          可見,這3種Interpreter構(gòu)建完成后,都會(huì)以私有變量internal的形式,參與AutogradInterpreter的構(gòu)建,并最終返回AutogradInterpreter。

          class AutogradInterpreter { public: AutogradInterpreter() = delete; AutogradInterpreter(const std::shared_ptr& internal) : internal_(internal) {} virtual ~AutogradInterpreter() = default; Maybe Apply(const OpExpr& op_expr, const TensorTuple& inputs, TensorTuple* outputs, const AttrMap& attrs) const { return Apply(op_expr, inputs, outputs, OpExprInterpContext(attrs)); } Maybe Apply(const OpExpr& op_expr, const TensorTuple& inputs, TensorTuple* outputs) const { return Apply(op_expr, inputs, outputs, OpExprInterpContext(AttrMap{})); } Maybe Apply(const OpExpr& op_expr, const TensorTuple& inputs, TensorTuple* outputs, const OpExprInterpContext& ctx) const; private: std::shared_ptr internal_;};

          Apply()

          通過上面我們知道,EagerMirroredInterpreter、EagerConsistentInterpreterLazyInterpreter都將為其包裹上AutogradInterpreter的殼,通過AutogradInterpreter觸發(fā)Apply的調(diào)用。顧名思義,AutogradInterpreter的作用主要是和autograd相關(guān),其主要為eager mode下前向的op節(jié)點(diǎn)插入對(duì)應(yīng)的用于反向計(jì)算grad的節(jié)點(diǎn)。

          我們看看這部分代碼,關(guān)鍵部分的作用在注釋里給出:

          Maybe AutogradInterpreter::Apply(const OpExpr& op_expr, const TensorTuple& inputs, TensorTuple* outputs, const OpExprInterpContext& ctx) const { // 判斷是否需要計(jì)算梯度,如果處于GradMode的作用域切改op注冊(cè)時(shí)沒有禁用梯度 // 則requires_grad的值根據(jù)輸入tensor的requires_grad屬性判斷 // any of input tensors requires_grad==True,則表示需要計(jì)算梯度 bool requires_grad = false; if (autograd::GradMode::is_enabled() && !JUST(op_expr.IsGradDisabled())) { requires_grad = std::any_of(inputs.begin(), inputs.end(), [](const std::shared_ptr& tensor) { return tensor->requires_grad(); }); }// 這一坨邏輯比較丑陋,是因?yàn)榻谥С至薿neflow系統(tǒng)中支持了stride&&view機(jī)制// 而大部分op尚未注冊(cè)stride推導(dǎo)、尚未支持non-contiguous的輸入tensor// 所以需要在這對(duì)這部分op的輸入進(jìn)行強(qiáng)制轉(zhuǎn)換,將其變?yōu)閏ontiguous的// NOTE: if this op not support stride, then need to tensor->contiguous()#define HANDLE_NON_CONTIGUOUS_INPUT(tensor_tuple_ptr) \ TensorTuple tmp_inputs; \ if (!LazyMode::is_enabled() && !JUST(op_expr.SupportNonContiguous())) { \ tmp_inputs.resize(inputs.size()); \ for (size_t i = 0; i < inputs.size(); i++) { tmp_inputs[i] = inputs[i]->contiguous(); } \ tensor_tuple_ptr = &tmp_inputs; \ } const TensorTuple* inputs_ptr = &inputs; HANDLE_NON_CONTIGUOUS_INPUT(inputs_ptr); // 這里是進(jìn)行實(shí)際Interpreter執(zhí)行的主要過程 { autograd::AutoGradMode mode(false); JUST(internal_->Apply(op_expr, *inputs_ptr, outputs, ctx)); } // 這里主要是為了eager mode下,且requires_grad==True的op, // 插入反向節(jié)點(diǎn)(AddNode)用于autograd,該節(jié)點(diǎn)包含反向梯度計(jì)算的方法(backward_fn) // Lazy mode will construct backward compute graph in passes, so disable autograd if lazy mode. std::shared_ptr grad_closure(nullptr); if (requires_grad && !LazyMode::is_enabled()) { grad_closure = JUST(op_expr.GetOrCreateOpGradClosure()); auto backward_fn = std::make_shared(); backward_fn->body = [=](const TensorTuple& out_grads, TensorTuple* in_grads, bool create_graph) -> Maybe { autograd::AutoGradMode mode(create_graph); JUST(grad_closure->Apply(out_grads, in_grads)); return Maybe::Ok(); }; backward_fn->status = [=]() { return grad_closure->state()->SavedTensors().size() > 0; }; JUST(GetThreadLocalAutogradEngine()->AddNode(op_expr.op_type_name() + "_backward", backward_fn, *inputs_ptr, outputs)); } // Update outputs autograd meta // Note: if requires_grad is True, we will create a new autograd meta for each output // in `AddBackwardFuncPtr` to support inplace operation, so the update should after // `AddBackwardFuncPtr` for (auto& output : *outputs) { output->set_is_leaf(inputs_ptr->size() == 0 || !requires_grad); ... if (!output->requires_grad()) { JUST(output->set_requires_grad( requires_grad && IsSupportRequireGradDataType(output->dtype()->data_type()))); } } // 捕獲前向的inputs outputs,反向計(jì)算時(shí)可能用到 if (requires_grad && !LazyMode::is_enabled()) { // Capture inputs and outputs after `AddBackwardFuncPtr` because of that grad function // node has been attached to them. JUST(grad_closure->Capture(*inputs_ptr, *outputs, ctx)); } return Maybe::Ok();}

          上面一坨邏輯有點(diǎn)多,讓我們看一下重點(diǎn),對(duì)于簡(jiǎn)單的Relu op,我們只需關(guān)注這部分代碼:

          // 這里是進(jìn)行實(shí)際Interpreter執(zhí)行的主要過程 { autograd::AutoGradMode mode(false); JUST(internal_->Apply(op_expr, *inputs_ptr, outputs, ctx)); }

          這里,還是以上面的flow.relu為例,由于是簡(jiǎn)單的Eager Mode,所以實(shí)際會(huì)走到EagerInterpreter的Apply方法:

          Maybe EagerInterpreter::Apply(const OpExpr& op_expr, const TensorTuple& inputs, TensorTuple* outputs, const OpExprInterpContext& ctx) const {#define APPLY_IF(op_type) \ if (const auto* op = dynamic_cast(&op_expr)) { \ return ApplyImpl(*op, inputs, outputs, ctx); \ } APPLY_IF(UserOp); APPLY_IF(VariableOp); APPLY_IF(CastToMirroredOp); APPLY_IF(CastFromMirroredOp); APPLY_IF(ConsistentToConsistentOp); APPLY_IF(CastToConsistentOp); APPLY_IF(CastFromConsistentOp); APPLY_IF(DistributeSplitOp); APPLY_IF(DistributeCloneOp); APPLY_IF(DistributeConcatOp); APPLY_IF(DistributeAddOp); APPLY_IF(FunctionOp); APPLY_IF(SelectTopNOp)#undef APPLY_IF OF_UNIMPLEMENTED() << "The type " << op_expr.op_type_name() << " has not been supported in EagerInterpreter::Apply.";}

          ?

          這里,通過宏定義APPLY_IF,增加了對(duì)不同類型op的分支處理。對(duì)于大多數(shù)用戶來(lái)說,用到的op都是UserOp類型,所以這里實(shí)際上會(huì)走到這個(gè)分支中:

          if (const auto* op = dynamic_cast(&op_expr)) { return ApplyImpl(*op, inputs, outputs, ctx); }

          再看看EagerMirroredInterpreter::ApplyImpl,位于

          oneflow/core/framework/op_interpreter/eager_mirrored_op_interpreter.cpp

          Maybe EagerMirroredInterpreter::ApplyImpl(const UserOpExpr& op_expr, const TensorTuple& inputs, TensorTuple* outputs, const OpExprInterpContext& ctx) const { return NaiveInterpret(op_expr, inputs, outputs, ctx);}

          其最終實(shí)現(xiàn)是NaiveInterpret。

          NaiveInterpret

          NaiveInterpret簡(jiǎn)單來(lái)說,主要用于做以下幾件事:

          check input tensor的device是否一致

          生成output tensor

          為output tensor推導(dǎo)和檢查shape/stride/dtype

          構(gòu)建op執(zhí)行指令,并派發(fā)至vm

          簡(jiǎn)化版的代碼如下:

          Maybe NaiveInterpret(const UserOpExpr& user_op_expr, const TensorTuple& inputs, const Symbol& default_device, TensorTuple* outputs, const OpExprInterpContext& ctx) { const auto& attrs = ctx.attrs; std::shared_ptr input_eager_blob_objects = std::make_shared(inputs.size()); // check devices for (int i = 0; i < inputs.size(); i++) { const auto& input_device = JUST(inputs.at(i)->device()); if (i > 0) { CHECK_OR_RETURN(*default_device == *input_device) << Error::RuntimeError() << "Expected all tensors to be on the same device, but found at least two devices, " << default_device->ToString() << " (positional 0) and " << input_device->ToString() << " (positional " << i << ")!"; } input_eager_blob_objects->at(i) = JUST(inputs.at(i)->eager_blob_object()); } // make output tensors std::shared_ptr output_eager_blob_objects = std::make_shared(outputs->size()); auto* output_tensor_metas = ThreadLocalDefaultOutputMutTensorMetas(outputs->size()); for (int i = 0; i < outputs->size(); i++) { if (!outputs->at(i)) { const auto& tensor_impl = std::make_shared(); outputs->at(i) = std::make_shared(tensor_impl); output_tensor_metas->at(i) = tensor_impl->mut_tensor_meta(); } else { bool has_eager_blob_object = JUST(outputs->at(i)->has_eager_blob_object()); CHECK_OR_RETURN(has_eager_blob_object); output_eager_blob_objects->at(i) = JUST(outputs->at(i)->eager_blob_object()); } } Symbol stream; bool need_check_mem_case = true; // Infer devices ... // Infer shapes strides dtype ... // 構(gòu)建op執(zhí)行指令,并派發(fā)至vm JUST(PhysicalRun([&](InstructionsBuilder* builder) -> Maybe { return builder->LocalCallOpKernel(kernel, input_eager_blob_objects, output_eager_blob_objects, ctx, stream); })); return Maybe::Ok();}

          Interpreter的終點(diǎn)是虛擬機(jī)(vm)。vm部分,是OneFlow比較獨(dú)特的設(shè)計(jì),內(nèi)容很多,這里暫不展開了:) 可以簡(jiǎn)單理解,派發(fā)至vm后,此op將進(jìn)入一個(gè)任務(wù)執(zhí)行的隊(duì)列,將會(huì)等待其vm的調(diào)度、執(zhí)行。

          5

          Compute

          在Interpreter將op執(zhí)行指令派發(fā)至vm后,經(jīng)過調(diào)度邏輯處理后,將會(huì)在

          oneflow/core/eager/opkernel_instruction_type.cpp

          被觸發(fā)執(zhí)行,核心代碼如下:

          static inline void OpKernelCompute( LocalCallOpKernelPhyInstrOperand* operand, DeviceCtx* device_ctx, user_op::OpKernelState* state, const user_op::OpKernelCache* cache) { auto* opkernel = operand->mut_opkernel(); auto* compute_ctx = opkernel->UpdateComputeContext(operand->inputs().get(), operand->outputs().get(), operand->consistent_tensor_infer_result().get(), device_ctx); ... operand->user_opkernel()->Compute(compute_ctx, state, cache); opkernel->UpdateComputeContext(nullptr, nullptr, nullptr, nullptr);}

          其中,

          operand->user_opkernel()->Compute(compute_ctx, state, cache);

          將觸發(fā)op kernel的實(shí)際執(zhí)行。通常來(lái)說,op的kernel實(shí)現(xiàn)根據(jù)device的不同,會(huì)派發(fā)到不同的實(shí)現(xiàn),其一般都位于:

          oneflow/user/kernels/xxx_kernel.cpp

          oneflow/user/kernels/xxx_kernel.cu

          這里的Relu op相對(duì)比較特殊,是用primitive實(shí)現(xiàn)的(primitive也是oneflow中一種獨(dú)特的設(shè)計(jì),有著良好的抽象和可組合性),具體這個(gè)UnaryPrimitive就是elementwise unary的模板+UnaryFunctor的組合。其調(diào)用鏈如下:

          UnaryPrimitiveKernel

          class UnaryPrimitiveKernel final : public user_op::OpKernel, public user_op::CudaGraphSupport { public: OF_DISALLOW_COPY_AND_MOVE(UnaryPrimitiveKernel); UnaryPrimitiveKernel() = default; ~UnaryPrimitiveKernel() = default; using PrimitiveFactoryFuncType = std::function( user_op::KernelComputeContext*)>; UnaryPrimitiveKernel(const std::string& output_name, const std::string& input_name, PrimitiveFactoryFuncType fn) : output_name_(output_name), input_name_(input_name), primitive_factory_func_(std::move(fn)) {} private: using user_op::OpKernel::Compute; void Compute(user_op::KernelComputeContext* ctx) const override { auto primitive = primitive_factory_func_(ctx); CHECK(primitive); const user_op::Tensor* input_tensor = ctx->Tensor4ArgNameAndIndex(input_name_, 0); ... const int64_t elem_cnt = input_shape.elem_cnt(); if (elem_cnt != 0) { primitive->Launch(ctx->stream(), input_tensor->dptr(), output_tensor->mut_dptr(), elem_cnt); } } bool AlwaysComputeWhenAllOutputsEmpty() const override { return false; } std::string output_name_; std::string input_name_; PrimitiveFactoryFuncType primitive_factory_func_;};

          ?

          ep::primitive::ElementwiseUnary

          templateclass ElementwiseUnaryImpl : public ElementwiseUnary { public: OF_DISALLOW_COPY_AND_MOVE(ElementwiseUnaryImpl); ElementwiseUnaryImpl(Scalar attr0, Scalar attr1) : attr0(attr0), attr1(attr1) {} ~ElementwiseUnaryImpl() override = default; void Launch(Stream* stream, const void* src_ptr, void* dst_ptr, size_t count) override { CpuStream* cpu_stream = stream->As(); Dst* dst = reinterpret_cast(dst_ptr); const Src* src = reinterpret_cast(src_ptr); auto functor = UnaryFunctor(attr0, attr1); cpu_stream->ParallelFor(0, count, [functor, src, dst](int64_t begin, int64_t end) { for (int64_t i = begin; i < end; i++) { dst[i] = functor(src[i]); } }); } protected: Scalar attr0, attr1;};

          UnaryFunctor

          這個(gè)UnaryFuntor根據(jù)不同的Unaray op類型,特化出不同的具體functor實(shí)現(xiàn),具體到Relu op,其實(shí)現(xiàn)位于

          oneflow/core/ep/common/primitive/unary_functor.h:

          templatestruct UnaryFunctor { UnaryFunctor(Scalar attr0, Scalar attr1) {} OF_DEVICE_FUNC Dst operator()(Src src) const { const Src zero_val = static_cast(0.0); if (src <= zero_val) { return static_cast(zero_val); } else { return static_cast(src); } }};

          至此,我們已經(jīng)完成了一個(gè)op的Python -> C++ 之旅。從細(xì)節(jié)上看,是相對(duì)復(fù)雜的,但從整體流程上看,其實(shí)是比較簡(jiǎn)單的,排除了binding,vm調(diào)度機(jī)制等細(xì)節(jié),其主要過程其實(shí)就4個(gè)環(huán)節(jié):Functor -> Dispatch -> Interpreter -> Kernel Compute。

          實(shí)現(xiàn)/新增一個(gè)op,通常也不需要管中間的Dispatch以及Interpreter,我們只需重點(diǎn)關(guān)注和該op強(qiáng)相關(guān)的部分——Functor層面的參數(shù)、op邏輯檢查,以及Kernel Compute部分的實(shí)際op運(yùn)算。

          (參考代碼:

          https://github.com/Oneflow-Inc/oneflow/commit/1dbdf8faed988fa7fd1a9034a4d79d5caf18512d)

          其他人都在看

          一個(gè)Tensor在深度學(xué)習(xí)框架中的執(zhí)行過程

          學(xué)習(xí)筆記:從Python到C++調(diào)用過程分析

          學(xué)習(xí)筆記:從Functor到OpExprInterpreter

          學(xué)習(xí)筆記:從OpExprInterpreter到OpKernel

          李飛飛:我更像物理學(xué)家,而不是工程師

          手把手推導(dǎo)分布式矩陣乘的最優(yōu)并行策略

          解讀Pathways(二):向前一步是OneFlow

          歡迎下載體驗(yàn)OneFlow v0.7.0:GitHub - Oneflow-Inc/oneflow: OneFlow is a performance-centered and open-source deep learning framework.OneFlow is a performance-centered and open-source deep learning framework. - GitHub - Oneflow-Inc/oneflow: OneFlow is a performance-centered and open-source deep learning framework.https://github.com/Oneflow-Inc/oneflow/

          關(guān)鍵詞:

          久久精品国产亚洲av瑜伽| 亚洲一区二区三区久久久久| 亚洲午夜精品国产电影在线观看| 亚洲三区在线观看无套内射| 亚洲精品老司机在线观看| 亚洲成a人无码亚洲成www牛牛| 亚洲熟伦熟女专区hd高清| 国产成人精品日本亚洲11| 亚洲国产日韩在线一区| 亚洲成人黄色网址| 亚洲国产精品成人精品小说| 亚洲精品中文字幕无乱码| 亚洲男人天堂影院| 亚洲综合一区二区| 亚洲最大在线观看| 亚洲国产综合自在线另类| 337p日本欧洲亚洲大胆精品555588 | 91亚洲国产成人久久精品网站| 亚洲精品无码久久久久| 亚洲热线99精品视频| 国产亚洲精品自在久久| 亚洲国产精彩中文乱码AV| 亚洲成a人片在线观看日本| 亚洲av综合av一区| 亚洲va在线va天堂va不卡下载| 亚洲视频一区在线观看| 亚洲成年人电影网站| 亚洲精品第一综合99久久| 亚洲AV永久无码精品一福利| 精品亚洲福利一区二区| 亚洲成av人在片观看| 国产AV无码专区亚洲AV漫画| 久久99国产亚洲高清观看首页| 亚洲免费视频网站| 亚洲天堂电影在线观看| 色偷偷女男人的天堂亚洲网| 亚洲丰满熟女一区二区哦| www.亚洲精品.com| 亚洲综合伊人久久综合| 亚洲宅男天堂在线观看无病毒| 亚洲国产成人高清在线观看 |