NCCL 通信原语 (Primitives)

1 Broadcast

从 root rank 广播数据到所有设备。

nccl broadcast

接口：

ncclResult_t ncclBroadcast(const void* sendbuff, void* recvbuff,
                            size_t count, ncclDataType_t datatype,
                            int root, ncclComm_t comm, cudaStream_t stream)

2 Reduce

执行规约计算（如 max, min, sum），并将结果写入指定的 rank。

nncl reduce

ncclResult_t ncclReduce(const void* sendbuff, void* recvbuff,
                        size_t count, ncclDataType_t datatype, ncclRedOp_t op,
                        int root, ncclComm_t comm, cudaStream_t stream)

3 ReduceScatter

计算规约算子，然后把结果分到不同的 rank

alt text

ncclResult_t ncclReduceScatter(const void* sendbuff,
                                void* recvbuff , size_t recvcount, ncclDataType_t datatype,
                                ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream)

4 AllGather

从 $k$ 个 rank 收集各自的 $N$ 个值，形成大小为 $k \times N$ 的输出，并广播给所有 rank。

alt text

ncclResult_t ncclAllGather(const void* sendbuff,
                            void* recvbuff, size_t sendcount, ncclDataType_t datatype,
                            ncclComm_t comm, cudaStream_t stream)

5 AllReduce

等价于 Reduce + Broadcast（或 ReduceScatter + AllGather）。

就是reuduce 后再广播下

alt text

ncclResult_t ncclAllReduce(const void* sendbuff,
                            void* recvbuff , size_t count, ncclDataType_t datatype,
                            ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream)

上篇LLM 数据的过滤、去重

下篇TP 并行的通讯计算