BaldLee's Blog

torch_block

Posted on 2023-11-09 Edited on 2025-09-12

pytorch踩坑记录

原来能跑的代码死活跑不起来，跑到

1	Using /root/.cache/torch_extensions/py310_cu122 as PyTorch extensions root...

就卡死。设置pytorch和transformer的日志级别到debug也看不到异常。

解决方案来自论坛讨论

把~/.cache/torch_extensions/py310_cu122/transformer_inference/里的lock文件删除后正常了。

Night system摸坑

Posted on 2023-11-03 Edited on 2025-09-12

学习使用Nsight system分析Cuda应用。

容器中nsys

在容器内用nsys指令时，如果权限不够会无法调用perf_event_open的系统调用。解决方案参考Container Support on Linux Servers。
最简单的方法是在启动容器时加一条--privileged=true。如果要更安全的方式参考上面的文档。

nsys的参数

一些实用的参数。所有参数参考文档NVIDIA Nsight Systems user guide。

-w (--show-output)
- 设置为true，会保留程序的stdout和stderr的输出
-t (--trace)
- 可选项 cuda, nvtx, cublas, cublas-verbose, cusparse, cusparse-verbose, cudnn, opengl, opengl-annotations, openacc, openmp, osrt, mpi, nvvideo, vulkan, vulkan-annotations, dx11, dx11-annotations, dx12, dx12-annotations, oshmem, ucx, wddm, nvmedia, none
- 默认值 cuda, opengl, nvtx, osrt
- 设置要监视的api
-c (--capture range)
- 可选项 none, cudaProfilerApi, hotkey, nvtx
- 默认值 none
- 设置从什么时候开始profiling
--cudabacktrace
- 可选项 all, none, kernel, memory, sync, other
- 启用回溯收集
- 对性能影响很大
--cuda-memory-usage
- 默认false。设置为true后会profiling内存开销。
- 对性能影响很大
-f (--force-overwrite)
- 默认为false。设置为true后强制覆盖原有的同名输出文件

Cuda应用基准测试的最佳实践

Posted on 2023-11-02 Edited on 2025-09-12

简介

测Cuda应用性能时受GPU频率影响很大。参考Nvidia的这篇报告提高基准测试结果的稳定性。(视频)

Cuda应用的性能测试结果的稳定性会受到以下几个硬件因素的影响:

CPU频率
NUMA结构
GPU时钟

性能测试时还需要考虑以下因素:

Cuda JIT编译
Cuda stream中用Cuda event测试
多stream竞争锁

CPU频率

CPU频率对基准测试的影响很大，睿频或者过热降频都会影响基准测试的结果。

linux上cpupower是一个实用的cpu性能监控工具。

监视动态信息:

1	~$ cpupower monitor -m Mperf

获取静态信息:

1	~$ cpupower frequency-info

用户不能直接控制intel CPU的频率。可以通过手动关闭睿频来获得稳定的基准测试结果。在文件/sys/devices/system/cpu/intel_pstate/no_turbo中写入1。

# Set the frequency Scaling Governor to Performance
~$ sudo cpupower frequency-set -g performence
# Disable Turbo Boost
~$ echo "1" | sudo tee
/sys/devices/system/cpu/intel_pstate/no_turbo
1

这么做会让CPU性能下降，这是为了稳定性做出的牺牲。

NUMA

Cuda主机和设备之间做内存拷贝时可能会出现性能不一致的情况，这可能是NUMA结构导致的。每个设备只能与一个NUMA节点直接pcie连接，当跨节点做内存拷贝时会有限制的性能下降。

numactl指令可以查看NUMA节点的配置:

1	~$ numactl --hardware

nvidia-smi指令可以查看多个GPU设备之间的连接情况:

1	~$ nvidia-smi topo -mp

numactl指令支持测试时绑定特定cpu和设备:

1	~$ numactl --cpunodebind=0 --membind=0 ./bandwidthTest --device=0

GPU时钟

GPU时钟频率的变化也会影响性能测试的稳定性。

nvidia-smi指令可以用来监控程序运行时CPU的时钟频率:

# Show current Performance State and throttling
~$ nvidia-smi -q -d PERFORMANCE
# Scroll the current clock of the GPU
~$ nvidia-smi dmon

为了获得稳定的结果，最佳实践是锁住GPU的时钟频率。
用下面指令获取可用的GPU时钟设定:

1	~$ nvidia-smi –q –d SUPPORTED_CLOCKS

用下面指令打印当前GPU时钟:

1	~$ nvidia-smi –q –d CLOCK

用下面指令锁住GPU时钟(Volta之后架构):

1	~$ nvidia-smi –lgc <Default Graphics Clock>

这么做会让GPU性能下降，这是为了稳定性做出的牺牲。

补充说明

nvidia-smi锁定频率有两种方式(具体可见man nvidia-smi)。这两个锁频方式会同时生效，最佳实践是只用其中一个。

nvidia-smi -lgc
- 这个方法的参数是(最小时钟频率, 最大时钟频率)。可以限制时钟频率的最小值和最大值。有个相似的命令是nvidia-smi -lmc可以控制内存的频率。
- 支持的架构：Volta及以后
- 解锁的命令nvidia-smi -rgc和nvidia-smi -rmc
nvidia-smi -ac
- 这个方法的参数是(内存频率上限，显卡频率上限)。只能限制显卡跑应用时频率的最大值。
- 支持的架构：Maxwell及以后
- 解锁的命令nvidia-smi -rac

此外，多卡情况下可以用-i <device id>来对指定卡做限制。

Cuda JIT

TODO: 完成后半部分

Deepspeed 张量并行源码分析2

Posted on 2023-10-23 Edited on 2025-09-12

从底层逆推对张量的拆分

初衷是想控制张量并行拆分，本文分析张量是怎么被拆分并分到两张卡的。以OPT模型为例。

张量的复制

张量的拆分最终发生在ReplaceWithTensorSlicing类的strided_copy方法中(文件位置deepspeed/module_inject/auto_tp.py)。这个类在replace_transformer_layer方法中(文件位置deepspeed/module_inject/replace_module.py)被实例化。代码如下。从代码可知复制时是根据dst.shape和src.shape来拆分张量的。如果dst.shape最外层的维度是src.shape的一半，那么张量就会被对半拆分。

def strided_copy(self,
                 dst: Optional[torch.Tensor],
                 src: Optional[torch.Tensor],
                 num_splits: int,
                 int8: bool = False,
                 allocate_tensor: bool = False):
    if src is None:
        return src
    src_shape = src.shape
    dst_shape = dst.shape
    outer_dim = 0 if int8 else -1
    if allocate_tensor:
        dst = torch.empty_like(dst)
    src_split = torch.split(src.data, src.shape[outer_dim] // num_splits, dim=outer_dim)

    if (len(src_shape) == 2 and len(dst_shape) == 2):
        if src_shape[outer_dim] == dst_shape[self.out_dim]:
            try:
                dst = dst.reshape(-1).data.copy_(src.data.reshape(-1)).reshape(src.shape)
            except:
                print(dst.shape, src.shape)
                exit()
            dst = torch.nn.parameter.Parameter(dst, requires_grad=False)
            if hasattr(src, 'scale'):
                dst.scale = src.scale
            return dst
        self.merge_assert(src_shape[outer_dim], dst_shape[self.out_dim])
        qkv_size = dst_shape[self.out_dim] // num_splits
        qkv_split = [torch.split(src_s, qkv_size, dim=outer_dim) for src_s in src_split]
        weight_split = [
            torch.cat([qkv_s[i] for qkv_s in qkv_split], axis=outer_dim) for i in range(len(qkv_split[0]))
        ]
        dst = dst.reshape(-1).data.copy_(weight_split[self.gpu_index].contiguous().reshape(-1)).reshape(
            weight_split[self.gpu_index].shape)
    else:
        if src_shape[0] == dst_shape[0]:
            return torch.nn.parameter.Parameter(src)
        qkv_size = dst_shape[0] // num_splits
        qkv_split = [torch.split(src_s, qkv_size, dim=0) for src_s in src_split]
        bias_split = [torch.cat([qkv_s[i] for qkv_s in qkv_split], axis=0) for i in range(len(qkv_split[0]))]
        dst.data.copy_(bias_split[self.gpu_index].contiguous())
    dst = torch.nn.parameter.Parameter(dst, requires_grad=False)
    if hasattr(src, 'scale'):
        dst.scale = src.scale
    return dst

张量形状的确定

调用上文strided_copy方法的函数如下(文件位置deepspeed/module_inject/containers/base.py)。可以看到dst来自self.module而src直接来自self(这里的self是Container)。

def attention_qkv_mp(self, mp_replace, reversed_dim=False):
        self.module.attention.attn_qkvw = mp_replace.strided_copy(self.module.attention.attn_qkvw,
                                                                  self.qkvw,
                                                                  num_splits=3,
                                                                  int8=reversed_dim)
        self.module.attention.attn_qkvb = mp_replace.strided_copy(self.module.attention.attn_qkvb,
                                                                  self.qkvb,
                                                                  num_splits=3,
                                                                  int8=reversed_dim)

self.qkvw最终是来自self.policy.client_module(参考deepspeed/module_inject/containers/opt.py)，也就是从huggingface拉下来的模型。

def attention(self, enable_training=False):
    qw = self.client_module.self_attn.q_proj.weight
    qb = self.client_module.self_attn.q_proj.bias
    kw = self.client_module.self_attn.k_proj.weight
    kb = self.client_module.self_attn.k_proj.bias
    vw = self.client_module.self_attn.v_proj.weight
    vb = self.client_module.self_attn.v_proj.bias
    qkvw = Parameter(torch.cat((qw, kw, vw), dim=0), requires_grad=enable_training)
    qkvb = Parameter(torch.cat((qb, kb, vb), dim=0), requires_grad=enable_training)
    return qkvw, \
           qkvb, \
           self.client_module.self_attn.out_proj.weight, \
           self.client_module.self_attn.out_proj.bias

self.module是DeepSpeedOPTInference，本质是DeepSpeedTransformerInference。self.module.attention是DeepSpeedSelfAttention(文件位置deepspeed/ops/transformer/inference/ds_attention.py)。里面的atten_qkvw的大小是由qkv_size_per_partition决定的。

self.attn_qkvw = nn.Parameter(torch.empty(self.config.hidden_size,
                                          qkv_size_per_partition,
                                          dtype=data_type,
                                          device=device),
                              requires_grad=False)
self.attn_qkvb = nn.Parameter(torch.empty(qkv_size_per_partition, dtype=data_type_fp, device=device),
                              requires_grad=False)

qkv_size_per_partition是在DeepSpeedSelfAttention初始化时决定的(文件位置deepspeed/ops/transformer/inference/ds_attention.py)。

qkv_size_per_partition = (self.config.hidden_size // self.config.mp_size) * 3 if config.num_kv < 0 else \
                         ((self.config.heads + self.config.num_kv * 2) // self.config.mp_size) * (self.config.hidden_size // self.config.heads)
self.attn_qkvw = nn.Parameter(torch.empty(self.config.hidden_size,
                                          qkv_size_per_partition,
                                          dtype=data_type,
                                          device=device),
                              requires_grad=False)
self.attn_qkvb = nn.Parameter(torch.empty(qkv_size_per_partition, dtype=data_type_fp, device=device),
                              requires_grad=False)
out_size_per_partition = self.config.hidden_size // self.config.mp_size
self.attn_ow = nn.Parameter(torch.empty(out_size_per_partition,
                                        self.config.hidden_size,
                                        dtype=data_type,
                                        device=device),
                            requires_grad=False)
self.attn_ob = nn.Parameter(torch.empty(self.config.hidden_size, dtype=data_type_fp, device=device),
                            requires_grad=False)

语言模型发展树

Posted on 2023-10-18 Edited on 2025-09-12

Deepspeed 张量并行源码分析

Posted on 2023-10-17 Edited on 2025-09-12

参考的是deepspeed源码，commit 604d701e35548e5407b017c088bdc3760832c9e0。

核心代码之前的函数调用

以下代码以bloom为例，推理初始化代码如下。设置mp_size=2和replace_with_kernel_inject=True，本别表示双卡和内核注入(kernel inject)。model是从huggingface获取的bloomz-3b。目前没有看到relace_method="auto"有什么用。

import torch
import deepspeed
import transformers

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "bigscience/bloomz-3b"
payload = "Explain in a sentence in English what is backpropagation in neural networks."
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
ds_model = deepspeed.init_inference(model=model,
                                    mp_size=2,
                                    dtype=torch.float16,
                                    replace_method="auto",
                                    replace_with_kernel_inject=True)

(文件位置deepspeed/__init__.py)
init_inference的主要业务是调用

1	engine = InferenceEngine(model, config=ds_inference_config)

(文件位置deepspeed/inference/engine.py)
InferenceEngine.__init__(self, model, config)中关于张量并行的业务代码如下。模式一是用户自定义策略。这里的自定义是指用户在config中指定特定模型的特定层做张量并行(具体参考Deepspeed inference tutorial)。模式二是用内核注入实现张量并行，是本文关注的重点。模式三是自动张量并行，是根据模型类型得到所有注入策略然后应用。以上三种模式最终的业务调用是self._apply_injection_policy()。

# We only support three modes: 1) user specified policy for tensor-parallelism, 2) kernel injection (replace_with_kernel_inject), and 3) automatic tensor parallelism if tp_size > 1.
if self.injection_dict:
    # 1. User specified Tensor Parallelism
    assert not config.replace_with_kernel_inject, "Cannot use both user specified injection policy and kernel injection"
    for client_module, injection_policy in self.injection_dict.items():
        assert issubclass(client_module,
                          torch.nn.Module), f"{client_module} is not a subclass of torch.nn.Module"
        # construct the tuple and pass that instead of a string or dict.
        if isinstance(injection_policy, str):
            config.injection_policy_tuple = (injection_policy, )
        else:
            config.injection_policy_tuple = injection_policy
        layer_names = [name for name, _ in self.module.named_modules()]
        for policy in config.injection_policy_tuple:
            if not any(name.endswith(policy) for name in layer_names):
                raise ValueError(f"Injection policy layer'{policy}' not valid.")
        self._apply_injection_policy(config, client_module)
else:
    if config.replace_with_kernel_inject:
        # 2. DeepSpeed Kernel Injection
        self._apply_injection_policy(config)
    elif config.tensor_parallel.tp_size > 1:
        # 3. Automatic Tensor Parallelism
        parser_dict = AutoTP.tp_parser(model)
        print("AutoTP: ", parser_dict)
        for client_module, injection_policy in parser_dict:
            if isinstance(injection_policy, str):
                config.injection_policy_tuple = (injection_policy, )
            else:
                config.injection_policy_tuple = injection_policy
            self._apply_injection_policy(config, client_module)

(文件位置deepspeed/inference/engine.py) 对torch.nn.Module类型的模型来说，只会调用replace_transformer_layer。(进入generic_injection的模型如果类型是torch.nn.Module会直接返回) 在内核注入模式下，client_module是默认值None。checkpoint也是None。

def _apply_injection_policy(self, config, client_module=None):
    # client_module is only passed when using the injection_dict method.
    checkpoint_dir = config.checkpoint
    checkpoint = SDLoaderFactory.get_sd_loader_json(checkpoint_dir,
                                                    self.checkpoint_engine) if checkpoint_dir is not None else None
    generic_injection(self.module, dtype=config.dtype, enable_cuda_graph=config.enable_cuda_graph)
    if isinstance(self.module, torch.nn.Module):
        # config is our DeepSpeedInferenceConfig and self.config is the HF model config
        replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)

(文件位置deepspeed/module_inject/replace_module.py) 除去子函数后主要业务代码如下。else后的是内核注入张量并行的调用，其中orig_layer_impl必定是None。这里的replace_fn是replace_transformer_layer的子函数。_replace_policy必定是None。

# defining globals as internally defined functions inherit these everywhere
quantize = (config.dtype == torch.int8)
# todo: Refactor later. In future, let's minimize the style used above and use config.** instead

linear_layer_setting = None
'''
    linear_layer_setting (tuple of modules) [Optional]: shows which two classes are used for linear layers and embedding 
layers
'''
micro_batch_size = -1
seed = -1
local_rank = -1

mp_replace = ReplaceWithTensorSlicing(mp_group=config.tensor_parallel.tp_group,
                                      mp_size=config.tensor_parallel.tp_size)  #, out_dim=0, in_dim=1)

if checkpoint_dict is not None and not config.replace_with_kernel_inject:
    # AutoTP shard loading
    checkpoint = checkpoint_dict["checkpoints"]
    pbar = tqdm.tqdm(total=len(checkpoint), desc=f"Loading {len(checkpoint)} checkpoint shards")
    for i in range(len(checkpoint)):
        checkpoint_file = os.path.join(config.base_dir, checkpoint[i])
        replaced_module = replace_module(model=model,
                                         orig_class=orig_layer_impl,
                                         replace_fn=replace_fn,
                                         _replace_policy=config.injection_policy_tuple,
                                         checkpoint=checkpoint_file)
        pbar.update(1)
        gc.collect()
    replaced_module = set_lm_head(replaced_module)
else:
    replaced_module = replace_module(model=model,
                                     orig_class=orig_layer_impl,
                                     replace_fn=replace_fn,
                                     _replace_policy=config.injection_policy_tuple)

(文件位置deepspeed/module_inject/replace_module.py) 内核注入模式下orig_class和_replace_policy必定是None。内核注入模式下执行的的是else下的代码，其中for plcy in replace_policies:遍历的replace_policies是文件deepspeed/module_inject/replace_policies.py中定义的全局变量(后文会介绍)。关注policy.update({orig_layer_class: (replace_fn, plcy)})，其中policy是要应用的策略目录，orig_layer_class是对应层的类型，replace_fn是revert_transformer_layer的子函数(后文会介绍)，plcy是replace_policies中的策略。这里遍历replace_policies中所有的策略，把层和对应的策略存到一个目录中。最后调用_replace_module()根据生成的策略目录更新模型。_ = plcy(None)做了什么会在后文介绍。

def replace_module(model, orig_class, replace_fn, _replace_policy, checkpoint=None):
    """ Scan the model for instances of ``orig_clas:`` to replace using ``replace_fn``.
    Arguments:
        model (torch.nn.Module): the model to augment
        orig_class (torch.nn.Module): the module to search for
        replace_fn (method): a method to convert instances of ``orig_class`` to the
                             desired type and return a new instance.
    Returns:
        A modified ``model``.
    """
    sd = None
    if checkpoint is not None:
        sd = torch.load(checkpoint, map_location='cpu')
    policy = {}
    if orig_class is not None:
        policy.update({orig_class: (replace_fn, _replace_policy)})
    else:
        for plcy in replace_policies:
            # instantiate a throw-away policy in order to populate the _orig_layer_class
            _ = plcy(None)
            if isinstance(plcy._orig_layer_class, list):
                for orig_layer_class in plcy._orig_layer_class:
                    policy.update({orig_layer_class: (replace_fn, plcy)})
            elif plcy._orig_layer_class is not None:
                policy.update({plcy._orig_layer_class: (replace_fn, plcy)})
    assert len(policy.items()) > 0,\
        "No default policy found! Please specify your policy injection_policy (like {BertLayer:HFBEertLayerPolicy})." +\
        "You can find some samples here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/module_inject/replace_policy.py"

    replaced_module, _ = _replace_module(model, policy, state_dict=sd)
    return replaced_module

插入说明：Policy的定义

(文件位置deepspeed/module_inject/replace_policies.py) replace_policy是一个policy列表如下。列表中的元素是deepspped/module_inject/containers目录下定义的policy。本文关注的是BLOOMLayerPolicy。后文会对Policy作进一步的分析。

# DeepSpeed Team

from .containers import HFGPT2LayerPolicy
from .containers import HFBertLayerPolicy
from .containers import BLOOMLayerPolicy
from .containers import HFGPTJLayerPolicy
from .containers import HFGPTNEOLayerPolicy
from .containers import GPTNEOXLayerPolicy
from .containers import HFOPTLayerPolicy
from .containers import MegatronLayerPolicy
from .containers import HFDistilBertLayerPolicy
from .containers import HFCLIPLayerPolicy
from .containers import LLAMALayerPolicy
from .containers import UNetPolicy
from .containers import VAEPolicy
from .containers import LLAMA2LayerPolicy
from .containers import InternLMLayerPolicy

# transformer-based policies
replace_policies = [
    HFBertLayerPolicy, HFGPTNEOLayerPolicy, GPTNEOXLayerPolicy, HFGPTJLayerPolicy, MegatronLayerPolicy,
    HFGPT2LayerPolicy, BLOOMLayerPolicy, HFOPTLayerPolicy, HFCLIPLayerPolicy, HFDistilBertLayerPolicy,
    LLAMALayerPolicy, LLAMA2LayerPolicy, InternLMLayerPolicy
]

(文件位置/deepspeed/module_inject/containers/bloom.py) 这里介绍_ = plcy(None)做了什么。以BLOOMLayerPolicy为例，这里的初始化函数如下。这里主要是初始化了BLOOMLayerPolicy的类变量_orig_layer_class。supported_models的更新是为了bloom补transformer name，不用关心。

class BLOOMLayerPolicy(TransformerPolicy):
    _orig_layer_class = None

    def __init__(self, client_module, inference=True, use_load_prefix=True, split_qkv=False):
        super().__init__(inference, linear_layer=True, use_load_prefix=use_load_prefix, split_qkv=split_qkv)
        self.client_module = client_module
        try:
            import transformers
            BLOOMLayerPolicy._orig_layer_class = transformers.models.bloom.modeling_bloom.BloomBlock
            global supported_models
            supported_models.update({transformers.models.bloom.modeling_bloom.BloomModel})
        except Exception as e:
            print(f"WARNING! Setting BLOOMLayerPolicy._orig_layer_class to None due to Exception: {e}")
            BLOOMLayerPolicy._orig_layer_class = None

插入说明：replace_fn

(文件位置deepspeed/module_inject/replace_module.py) 在内核注入模式下，replace_fn的主要任务是调用replace_with_policy。这里可以看到state_dict是无用的。

def replace_fn(child, _policy, layer_id=0, prefix="", state_dict=None):
    training = False  # todo: refactor this part to go in the config
    if training:
        # copy relevant state from child -> new module
        new_module = replace_with_policy(child, _policy, config.triangular_masking)
    else:
        # copy relevant state from child -> new module
        if config.replace_with_kernel_inject:
            new_module = replace_with_policy(child,
                                             _policy,
                                             config.triangular_masking,
                                             inference=True,
                                             layer_id=layer_id)
        else:
            new_module = replace_wo_policy(child, _policy, prefix=prefix, state_dict=state_dict)
    return new_module

回到正题，关注_replace_module函数

(文件位置deepspeed/module_inject/replace_module.py) 根据之前生成的策略组对模型做replace。model.named_children()是一个迭代器。遍历所得的name是序号，等同于layer_id，child是transformers.models.bloom.modeling_bloom.BloomBlock类型的对象。如果if语句为真，即当前child的类型在策略组里边，就调用对应的策略policies[child.__class__][0]()，即前面的replace_fn。传入的参数中policies[child.__class__][-1]是对应的策略。如果else语句为假，则递归调用_replace_module。由于state_dict是None，checking_key和buffer相关代码不会被调用。最后reset_cache是为模型添加重置KV cache的方法。

def _replace_module(model, policies, prefix='', layer_id=0, level_id=0, state_dict=None):
    """ Traverse model's children recursively and apply any transformations in ``policies``.
    Arguments:
        model (torch.nn.Module): model to augment
        policies (dict): Mapping of source class to replacement function.
    Returns:
        Modified ``model``.
    """
    for name, child in model.named_children():
        if child.__class__ in policies:
            replaced_module = policies[child.__class__][0](child,
                                                           policies[child.__class__][-1],
                                                           layer_id,
                                                           prefix=prefix + name,
                                                           state_dict=state_dict)
            setattr(model, name, replaced_module)
            if isinstance(model, PipelineModule):
                assert hasattr(model, 'forward_funcs'),\
                    "we require pipe-module to have the list of fwd_functions"
                model.forward_funcs[model.fwd_map[name]] = replaced_module
            layer_id += 1
        else:
            checking_key = prefix + name + '.'
            if Loading.is_load_module(child) and state_dict is not None:
                if any(checking_key in item for item in state_dict):
                    Loading.load(
                        child,
                        state_dict,
                        checking_key,
                    )
                else:
                    continue
            if len(child._buffers) != 0 and state_dict is not None:
                Loading.load_buffer(child, state_dict, checking_key)
            _, layer_id = _replace_module(child,
                                          policies,
                                          prefix if level_id == 0 and skip_level_0_prefix(model, state_dict) else \
                                          prefix + name + '.',
                                          layer_id=layer_id,
                                          level_id=level_id + 1,
                                          state_dict=state_dict)

    # Add the reset_cache func to the model, so that it can be called in the beginning of text-generation.
    model.reset_cache = transformer_inference.DeepSpeedTransformerInference.reset_cache
    return model, layer_id

通过执行代码可以观测到整个模型是如下结构，其中连续30层的transformers.models.bloom.modeling_bloom.BloomBlock被策略组中的策略替换了。下面将走进replace_with_policy观察该子模型是如何被替换的。

transformers.models.bloom.modeling_bloom.BloomModel
    torch.nn.moudles.sparse.Embedding
    torch.nn.modules.normalization.LayerNorm
    torch.nn.modules.container.ModuleList
        transformers.models.bloom.modeling_bloom.BloomBlock * 30
    torch.nn.modules.normalization.LayerNorm
torch.nn.modules.linear.Linear

关注replace_with_policy

(文件位置deepspeed/module_inject/replace_module.py) replace_with_policy也是replace_transformer_layer的子函数，代码如下。参数中child是模型，policy_cls是策略，在本文中是BLOOMLayerPolicy。policy是对用当前模型初始化的策略实例。cuda_graph和MoE不用关心。下面将逐步说明干了什么。

def replace_with_policy(child, policy_cls, triangular_masking, inference=False, layer_id=0):
    policy = policy_cls(child, inference=inference)
    if not policy.cuda_graph_supported:
        # policy says cuda graph is not supported raise an error if set
        assert not config.enable_cuda_graph, "cuda graph is not supported with this model, please disable"

    from deepspeed.moe.layer import MoE
    moe = False
    if hasattr(child, 'mlp') and isinstance(child.mlp, MoE):
        num_experts = child.mlp.num_experts
        moe = True

    # 1. Create a model-specific container object using the policy object.
    _container = policy_to_ds_container(policy=policy,
                                        config=config,
                                        model_config=model_config,
                                        layer_id=layer_id,
                                        child=child)
    _container.set_moe(moe)

    # 2. Set the tensor parallelism config
    _container.set_tensor_parallel_config(config.tensor_parallel.tp_size, config.tensor_parallel.tp_group)
    # _container.set_tensor_parallel_config(config.tensor_parallel.tp_size, config.tensor_parallel.tp_group)

    # 3. Initialize tensors
    _container.initialize_tensors()

    # 4. deal with data types -- needs refactor to use dtype instead of fp16
    if config.dtype in [torch.float16, torch.bfloat16, torch.int8]:
        _container.convert_to_required_dtype()

    # 5. Set the quantization config
    quantizer = GroupQuantizer(q_int8=quantize)
    _container.set_quantization_config(quantizer)

    # 6. create a DS Inference config object
    _container.create_ds_model_config()

    # 7. use the config and create the module
    _container.create_module()

    # 8. transpose the weights and bias if needed
    _container.transpose()

    # 9. deal with tensor parallelism.
    _container.apply_tensor_parallelism(mp_replace)

    # 10. copy the tensors from the model-specific container to the new module
    _container.copy_data_to_new_module()

    # 11. set global for generic checkpoint loading
    global container_g

    if container_g is None:
        container_g = _container

    return _container.module

第1步创建container

(文件位置deepspeed/module_inject/utils.py) 下面是函数policy_to_ds_container的代码，其主要工作是把Policy对象变成Container对象。在本文中是用BLOOMLayerPolicy实例化一个DS_BloomContainer对象。输入的五个参数都用来初始化DS_BloomContainer的父类BaseTransformerContainer。

# helper function to map between DS policies and DS containers
def policy_to_ds_container(**kwargs):
    from .containers import HFGPT2LayerPolicy, DS_GPT2Container
    from .containers import HFBertLayerPolicy, DS_BERTContainer
    from .containers import BLOOMLayerPolicy, DS_BloomContainer
    from .containers import HFGPTJLayerPolicy, DS_GPTJContainer
    from .containers import HFGPTNEOLayerPolicy, DS_GPTNEOContainer
    from .containers import GPTNEOXLayerPolicy, DS_GPTNEOXContainer
    from .containers import HFOPTLayerPolicy, DS_OPTContainer
    from .containers import MegatronLayerPolicy, DS_MegatronGPTContainer
    from .containers import HFDistilBertLayerPolicy, DS_DistilBERTContainer
    from .containers import LLAMALayerPolicy, DS_LLAMAContainer
    from .containers import LLAMA2LayerPolicy, DS_LLAMA2Container
    from .containers import InternLMLayerPolicy, DS_InternLMContainer

    policy_to_container = {
        HFGPT2LayerPolicy: DS_GPT2Container,
        HFBertLayerPolicy: DS_BERTContainer,
        BLOOMLayerPolicy: DS_BloomContainer,
        HFGPTJLayerPolicy: DS_GPTJContainer,
        HFGPTNEOLayerPolicy: DS_GPTNEOContainer,
        GPTNEOXLayerPolicy: DS_GPTNEOXContainer,
        HFOPTLayerPolicy: DS_OPTContainer,
        MegatronLayerPolicy: DS_MegatronGPTContainer,
        HFDistilBertLayerPolicy: DS_DistilBERTContainer,
        LLAMALayerPolicy: DS_LLAMAContainer,
        LLAMA2LayerPolicy: DS_LLAMA2Container,
        InternLMLayerPolicy: DS_InternLMContainer
    }

    container = None
    policy = kwargs['policy']
    assert policy is not None, "Policy cannot be None"
    policy_type = type(policy)

    if policy_type not in policy_to_container:
        log_dist(f"Policy type {policy_type} not supported", [0])
    else:
        container = policy_to_container[policy_type](**kwargs)

    return container

第2步设置并行参数

(文件位置deepspeed/module_inject/containers/base.py) set_tensor_parallel_config是BaseTransformerContainer的方法，代码如下。设置了模型并行的大小和并行group。

1
2
3

def set_tensor_parallel_config(self, mp_size, mp_group):
    self.mp_size = mp_size
    self.mp_group = mp_group

第3步初始化张量

initialize_tensors是被反复继承的方法，DS_BloomContainer的继承顺序如下。

(
    <class 'deepspeed.module_inject.containers.bloom.DS_BloomContainer'>,
    <class 'deepspeed.module_inject.containers.features.meta_tensor.MetaTensorContainer'>,
    <class 'deepspeed.module_inject.containers.features.hybrid_engine.HybridEngineContainer'>,
    <class 'deepspeed.module_inject.containers.base.BaseTransformerContainer'>,
    <class 'abc.ABC'>,
    <class 'object'>
)

(文件位置deepspeed/module_inject/containers/features/meta_tensor.py) MetaTensorContainer的initialize_tensors如下，判断是不是meta tensor。

1
2
3

def initialize_tensors(self, enable_training=False):
    super().initialize_tensors(enable_training=enable_training)
    self.is_meta = self.qkvw.is_meta

(文件位置deepspeed/module_inject/containers/features/hybrid_engine.py) HybridEngineContainer的initialize_tensors包含了对lora参数的赋值。这是个抽象函数，bloom的实现如下。

def set_lora_params(self):
    """
    Necessary to implement for `HybridEngineContainer`
    """
    self.lora_params = [
        maybe_get_lora(p) for p in [
            self.policy.client_module.mlp.dense_h_to_4h, self.policy.client_module.mlp.dense_4h_to_h, self.policy.
            client_module.self_attention.query_key_value, self.policy.client_module.self_attention.dense
        ]
    ]

(文件位置deepspeed/module_inject/containers/base.py) BaseTransformerContainer的initialize_tensors如下，包含了set_attention，set_mlp和set_layernorm。

def initialize_tensors(self, enable_training=False):
    # Set the tensors from policy (user module) to container (DS module)
    self.set_attention(*self.policy.attention(enable_training=enable_training))
    self.set_mlp(*self.policy.mlp(enable_training=enable_training))
    self.set_layernorm(*self.policy.layernorm())
    #self.check_meta_tensor_support()

(文件位置deepspeed/module_inject/containers/base.py) 这三个方法都是对参数的直接复制，代码如下。

def set_attention(self, qkvw, qkvb, dense_w, dense_b):
    self.qkvw = qkvw
    self.qkvb = qkvb
    self.dense_w = dense_w
    self.dense_b = dense_b

def set_mlp(self, _h4h_w, _h4h_b, _4hh_w, _4hh_b):
    self._h4h_w = _h4h_w
    self._h4h_b = _h4h_b
    self._4hh_w = _4hh_w
    self._4hh_b = _4hh_b

def set_layernorm(self, attn_nw, attn_nb, input_nw, input_nb):
    self.attn_nw = attn_nw
    self.attn_nb = attn_nb
    self.input_nw = input_nw
    self.input_nb = input_nb

(文件位置deepspeed/module_inject/containers/bloom.py) 回到上面调用三个set时传入的参数，这些是policy中定义的，代码如下。可以看到这些权重都是从client_module照搬来的。

def attention(self, enable_training=False):
    return self.client_module.self_attention.query_key_value.weight, \
            self.client_module.self_attention.query_key_value.bias, \
            self.client_module.self_attention.dense.weight, \
            self.client_module.self_attention.dense.bias,

def mlp(self, enable_training=False):
    return self.client_module.mlp.dense_h_to_4h.weight, \
           self.client_module.mlp.dense_h_to_4h.bias, \
           self.client_module.mlp.dense_4h_to_h.weight, \
           self.client_module.mlp.dense_4h_to_h.bias

def layernorm(self):
    return self.client_module.post_attention_layernorm.weight, \
           self.client_module.post_attention_layernorm.bias, \
           self.client_module.input_layernorm.weight, \

第4步改变精度

(文件位置deepspeed/module_inject/containers/base.py) 改变精度的代码如下，比较直观。

def convert_to_required_dtype(self):
    # Note: converting tensors to fp16 requires that we do it in-place using self.__dict__ and not make a list/dict copy
    if self.dtype in [torch.half, torch.bfloat16]:
        for k, v in self.__dict__.items():
            # The list comprehension is used for MoE tensor lists
            if isinstance(v, list) and all((isinstance(tensor, torch.Tensor) \
               or isinstance(tensor, torch.nn.Parameter)) for tensor in v):
                self.__dict__[k] = [moe_tensor.to(self.dtype) for moe_tensor in v]

            if isinstance(v, torch.Tensor) or isinstance(v, torch.nn.Parameter):
                self.__dict__[k] = v.to(self.dtype)

第5步设置量化设置

初始化了一个GroupQuantizer并赋给模型，本文不关注量化便不赘述。

第6步创建DS设置

(文件位置deepspeed/module_inject/containers/base.py) create_ds_model_config()方法的代码如下。set_hidden_heads是配置隐藏层的设置，DeepSpeedInferenceConfig是按照老设置创建的。

def create_ds_model_config(self):
    self.set_hidden_heads(*self.policy.get_hidden_heads())
    assert self.num_attention_heads % self.mp_size == 0,\
            "To run the model parallel across the GPUs, the attention_heads require to be divisible by the world_size!" +\
            "This is because the attention computation is partitioned evenly among the parallel GPUs."

    self.ds_model_config = DeepSpeedInferenceConfig(
        hidden_size=self.hidden_size,
        intermediate_size=self.intermediate_size,
        heads=self.num_attention_heads,
        layer_norm_eps=self.layernorm_epsilon,
        dtype=self.dtype,
        pre_layer_norm=self.pre_layer_norm,
        norm_type=self.norm_type,
        mp_size=self.mp_size,
        return_tuple=self.return_tuple,
        triangular_masking=self.triangular_masking,
        local_attention=self.local_attention,
        window_size=self.window_size,
        rotary_dim=self.rotary_dim,
        mlp_after_attn=self.mlp_after_attn,
        mlp_act_func_type=self.mlp_act_func_type,
        training_mp_size=self.training_mp_size,
        bigscience_bloom=self.bigscience_bloom,
        max_out_tokens=self.max_out_tokens,
        min_out_tokens=self.min_out_tokens,
        scale_attn_by_inverse_layer_idx=self.scale_attn_by_inverse_layer_idx,
        use_mup=self.use_mup,
        return_single_tuple=self.return_single_tuple,
        set_empty_params=self.config.set_empty_params,
        transposed_mode=self.config.transposed_mode,
        use_triton=self.use_triton,
        triton_autotune=self.config.triton_autotune)

    if self.use_triton and deepspeed.HAS_TRITON:
        from .bert import DS_BERTContainer
        if not isinstance(self, DS_BERTContainer):
            raise NotImplementedError("Triton kernels are only for BERT-like models yet")

        if not self.config.triton_autotune:
            from deepspeed.ops.transformer.inference.triton.matmul_ext import fp16_matmul
            fp16_matmul.skip_autotune()

    return self.ds_model_config

第7步创建模型

(文件位置deepspeed/module_inject/containers/bloom.py) bloom中的create_module方法代码如下。

def create_module(self, config=None):
    _config = config if config is not None else self.ds_model_config

    self.module = DeepSpeedBloomInference(_config, mp_group=self.mp_group)
    self.module.config.scale_attention = self.scale_attention
    return self.module

(文件位置deepspeed/module_implementations/transformers/ds_transformer.py) DeepSpeedBloomInference裸继承了DeepSpeedTransformerInference。DeepSpeedTransformerInference的初始化代码如下。值得关注的是attention和mlp分别用BloomSelfAttention和DeepSpeedMLP创建。BloomSelfAttention是继承自DeepSpeedSelfAttention的子类。与张量并行有关的部分是mp_group被作为参数传入。

class DeepSpeedTransformerInference(nn.Module):
    """Initialize the DeepSpeed Transformer Layer.
        Arguments:
            layer_id: The layer index starting from 0, e.g. if model has 24 transformer layers,
                layer_id will be 0,1,2...23 when each layer object is instantiated
            config: An object of DeepSpeedInferenceConfig
            mp_group: Model parallelism group initialized on the modeling side.
            quantize_scales: This argument groups all the layers' scales used for quantization
            quantize_groups: Number of groups used for quantizing the model
            merge_count: Shows the number of model-parallel checkpoints merged before running inference.
                We use this argument to control the quantization scale for the model parameters if a bigger
                quantize-grouping than 1 is used.
            mlp_extra_grouping: This flag is used to show a 2x higher number of groups used for the MLP part
                of a Transformer layer. We use this feature for quantization to reduce the convergence impact
                for specific downstream tasks.
    """
    layer_id = 0

    def __init__(self,
                 config,
                 mp_group=None,
                 quantize_scales=None,
                 quantize_groups=1,
                 merge_count=1,
                 mlp_extra_grouping=False):
        super(DeepSpeedTransformerInference, self).__init__()

        self.config = config
        self.config.layer_id = DeepSpeedTransformerInference.layer_id
        DeepSpeedTransformerInference.layer_id += 1

        data_type = torch.half if self.config.dtype == torch.int8 else self.config.dtype
        global inference_module
        if inference_module is None:
            builder = InferenceBuilder()
            inference_module = builder.load()

        if DeepSpeedTransformerInference.layer_id == 1:
            log_dist(f"DeepSpeed-Inference config: {self.config.__dict__}", [0])
            if deepspeed.HAS_TRITON and self.config.use_triton:
                log_dist(f"Injecting Triton kernels ...", [0])

        if self.config.bigscience_bloom:
            self.attention = BloomSelfAttention(self.config, mp_group, quantize_scales, quantize_groups, merge_count)
            assert not self.config.use_triton
        else:
            if deepspeed.HAS_TRITON and self.config.use_triton:
                self.attention = TritonSelfAttention(self.config)
            else:
                self.attention = DeepSpeedSelfAttention(self.config, mp_group, quantize_scales, quantize_groups,
                                                        merge_count)

        if deepspeed.HAS_TRITON and self.config.use_triton:
            self.mlp = TritonMLP(self.config)
        else:
            self.mlp = DeepSpeedMLP(self.config, mp_group, quantize_scales, quantize_groups, merge_count,
                                    mlp_extra_grouping)

        device = get_accelerator().current_device_name()  # if config.bigscience_bloom else 'cpu'
        if self.config.set_empty_params:
            self.norm_w = None
            self.norm_b = None
        else:
            self.norm_w = nn.Parameter(torch.empty(self.config.hidden_size, dtype=data_type, device=device),
                                       requires_grad=False)
            self.norm_b = nn.Parameter(torch.empty(self.config.hidden_size, dtype=data_type, device=device),
                                       requires_grad=False)
        self.layer_past = None
        try:
            if config.dtype == torch.float32:
                self.allocate_workspace = inference_module.allocate_workspace_fp32
            elif config.dtype == torch.bfloat16:
                self.allocate_workspace = inference_module.allocate_workspace_bf16
            else:
                self.allocate_workspace = inference_module.allocate_workspace_fp32
            self._alloc_workspace = True
        except AttributeError:
            self.allocate_workspace = None
            self._alloc_workspace = False

(文件位置deepspeed/ops/transformer/inference/ds_attention.py和deepspeed/ops/transformer/inference/ds_mlp.py) 下面是attention和mlp的forward代码，这里关注创建的attention和mlp是怎么张量并行的。可以看到最后返回前会判断mp_group的大小是否为复数，若真则调用dist.all_reduce来通信。

# attention
def forward(self,
            input,
            input_mask,
            head_mask=None,
            layer_past=None,
            get_present=False,
            encoder_hidden_states=None,
            encoder_attention_mask=None,
            output_attentions=False,
            norm_w=None,
            norm_b=None,
            alibi=None):
    if self.attn_qkvw is None:
        self._attn_qkvw, self._attn_qkvb = self._merge_qkv()
    else:
        self._attn_qkvw = self.attn_qkvw
        self._attn_qkvb = self.attn_qkvb
    if not self.config.pre_layer_norm:
        qkv_out = self.linear_func(input=input,
                                   weight=self._attn_qkvw,
                                   bias=self._attn_qkvb,
                                   add_bias=self.attn_qkvb is not None,
                                   do_flash_attn=False,
                                   num_heads=self.num_attention_heads_per_partition,
                                   num_layers=DeepSpeedSelfAttention.num_layers)
    else:
        qkv_out = self.qkv_func(input=input,
                                weight=self._attn_qkvw,
                                bias=self._attn_qkvb,
                                gamma=norm_w,
                                beta=norm_b)

    context_layer, key_layer, value_layer = self.compute_attention(qkv_out=qkv_out,
                                                                   input_mask=input_mask,
                                                                   layer_past=layer_past,
                                                                   alibi=alibi)
    output = self.vector_matmul_func(input=context_layer, weight=self.attn_ow)
    inp_norm = qkv_out[-1]

    if self.config.mlp_after_attn and self.mp_group is not None and dist.get_world_size(group=self.mp_group) > 1:
        dist.all_reduce(output, group=self.mp_group)
    return (output, key_layer, value_layer, context_layer, inp_norm)

# mlp
def forward(self, input, residual, residual_norm, bias):
    if self.inter_w is None:
        self._inter_w, self._inter_b = self._merge_inter_w()
    else:
        self._inter_w = self.inter_w
        self._inter_b = self.inter_b

    residual_add = None
    if self.attn_nw is None:
        output = self.fused_gemm_gelu(input=residual_norm,
                                      weight=self._inter_w,
                                      bias=self._inter_b,
                                      weight_out=self.output_w)
    else:
        output, residual_add = self.mlp_gemm_func(input=input,
                                                  residual=residual,
                                                  weight_interm=self._inter_w,
                                                  weight_out=self.output_w,
                                                  input_bias=bias,
                                                  bias=self._inter_b,
                                                  gamma=self.attn_nw,
                                                  beta=self.attn_nb)

    residual = self.residual_add_func(hidden_state=output,
                                      residual=residual,
                                      add_bias=bias is not None,
                                      attention_output=input,
                                      attention_bias=bias if bias is not None else self.output_b,
                                      final_bias=self.output_b,
                                      residual_add=residual_add)
    if self.mp_group is not None and dist.get_world_size(group=self.mp_group) > 1:
        dist.all_reduce(residual, group=self.mp_group)

    return residual

(文件位置deepspeed/comm/comm.py) 这里说明dist.all_reduce()到底是什么。dist变量实际是全局变量cdb(comm_dist_backend)，在初始化InferenceEngine时调用了init_distributed()，里面初始化了cdb，代码如下。第一个if中尝试用当前设备的comm初始化，但是deepspeed目前没有实现nccl后端因此会失败。第二个if中用TorchBackend初始化cdb。TorchBackend不再赘述，本文用到的all_reduce方法本质是torch.distributed.ReduceOp中的求和算子。

if cdb is None:
    init_deepspeed_backend(get_accelerator().communication_backend_name(), timeout, init_method)
    set_backend()
    utils.logger.info(f'cdb={cdb}')
if cdb is None and torch.distributed.is_initialized():
    # The user initialized torch.dist themselves, create cdb and short-circuit
    cdb = TorchBackend(dist_backend, timeout, init_method)
    return

第8步转置权重

和本文主旨无关，不作赘述

第9步应用张量并行

(文件位置deepspeed/module_inject/containers/features/hybrid_engine.py) apply_tensor_parallelism也是一个被多次继承的方法。本文不涉及meta tensor故不作讨论。HybridEngineContainer中的代码如下。其中attention_qkv_mp是在子类DS_BloomContainer中实现的。这些方法把qkv和mlp的权重复制到各个GPU上(可参考deepspped/module_inject/containers/bloom.py的源码)。

def apply_tensor_parallelism(self, mp_replace, reversed_dim=False):
    """
    Add support for reversed dim in tensor parallelism. If necessary, override
    the called methods to handle partitioned weights (i.e. if qkv is split, override
    the `attention_qkv_mp` method). If the model component is not split, it should
    be safe to use the default implementation.
    """
    # Setup the new Attention module
    self.attention_qkv_mp(mp_replace, reversed_dim=reversed_dim)
    self.attention_o_mp(mp_replace, reversed_dim=reversed_dim)

    # Setup the new MLP module
    self.mlp_inter_mp(mp_replace, reversed_dim=reversed_dim)
    self.mlp_output_mp(mp_replace, reversed_dim=reversed_dim)

第10步把容器内的张量拷贝到模型中

基本的复制，不作赘述

Hello World

Posted on 2023-04-02 Edited on 2025-09-12

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

Quick Start

Create a new post

1	$ hexo new "My New Post"

More info: Writing

Run server

1	$ hexo server

More info: Server

Generate static files

1	$ hexo generate

More info: Generating

Deploy to remote sites

1	$ hexo deploy

More info: Deployment