MLNLP社区是国内外知名的机器学习与自然语言处理社区，受众覆盖国内外NLP硕博生、高校老师以及企业研究人员。

社区的愿景是促进国内外自然语言处理，机器学习学术界、产业界和广大爱好者之间的交流和进步，特别是初学者同学们的进步。

转载自 | 机器之心

编辑 | 蛋酱

十天前的 Meta Connect 2024 大会上，开源领域迎来了可在边缘和移动设备上的运行的轻量级模型 Llama 3.2 1B 和 3B。两个版本都是纯文本模型，但也具备多语言文本生成和工具调用能力。Meta 表示，这些模型可让开发者构建个性化的、在设备本地上运行的通用应用 —— 这类应用将具备很强的隐私性，因为数据无需离开设备。

近日，机器学习研究员 Sebastian Raschka 光速发布长篇教程《Converting Llama 2 to Llama 3.2 From Scratch》。

博文链接：https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/07_gpt_to_llama/converting-llama2-to-llama3.ipynb

本文是《 Converting a From-Scratch GPT Architecture to Llama 2》的后续，更新的内容是如何将 Meta 的 Llama 2 架构模型逐步转换为 Llama 3、Llama 3.1 和 Llama 3.2。为了避免不必要的冗长，本文特意将解释部分缩至最短，并将重点放在主代码上。

机器之心对文章内容进行了不改变原意的编译：

1 逐步转换 Llama 模型实现

如果你是初次实施 LLM 架构，建议从《Build a Large Language Model From Scratch》（https://github.com/rasbt/LLMs-from-scratch/blob/0972ded5309c25dc5eecc98b62897d677c6c36c4/ch04/01_main-chapter-code/ch04.ipynb）的第 4 章开始，那部分内容将逐步指导你实施原始 GPT 架构。

然后可参考《Converting a From-Scratch GPT Architecture to Llama 2》（https://github.com/rasbt/LLMs-from-scratch/blob/0972ded5309c25dc5eecc98b62897d677c6c36c4/ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb），将实现 Llama 特有的组件，如 RMSNorm 层、SiLU 和 SwiGLU 激活、RoPE（旋转位置嵌入）和 SentencePiece tokenizer。

本笔记本采用 Llama 2 架构，并通过以下方式将其转换为 Llama 3 架构：

修改旋转嵌入
实现分组查询注意力
使用定制版的 GPT-4 tokenizer

随后，我们将 Meta 共享的原始 Llama 3 权重加载到架构中：

1.1 复用 Llama 2 的组件

Llama 2 实际上与 Llama 3 非常相似，如上文所述和本文开头的图片所示。

这意味着我们可以使用以下代码从 Llama 2 笔记本中导入多个构建模块：

import osimport sysimport ioimport nbformatimport typesdef import_from_notebook():def import_definitions_from_notebook(fullname, names):current_dir = os.getcwd()path = os.path.join(current_dir, fullname + ".ipynb")path = os.path.normpath(path)# Load the notebookif not os.path.exists(path):raise FileNotFoundError(f"Notebook file not found at: {path}")with io.open(path, "r", encoding="utf-8") as f:nb = nbformat.read(f, as_version=4)# Create a module to store the imported functions and classesmod = types.ModuleType(fullname)sys.modules[fullname] = mod# Go through the notebook cells and only execute function or class definitionsfor cell in nb.cells:if cell.cell_type == "code":cell_code = cell.sourcefor name in names:# Check for function or class definitionsif f"def {name}" in cell_code or f"class {name}" in cell_code:exec(cell_code, mod.__dict__)return modfullname = "converting-gpt-to-llama2"names = ["precompute_rope_params", "compute_rope", "SiLU", "FeedForward", "RMSNorm", "MultiHeadAttention"]return import_definitions_from_notebook(fullname, names)

imported_module = import_from_notebook()# We need to redefine precompute_rope_params# precompute_rope_params = getattr(imported_module, "precompute_rope_params", None)compute_rope = getattr(imported_module, "compute_rope", None)SiLU = getattr(imported_module, "SiLU", None)FeedForward = getattr(imported_module, "FeedForward", None)RMSNorm = getattr(imported_module, "RMSNorm", None)# MultiHeadAttention only for comparison purposesMultiHeadAttention = getattr(imported_module, "MultiHeadAttention", None)

1.2 修改后的 RoPE

Llama 3 使用的 RoPE 与 Llama 2 相似，可参阅 RoPE 论文（https://arxiv.org/abs/2104.09864）。

不过，二者 RoPE 设置有一些细微差别。Llama 3 现在支持多达 8192 个 token，是 Llama 2（4096）的两倍。

RoPE 的基础值（见下文公式），从 10000（Llama 2）增加到 50000（Llama 3），公式如下（改编自 RoPE 论文）：

这些值是一组预定义的参数，用于确定旋转矩阵中的旋转角度，其中的维数是嵌入空间的维数。

将基数从 10000 增加到 50000，频率（或旋转角度）在各维度上的衰减速度会更慢，这意味着维度越高，角度越大（本质上，这是对频率的解压缩）。

此外，我们还在下面的代码中引入了一个 freq_config 部分，用于调整频率；不过，在 Llama 3（只有 Llama 3.1 和 Llama 3.2）中并不需要它，所以稍后会重新访问这个 freq_config（默认设置为「无」并被忽略）。

import torchdef precompute_rope_params(head_dim, theta_base=10000, context_length=4096, freq_config=None):assert head_dim % 2 == 0, "Embedding dimension must be even"# Compute the inverse frequenciesinv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim // 2) / (head_dim // 2)))################################ NEW ################################################ Frequency adjustmentsif freq_config is not None:low_freq_wavelen = freq_config["original_context_length"] / freq_config["low_freq_factor"]high_freq_wavelen = freq_config["original_context_length"] / freq_config["high_freq_factor"]wavelen = 2 * torch.pi / inv_freqinv_freq_llama = torch.where(wavelen > low_freq_wavelen, inv_freq / freq_config["factor"], inv_freq)smooth_factor = (freq_config["original_context_length"] / wavelen - freq_config["low_freq_factor"]) / (freq_config["high_freq_factor"] - freq_config["low_freq_factor"])smoothed_inv_freq = ((1 - smooth_factor) * (inv_freq / freq_config["factor"]) + smooth_factor * inv_freq)is_medium_freq = (wavelen <= low_freq_wavelen) & (wavelen >= high_freq_wavelen)inv_freq_llama = torch.where(is_medium_freq, smoothed_inv_freq, inv_freq_llama)inv_freq = inv_freq_llama##################################################################################### Generate position indicespositions = torch.arange(context_length)# Compute the anglesangles = positions[:, None] * inv_freq[None, :]  # Shape: (context_length, head_dim // 2)# Expand angles to match the head_dimangles = torch.cat([angles, angles], dim=1)  # Shape: (context_length, head_dim)# Precompute sine and cosinecos = torch.cos(angles)sin = torch.sin(angles)return cos, sin

总之，与 Llama 2 相比，Llama 3 的新功能是「上下文长度」和 theta 基底参数：

# Instantiate RoPE parametersllama_2_context_len = 4096llama_3_context_len = 8192llama_2_theta_base = 10_000llama_3_theta_base = 50_000

在 Llama 2 中，用法与以前相同：

# Settingsbatch_size = 2num_heads = 4head_dim = 16# Instantiate RoPE parameterscos, sin = precompute_rope_params(head_dim=head_dim,theta_base=llama_3_theta_base,context_length=llama_3_context_len)# Dummy query and key tensorstorch.manual_seed(123)queries = torch.randn(batch_size, llama_3_context_len, num_heads, head_dim)keys = torch.randn(batch_size, llama_3_context_len, num_heads, head_dim)# Apply rotary position embeddingsqueries_rot = compute_rope(queries, cos, sin)keys_rot = compute_rope(keys, cos, sin)

1.3 分组查询注意力

本节将用一种名为分组查询注意力（GQA）的替代机制来取代多头注意力（MHA）。简而言之，可以将 GQA 视为计算和参数效率更高的 MHA 版本。

在 GQA 中，通过在多个注意力头之间共享来减少键和值投影的数量，每个注意力头仍有其独特的查询，但这些查询关注同一组键和值。

下面是具有 2 个 key-value 组的 GQA 示例：

GQA 的主要思想是减少与键值对相关的唯一查询组的数量，从而在不显著降低建模性能的情况下，减少 MHA 中某些矩阵乘法的大小和参数的数量。

简而言之，GQA 的主要变化是每个查询组都需要重复，以匹配与之相关的头数量，具体实现如下：

import torch.nn as nn
class GroupedQueryAttention(nn.Module):    def __init__(            self, d_in, d_out, context_length, num_heads,            num_kv_groups,       # NEW            rope_base=10_000,    # NEW            rope_config=None,    # NEW            dtype=None        ):        super().__init__()        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"        assert num_heads % num_kv_groups == 0, "num_heads must be divisible by num_kv_groups"
        self.d_out = d_out        self.num_heads = num_heads        self.head_dim = d_out // num_heads
        ############################# NEW  #############################        # self.W_key = nn.Linear(d_in, d_out, bias=False, dtype=dtype)        # self.W_value = nn.Linear(d_in, d_out, bias=False, dtype=dtype)        self.W_key = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)        self.W_value = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)        self.num_kv_groups = num_kv_groups        self.group_size = num_heads // num_kv_groups        ################################################################
        self.W_query = nn.Linear(d_in, d_out, bias=False, dtype=dtype)        self.out_proj = nn.Linear(d_out, d_out, bias=False, dtype=dtype)
        self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))        cos, sin = precompute_rope_params(            head_dim=self.head_dim,            theta_base=rope_base,      # NEW            freq_config=rope_config,   # NEW            context_length=8192        )        self.register_buffer("cos", cos)        self.register_buffer("sin", sin)
    def forward(self, x):        b, num_tokens, d_in = x.shape
        queries = self.W_query(x)  # Shape: (b, num_tokens, d_out)        keys = self.W_key(x)  # Shape: (b, num_tokens, num_kv_groups * head_dim)        values = self.W_value(x)  # Shape: (b, num_tokens, num_kv_groups * head_dim)
        # Reshape queries, keys, and values        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
        ##################### NEW  #####################        # keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)        # values = values.view(b, num_tokens, self.num_heads, self.head_dim)        keys = keys.view(b, num_tokens, self.num_kv_groups, self.head_dim)        values = values.view(b, num_tokens, self.num_kv_groups, self.head_dim)        ################################################
        # Transpose keys, values, and queries        keys = keys.transpose(1, 2)  # Shape: (b, num_heads, num_tokens, head_dim)        values = values.transpose(1, 2)  # Shape: (b, num_heads, num_tokens, head_dim)        queries = queries.transpose(1, 2)  # Shape: (b, num_query_groups, num_tokens, head_dim)
        # Apply RoPE        keys = compute_rope(keys, self.cos, self.sin)        queries = compute_rope(queries, self.cos, self.sin)
        ##################### NEW  #####################        # Expand keys and values to match the number of heads        # Shape: (b, num_heads, num_tokens, head_dim)
        keys = keys.repeat_interleave(self.group_size, dim=1)  # Shape: (b, num_heads, num_tokens, head_dim)        values = values.repeat_interleave(self.group_size, dim=1)  # Shape: (b, num_heads, num_tokens, head_dim)        # For example, before repeat_interleave along dim=1 (query groups):        #   [K1, K2]        # After repeat_interleave (each query group is repeated group_size times):        #   [K1, K1, K2, K2]        # If we used regular repeat instead of repeat_interleave, we'd get:        #   [K1, K2, K1, K2]        ################################################
        # Compute scaled dot-product attention (aka self-attention) with a causal mask        # Shape: (b, num_heads, num_tokens, num_tokens)        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head
        # Original mask truncated to the number of tokens and converted to boolean        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
        # Use the mask to fill attention scores        attn_scores.masked_fill_(mask_bool, -torch.inf)
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)        assert keys.shape[-1] == self.head_dim
        # Shape: (b, num_tokens, num_heads, head_dim)        context_vec = (attn_weights @ values).transpose(1, 2)
        # Combine heads, where self.d_out = self.num_heads * self.head_dim        context_vec = context_vec.reshape(b, num_tokens, self.d_out)        context_vec = self.out_proj(context_vec)  # optional projection
        return context_vec

参数节省的情况，请参考以下来自 GPT 和 Llama 2 代码的多头注意力示例：

# Settingsbatch_size = 1context_len = 3000max_context_len = 8192embed_dim = 4096num_heads = 32example_batch = torch.randn((batch_size, context_len, embed_dim))mha = MultiHeadAttention(d_in=embed_dim,d_out=embed_dim,context_length=max_context_len,num_heads=num_heads)mha(example_batch)print("W_key:", mha.W_key.weight.shape)print("W_value:", mha.W_value.weight.shape)print("W_query:", mha.W_query.weight.shape)

W_key: torch.Size([4096, 4096])W_value: torch.Size([4096, 4096])W_query: torch.Size([4096, 4096])

现在，如果改用分组查询注意力，并使用 8 个 kv 组（Llama 3 8B 使用了 8 个 kv 组），可以看到 key 和 value 矩阵的行数减少了 4 倍（因为 32 个注意力头除以 8 个 kv 组就是 4）：

gqa = GroupedQueryAttention(d_in=embed_dim,d_out=embed_dim,context_length=max_context_len,num_heads=num_heads,num_kv_groups=8,rope_base=llama_3_theta_base)gqa(example_batch)print("W_key:", gqa.W_key.weight.shape)print("W_value:", gqa.W_value.weight.shape)print("W_query:", gqa.W_query.weight.shape)

W_key: torch.Size([1024, 4096])W_value: torch.Size([1024, 4096])W_query: torch.Size([4096, 4096])

顺便提一下，为了使分组查询注意力等同于标准的多头注意力，可以将查询组的数量（num_kv_groups）设置为与头的数量（num_heads）相等。

最后，比较一下下面的参数数量：

print("Total number of parameters:")mha_total_params = sum(p.numel() for p in mha.parameters())print(f"MHA: {mha_total_params:,}")gqa_total_params = sum(p.numel() for p in gqa.parameters())print(f"GQA: {gqa_total_params:,}")

Total number of parameters:MHA: 67,108,864GQA: 41,943,040

# Free up memory:del mhadel gqa

1.4 更新 TransformerBlock 模块

接下来，更新 Transformer 块。在这里，只需将 MultiHeadAttention 与 GroupedQueryAttention 互换，并添加新的 RoPE 设置：

class TransformerBlock(nn.Module):    def __init__(self, cfg):        super().__init__()        self.att =  GroupedQueryAttention(  # MultiHeadAttention(            d_in=cfg["emb_dim"],            d_out=cfg["emb_dim"],            context_length=cfg["context_length"],            num_heads=cfg["n_heads"],            num_kv_groups=cfg["n_kv_groups"],  # NEW            rope_base=cfg["rope_base"],        # NEW            rope_config=cfg["rope_freq"],      # NEW            dtype=cfg["dtype"]        )        self.ff = FeedForward(cfg)        self.norm1 = RMSNorm(cfg["emb_dim"], eps=1e-5)        self.norm2 = RMSNorm(cfg["emb_dim"], eps=1e-5)
    def forward(self, x):        # Shortcut connection for attention block        shortcut = x        x = self.norm1(x)        x = self.att(x.to(torch.bfloat16))   # Shape [batch_size, num_tokens, emb_size]        x = x + shortcut  # Add the original input back
        # Shortcut connection for feed-forward block        shortcut = x        x = self.norm2(x)        x = self.ff(x.to(torch.bfloat16))        x = x + shortcut  # Add the original input back
        return x

1.5 定义模型类

幸运的是，在设置模型类时，我们不需要做太多，只需将名称更新为 Llama3Model

# class Llama2Model(nn.Module):class Llama3Model(nn.Module):    def __init__(self, cfg):        super().__init__()        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"], dtype=cfg["dtype"])
        self.trf_blocks = nn.Sequential(            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
        self.final_norm = RMSNorm(cfg["emb_dim"], eps=1e-5)        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False, dtype=cfg["dtype"])
    def forward(self, in_idx):        batch_size, seq_len = in_idx.shape        tok_embeds = self.tok_emb(in_idx)        x = tok_embeds        x = self.trf_blocks(x)        x = self.final_norm(x)        logits = self.out_head(x.to(torch.bfloat16))        return logits

2 初始化模型

现在，我们可以定义一个 Llama 3 配置文件（为便于比较，显示的是 Llama 2 配置文件）：

LLAMA2_CONFIG_7B = {"vocab_size": 32_000,    # Vocabulary size"context_length": 4096,  # Context length"emb_dim": 4096,         # Embedding dimension"n_heads": 32,           # Number of attention heads"n_layers": 32,          # Number of layers"hidden_dim": 11_008,    # Size of the intermediate dimension in FeedForward"dtype": torch.bfloat16  # Lower-precision dtype to save memory}

LLAMA3_CONFIG_8B = {"vocab_size": 128_256,   # NEW: Larger vocabulary size"context_length": 8192,  # NEW: Larger context length"emb_dim": 4096,         # Embedding dimension"n_heads": 32,           # Number of attention heads"n_layers": 32,          # Number of layers"hidden_dim": 14_336,    # NEW: Larger size of the intermediate dimension in FeedForward"n_kv_groups": 8,        # NEW: Key-Value groups for grouped-query attention"rope_base": 50_000,     # NEW: The base in RoPE's "theta" was increased to 50_000"rope_freq": None,       # NEW: Additional configuration for adjusting the RoPE frequencies"dtype": torch.bfloat16  # Lower-precision dtype to save memory}

使用这些设置，我们现在可以初始化 Llama 3 8B 模型。

请注意，这需要约 34 GB 内存（作为对比，Llama 2 7B 需要约 26 GB 内存）

model = Llama3Model(LLAMA3_CONFIG_8B)

total_params = sum(p.numel() for p in model.parameters())print(f"Total number of parameters: {total_params:,}")

Total number of parameters: 8,030,261,248

如上图所示，模型包含 80 亿个参数。此外，我们还可以使用下面的代码计算该模型的内存需求：

def model_memory_size(model, input_dtype=torch.float32):    total_params = 0    total_grads = 0    for param in model.parameters():        # Calculate total number of elements per parameter        param_size = param.numel()        total_params += param_size        # Check if gradients are stored for this parameter        if param.requires_grad:            total_grads += param_size
    # Calculate buffer size (non-parameters that require memory)    total_buffers = sum(buf.numel() for buf in model.buffers())
    # Size in bytes = (Number of elements) * (Size of each element in bytes)    # We assume parameters and gradients are stored in the same type as input dtype    element_size = torch.tensor(0, dtype=input_dtype).element_size()    total_memory_bytes = (total_params + total_grads + total_buffers) * element_size
    # Convert bytes to gigabytes    total_memory_gb = total_memory_bytes / (1024**3)
    return total_memory_gb
print(f"float32 (PyTorch default): {model_memory_size(model, input_dtype=torch.float32):.2f} GB")print(f"bfloat16: {model_memory_size(model, input_dtype=torch.bfloat16):.2f} GB")float32 (PyTorch default): 68.08 GBbfloat16: 34.04 GB

最后，如果适用，我们还可以将模型转移到 NVIDIA 或 Apple Silicon GPU 上：

if torch.cuda.is_available():    device = torch.device("cuda")elif torch.backends.mps.is_available():    device = torch.device("mps")else:    device = torch.device("cpu")
model.to(device);

3 加载 tokenizer

在本节中，我们将为模型加载 tokenizer。

Llama 2 使用了谷歌的 SentencePiece tokenizer ，而不是 OpenAI 基于 Tiktoken 库的 BPE tokenizer 。然而，Llama 3 恢复使用 Tiktoken 的 BPE tokenizer；具体来说，它使用的是具有扩展词汇的 GPT-4 tokenizer。我们可以在 Meta AI 的官方 Llama 3 存储库中找到最初的 Tiktoken 适配程序。

下面重写了 tokenizer 的代码，使其更易读，更适合本笔记本使用（但表现应该是相似的）：

import osfrom pathlib import Path
import tiktokenfrom tiktoken.load import load_tiktoken_bpe

class Tokenizer:    def __init__(self, model_path):        assert os.path.isfile(model_path), f"Model file {model_path} not found"        mergeable_ranks = load_tiktoken_bpe(model_path)        num_base_tokens = len(mergeable_ranks)
        self.special_tokens = {            "<|begin_of_text|>": 128000,            "<|end_of_text|>": 128001,            "<|start_header_id|>": 128006,            "<|end_header_id|>": 128007,            "<|eot_id|>": 128009,        }        self.special_tokens.update({            f"<|reserved_{i}|>": 128002 + i for i in range(256) if (128002 + i) not in self.special_tokens.values()        })
        self.model = tiktoken.Encoding(            name=Path(model_path).name,            pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+",            mergeable_ranks=mergeable_ranks,            special_tokens=self.special_tokens        )

    def encode(self, text, bos=False, eos=False, allowed_special=set(), disallowed_special=()):        if bos:            tokens = [self.special_tokens["<|begin_of_text|>"]]        else:            tokens = []
        tokens += self.model.encode(text, allowed_special=allowed_special, disallowed_special=disallowed_special)
        if eos:            tokens.append(self.special_tokens["<|end_of_text|>"])        return tokens
    def decode(self, tokens):        return self.model.decode(tokens)

Meta AI 在 Hugging Face Hub 上共享了 Llama 3 模型的原始权重和 tokenizer 词库。

我们将首先从 Hub 下载 tokenizer 词库，并将其加载到上述代码中。请注意，Meta AI 要求你在下载文件前接受 Llama 3 许可条款；为此必须创建一个 Hugging Face Hub 账户，并访问 meta-llama/Meta-Llama-3-8B 存储库以接受条款。

接下来，需要创建一个访问 token；要生成一个具有「读取」权限的访问 token，请点击右上角的个人资料图片，然后点击「设置」。

然后，创建并复制访问 token，以便复制并粘贴到下一个代码单元中：

from huggingface_hub import loginimport jsonwith open("config.json", "r") as config_file:config = json.load(config_file)access_token = config["HF_ACCESS_TOKEN"]login(token=access_token)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.Token is valid (permission: read).Your token has been saved to /root/.cache/huggingface/tokenLogin successful

通过访问 token 登录（这是验证我们是否接受 Llama 3 许可条款所必需的）后，就可以下载 tokenizer 词库了：

from huggingface_hub import hf_hub_downloadtokenizer_file_path = hf_hub_download(repo_id="meta-llama/Meta-Llama-3-8B",filename="original/tokenizer.model",local_dir="llama3-files")

请注意，在使用 Llama 3 文件时，我们可能需要 blobfile 软件包，它用于处理存储在云存储解决方案（如 Google Cloud Storage (GCS)、Azure Blob Storage 或 Amazon S3）中的数据集或模型。

可以通过取消注释并执行下面的 pip 命令来安装此依赖包：

# pip install blobfile

tokenizer = Tokenizer(tokenizer_file_path)

现在，我们可以使用生成函数让 Llama 3 模型生成新文本：

from previous_chapters import generate, text_to_token_ids, token_ids_to_texttorch.manual_seed(123)token_ids = generate(model=model,idx=text_to_token_ids("Every effort", tokenizer).to(device),max_new_tokens=30,context_size=LLAMA3_CONFIG_8B["context_length"],top_k=1,temperature=0.)print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text: Every effort_dead aeros Ingredients başında.extension clangmissions.esp 사진 Ek Pars til DoctorsDaoеньostivan normal Ekized � Ekized � Ek rdr tık%,orgen>',

当然，正如我们在上面看到的，这段文字是毫无意义的，因为我们还没有训练过 Llama 3 模型。在下一节中，我们将从 Meta AI 中加载预训练的权重，而不是自己训练模型，因为这将花费数万至数十万美元。

4 加载预训练权重

我们将加载下面的「meta-llama/Meta-Llama-3-8B 」base 模型，它是微调前的简单文本补全模型。

或者，你也可以加载经过指令微调和对齐的「meta-llama/Meta-Llama-3-8B-Instruct」模型，方法是相应修改下一个代码单元中的字符串。加起来，权重文件大约有 16 GB 大。

from safetensors.torch import load_file
combined_weights = {}
for i in range(1, 5):    weights_file = hf_hub_download(        repo_id="meta-llama/Meta-Llama-3-8B",        filename=f"model-0000{i}-of-00004.safetensors",        local_dir="llama3-files"    )    current_weights = load_file(weights_file)    combined_weights.update(current_weights)model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

权重包含以下张量（为简单起见，只显示前 15 个张量）：

list(combined_weights.keys())[:15]

['model.embed_tokens.weight', 'model.layers.0.input_layernorm.weight', 'model.layers.0.mlp.down_proj.weight', 'model.layers.0.mlp.gate_proj.weight', 'model.layers.0.mlp.up_proj.weight', 'model.layers.0.post_attention_layernorm.weight', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.0.self_attn.o_proj.weight', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.1.input_layernorm.weight', 'model.layers.1.mlp.down_proj.weight', 'model.layers.1.mlp.gate_proj.weight', 'model.layers.1.mlp.up_proj.weight', 'model.layers.1.post_attention_layernorm.weight']

下面的函数仿照《Build a Large Language Model From Scratch》第 5 章（https://github.com/rasbt/LLMs-from-scratch/blob/0972ded5309c25dc5eecc98b62897d677c6c36c4/ch05/01_main-chapter-code/ch05.ipynb）中的 load_weights_into_gpt 函数，将预训练好的权重加载到 Llama 3 模型中：

def assign(left, right, tensor_name="unknown"):    if left.shape != right.shape:        raise ValueError(f"Shape mismatch in tensor '{tensor_name}'. Left: {left.shape}, Right: {right.shape}")
    if isinstance(right, torch.Tensor):        return torch.nn.Parameter(right.clone().detach())    else:        return torch.nn.Parameter(torch.tensor(right))

def load_weights_into_llama(model, param_config, params):    model.tok_emb.weight = assign(model.tok_emb.weight, params["model.embed_tokens.weight"], "model.embed_tokens.weight")
    for l in range(param_config["n_layers"]):
        # Load attention weights        model.trf_blocks[l].att.W_query.weight = assign(            model.trf_blocks[l].att.W_query.weight,            params[f"model.layers.{l}.self_attn.q_proj.weight"],            f"model.layers.{l}.self_attn.q_proj.weight"        )        model.trf_blocks[l].att.W_key.weight = assign(            model.trf_blocks[l].att.W_key.weight,            params[f"model.layers.{l}.self_attn.k_proj.weight"],            f"model.layers.{l}.self_attn.k_proj.weight"        )        model.trf_blocks[l].att.W_value.weight = assign(            model.trf_blocks[l].att.W_value.weight,            params[f"model.layers.{l}.self_attn.v_proj.weight"],            f"model.layers.{l}.self_attn.v_proj.weight"        )        model.trf_blocks[l].att.out_proj.weight = assign(            model.trf_blocks[l].att.out_proj.weight,            params[f"model.layers.{l}.self_attn.o_proj.weight"],            f"model.layers.{l}.self_attn.o_proj.weight"        )        model.trf_blocks[l].norm1.weight = assign(            model.trf_blocks[l].norm1.weight,            params[f"model.layers.{l}.input_layernorm.weight"],            f"model.layers.{l}.input_layernorm.weight"        )
        # Load FeedForward weights        model.trf_blocks[l].ff.fc1.weight = assign(            model.trf_blocks[l].ff.fc1.weight,            params[f"model.layers.{l}.mlp.gate_proj.weight"],            f"model.layers.{l}.mlp.gate_proj.weight"        )        model.trf_blocks[l].ff.fc2.weight = assign(            model.trf_blocks[l].ff.fc2.weight,            params[f"model.layers.{l}.mlp.up_proj.weight"],            f"model.layers.{l}.mlp.up_proj.weight"        )        model.trf_blocks[l].ff.fc3.weight = assign(            model.trf_blocks[l].ff.fc3.weight,            params[f"model.layers.{l}.mlp.down_proj.weight"],            f"model.layers.{l}.mlp.down_proj.weight"        )        model.trf_blocks[l].norm2.weight = assign(            model.trf_blocks[l].norm2.weight,            params[f"model.layers.{l}.post_attention_layernorm.weight"],            f"model.layers.{l}.post_attention_layernorm.weight"        )
    # Load output layer weights    model.final_norm.weight = assign(model.final_norm.weight, params["model.norm.weight"], "model.norm.weight")
    if "lm_head.weight" in params.keys():        model.out_head.weight = assign(model.out_head.weight, params["lm_head.weight"], "lm_head.weight")    else:        model.out_head.weight = assign(model.out_head.weight, params["model.embed_tokens.weight"], "model.embed_tokens.weight")        print("Model uses weight tying.")

load_weights_into_llama(model, LLAMA3_CONFIG_8B, combined_weights)model.to(device);del combined_weights  # free up memory

接下来，我们就可以使用该模型生成文本了：

torch.manual_seed(123)
token_ids = generate(    model=model,    idx=text_to_token_ids("Every effort", tokenizer).to(device),    max_new_tokens=25,    context_size=LLAMA3_CONFIG_8B["context_length"],    top_k=1,    temperature=0.)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))Output text: Every effort has been made to trace copyright holders and to obtain their permission for the use of copyright material. The publisher apologizes for any

5 使用指令微调模型

上面我们使用的是经过预训练的基础模型，如果你想使用一个能够遵循指令的模型，请使用「meta-llama/Llama-3-8b-Instruct」模型，如下所示：

# to free up memoryimport gcdel modelgc.collect()  # Run Python garbage collectorif torch.cuda.is_available():torch.cuda.empty_cache()

combined_weights = {}for i in range(1, 5):weights_file = hf_hub_download(repo_id="meta-llama/Meta-Llama-3-8B-Instruct",filename=f"model-0000{i}-of-00004.safetensors",local_dir="llama3-files")current_weights = load_file(weights_file)combined_weights.update(current_weights)model = Llama3Model(LLAMA3_CONFIG_8B)load_weights_into_llama(model, LLAMA3_CONFIG_8B, combined_weights)model.to(device)del combined_weights  # free up memory

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

请注意，Llama 3 模型最好与微调时使用的正确提示模板一起使用。

下面是一个基于 Meta AI 的 Llama 3 专用 ChatFormat 代码的 tokenizer wrapper 类，用于构建提示模板：

class ChatFormat:    def __init__(self, tokenizer):        self.tokenizer = tokenizer
    def encode_header(self, message):        tokens = []        tokens.append(self.tokenizer.special_tokens["<|start_header_id|>"])        tokens.extend(self.tokenizer.encode(message["role"], bos=False, eos=False))        tokens.append(self.tokenizer.special_tokens["<|end_header_id|>"])        tokens.extend(self.tokenizer.encode("\n\n", bos=False, eos=False))        return tokens
    def encode(self, text):        message = {            "role": "user",            "content": text        }
        tokens = self.encode_header(message)        tokens.extend(            self.tokenizer.encode(message["content"].strip(), bos=False, eos=False)        )        tokens.append(self.tokenizer.special_tokens["<|eot_id|>"])        return tokens
    def decode(self, token_ids):        return self.tokenizer.decode(token_ids)

chat_tokenizer = ChatFormat(tokenizer)

用法如下：

token_ids = chat_tokenizer.encode("Hello World!")print(token_ids)

[128006, 882, 128007, 271, 9906, 4435, 0, 128009]

tokenizer.decode(token_ids)

'<|start_header_id|>user<|end_header_id|>\n\nHello World!<|eot_id|>'

现在，让我们来看看 Llama 3 教学模式的实际应用：

import retorch.manual_seed(123)token_ids = generate(model=model,idx=text_to_token_ids("What do llamas eat?", chat_tokenizer).to(device),max_new_tokens=150,context_size=LLAMA3_CONFIG_8B["context_length"],top_k=1,temperature=0.)output_text = token_ids_to_text(token_ids, tokenizer)def clean_text(text, header_end="assistant<|end_header_id|>\n\n"):# Find the index of the first occurrence of "<|end_header_id|>"index = text.find(header_end)if index != -1:# Return the substring starting after "<|end_header_id|>"return text[index + len(header_end):].strip()  # Strip removes leading/trailing whitespaceelse:# If the token is not found, return the original textreturn textprint("Output text:\n", clean_text(output_text))

Output text: Llamas are herbivores, which means they primarily eat plants and plant-based foods. Here are some of the things llamas like to eat:
1. Grass: Llamas love to graze on grass, especially in the spring and summer months.2. Hay: Hay is a staple in a llama's diet. They like to eat timothy hay, alfalfa hay, and other types of hay.3. Grains: Llamas may also be fed grains like oats, barley, and corn. However, grains should not make up more than 10% of a llama's diet.4. Fruits and vegetables: Llamas may enjoy fruits and vegetables as treats, such as apples,

Llama 3.1 8B

在 Llama 3 发布几个月后，Meta AI 又推出了 Llama 3.1 模型套件（详见 Llama 3.1 官方介绍）。

方便的是，我们可以重复使用之前的 Llama 3 代码来实现 Llama 3.1 8B：

结构完全相同，唯一的变化是重新调整了 RoPE 频率，如下配置文件所示：

LLAMA3_CONFIG_8B = {"vocab_size": 128_256,   # Vocabulary size"context_length": 8192,  # Context length"emb_dim": 4096,         # Embedding dimension"n_heads": 32,           # Number of attention heads"n_layers": 32,          # Number of layers"hidden_dim": 14_336,    # Size of the intermediate dimension in FeedForward"n_kv_groups": 8,        # Key-Value groups for grouped-query attention"rope_base": 50_000,     # The base in RoPE's "theta""rope_freq": None,       # Additional configuration for adjusting the RoPE frequencies"dtype": torch.bfloat16  # Lower-precision dtype to save memory}LLAMA31_CONFIG_8B = {"vocab_size": 128_256,    # Vocabulary size"context_length": 8192,   # Context length"emb_dim": 4096,          # Embedding dimension"n_heads": 32,            # Number of attention heads"n_layers": 32,           # Number of layers"hidden_dim": 14_336,     # Size of the intermediate dimension in FeedForward"n_kv_groups": 8,         # Key-Value groups for grouped-query attention"rope_base": 50_000,      # The base in RoPE's "theta""dtype": torch.bfloat16,  # Lower-precision dtype to save memory"rope_freq": {            # NEW: RoPE frequency scaling"factor": 8.0,"low_freq_factor": 1.0,"high_freq_factor": 4.0,"original_context_length": 8192,}}

正如我们之前在代码中看到的，RoPE 方法使用正弦函数（正弦和余弦）将位置信息直接嵌入注意力机制中。

在 Llama 3.1 中，通过附加配置，我们对反向频率计算进行了额外调整。这些调整会影响不同频率成分对位置嵌入的贡献。

让我们在实践中试试 Llama 3.1 模型；首先，我们清除旧模型，以释放一些 GPU 内存：

# free up memorydel model
gc.collect()  # Run Python garbage collector
if java基础 传智播客毕向东 torch.cuda.is_available():    torch.cuda.empty_cache()

接下来，我们下载 tokenizer。

请注意，由于 Llama 3.1 系列与 Llama 3 系列不同，、必须访问 meta-llama/Llama-3.1-8Brepository，并确认许可条款，这样 Hugging Face 访问 token 才能在下载时起作用。

简单起见，我们在下面只加载 base 模型，但也有一个经过指令微调的版本，你可以将「meta-llama/Llama-3.1-8B」替换为「meta-llama/Llama-3.1-8B-Instruct」。

tokenizer_file_path = hf_hub_download(    repo_id="meta-llama/Llama-3.1-8B",    filename="original/tokenizer.model",    local_dir="llama3-files")
tokenizer = Tokenizer(tokenizer_file_path)model = Llama3Model(LLAMA31_CONFIG_8B)
total_params = sum(p.numel() for p in model.parameters())print(f"Total number of parameters: {total_params:,}")Total number of parameters: 8,030,261,248combined_weights = {}
for i in range(1, 5):    weights_file = hf_hub_download(        repo_id="meta-llama/Llama-3.1-8B",        filename=f"model-0000{i}-of-00004.safetensors",        local_dir="llama3-files"    )    current_weights = load_file(weights_file)    combined_weights.update(current_weights)
load_weights_into_llama(model, LLAMA31_CONFIG_8B, combined_weights)model.to(device);model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]torch.manual_seed(123)
token_ids = generate(    model=model,    idx=text_to_token_ids("Every effort", tokenizer).to(device),    max_new_tokens=25,    context_size=LLAMA31_CONFIG_8B["context_length"],    top_k=1,    temperature=0.)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))Output text: Every effort has been made to trace copyright holders and to obtain their permission for the use of copyright material. The publisher apologizes for any

Llama 3.2 1B

截至本文撰写之时，Meta AI 的最新模型是此处公布的 Llama 3.2 模型。

Llama 3.2 文本模型的代码与 Llama 3.1 相似，只是缩小了模型的大小（有 1B 和 3B 版本）。

另一个效率上的调整是，他们又增加了权重绑定（GPT-2 架构中最初使用的概念）；在这里，他们在输入（token）嵌入层和输出层中重复使用相同的权重参数值。

Llama 3.2 1B 的模型体积小，甚至可以在许多移动设备上运行，因此非常方便。

Llama 3.1 8B 和 Llama 3.2 1B 在结构上的差异如下图所示：

从上图可以看出，Llama 3.1 8B 和 Llama 3.2 1B 架构的主要区别在于各自的尺寸。

一个小的额外变化是增加了 RoPE rescaling 系数，这反映在下面的配置文件中：

LLAMA31_CONFIG_8B = {"vocab_size": 128_256,    # Vocabulary size"context_length": 8192,   # Context length"emb_dim": 4096,          # Embedding dimension"n_heads": 32,            # Number of attention heads"n_layers": 32,           # Number of layers"hidden_dim": 14_336,     # Size of the intermediate dimension in FeedForward"n_kv_groups": 8,         # Key-Value groups for grouped-query attention"rope_base": 50_000,      # The base in RoPE's "theta""dtype": torch.bfloat16,  # Lower-precision dtype to save memory"rope_freq": {          # RoPE frequency scaling"factor": 8.0,"low_freq_factor": 1.0,"high_freq_factor": 4.0,"original_context_length": 8192,}}LLAMA32_CONFIG_1B = {"vocab_size": 128_256,    # Vocabulary size"context_length": 8192,   # Context length"emb_dim": 2048,          # NEW: Half the embedding dimension"n_heads": 32,            # Number of attention heads"n_layers": 16,           # NEW: Half the number of layers"hidden_dim": 8192,      # NEW: Almopst half the size of the intermediate dimension in FeedForward"n_kv_groups": 8,         # Key-Value groups for grouped-query attention"rope_base": 50_000,      # The base in RoPE's "theta""dtype": torch.bfloat16,  # Lower-precision dtype to save memory"rope_freq": {            # RoPE frequency scaling"factor": 32.0,       # NEW: Adjustment of the rescaling factor"low_freq_factor": 1.0,"high_freq_factor": 4.0,"original_context_length": 8192,}}

下面，我们可以重复使用 Llama 3.1 8B 部分的代码来加载 Llama 3.2 1B 模型。

同样，由于 Llama 3.2 系列有别于 Llama 3.1 系列，因此必须访问 meta-llama/Llama-3.2-1B 软件源并确认许可条款。

简单起见，我们只在下面加载基本模型，但也有一个经过指令微调的版本，可以用「meta-llama/Llama-3.2-1B-Instruct」替换「meta-llama/Llama-3.2-1B」。

# free up memorydel modelgc.collect()  # Run Python garbage collectorif torch.cuda.is_available():torch.cuda.empty_cache()

tokenizer_file_path = hf_hub_download(repo_id="meta-llama/Llama-3.2-1B",filename="original/tokenizer.model",local_dir="llama32-files")tokenizer = Tokenizer(tokenizer_file_path)

model = Llama3Model(LLAMA32_CONFIG_1B)total_params = sum(p.numel() for p in model.parameters())print(f"Total number of parameters: {total_params:,}")# Account for weight tyingtotal_params_normalized = total_params - model.tok_emb.weight.numel()print(f"\nTotal number of unique parameters: {total_params_normalized:,}")

Total number of parameters: 1,498,482,688
Total number of unique parameters: 1,235,814,400

weights_file = hf_hub_download(repo_id="meta-llama/Llama-3.2-1B",filename=f"model.safetensors",local_dir="llama32-files")current_weights = load_file(weights_file)load_weights_into_llama(model, LLAMA32_CONFIG_1B, current_weights)model.to(device);

Model uses weight tying.

print("Weight tying:", torch.equal(model.tok_emb.weight, model.out_head.weight))

Weight tying: True

torch.manual_seed(123)token_ids = generate(model=model,idx=text_to_token_ids("Every effort", tokenizer).to(device),max_new_tokens=25,context_size=LLAMA32_CONFIG_1B["context_length"],top_k=1,temperature=0.)print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text: Every effort is made to ensure that the information on this website is accurate. However, we cannot guarantee that the information is accurate, complete

原文链接：

https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/07_gpt_to_llama/converting-llama2-to-llama3.ipynb

技术交流群邀请函

△长按添加小助手

扫描二维码添加小助手微信

请备注：姓名-学校/公司-研究方向（如：小张-哈工大-对话系统）即可申请加入自然语言处理/Pytorch等技术交流群

关于我们

MLNLP 社区是由国内外机器学习与自然语言处理学者联合构建的民间学术社区，目前已经发展为国内外知名的机器学习与自然语言处理社区，旨在促进机器学习，自然语言处理学术界、产业界和广大爱好者之间的进步。

社区可以为相关从业者的深造、就业及研究等方面提供开放交流平台。欢迎大家关注和加入我们。

上一篇： Java基础数组写售票减少

下一篇：有java基础学小程序

版权声明：
本文来源网络，所有图片文章版权属于原作者，如有侵权，联系删除。

本文网址：https://www.bianchenghao6.com/h6javajc/26047.html

java基础传智播客毕向东