在 Amazon SageMaker 上部署 Llama 2 7B/13B/70B

LLaMA 2 是 LLaMA 的下一个版本。它接受更多数据 - 2T 令牌的训练，并支持高达 4K 令牌的上下文长度窗口。Meta 通过基于人类反馈的强化学习对超过 100 万条人类注释进行了微调对话模型。

在本博客中，我将记录如何将 Llama 2 模型部署到亚马逊云科技 Amazon SageMaker上的过程。 Hugging Face LLM DLC，它是一个新的专用推理容器，可在安全且受管理的环境中轻松部署 LLM。该 DLC 由文本生成推理 (TGI)提供支持，这是一种可扩展的优化解决方案，用于部署和服务大型语言模型 (LLM)。文章后面还包括了不同型号尺寸的硬件要求。

博客中将按以下步骤介绍如何安装Llama 2：

设置开发环境
获取新的 Hugging Face LLM DLC
硬件要求
将 Llama 2 部署到 Amazon SageMaker
运行推理并与模型聊天
清理

1.搭建开发环境

我使用sagemaker python SDK 将 Llama 2 部署到亚马逊云科技 Amazon SageMaker上。并确保配置了 AWS 帐户并安装了 sagemaker python SDK。若没有账号可以点击这里注册亚马逊云科技。

!pip install "sagemaker>=2.175.0" --upgrade --quiet

如果在本地环境中使用 Sagemaker。需要访问具有 Sagemaker 所需权限的 IAM 角色。可以在这里找到更多相关信息。

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

2.获取新的Hugging Face LLM DLC

与部署常规HuggingFace模型相比，首先需要检索容器uri并将其提供给HuggingFaceModel模型类，其中image_uri指向图像。要在Amazon Sagemaker中检索新的Hugging Face LLM DLC，可以使用sagemaker SDK提供的get_huggingface_llm_image_uri方法。此方法允许根据指定的后端、会话、区域和版本检索所需Hugging Face LLM DLC的URI。可以在此处找到可用版本

from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.9.3"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

3.硬件要求

Llama 2有3种不同的尺寸-7 B，13 B和70 B参数。硬件要求将根据部署到SageMaker的模型大小而有所不同。下面是我测试的每个型号尺寸的最低要求。

注意：我还没有测试GPTQ模型。

模型	实例类型	量化	每个副本的 GPU 数量
Llama 7B	(ml.)g5.2xlarge	-	1
Llama 13B	(ml.)g5.12xlarge	-	4
Llama 70B	(ml.)g5.48xlarge	bitsandbytes	8
Llama 70B	(ml.)p4d.24xlarge	-	8

注意：Amazon SageMaker目前不支持实例切片，例如，对于Llama 70B，无法在单个实例上运行多个副本。

这些是我已经验证的7B、13B和70B LLaMA 2型号在SageMaker上工作的最低设置。在接下来的几周里，我计划运行详细的基准测试，涵盖不同硬件配置的延迟和吞吐量。目前不建议将Llama 70B部署到g5.48xlarge实例，因为SageMaker的60秒请求超时限制会导致长请求超时。使用p4d实例部署Llama 70B。

通过减少MAX_TOTAL_TOKENS和MAX_BATCH_TOTAL_TOKENS参数，可以在没有量化的g5.48xlarge实例上运行Llama 70 B。我还没测试过这个。

4.将Llama 2部署到Amazon SageMaker

要将meta-llama/Llama-2- 13 b-chat-hf部署到Amazon SageMaker，需要创建一个HuggingFaceModel模型类，并定义端点配置，包括hf_model_id、instance_type等。使用g5.12xlarge实例类型，因为它具有4个NVIDIA A10 G GPU和96 GB GPU内存。

注意：此表单用于在获得 Meta 的访问权限后启用 Hugging Face 上的 Llama 2 访问权限。在提交此表格之前，请访问Meta网站并接受许可条款和可接受的使用政策。请求将在 1-2 天内得到处理。

import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.p4d.24xlarge"
number_of_gpu = 8
health_check_timeout = 300

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "meta-llama/Llama-2-70b-chat-hf", # model_id from hf.co/models
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(2048),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(4096),  # Max length of the generation (including input text)
  'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192),  # Limits the number of tokens that can be processed in parallel during the generation
  'HUGGING_FACE_HUB_TOKEN': "<REPLACE WITH YOUR TOKEN>"
  # ,'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}

# check if token is set
assert config['HUGGING_FACE_HUB_TOKEN'] != "<REPLACE WITH YOUR TOKEN>", "Please set your Hugging Face Hub token"

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

创建HuggingFaceModel后，可以使用deploy方法将其部署到Amazon SageMaker。将使用ml.g5.12xlarge实例类型部署模型。TGI将自动在所有GPU上分发和分割模型。

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

SageMaker现在将创建端点并将模型部署到其中。这可能需要10-15分钟。

5.运行推理并与模型聊天

在端点部署之后，可以在它上面运行推理。使用预测器中的predict方法在端点上运行推理。也可以使用不同的参数运行推理来影响生成。参数可以在有效载荷的parameters属性中定义。截至目前，TGI支持以下参数：

temperature：控制模型中的随机性。较低的值将使模型更具确定性，较高的值将使模型更加随机。默认值为 1.0。
max_new_tokens：生成的最大令牌数。默认值为 20，最大值为 512。
repetition_penalty：控制重复的可能性，默认为null。
seed：用于随机生成的种子，默认为null。
stop：停止生成的令牌列表。当生成其中一个代币时，生成将停止。
top_k：为 top-k 过滤保留的最高概率词汇标记的数量。默认值为null，这会禁用 top-k-filtering。
top_p：为核采样保留的参数最高概率词汇标记的累积概率，默认为null
do_sample：是否使用采样；否则使用贪婪解码。默认值为false。
best_of：生成 best_of 序列并返回最高 token logprobs 的序列，默认为null。
details：是否返回有关该代的详细信息。默认值为false。
return_full_text：是否返回全文或仅返回生成的部分。默认值为false。
truncate：是否将输入截断为模型的最大长度。默认值为true。
typical_p：令牌的典型概率。默认值为null。
watermark：生成时使用的水印。默认值为false。

可以在swagger文档中找到TGI的开放API规范

Meta-llama/Llama-2- 13 b-chat-hf是一个会话聊天模型，这意味着可以使用以下提示与它聊天

<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_msg_1 }} [/INST] {{ model_answer_1 }} </s><s>[INST] {{ user_msg_2 }} [/INST]

我创建了一个小的辅助方法build_llama2_prompt，它将一个“消息”列表转换为提示格式。我还定义了一个system_prompt，用于启动对话。将使用system_prompt询问模型关于夏天要做的一些很酷的想法。

def build_llama2_prompt(messages):
    startPrompt = "<s>[INST] "
    endPrompt = " [/INST]"
    conversation = []
    for index, message in enumerate(messages):
        if message["role"] == "system" and index == 0:
            conversation.append(f"<<SYS>>\n{message['content']}\n<</SYS>>\n\n")
        elif message["role"] == "user":
            conversation.append(message["content"].strip())
        else:
            conversation.append(f" [/INST] {message['content'].strip()}</s><s>[INST] ")

    return startPrompt + "".join(conversation) + endPrompt

messages = [
  { "role": "system","content": "You are a friendly and knowledgeable vacation planning assistant named Clara. Your goal is to have natural conversations with users to help them plan their perfect vacation. "}
]

让我们看看模型能否为夏天想出一些很酷的主意。

# define question and add to messages
instruction = "What are some cool ideas to do in the summer?"
messages.append({"role": "user", "content": instruction})
prompt = build_llama2_prompt(messages)

chat = llm.predict({"inputs":prompt})

print(chat[0]["generated_text"][len(prompt):])

现在，使用不同的参数运行推理来影响生成。parameters参数可以在有效负载的属性中定义。

# hyperparameters for llm
payload = {
  "inputs":  prompt,
  "parameters": {
    "do_sample": True,
    "top_p": 0.6,
    "temperature": 0.9,
    "top_k": 50,
    "max_new_tokens": 512,
    "repetition_penalty": 1.03,
    "stop": ["</s>"]
  }
}

# send request to endpoint
response = llm.predict(payload)

print(response[0]["generated_text"][len(prompt):])

6.清理

要进行清理，可以删除模型和端点。

llm.delete_model()
llm.delete_endpoint()

结论

在 Amazon SageMaker 上部署 Llama 2 提供了一种可扩展、安全的方式来利用 LLM。只需几行代码，Hugging Face Inference DLC 就可以让每个人轻松地将强大的 LLM 集成到应用程序中。