标签 Llama2 下的文章 - 沉默的博客

排行榜登录 / 注册

标签搜索

Chen'mo

累计撰写 1,273 篇文章
累计收到 389 条评论

搜索到 1 篇与的结果

在 Amazon SageMaker 上部署 Llama 2 7B/13B/70B LLaMA 2 是 LLaMA 的下一个版本。它接受更多数据 - 2T 令牌的训练，并支持高达 4K 令牌的上下文长度窗口。Meta 通过基于人类反馈的强化学习对超过 100 万条人类注释进行了微调对话模型。在本博客中，我将记录如何将 Llama 2 模型部署到亚马逊云科技 Amazon SageMaker上的过程。 Hugging Face LLM DLC，它是一个新的专用推理容器，可在安全且受管理的环境中轻松部署 LLM。该 DLC 由文本生成推理 (TGI)提供支持，这是一种可扩展的优化解决方案，用于部署和服务大型语言模型 (LLM)。文章后面还包括了不同型号尺寸的硬件要求。博客中将按以下步骤介绍如何安装Llama 2：设置开发环境获取新的 Hugging Face LLM DLC硬件要求将 Llama 2 部署到 Amazon SageMaker运行推理并与模型聊天清理1.搭建开发环境我使用sagemaker python SDK 将 Llama 2 部署到亚马逊云科技 Amazon SageMaker上。并确保配置了 AWS 帐户并安装了 sagemaker python SDK。若没有账号可以点击这里注册亚马逊云科技。!pip install "sagemaker>=2.175.0" --upgrade --quiet如果在本地环境中使用 Sagemaker。需要访问具有 Sagemaker 所需权限的 IAM 角色。可以在这里找到更多相关信息。import sagemaker import boto3 sess = sagemaker.Session() # sagemaker session bucket -> used for uploading data, models and logs # sagemaker will automatically create this bucket if it not exists sagemaker_session_bucket=None if sagemaker_session_bucket is None and sess is not None: # set to default bucket if a bucket name is not given sagemaker_session_bucket = sess.default_bucket() try: role = sagemaker.get_execution_role() except ValueError: iam = boto3.client('iam') role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn'] sess = sagemaker.Session(default_bucket=sagemaker_session_bucket) print(f"sagemaker role arn: {role}") print(f"sagemaker session region: {sess.boto_region_name}")2.获取新的Hugging Face LLM DLC与部署常规HuggingFace模型相比，首先需要检索容器uri并将其提供给HuggingFaceModel模型类，其中image_uri指向图像。要在Amazon Sagemaker中检索新的Hugging Face LLM DLC，可以使用sagemaker SDK提供的get_huggingface_llm_image_uri方法。此方法允许根据指定的后端、会话、区域和版本检索所需Hugging Face LLM DLC的URI。可以在此处找到可用版本from sagemaker.huggingface import get_huggingface_llm_image_uri # retrieve the llm image uri llm_image = get_huggingface_llm_image_uri( "huggingface", version="0.9.3" ) # print ecr image uri print(f"llm image uri: {llm_image}")3.硬件要求Llama 2有3种不同的尺寸-7 B，13 B和70 B参数。硬件要求将根据部署到SageMaker的模型大小而有所不同。下面是我测试的每个型号尺寸的最低要求。注意：我还没有测试GPTQ模型。模型实例类型量化每个副本的 GPU 数量Llama 7B(ml.)g5.2xlarge-1Llama 13B(ml.)g5.12xlarge-4Llama 70B(ml.)g5.48xlargebitsandbytes8Llama 70B(ml.)p4d.24xlarge-8注意：Amazon SageMaker目前不支持实例切片，例如，对于Llama 70B，无法在单个实例上运行多个副本。这些是我已经验证的7B、13B和70B LLaMA 2型号在SageMaker上工作的最低设置。在接下来的几周里，我计划运行详细的基准测试，涵盖不同硬件配置的延迟和吞吐量。目前不建议将Llama 70B部署到g5.48xlarge实例，因为SageMaker的60秒请求超时限制会导致长请求超时。使用p4d实例部署Llama 70B。通过减少MAX_TOTAL_TOKENS和MAX_BATCH_TOTAL_TOKENS参数，可以在没有量化的g5.48xlarge实例上运行Llama 70 B。我还没测试过这个。4.将Llama 2部署到Amazon SageMaker要将meta-llama/Llama-2- 13 b-chat-hf部署到Amazon SageMaker，需要创建一个HuggingFaceModel模型类，并定义端点配置，包括hf_model_id、instance_type等。使用g5.12xlarge实例类型，因为它具有4个NVIDIA A10 G GPU和96 GB GPU内存。注意：此表单用于在获得 Meta 的访问权限后启用 Hugging Face 上的 Llama 2 访问权限。在提交此表格之前，请访问Meta网站并接受许可条款和可接受的使用政策。请求将在 1-2 天内得到处理。import json from sagemaker.huggingface import HuggingFaceModel # sagemaker config instance_type = "ml.p4d.24xlarge" number_of_gpu = 8 health_check_timeout = 300 # Define Model and Endpoint configuration parameter config = { 'HF_MODEL_ID': "meta-llama/Llama-2-70b-chat-hf", # model_id from hf.co/models 'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica 'MAX_INPUT_LENGTH': json.dumps(2048), # Max length of input text 'MAX_TOTAL_TOKENS': json.dumps(4096), # Max length of the generation (including input text) 'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192), # Limits the number of tokens that can be processed in parallel during the generation 'HUGGING_FACE_HUB_TOKEN': "<REPLACE WITH YOUR TOKEN>" # ,'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize } # check if token is set assert config['HUGGING_FACE_HUB_TOKEN'] != "<REPLACE WITH YOUR TOKEN>", "Please set your Hugging Face Hub token" # create HuggingFaceModel with the image uri llm_model = HuggingFaceModel( role=role, image_uri=llm_image, env=config )创建HuggingFaceModel后，可以使用deploy方法将其部署到Amazon SageMaker。将使用ml.g5.12xlarge实例类型部署模型。TGI将自动在所有GPU上分发和分割模型。# Deploy model to an endpoint # https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy llm = llm_model.deploy( initial_instance_count=1, instance_type=instance_type, container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model )SageMaker现在将创建端点并将模型部署到其中。这可能需要10-15分钟。5.运行推理并与模型聊天在端点部署之后，可以在它上面运行推理。使用预测器中的predict方法在端点上运行推理。也可以使用不同的参数运行推理来影响生成。参数可以在有效载荷的parameters属性中定义。截至目前，TGI支持以下参数：temperature：控制模型中的随机性。较低的值将使模型更具确定性，较高的值将使模型更加随机。默认值为 1.0。max_new_tokens：生成的最大令牌数。默认值为 20，最大值为 512。repetition_penalty：控制重复的可能性，默认为null。seed：用于随机生成的种子，默认为null。stop：停止生成的令牌列表。当生成其中一个代币时，生成将停止。top_k：为 top-k 过滤保留的最高概率词汇标记的数量。默认值为null，这会禁用 top-k-filtering。top_p：为核采样保留的参数最高概率词汇标记的累积概率，默认为nulldo_sample：是否使用采样；否则使用贪婪解码。默认值为false。best_of：生成 best_of 序列并返回最高 token logprobs 的序列，默认为null。details：是否返回有关该代的详细信息。默认值为false。return_full_text：是否返回全文或仅返回生成的部分。默认值为false。truncate：是否将输入截断为模型的最大长度。默认值为true。typical_p：令牌的典型概率。默认值为null。watermark：生成时使用的水印。默认值为false。可以在swagger文档中找到TGI的开放API规范Meta-llama/Llama-2- 13 b-chat-hf是一个会话聊天模型，这意味着可以使用以下提示与它聊天<s>[INST] <<SYS>> {{ system_prompt }} <</SYS>> {{ user_msg_1 }} [/INST] {{ model_answer_1 }} </s><s>[INST] {{ user_msg_2 }} [/INST]我创建了一个小的辅助方法build_llama2_prompt，它将一个“消息”列表转换为提示格式。我还定义了一个system_prompt，用于启动对话。将使用system_prompt询问模型关于夏天要做的一些很酷的想法。def build_llama2_prompt(messages): startPrompt = "<s>[INST] " endPrompt = " [/INST]" conversation = [] for index, message in enumerate(messages): if message["role"] == "system" and index == 0: conversation.append(f"<<SYS>>\n{message['content']}\n<</SYS>>\n\n") elif message["role"] == "user": conversation.append(message["content"].strip()) else: conversation.append(f" [/INST] {message['content'].strip()}</s><s>[INST] ") return startPrompt + "".join(conversation) + endPrompt messages = [ { "role": "system","content": "You are a friendly and knowledgeable vacation planning assistant named Clara. Your goal is to have natural conversations with users to help them plan their perfect vacation. "} ]让我们看看模型能否为夏天想出一些很酷的主意。# define question and add to messages instruction = "What are some cool ideas to do in the summer?" messages.append({"role": "user", "content": instruction}) prompt = build_llama2_prompt(messages) chat = llm.predict({"inputs":prompt}) print(chat[0]["generated_text"][len(prompt):])现在，使用不同的参数运行推理来影响生成。parameters参数可以在有效负载的属性中定义。# hyperparameters for llm payload = { "inputs": prompt, "parameters": { "do_sample": True, "top_p": 0.6, "temperature": 0.9, "top_k": 50, "max_new_tokens": 512, "repetition_penalty": 1.03, "stop": ["</s>"] } } # send request to endpoint response = llm.predict(payload) print(response[0]["generated_text"][len(prompt):])6.清理要进行清理，可以删除模型和端点。llm.delete_model() llm.delete_endpoint()结论在 Amazon SageMaker 上部署 Llama 2 提供了一种可扩展、安全的方式来利用 LLM。只需几行代码，Hugging Face Inference DLC 就可以让每个人轻松地将强大的 LLM 集成到应用程序中。
- 2023年12月26日
- 14,234 阅读
- 0 评论
- 0 点赞