Llama2-13B-GPTQ seq-scheduler rollingbatch deployment guide¶
In this tutorial, you will use LMI container from DLC to SageMaker and run inference with it.
Please make sure the following permission granted before running the notebook:
- SageMaker access
Step 1: Let's bump up SageMaker and import stuff¶
%pip install sagemaker --upgrade --quiet
import sagemaker
from sagemaker.djl_inference.model import DJLModel
role = sagemaker.get_execution_role() # execution role for the endpoint
session = sagemaker.session.Session() # sagemaker session for interacting with different AWS APIs
Step 2: Start building SageMaker endpoint¶
In this step, we will build SageMaker endpoint from scratch
Getting the container image URI (optional)¶
Check out available images: Large Model Inference available DLC
# Choose a specific version of LMI image directly:
# image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124"
Create SageMaker model¶
Here we are using LMI PySDK to create the model.
Checkout more configuration options.
model_id = "TheBloke/Llama-2-13B-GPTQ" # model will be download form Huggingface hub
env = {
"TENSOR_PARALLEL_DEGREE": "1", # use 1 GPU, set to "max" to use all GPUs on the instance
"OPTION_ROLLING_BATCH": "auto", # optional, enabled by default
"OPTION_TRUST_REMOTE_CODE": "true",
}
model = DJLModel(
model_id=model_id,
env=env,
role=role)
Create SageMaker endpoint¶
You need to specify the instance to use and endpoint names
instance_type = "ml.g5.2xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")
predictor = model.deploy(initial_instance_count=1,
instance_type=instance_type,
endpoint_name=endpoint_name,
# container_startup_health_check_timeout=3600,
)
Step 3: Run inference¶
predictor.predict(
{"inputs": "def hello_world():", "parameters": {"max_new_tokens":128, "do_sample":"true"}}
)
benchmark¶
This can be done outside this notebook, in a bash shell terminal. The connection to the server is via the $SAGEMAKER url. The awscurl
here is a benchmark tool, obtainable from
curl -O https://publish.djl.ai/awscurl/0.28.0/awscur && chmod +x awscurl
%%sh
curl -O https://publish.djl.ai/awscurl/awscurl && chmod +x awscurl
endpoint_url=f"https://runtime.sagemaker.{session._region_name}.amazonaws.com/endpoints/{endpoint_name}/invocations"
endpoint_url
!TOKENIZER=codellama/CodeLlama-34b-hf ./awscurl -c 4 -N 10 -n sagemaker {endpoint_url} \
-H "Content-type: application/json" \
-d '{{"inputs":"The new movie that got Oscar this year","parameters":{{"max_new_tokens":256, "do_sample":true, "temperature":0.8, "top_k":5}}}}' \
-t
Clean up the environment¶
session.delete_endpoint(endpoint_name)
session.delete_endpoint_config(endpoint_name)
model.delete_model()