%pip install sagemaker boto3 awscli --upgrade --quiet
import sagemaker
from sagemaker.djl_inference.model import DJLModel
role = sagemaker.get_execution_role() # execution role for the endpoint
session = sagemaker.session.Session() # sagemaker session for interacting with different AWS APIs
Step 2: Start building SageMaker endpoint¶
In this step, we will build SageMaker endpoint from scratch
Getting the container image URI (optional)¶
Check out available images: Large Model Inference available DLC
# Choose a specific version of LMI image directly:
image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.28.0-neuronx-sdk2.18.2"
Create SageMaker model¶
Here we are using LMI PySDK to create the model.
Checkout more configuration options.
# model_id = "s3://YOUR_BUCKET" # download model from your s3 bucket
model_id = "TheBloke/Llama-2-70B-Chat-fp16" # model will be download form Huggingface hub
draft_model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
env = {
"TENSOR_PARALLEL_DEGREE": "24",
"OPTION_ROLLING_BATCH": "auto",
"OPTION_MODEL_LOADING_TIMEOUT": "3600",
"OPTION_SPECULATIVE_DRAFT_MODEL": draft_model_id,
"OPTION_N_POSITIONS": "1024",
"OPTION_ROLLING_BATCH_SIZE": "1",
"OPTION_ENTRYPOINT": "djl_python.transformers_neuronx",
}
model = DJLModel(
model_id=model_id,
image_uri=image_uri, # choose a specific version of LMI DLC image
env=env,
role=role)
Create SageMaker endpoint¶
You need to specify the instance to use and endpoint names
instance_type = "ml.inf2.48xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")
predictor = model.deploy(initial_instance_count=1,
instance_type=instance_type,
endpoint_name=endpoint_name,
container_startup_health_check_timeout=3600,
volume_size=512
)
Step 3: Test and benchmark the inference¶
%%timeit -n3 -r1
predictor.predict(
{"inputs": "Large model inference is", "parameters": {}}
)
Clean up the environment¶
session.delete_endpoint(endpoint_name)
session.delete_endpoint_config(endpoint_name)
model.delete_model()