Table of Contents¶
- Overview - Large Model Inference (LMI) Containers
- QuickStart
- Sample Notebooks
- Starting Guide
- Advanced Deployment Guide
- Supported LMI Inference Libraries
Overview - Large Model Inference (LMI) Containers¶
LMI containers are a set of high performance Docker Containers purpose built for large language model (LLM) inference. With these containers you can leverage high performance open-source inference libraries like vLLM, TensorRT-LLM, DeepSpeed, Transformers NeuronX to deploy LLMs on AWS SageMaker Endpoints. These containers bundle together a model server with open-source inference libraries to deliver an all-in-one LLM serving solution. We provide quick start notebooks that get you deploying popular open source models in minutes, and advanced guides to maximize performance of your endpoint.
LMI containers provide many features, including:
- Optimized inference performance for popular model architectures like Llama, Bloom, Falcon, T5, Mixtral, and more
- Integration with open source inference libraries like vLLM, TensorRT-LLM, DeepSpeed, and Transformers NeuronX
- Continuous Batching for maximizing throughput at high concurrency
- Token Streaming
- Quantization through AWQ, GPTQ, and SmoothQuant
- Multi GPU inference using Tensor Parallelism
- Serving LoRA fine-tuned models
LMI containers provide these features through integrations with popular inference libraries.
A unified configuration format enables users to easily leverage the latest optimizations and technologies across libraries.
We will refer to each of these libraries as backends
throughout the documentation.
The term backend refers to a combination of Engine (LMI uses the Python and MPI Engines) and inference library.
You can learn more about the components of LMI here.
QuickStart¶
Our recommended progression for the LMI documentation is as follows:
- Sample Notebooks: We provide sample notebooks for popular models that can be run as-is. This is the quickest way to start deploying models with LMI.
- Starting Guide: The starter guide describes a simplified UX for configuring LMI containers. This UX is applicable across all LMI containers, and focuses on the most important configurations available for tuning performance.
- Deployment Guide: The deployment guide is an advanced guide tailored for users that want to squeeze the most performance out of LMI. It is intended for users aiming to deploy LLMs in a production setting, using a specific backend.
Sample Notebooks¶
The following table provides notebooks that demonstrate how to deploy popular open source LLMs using LMI containers on SageMaker. If this is your first time using LMI, or you want a starting point for deploying a specific model, we recommend following the notebooks below. All the below samples are hosted in the SageMaker Generative AI Hosting Examples Repository. That repository will be continuously updated with examples.
Model | Instance Type | Sample Notebook |
---|---|---|
Llama-2-7b | ml.g5.2xlarge |
notebook |
Llama-2-13b | ml.g5.12xlarge |
notebook |
Llama-2-70b | ml.p4d.24xlarge |
notebook |
Mistral-7b | ml.g5.2xlarge |
notebook |
Mixtral-8x7b | ml.p4d.24xlarge |
notebook |
Flan-T5-XXL | ml.g5.12xlarge |
notebook |
CodeLlama-13b | ml.g5.48xlarge |
notebook |
Falcon-7b | ml.g5.2xlarge |
notebook |
Falcon-40b | ml.g5.48xlarge |
notebook |
Note: Some models in the table above are available from multiple providers. We link to the specific model we tested with, but we expect same model from a different provider (or a fine-tuned variant) to work.
Starting Guide¶
The starting guide is our recommended introduction for all users. This guide provides a simplified UX through a reduced set of configurations that are applicable to all LMI containers starting from v0.27.0.
Advanced Deployment Guide¶
We have put together a comprehensive deployment guide that takes you through the steps needed to deploy a model using LMI containers on SageMaker. The document covers the phases from storing your model artifacts through benchmarking your SageMaker endpoint. It is intended for users moving towards deploying LLMs in production settings.
Supported LMI Inference Libraries¶
LMI Containers provide integration with multiple inference libraries. You can learn more about their integration with LMI from the respective user guides:
- DeepSpeed - User Guide
- vLLM - User Guide
- LMI-Dist - User Guide
- TensorRT-LLM - User Guide
- Transformers NeuronX - User Guide
- HuggingFace Accelerate - User Guide
LMI provides access to multiple libraries to enable users to find the best stack for their model and use-case. Each inference framework provides a unique set of features and optimizations that can be tuned for your model and use case. With LMIs built-in inference handlers and unified configuration, experimenting with different stacks is as simple as changing a few configurations. Refer to the stack specific user guides, and the LMI deployment guide to learn more. An overview of the different LMI components is provided in the deployment guide
The following table shows which SageMaker DLC (deep learning container) to use for each backend. This information is also available on the SageMaker DLC GitHub repository.
Backend | SageMakerDLC | Example URI |
---|---|---|
vLLM |
djl-deepspeed | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 |
lmi-dist |
djl-deepspeed | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 |
hf-accelerate |
djl-deepspeed | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 |
deepspeed |
djl-deepspeed | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 |
tensorrt-llm |
djl-tensorrtllm | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-tensorrtllm0.8.0-cu122 |
transformers-neuronx |
djl-neuronx | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-neuronx-sdk2.18.0 |