Deploy a model to an endpoint

Before you can get online inferences from a trained model, you must deploy the model to an endpoint. This can be done by using the Google Cloud console, the Google Cloud CLI, or the Vertex AI API.

This document describes the process for deploying models to endpoints.

What happens when you deploy a model

Deploying a model associates physical resources with the model so that it can serve online inferences with low latency.

You can deploy multiple models to an endpoint, or you can deploy the same model to multiple endpoints. For more information, see Reasons to deploy more than one model to the same endpoint.

Prepare to deploy a model to an endpoint

During model deployment, you make the following important decisions about how to run online inference:

Resource created Setting specified at resource creation
Endpoint Location in which to run inferences
Model Container to use (ModelContainerSpec)
DeployedModel Compute resources to use for online inference

After the model is deployed to the endpoint, these deployment settings can't be changed. To change them, you must redeploy your model.

The first step in the deployment process is to decide which type of endpoint to use. For more information, see Choose an endpoint type.

Next, make sure that the model is visible in Vertex AI Model Registry. This is required for the model to be deployable. For information about Model Registry, including how to import model artifacts or create them directly in Model Registry, see Introduction to Vertex AI Model Registry.

The next decision to make is which compute resources to use for serving the model. The model's training type (AutoML or custom) and (AutoML) data type determine the kinds of physical resources available to the model. After model deployment, you can mutate some of those resources without creating a new deployment.

The endpoint resource provides the service endpoint (URL) you use to request the inference. For example:

   https://us-central1-aiplatform.googleapis.com/v1/projects/{project}/locations/{location}/endpoints/{endpoint}:predict

Deploy a model to an endpoint

You can deploy a model to an endpoint by using the Google Cloud console or by using the gcloud CLI or Vertex AI API.

Deploy a model to a public endpoint by using the Google Cloud console

In the Google Cloud console, you can deploy a model to an existing dedicated or shared public endpoint, or you can create a new endpoint during the deployment process. For details, see Deploy a model by using the Google Cloud console.

Deploy a model to a public endpoint by using the gcloud CLI or Vertex AI API

When you deploy a model by using the gcloud CLI or Vertex AI API, you must first create a dedicated or shared endpoint and then deploy the model to it. For details, see:

  1. Create a dedicated or shared public endpoint
  2. Deploy a model by using the gcloud CLI or Vertex AI API

Deploy a model to a Private Service Connect endpoint

For details, see Use Private Service Connect endpoints for online inference.

Use a rolling deployment to update a deployed model

You can use a rolling deployment to replace a deployed model with a new version of the same model. The new model reuses the compute resources from the previous one. For details, see Use a rolling deployment to replace a deployed model.

Undeploy a model and delete the endpoint

You can undeploy a model and delete the endpoint. For details, see Undeploy a model and delete the endpoint.

Reasons to deploy more than one model to the same endpoint

Deploying two models to the same endpoint lets you gradually replace one model with the other. For example, suppose you are using a model, and find a way to increase the accuracy of that model with new training data. However, you don't want to update your application to point to a new endpoint URL, and you don't want to create sudden changes in your application. You can add the new model to the same endpoint, serving a small percentage of traffic, and gradually increase the traffic split for the new model until it is serving 100% of the traffic.

Because the resources are associated with the model rather than the endpoint, you could deploy models of different types to the same endpoint. However, the best practice is to deploy models of a specific type (for example, AutoML tabular or custom-trained) to an endpoint. This configuration is easier to manage.

Reasons to deploy a model to more than one endpoint

You might want to deploy your models with different resources for different application environments, such as testing and production. You might also want to support different SLOs for your inference requests. Perhaps one of your applications has much higher performance needs than the others. In this case, you can deploy that model to a higher-performance endpoint with more machine resources. To optimize costs, you can also deploy the model to a lower-performance endpoint with fewer machine resources.

Scaling behavior

When you deploy a model for online inference as a DeployedModel, you can configure inference nodes to automatically scale. To do this, set dedicatedResources.maxReplicaCount to a greater value than dedicatedResources.minReplicaCount.

When you configure a DeployedModel, you must set dedicatedResources.minReplicaCount to at least 1. In other words, you cannot configure the DeployedModel to scale to 0 inference nodes when it is unused.

By default, the deployment operation is only considered successful if the number of inference nodes reaches dedicatedResources.minReplicaCount before the deployment request timeout value. Otherwise, the deployment is marked as failed, and the underlying resources are released.

Partially successful deployment and mutation

You can modify the default deployment behavior by setting dedicatedResources.requiredReplicaCount to a value that is less than dedicatedResources.minReplicaCount. In this case, when the number of inference nodes reaches dedicatedResources.requiredReplicaCount, the deployment operation is marked as successful, even though it is not yet complete. Deployment continues until dedicatedResources.minReplicaCount is reached. If dedicatedResources.minReplicaCount isn't reached before the deployment request time, the operation is still successful, but an error message for the failed replicas is returned in DeployedModel.status.message.

Quota for Custom model serving is calculated based on your deployed model's real-time usage of compute resources. If the sum of maxReplicaCount for all the deployments in your project is more than your project's quota, some deployments may fail to autoscale due to quota being exhausted.

Endpoints are scaled up and down per machine, but quota is calculated per CPU or GPU. For example, if your model is deployed to a2-highgpu-2g machine type, each active replica counts as 24 CPUs and 2 GPUs against your project's quota. For more information, see Quota and limits.

The inference nodes for batch inference don't automatically scale. Vertex AI uses BatchDedicatedResources.startingReplicaCount and ignores BatchDedicatedResources.maxReplicaCount.

Target utilization and configuration

By default, if you deploy a model without dedicated GPU resources, Vertex AI automatically scales the number of replicas up or down so that CPU usage matches the default 60% target value.

By default, if you deploy a model with dedicated GPU resources (if machineSpec.accelerator_count is greater than 0), Vertex AI will automatically scale the number of replicas up or down so that the CPU or GPU usage, whichever is higher, matches the default 60% target value. Therefore, if your inference throughput is causing high GPU usage, but not high CPU usage, Vertex AI will scale up, and the CPU utilization will be very low, which will be visible in monitoring. Conversely, if your custom container is underutilizing the GPU, but has an unrelated process that raise CPU utilization higher than 60%, Vertex AI will scale up, even if this may not have been needed to achieve QPS and latency targets.

You can override the default threshold metric and target by specifying autoscalingMetricSpecs. Note that if your deployment is configured to scale based only on CPU usage, it won't scale up even if GPU usage is high.

The following autoscaling metrics are supported:

  • CPU utilization (aiplatform.googleapis.com/prediction/online/cpu/utilization): Scales based on CPU usage. Its unit is CPU utilization per replica. The target value is a percentage (0-100). The default target value is 60%.
  • GPU utilization (aiplatform.googleapis.com/prediction/online/accelerator/duty_cycle): Scales based on GPU usage. Its unit is GPU utilization per replica. The target value is a percentage (0-100). The default target value is 60%.
  • Request count (aiplatform.googleapis.com/prediction/online/request_count): Scales based on the number of requests. Its unit is requests per minute per replica. The target value is an integer. This metric is disabled by default.

When configuring autoscaling, use METRIC_NAME for the metric identifier and TARGET_THRESHOLD for the target value.

Configure autoscaling during deployment

To configure autoscaling when deploying a model, use one of the following interfaces:

gcloud

To configure autoscaling when deploying a model using the gcloud CLI, use the gcloud ai endpoints deploy-model command.

Note that for gcloud, the metric keyword is slightly different. Use the following:

  • cpu-usage
  • gpu-duty-cycle
  • request-counts-per-minute

Before using any of the command data, make the following replacements:

  • ENDPOINT_ID: The ID of your endpoint.
  • PROJECT_ID: Your project ID.
  • LOCATION: The region of your endpoint.
  • MODEL_ID: The ID of the model to deploy.
  • MACHINE_TYPE: The machine type for the deployed model (e.g., n1-standard-4).
  • ACCELERATOR_TYPE: Optional. The type of GPU accelerator to attach (e.g., NVIDIA_L4).
  • ACCELERATOR_COUNT: Optional. The number of accelerators to attach to each machine.
  • MIN_REPLICA_COUNT: The minimum number of replicas for autoscaling.
  • MAX_REPLICA_COUNT: The maximum number of replicas for autoscaling.
  • METRIC_NAME_GCLOUD: The identifier of the autoscaling metric.
  • TARGET_THRESHOLD: The target value for the specified metric.
gcloud ai endpoints deploy-model ENDPOINT_ID \
    --project=PROJECT_ID \
    --region=LOCATION \
    --model=MODEL_ID \
    --display-name=DEPLOYED_MODEL_DISPLAY_NAME \
    --machine-type=MACHINE_TYPE \
    --accelerator-type=ACCELERATOR_TYPE \
    --accelerator-count=ACCELERATOR_COUNT \
    --min-replica-count=MIN_REPLICA_COUNT \
    --max-replica-count=MAX_REPLICA_COUNT \
    --autoscaling-metric-specs=METRIC_NAME_GCLOUD=TARGET_THRESHOLD

REST

To configure autoscaling when deploying a model using the REST API, use the projects.locations.endpoints.deployModel method.

Before using any of the request data, make the following replacements:

  • ENDPOINT_ID: The ID of your endpoint.
  • PROJECT_ID: Your project ID.
  • LOCATION: The region of your endpoint.
  • MODEL_ID: The ID of the model to deploy.
  • DEPLOYED_MODEL_DISPLAY_NAME: A display name for the deployed model.
  • MACHINE_TYPE: The machine type for the deployed model (e.g., n1-standard-4).
  • ACCELERATOR_TYPE: Optional. The type of GPU accelerator to attach (e.g., NVIDIA_L4).
  • ACCELERATOR_COUNT: Optional. The number of accelerators to attach to each machine.
  • MIN_REPLICA_COUNT: The minimum number of replicas for autoscaling.
  • MAX_REPLICA_COUNT: The maximum number of replicas for autoscaling.
  • METRIC_NAME: The identifier of the autoscaling metric.
  • TARGET_THRESHOLD: The target value for the specified metric.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID:deployModel

Request JSON body:

{
  "deployedModel": {
    "model": "projects/PROJECT_ID/locations/LOCATION/models/MODEL_ID",
    "displayName": "DEPLOYED_MODEL_DISPLAY_NAME",
    "dedicatedResources": {
      "machineSpec": {
        "machineType": "MACHINE_TYPE",
        "acceleratorType": "ACCELERATOR_TYPE",
        "acceleratorCount": ACCELERATOR_COUNT
      },
      "minReplicaCount": MIN_REPLICA_COUNT,
      "maxReplicaCount": MAX_REPLICA_COUNT,
      "autoscalingMetricSpecs": [
        {
          "metricName": "METRIC_NAME",
          "target": TARGET_THRESHOLD
        }
      ]
    }
  }
}

Python

The Python SDK autoscaling is configured through parameter names in the deploy() function call. The sample command uses concurrency-based autoscaling as an example. The configurable autoscaling parameters are:

  • autoscaling_target_cpu_utilization
  • autoscaling_target_accelerator_duty_cycle
  • autoscaling_target_request_count_per_minute

To configure autoscaling when deploying a model using the Vertex AI SDK for Python:

Before running the code, make the following replacements:

  • PROJECT_ID: Your project ID.
  • LOCATION: The region of your endpoint.
  • ENDPOINT_ID: The ID of your endpoint.
  • MODEL_ID: The ID of the model to deploy.
  • DEPLOYED_MODEL_DISPLAY_NAME: A display name for the deployed model.
  • MACHINE_TYPE: The machine type for the deployed model (e.g., n1-standard-4).
  • ACCELERATOR_TYPE: Optional. The type of GPU accelerator to attach (e.g., NVIDIA_L4).
  • ACCELERATOR_COUNT: Optional. The number of accelerators to attach to each machine.
  • MIN_REPLICA_COUNT: The minimum number of replicas for autoscaling.
  • MAX_REPLICA_COUNT: The maximum number of replicas for autoscaling.
  • METRIC_NAME: The identifier of the autoscaling metric.
  • TARGET_THRESHOLD: The target value for the specified metric.
from google.cloud import aiplatform

# Initialize Vertex AI
aiplatform.init(project="PROJECT_ID", location="LOCATION")

# Get the model from Model Registry
model = aiplatform.Model("MODEL_ID")

# Get the endpoint
endpoint = aiplatform.Endpoint("ENDPOINT_ID")

# Deploy the model to the endpoint
model.deploy(
  endpoint=endpoint,
  machine_type="MACHINE_TYPE",
  accelerator_type="ACCELERATOR_TYPE",
  accelerator_count=ACCELERATOR_COUNT
  min_replica_count=MIN_REPLICA_COUNT,
  max_replica_count=MAX_REPLICA_COUNT,
  autoscaling_target_request_count_per_minute=TARGET_THRESHOLD,
)

Update autoscaling configuration

To update an existing autoscaling configuration, use one of the following interfaces:

REST

To update the autoscaling configuration of a deployed model using the REST API, use the projects.locations.endpoints.mutateDeployedModel method.

Before using any of the request data, make the following replacements:

  • ENDPOINT_ID: The ID of your endpoint.
  • PROJECT_ID: Your project ID.
  • LOCATION: The region of your endpoint.
  • DEPLOYED_MODEL_ID: The ID of the deployed model to update.
  • MIN_REPLICA_COUNT: The new minimum number of replicas for autoscaling.
  • MAX_REPLICA_COUNT: The new maximum number of replicas for autoscaling.
  • METRIC_NAME: The identifier of the autoscaling metric.
  • TARGET_THRESHOLD: The target value for the specified metric.

HTTP method and URL:

PATCH https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID:mutateDeployedModel

Request JSON body:

{
  "deployedModel": {
    "id": "DEPLOYED_MODEL_ID",
    "dedicatedResources": {
      "minReplicaCount": MIN_REPLICA_COUNT,
      "maxReplicaCount": MAX_REPLICA_COUNT,
      "autoscalingMetricSpecs": [
        {
          "metricName": "METRIC_NAME",
          "target": TARGET_THRESHOLD
        }
      ]
    }
  },
  "updateMask": {
    "paths": [
      "dedicated_resources.min_replica_count",
      "dedicated_resources.max_replica_count",
      "dedicated_resources.autoscaling_metric_specs"
    ]
  }
}

Manage resource usage

You can monitor your endpoint to track metrics like CPU and Accelerator usage, number of requests, latency, and the current and target number of replicas. This information can help you understand your endpoint's resource usage and scaling behavior.

Keep in mind that each replica runs only a single container. This means that if an inference container can't fully use the selected compute resource, such as single threaded code for a multi-core machine, or a custom model that calls another service as part of making the inference, your nodes may not scale up.

For example, if you are using FastAPI, or any model server that has a configurable number of workers or threads, there are many cases where having more than one worker can increase resource utilization, which improves the ability for the service to automatically scale the number of replicas.

We generally recommend starting with one worker or thread per core. If you notice that CPU utilization is low, especially under high load, or your model isn't scaling up because CPU utilization is low, then increase the number of workers. On the other hand, if you notice that utilization is too high and your latencies increase more than expected under load, try using fewer workers. If you are already using only a single worker, try using a smaller machine type.

Scaling behavior and lag

Vertex AI adjusts the number of replicas every 15 seconds using data from the previous 5 minutes window. For each 15 second cycle, the system measures the server utilization and generates a target number of replicas based on the following formula:

target # of replicas = Ceil(current # of replicas * (current utilization / target utilization))

For example, if you have two replicas that are being utilized at 100%, the target is 4:

4 = Ceil(3.33) = Ceil(2 * (100% / 60%))

Another example, if you have 10 replicas and utilization drops to 1%, the target is 1:

1 = Ceil(.167) = Ceil(10 * (1% / 60%))

At the end of each 15 second cycle, the system adjusts the number of replicas to match the highest target value from the previous 5 minutes window. Notice that because the highest target value is chosen, your endpoint won't scale down if there is a spike in utilization during that 5 minute window, even if overall utilization is very low. On the other hand, if the system needs to be scaled up, it will do that within 15 seconds since the highest target value is chosen instead of the average.

Keep in mind that even after Vertex AI adjusts the number of replicas, it takes time to start up or turn down the replicas. Thus there is an additional delay before the endpoint can adjust to the traffic. The main factors that contribute to this time include the following:

  • The time to provision and start the Compute Engine VMs
  • The time to download the container from the registry
  • The time to load the model from storage

The best way to understand the real world scaling behavior of your model is to run a load test and optimize the characteristics that matter for your model and your use case. If the autoscaler isn't scaling up fast enough for your application, provision enough min_replicas to handle your expected baseline traffic.

Update the scaling configuration

If you specified either DedicatedResources or AutomaticResources when you deployed the model, you can update the scaling configuration without redeploying the model by calling mutateDeployedModel.

For example, the following request updates max_replica, autoscaling_metric_specs, and disables container logging.

{
  "deployedModel": {
    "id": "2464520679043629056",
    "dedicatedResources": {
      "maxReplicaCount": 9,
      "autoscalingMetricSpecs": [
        {
          "metricName": "aiplatform.googleapis.com/prediction/online/cpu/utilization",
          "target": 50
        }
      ]
    },
    "disableContainerLogging": true
  },
  "update_mask": {
    "paths": [
      "dedicated_resources.max_replica_count",
      "dedicated_resources.autoscaling_metric_specs",
      "disable_container_logging"
    ]
  }
}

Usage notes:

  • You can't change the machine type or switch from DedicatedResources to AutomaticResources or the other way around. The only scaling configuration fields you can change are: min_replica, max_replica, required_replica, and AutoscalingMetricSpec (DedicatedResources only).
  • You must list every field you need to update in updateMask. Unlisted fields are ignored.
  • The DeployedModel must be in a DEPLOYED state. There can be at most one active mutate operation per deployed model.
  • mutateDeployedModel also lets you enable or disable container logging. For more information, see Online inference logging.

What's next