This page shows you how to perform a model-based evaluation with Gen AI evaluation service using the Agent Platform SDK for Python.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
roles/resourcemanager.projectCreator), which contains the
resourcemanager.projects.create permission. Learn how to grant
roles.
Verify that billing is enabled for your Google Cloud project.
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
roles/resourcemanager.projectCreator), which contains the
resourcemanager.projects.create permission. Learn how to grant
roles.
Verify that billing is enabled for your Google Cloud project.
Install the Agent Platform SDK for Python with Gen AI evaluation service dependency:
!pip install google-cloud-aiplatform[evaluation]
Set up your credentials. If you are running this quickstart in Colaboratory, run the following:
from google.colab import auth
auth.authenticate_user()
For other environments, refer to Authenticate to Agent Platform.
Import your libraries and set up your project and location. Note that The following metric definition evaluates the text quality generated from a large language model based on two criteria: Add the following code to prepare your dataset: Run the evaluation: View the evaluation results for each response in the
To avoid incurring charges to your Google Cloud account for
the resources used on this page, follow these steps.
Delete the Import libraries
import pandas as pd
import vertexai
from vertexai.evaluation import EvalTask, PointwiseMetric, PointwiseMetricPromptTemplate
from google.cloud import aiplatform
PROJECT_ID = "PROJECT_ID"
LOCATION = "LOCATION"
EXPERIMENT_NAME = "EXPERIMENT_NAME"
vertexai.init(
project=PROJECT_ID,
location=LOCATION,
)
EXPERIMENT_NAME can only contain lowercase alphanumeric characters and hyphens, up to a maximum of 127 characters.Set up evaluation metrics based on your criteria
Fluency and Entertaining. The code defines a metric called custom_text_quality using those two criteria:custom_text_quality = PointwiseMetric(
metric="custom_text_quality",
metric_prompt_template=PointwiseMetricPromptTemplate(
criteria={
"fluency": (
"Sentences flow smoothly and are easy to read, avoiding awkward"
" phrasing or run-on sentences. Ideas and sentences connect"
" logically, using transitions effectively where needed."
),
"entertaining": (
"Short, amusing text that incorporates emojis, exclamations and"
" questions to convey quick and spontaneous communication and"
" diversion."
),
},
rating_rubric={
"1": "The response performs well on both criteria.",
"0": "The response is somewhat aligned with both criteria",
"-1": "The response falls short on both criteria",
},
),
)
Prepare your dataset
responses = [
# An example of good custom_text_quality
"Life is a rollercoaster, full of ups and downs, but it's the thrill that keeps us coming back for more!",
# An example of medium custom_text_quality
"The weather is nice today, not too hot, not too cold.",
# An example of poor custom_text_quality
"The weather is, you know, whatever.",
]
eval_dataset = pd.DataFrame({
"response" : responses,
})
Run evaluation with your dataset
eval_task = EvalTask(
dataset=eval_dataset,
metrics=[custom_text_quality],
experiment=EXPERIMENT_NAME
)
pointwise_result = eval_task.evaluate()
metrics_table Pandas DataFrame:pointwise_result.metrics_table
Clean up
ExperimentRun created by the evaluation:aiplatform.ExperimentRun(
run_name=pointwise_result.metadata["experiment_run"],
experiment=pointwise_result.metadata["experiment"],
).delete()
What's next
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-06-10 UTC.