Introduction

Amazon SageMaker is a fully managed machine learning service. With Amazon SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment.

SageMaker provides:

  • An integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you do not have to manage servers.
  • A common machine learning algorithms that are optimized to run efficiently against extremely large data in a distributed environment.

With native support for bring-your-own-algorithms and frameworks, Amazon SageMaker offers flexible distributed training options that adjust to your specific workflows. Deploy a model into a secure and scalable environment by launching it with a single click from the Amazon SageMaker console. Training and hosting are billed by minutes of usage, with no minimum fees and no upfront commitments.

Amazon SageMaker Ground Truth

High-quality training datasets by using workers along with machine learning to create labeled datasets.

Amazon SageMaker Training

An Amazon SageMaker training job is an iterative process that teaches a model to make predictions by presenting examples from a training dataset. Typically, a training algorithm computes several metrics, such as training error and prediction accuracy. These metrics help diagnose whether the model is learning well and will generalize well for making predictions on unseen data. The training algorithm writes the values of these metrics to logs, which Amazon SageMaker monitors and sends to Amazon CloudWatch in real-time.

Amazon SageMaker Endpoint

Creates an endpoint using the endpoint configuration specified in the request. Amazon SageMaker uses the endpoint to provision resources and deploy models.

Amazon SageMaker Transform Job

Use batch transform when you need to do the following:

  • Preprocess datasets to remove noise or bias that interferes with training or inference from your dataset.
  • Get inferences from large datasets.
  • Run inference when you do not need a persistent endpoint.
  • Associate input records with inferences to assist the interpretation of results.

Setup

To set up the integration:

  1. Select SageMaker GroundTruth in AWS Integration Discovery Profile to discover AWS SageMaker GroundTruth.
  2. Select SageMaker Training in AWS Integration Discovery Profile to discover AWS SageMaker Training Job.
  3. Select SageMaker EndPoint in AWS Integration Discovery Profile to discover AWS SageMaker Endpoint.
  4. Select SageMaker Transform Job in AWS Integration Discovery Profile to discover AWS SageMaker Transform Job.

Metrics

GroundTruth metrics

OpsRamp MetricMetric Display NameUnitAggregation TypeDescription
aws_sagemaker_labelingjobs_ActiveWorkersActiveWorkersCountSumNumber of workers on a private work team performing a labeling job.
aws_sagemaker_labelingjobs_JobsSucceededJobsSucceededNoneSumNumber of labeling jobs that succeeded. To get the total number of labeling jobs that succeeded.
aws_sagemaker_labelingjobs_DatasetObjectsAutoAnnotatedDatasetObjectsAutoAnnotatedCountSumNumber of dataset objects auto-annotated in a labeling job.
aws_sagemaker_labelingjobs_DatasetObjectsHumanAnnotatedDatasetObjectsHumanAnnotatedCountSumNumber of dataset objects annotated by a human in a labeling job.
aws_sagemaker_labelingjobs_DatasetObjectsLabelingFailedDatasetObjectsLabelingFailedCountSumNumber of dataset objects that failed labeling in a labeling job.
aws_sagemaker_labelingjobs_TotalDatasetObjectsLabeledTotalDatasetObjectsLabeledCountSumNumber of dataset objects labeled successfully in a labeling job.
aws_sagemaker_labelingjobs_JobsStoppedJobsStoppedCountSumNumber of labeling jobs that were stopped.

Training metrics

OpsRamp MetricMetric Display NameUnitAggregation TypeDescription
aws_sagemaker_trainingjobs_CPUUtilizationCPUUtilizationPercentAveragePercentage of CPU units used by the containers on an instance.
aws_sagemaker_trainingjobs_MemoryUtilizationMemoryUtilizationPercentAveragePercentage of memory used by the containers on an instance.
aws_sagemaker_trainingjobs_GPUUtilizationGPUUtilizationPercentAveragePercentage of GPU units used by the containers on an instance.
aws_sagemaker_trainingjobs_GPUMemoryUtilizationGPUMemoryUtilizationPercentAveragePercentage of GPU memory used by the containers on an instance.
aws_sagemaker_trainingjobs_DiskUtilizationDiskUtilizationPercentAveragePercentage of disk space used by the containers on an instance.

Event support

CloudTrail event support

  • Supported (Sagemaker GroundTruth, Training, Endpoint, Transform Job)
  • Configurable in OpsRamp AWS Integration Discovery Profile.

CloudWatch alarm support

  • Not Supported

External reference