Categories
Uncategorized

Metrics & Instrumentation with Prometheus: An Ops Perspective

I am a huge proponent of Prometheus for metrics-based monitoring and code instrumentation. It is easy to setup and has a decent story around scaling. There is a large amount of community support and information available to get going. I really enjoy that they publish an opinionated Best Practices section that I have successfully used in multiple positions to introduce a standardized approach to instrumentation.

As an exercise in Golang Development and production service support, I wrote and operate https://isicircleciup.net. It’s a simple service that scrapes CircleCI’s status page, responds with a corresponding interpretation, and increases one of two counters: circleci_outage_scrape_count or circleci_success_scrape_count.

Looking back, I should have leveraged labels to indicate CircleCI’s status rather than two separate metrics. Admittedly I was distracted by getting this service running on Nomad, implementing centralized logging, and getting Prometheus and Grafana running on the cluster as well.

A sampling of 7 days of circleci_outage_scrape_count

In this implementation the single metric circleci_outage_scrape_count has two 3 labels: host, instance, and job. I am unsure how helpful the “job” metric is since it is captured as the prefix to the “outage_scrape_count” metric.

Setting up the metric is pretty straightforward using the Prometheus Golang client.

outageCounter = prometheus.NewCounterVec(
	prometheus.CounterOpts{
		Name: "circleci_outage_scrape_count",
		Help: "Count of page scrapes that yield a negative outlook",
	},[]string{"host"})

This counter is incremented by a single line.

outageCounter.With(prometheus.Labels{"host": hostname}).Inc()

Within Grafana, the following two PromQL queries create the graph below.

sum(rate(circleci_outage_scrape_count[30m]))
sum(rate(circleci_success_scrape_count[30m]))
A timeseries representation of CircleCI's status page

In this graph, I throw away the host, instance, and job labels. So maybe those labels aren’t necessary?

Lessons learned

Each label-value pair creates a unique time series which requires machine resources from Prometheus, so while labels are important and encouraged, we must be thoughtful in our use of them.

Following best practices I should have a single metric called circleci_scrape_count. I’d label the metric along the following dimensions:

  • host: records the container ID the service is running within.
  • instance: records the instance the container is running on. This is a built-in label that Prometheus adds for us.
  • status: “success” or “failure” to record the state of CircleCI’s operations.

I chose these labels to facilitate troubleshooting efforts when the service starts to misbehave. My logging implementation captures the container ID and the host. I may be able to develop some alerting across hostnames using application-specific metrics. Perhaps a single container is misbehaving during a rolling deployment? I may be able to identify that a newer container ID is problematic and roll-back to stabilize the service.

Status is the “product” in the instance of this site, so it is essential that I capture this.

Continued Reading

I encourage you to keep reading up on Prometheus. It’s a very compelling option for quickly rolling out instrumentation. In a past position, I rolled out the node_exporter on all EC2 instances in the organization and had machine metrics overnight. Combined with Consul, auto-discovery is very pleasant.

Happy instrumentation!

Leave a Reply

Your email address will not be published. Required fields are marked *