I am a huge proponent of Prometheus for metrics-based monitoring and code instrumentation. It is easy to setup and has a decent story around scaling. There is a large amount of community support and information available to get going. I really enjoy that they publish an opinionated Best Practices section that I have successfully used in multiple positions to introduce a standardized approach to instrumentation.
As an exercise in Golang Development and production service support, I wrote and operate https://isicircleciup.net. It’s a simple service that scrapes CircleCI’s status page, responds with a corresponding interpretation, and increases one of two counters: circleci_outage_scrape_count or circleci_success_scrape_count.
Looking back, I should have leveraged labels to indicate CircleCI’s status rather than two separate metrics. Admittedly I was distracted by getting this service running on Nomad, implementing centralized logging, and getting Prometheus and Grafana running on the cluster as well.

circleci_outage_scrape_count
In this implementation the single metric circleci_outage_scrape_count
has two 3 labels: host, instance, and job. I am unsure how helpful the “job” metric is since it is captured as the prefix to the “outage_scrape_count” metric.
Setting up the metric is pretty straightforward using the Prometheus Golang client.
outageCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "circleci_outage_scrape_count",
Help: "Count of page scrapes that yield a negative outlook",
},[]string{"host"})
This counter is incremented by a single line.
outageCounter.With(prometheus.Labels{"host": hostname}).Inc()
Within Grafana, the following two PromQL queries create the graph below.
sum(rate(circleci_outage_scrape_count[30m]))
sum(rate(circleci_success_scrape_count[30m]))

In this graph, I throw away the host, instance, and job labels. So maybe those labels aren’t necessary?
Lessons learned
Each label-value pair creates a unique time series which requires machine resources from Prometheus, so while labels are important and encouraged, we must be thoughtful in our use of them.
Following best practices I should have a single metric called circleci_scrape_count
. I’d label the metric along the following dimensions:
host
: records the container ID the service is running within.instance
: records the instance the container is running on. This is a built-in label that Prometheus adds for us.status
: “success” or “failure” to record the state of CircleCI’s operations.
I chose these labels to facilitate troubleshooting efforts when the service starts to misbehave. My logging implementation captures the container ID and the host. I may be able to develop some alerting across hostnames using application-specific metrics. Perhaps a single container is misbehaving during a rolling deployment? I may be able to identify that a newer container ID is problematic and roll-back to stabilize the service.
Status is the “product” in the instance of this site, so it is essential that I capture this.
Continued Reading
I encourage you to keep reading up on Prometheus. It’s a very compelling option for quickly rolling out instrumentation. In a past position, I rolled out the node_exporter
on all EC2 instances in the organization and had machine metrics overnight. Combined with Consul, auto-discovery is very pleasant.
- https://prometheus.io/docs/practices/instrumentation/
- https://github.com/samber/awesome-prometheus-alerts
- https://github.com/roaldnefs/awesome-prometheus
- https://www.prometheus.io/docs/visualization/grafana/
- https://prometheus.io/docs/prometheus/latest/configuration/configuration/#consul_sd_config
Happy instrumentation!