Categories
Uncategorized

Using Vault to Log into Nomad

Nomad should be protected before being used in any serious capacity. Nomad has the concept of ACLs and Roles that are associated with a token.

When you bootstrap the cluster (something we’ll discuss across later posts), a management token is issued called the Bootstrap Token. You can think of the Bootstrap Token as root. In my Nomad cluster deployments, I use the Bootstrap Token to create another management token that I give to Vault to manage access to my Nomad cluster from that point forward.

Vault creates short-lived Nomad tokens leveraging its Nomad Secret Backend. The illustration below is my attempt at describing the flow, but I’ll also show you what the process looks like.

Vault can be configured to broker Nomad tokens, controlling their lifecycle for enhanced security posture.

This diagram is confusing if you’re not familiar with the plumbing and nuances of Nomad ACLs and Vault policies, roles, and backends.

I think it makes sense to walk through this process to wrap our heads around the flow. It’s not as bad as one would think. We’ll use the Web UI to make things easier, although this can be done over the CLI as well.

Walk-thru

The first step is to log into Vault.

The Vault login screen. I have an OAuth2 Single Sign On integration on my cluster that uses GSuite as the identity provider. TLDR: I login with my GSuite Account.

Next we will want to expose the command prompt within the UI.

Clicking the little terminal icon in the upper-right corner will expose the command prompt.

Using the command prompt, we ask Vault to broker a Nomad Token on our behalf. My cluster is setup with a policy called “devops” which gives me most access that I need without being “root”. Your implementation may depend on your organization. I enter the following command into the command prompt:

vault read nomad/creds/devops

Again, you would replace “devops” with the appropriate Nomad policy name your account has access to. If you’re reading this post and you belong to the organization I work for in my 9-5, try “developer”.

We use the command prompt to give us a Nomad Token.

If everything was successful, you should have output from the command that contains secret_id. This is your Nomad Token. Copy it into your clipboard.

Copy the secret_id into your clipboard. This is your Nomad Token. We will use it to log into Nomad.

Navigate to your Nomad cluster. If ACLs are enabled, you should see ACL Tokens in the upper-right corner.

We need to go to the ACL page to enter our Nomad token. Click the link in the upper-right hand corner.

You will now be prompted to enter your Secret ID into a text field on the page. If there already is a token populated in the box, click the “Clear Token” button.

Enter your Nomad Token (secret_id from the previous step) into the text input field.

If your Nomad token is valid, Nomad will greet you.

If your Nomad token is valid, Nomad will let you know. You are now allowed to do everything associated with your associated ACL.

And that’s it! You’re in Nomad and free to roam around.

In posts to come, we’ll explore how to use Nomad and how the I’ve configured the plumbing of ACLs, Roles, Permissions, Single Sign On, and all the nitty-gritty details.

Happy Nomading!

I confidently use real token information in my tutorials from my home lab. The time-to-live (TTL) on my tokens brokered by Vault is set to 1 hour. By the time this article is live the token will cease to exist! This is the beauty of Vault, you have powerful security measures in-place for a little investment of time.

– kw
Categories
Uncategorized

Metrics & Instrumentation with Prometheus: An Ops Perspective

I am a huge proponent of Prometheus for metrics-based monitoring and code instrumentation. It is easy to setup and has a decent story around scaling. There is a large amount of community support and information available to get going. I really enjoy that they publish an opinionated Best Practices section that I have successfully used in multiple positions to introduce a standardized approach to instrumentation.

As an exercise in Golang Development and production service support, I wrote and operate https://isicircleciup.net. It’s a simple service that scrapes CircleCI’s status page, responds with a corresponding interpretation, and increases one of two counters: circleci_outage_scrape_count or circleci_success_scrape_count.

Looking back, I should have leveraged labels to indicate CircleCI’s status rather than two separate metrics. Admittedly I was distracted by getting this service running on Nomad, implementing centralized logging, and getting Prometheus and Grafana running on the cluster as well.

A sampling of 7 days of circleci_outage_scrape_count

In this implementation the single metric circleci_outage_scrape_count has two 3 labels: host, instance, and job. I am unsure how helpful the “job” metric is since it is captured as the prefix to the “outage_scrape_count” metric.

Setting up the metric is pretty straightforward using the Prometheus Golang client.

outageCounter = prometheus.NewCounterVec(
	prometheus.CounterOpts{
		Name: "circleci_outage_scrape_count",
		Help: "Count of page scrapes that yield a negative outlook",
	},[]string{"host"})

This counter is incremented by a single line.

outageCounter.With(prometheus.Labels{"host": hostname}).Inc()

Within Grafana, the following two PromQL queries create the graph below.

sum(rate(circleci_outage_scrape_count[30m]))
sum(rate(circleci_success_scrape_count[30m]))
A timeseries representation of CircleCI's status page

In this graph, I throw away the host, instance, and job labels. So maybe those labels aren’t necessary?

Lessons learned

Each label-value pair creates a unique time series which requires machine resources from Prometheus, so while labels are important and encouraged, we must be thoughtful in our use of them.

Following best practices I should have a single metric called circleci_scrape_count. I’d label the metric along the following dimensions:

  • host: records the container ID the service is running within.
  • instance: records the instance the container is running on. This is a built-in label that Prometheus adds for us.
  • status: “success” or “failure” to record the state of CircleCI’s operations.

I chose these labels to facilitate troubleshooting efforts when the service starts to misbehave. My logging implementation captures the container ID and the host. I may be able to develop some alerting across hostnames using application-specific metrics. Perhaps a single container is misbehaving during a rolling deployment? I may be able to identify that a newer container ID is problematic and roll-back to stabilize the service.

Status is the “product” in the instance of this site, so it is essential that I capture this.

Continued Reading

I encourage you to keep reading up on Prometheus. It’s a very compelling option for quickly rolling out instrumentation. In a past position, I rolled out the node_exporter on all EC2 instances in the organization and had machine metrics overnight. Combined with Consul, auto-discovery is very pleasant.

Happy instrumentation!

Categories
Uncategorized

Centralized Logging on Nomad

In this guide, I share how I implemented centralized logging on my Nomad cluster. One important thing to note about my configuration is that most of my workload runs in Docker managed by Nomad. I will share some of my configurations, but you may need to adapt for your specific needs.

I still need to provide full working Nomad Specifications, I suppose I’ll put them on GitHub/GitLab once I’ve cleaned them up.

Loki is a logging solution produced by Grafana Labs. I have chosen to use Loki as the log interface for my home lab Nomad cluster due to its simplicity and my desire to build out a single pane of glass for all of my cluster’s metrics.

There are three main components in the Loki stack:

  1. A log forwarder (I use two: promtail and fluent-bit, we’ll discuss this in a bit.)
  2. Loki – the log aggregator
  3. Grafana – the graphical interface used to query logs.

There are a few things that I like about Loki. First, the architecture is fairly simple to rationalize. In my implementation, I run fluent-bit as a logging driver on each of the Nomad hosts. Subsequently, I run Loki to accept the forwarded logs from fluent-bit. A simple query in Grafana and I have logs from all of my containers available to me with enough metadata to get an idea of what’s happening with the application.

One of my criticisms is that the documentation that Grafana offers up is pretty scattered and you’re left to kind of figure out the missing pieces.

Configuration

It took me a little bit of tweaking to figure out how to produce the output that works best for me. I will express most of my configuration through Nomad configurations. I like the idea of having all of the necessary bits to run an application in one file. I’ll walk through my Nomad Job Specification files. I’ll make them available in full. TBD.

Loki

My configuration isn’t necessarily what you’d want for a production environment, it is fairly solid for a development environment.

task "loki" {
  driver = "docker"
  config {
    image = "grafana/loki:latest"
    port_map {
      http = 3100
    }
    args = ["-config.file=/local/loki-config.yaml"]
  }

Grafana offers a Docker container image with the necessary components called grafana/loki. The HTTP interface runs on port 3100. I then tell Loki to use a templated configuration file that we’ll review later on in the file.

resources {
  cpu    = 500 # 500 MHz
  memory = 256 # 256MB
  network {
    mbits = 10
    port  "http"  { static = "4444" }
  }
}

At this time, I have allocated meager resources for the 1 Loki service running on Nomad. I set a static port so I’m able to easily leverage internal networking. While this isn’t necessary, it makes integration into Grafana easier. I chose port 4444, but you are free to set a non-conflicting port that works for your setup. The next portion is the configuration.

template {
        data        = <<EOH
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s
  max_transfer_retries: 0

schema_config:
  configs:
    - from: 2018-04-15
      store: boltdb
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 168h

storage_config:
  boltdb:
    directory: /tmp/loki/index

  filesystem:
    directory: /tmp/loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

chunk_store_config:
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s
EOH
        destination = "local/loki-config.yaml"
        env         = false
      }

The configuration is pretty much the example from the Docker Compose exapmle that Grafana provides for local development. It uses in-container storage mechanisms and has defaults setup for the ingester. Admittedly I haven’t played around too much with the configuration yet, but I’m excited to start digging into some of the tuneables.

Future work

Some things I’d like to look into:

Now that we have a destination for logs, let’s connect it to Grafana so we can see logs come into the system.

Grafana

Grafana 7.1 comes with Loki Data Sources available by default and the Explorer is enabled so you can start querying logs right away.

Setup is fairly straight-forward. I use Consul and the static port we allocated in the Loki Configuration section to connect over the private network. I really wanted to run this through my load balancer to have SSL termination, but the LB is public and I haven’t implemented authentication yet to protect Loki from the public Internet.

Now that we have a way to verify logs are flowing into Loki, we can start sending some logs into the system.

Fluent-bit: Docker Logs

Admittedly this was a little tricky to figure out, I am still tweaking my configuration, but I have something workable. I run Fluent-Bit on each of my Nomad client nodes as a Nomad Job.

Nomad Job Specification

This configuration was a bit tricky to figure out and requires knowledge of a couple tidbits on how Fluent Bit works. The generic Fluent Bit, offered by the Fluent company, has a plugin system that allows developers to write input or output plugins. Fluent Bit is a compiled executable as it is written in the C language. In order to load a plugin you must include the shared library object.

Grafana has written a Loki plugin for Fluent Bit and bundles it in their grafana/fluent-bit-plugin-loki Docker image.

In order to get Fluent Bit to read the templated configuration, you must pass the -c flag. To include the proper shared library, you must supply the -e flag.

You can see the “magic” in Grafana’s Dockerfile for Fluent Bit.

I elected to run this as a “system” job meaning that Nomad will place this job on each client in the cluster, which is what we’ll want later when I tell the container to send its logs to <HOST IP>:24224.

job "fluent-bit" {
  type = "system"
    task "fluent-bit" {
      driver = "docker"
      config {
        image = "grafana/fluent-bit-plugin-loki:latest"
        port_map {
          fluentd = 24224
        }
        command = "/fluent-bit/bin/fluent-bit"
        args = ["-c", "/local/fluent-bit.conf", "-e", "/fluent-bit/bin/out_loki.so"]
      }
      ...

Copy

The actual configuration file is pretty terse. In this case, Fluent Bit is listening for logs on 0.0.0.0:24224 and will forward the logs onto Loki. I strip out “source” and “container_id” from the Docker log JSON payload. We create labels “job” and “hostname” and use them as the base Loki log structure. We then instruct Fluent Bit to yank the “container_name” field from the Docker log and inject it into the Loki log structure.

See the documentation for each Loki plugin option.

template {
        destination = "/local/fluent-bit.conf"
        data = <<EOH
[INPUT]
    Name        forward
    Listen      0.0.0.0
    Port        24224
[Output]
    Name loki
    Match *
    Url http://loki.service.dc1.kwojo:4444/loki/api/v1/push
    RemoveKeys source,container_id
    Labels {job="fluent-bit", hostname="{{env "attr.unique.hostname" }}"}
    LabelKeys container_name
    BatchWait 1
    BatchSize 1001024
    LineFormat json
    LogLevel info
EOH
      }

Copy

For my current setup, I’ve provisioned pretty conservative resources while I gain more familiarity, test, and tweak. The most import part is to set a static port binding, in this case 24224.

resources {
  cpu    = 500 # 500 MHz
  memory = 256 # 256MB
  network {
    mode = "host"
    mbits = 10
    port  "fluentd"  { static = 24224 }
  }
}

Sending Docker Logs to Loki

Docker allows you to specify various log drivers. Nomad facilites this with the logging stanza. For services that I want to log to Loki, I copy/paste the logging stanza below.

The magic is done through environment variables that are managed by Nomad, namely ${atrr.unique.network.ip-address} which evaulates to the IP address of the Nomad client where the container is placed. Since we have Fluent Bit listening on all client hosts as a Nomad System Service, we get all logs from all containers.

task "wiki-js" {
  driver = "docker"
  config {
    image = "requarks/wiki:2"
    port_map {
      http = 3000
    }
    logging {
      type = "fluentd"
      config {
        fluentd-address = "${attr.unique.network.ip-address}:24224"
      }
    }
  }

Syslogs

TBD

Querying Logs

It takes a little while to get used to how Loki handles searching and filtering, but once I got the hang of it, it became really fast and powerful to gain insight into my workloads.

In the query bar, {job="fluent-bit"} will return all logs that Fluent Bit is sending into Loki. Due to our configuration our final logging structure looks similar to this.

Examples

References