Categories
Uncategorized

Feature Flagging

As the lazy proactive engineer that I am, I tend to advocate for self-service tooling and the empowerment of decision-makers. The idea of continuous integration and continuous deployment (CI/CD) can be scary for some shops.

Usually the main concern is code that “goes for a ride” with a desired strategy. What happens when code isn’t quite ready for prime-time? Maybe product & marketing hasn’t announced a new feature or maybe there is a dependency that isn’t quite ready yet? Well part of the solution involves Feature Flagging. This is the idea that code can be deployed, then activated at a later point in time. This decouples deployment from functionality, making the change of behavior a configuration item.

In the past, I have seen a lot of home-grown solutions that tried to solve this problem. Today, it appears there is a growing number of solutions both commercial and open-source that fill this space.

My main recommendation is putting the tooling in the hands of the decision-maker and free up your engineering team to spend their time on more productive activities. Perhaps the decision-maker is a product owner?

A follow up recommendation is to consider the cost of a feature-flag. Generally-speaking a flag’s state is stored in a database? Is your code optimized to check the feature flag only when necessary, or does a page load trigger many external calls? Is the datastore right-sized? Does the network have enough throughput? This is an implementation detail that should be considered before jumping head-first into implementing feature flags.

During my morning scroll, I stumbled upon an open-source offering called Flagr. I like that there is an HTTP API for integration into a diverse set of tooling. I like that there is a GUI that can enable non-programmers to take feature roll-out into their own hands.

I have not used Flagr yet, so I am not 100% understanding of the short-comings of this tool. I do encourage that before you start working on writing your own flagging system to take a survey of offerings that can guide you into a pattern that you don’t have to dream up yourself.

Happy development! Cheers.

Categories
Uncategorized

.bashrc aliases for the Hashicorp CLI Native

I pretty much live in a vim, tmux, and bash world for most of my interaction outside of the web browser. Here is a snippet of my .bashrc file that saves me some time.

export VAULT_ADDR="https://your-vault-server-domain"
export NOMAD_ADDR="https://your-nomad-server-domain"
export CONSUL_ADDR="https://your-consul-server-domain"

alias vl='vault login -method=oidc'
alias nl='export NOMAD_TOKEN=$(vault read -field=secret_id nomad/creds/developer)'
alias cl='export CONSUL_HTTP_TOKEN=$(vault read -field=token consul/creds/developer)'

My workflow is:

  1. vl – Log into Vault
  2. nl – Retrieve and set Nomad credentials
  3. cl – Retrieve and set Consul credentials

I decided to keep these as separate aliases to only retrieve the tokens I need. I generally use Vault and Nomad the most. My tokens expire after 1 hour, so this saves me quite a bit of typing.

Categories
Uncategorized

Forcefully terminating a user session in MVS

Login Prompt of Hercules TK4- MVS distribution
The TK4- login screen

My latest project involves running an emulated IBM mainframe on a virtual machine scheduled by Nomad. I came across a situation where my TSO session became unresponsive and I was unable to break out of the error by F13 or LOGOFF into my 3270 terminal.

If you have access to the operator’s console, you can CANCEL the session by issuing /C U=<USER_ID>. This will terminate the session and allow you to log in again through your terminal.

Happy main framing!

Categories
Uncategorized

Run your KVMs as Nomad Jobs using QEMU and Cloud Init

When you download a VM image straight from a repository, it is pretty vanilla – there isn’t much configured outside of the defaults.

This guide will create a slim shim that runs when your VM starts up to inject your SSH keys.When you start your VM, you will specify the base image as a disk AND the preseed which is configured to run at startup via Cloud-Init.

I am running on an Ubuntu 18.04 workstation, but the process should be the same in newer versions and other distributions.

Install Required Packages

We will use a utility called cloud-localds that is included in the cloud-image-utils package and QEMU to run our VM.

sudo apt-get install -y cloud-image-utils qemu

Download a Base Image

curl https://cloud.centos.org/centos/8/x86_64/
images/CentOS-8-GenericCloud-8.1.1911-20200113.3.x86_64.qcow2 --output CentOS-8-GenericCloud-8.1.1911-20200113.3.x86_64.qcow2

We will use a CentOS 8 image on an Ubuntu host to wrap our heads around the separate machines. Plus CentOS is a fine server image!

Create your pre-seed

Create a file called cloud-init.cfg and copy the following contents into the file:

users:
  - name: centos
    sudo: ALL=(ALL) NOPASSWD:ALL
    groups: users, admin
    home: /home/centos
    shell: /bin/bash
    lock_passwd: false
    ssh-authorized-keys:
      - <your ssh public key>
ssh_pwauth: false
disable_root: false
chpasswd:
  list: |
     centos:linux
  expire: False
packages:
  - qemu-guest-agent
# written to /var/log/cloud-init-output.log
final_message: "The system is finally up, after $UPTIME seconds"

Hint: Remember the “final_message” statement.

Replace <your ssh public key> with the contents of your public key. This is likely in ~/.ssh/id_rsa.pub. It should be a single line. Just paste it right on the line. It should look like this:

ssh-authorized-keys:
      - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCx/2nHdD7R+cn4He0Dq9uOGKGf2H9jgP+fj8SyVIZLfoVoAOOuHQNv7iXck9aF1anz6jx+LWnosPwdpg5F0LahSmxLWP6Kw8NTw46URYCtgrzg442J2mDsVd38eQI6JkAHh1GY60Klm/yCkRZNHheMBEeVSFDJKBLF4ZMJuWLKDH4po+8ExfS1IrsrgJh0h84CYL+HygI8QdFaqnTV12DCzU4ej7U36rsHu22yy6xDZJ1VC0mbf+sjF7hAfx8smF+Hg5IoCQDHf3enJdDTBUR40Fh96K80CQ2cT+teTdnvMYhI8vZa5h843ynm9Afy5p3xeXcAacYc6c0panKhLacL kevinwojkovich@Kevins-MacBook-Pro.local


Save this file and then execute the following command.

cloud-localds cloud-init.img cloud_init.cfg

This creates a disk image that contains cloud-init data that will configure your virtual machine upon boot.

Store the disk Images somewhere!

In order for your VM to run, it must have access to the disk images we just downloaded to our workstation.

Since we don’t know where Nomad will place the job within the cluster, we need to put these images into storage that is accessible by all hosts.

In my lab, I run Minio, an S3-compatible object store. For your purposes you may consider S3, HTTP, or even Git. We will take advantage of Nomad’s artifact stanza to download the image to the appropriate spot on the host system for the virtual machine.

For the sake of this post, assume that you have the files accessible over HTTP.

In my own use, I used a Presigned URL that allowed Nomad to retrieve the objects out of Minio.

I uploaded cloud-init.img to object storage as well as CentOS-8-GenericCloud-8.1.1911-20200113.3.x86_64.qcow2.

In the example below, replace the artifact source with a real destination.

Create your Nomad Job

Copy and paste this into a file called blog-test-centos8-vm.nomad Feel free to adjust CPU, Memory, or other settings if you know what you’re doing.

Please remember that the artifact sources need to be real locations that your Nomad workers have access to.

job "blog-test-centos8" {
  datacenters = ["dc1"]
  type = "service"
  group "blog-test-centos8" {
    count = 1
   
    task "blog-test-centos8" {
      driver = "qemu"
      config {
        image_path = "local/CentOS-8-GenericCloud-8.1.1911-20200113.3.x86_64.qcow2"
        args = ["-drive", "file=local/cloud-init.img"]
        accelerator = "kvm"
        port_map {
          ssh = 22
        }
      }

      artifact {
        source = "https://<webhost>/directory/cloud-init.img" 
      }

      artifact {
        source = "https://<webhost>/directory/CentOS-8-GenericCloud-8.1.1911-20200113.3.x86_64.qcow2"
      }

      resources {
        cpu    = 500
        memory = 1024

        network {
          mbits = 10
          port  "ssh" {}
        }
      }
    }
  }
}

This job specification is pretty straight-forward, but I’ll call out the points of interest.

  • task.config.image_path: This should be the base image for your virtual machine.
  • task.config.args: We leverage the arguments to specify a second image, our Cloud Init preseed. When the VM starts, it will run cloud-init using the data on this image.
  • task.config.port_map: Without setting up consoles, the only way to access the virtual machine will be through SSH. We will expose this port.
  • resources.network.port.ssh: Since we are running this on bare metal, the host already allocated port 22 to its SSH daemon. We will let Nomad handle the allocation of an ephemeral port for SSH access to the guest VM. Don’t worry, we can find this port later.
  • artifact(s): there are two artifact stanzas that source the VM images: the preseed and the base image. These are downloaded before the VM runs.

Run the Nomad Job

Alright now that we have the job specification written, you can plan and run it.

nomad plan blog-test-centos8-vm.nomad
nomad job run -check-index 290127 blog-test-centos8-vm.nomad

If all is successful, the allocation should start. You can hop into the Nomad UI, click on the allocated task, and view the log output.

Nomad allocation log output of cloud-init.
System log output highlighting the final_message stanza from our preseed cloud-init.img.

Under the hood, Nomad executes the following command on the worker. Since Nomad already adds the base image, the trick is to use the args to specify the Cloud Init preseed image.

qemu-system-x86_64 -machine type=pc,accel=kvm -name test-vm -m 1024M -drive file=Base-CentOS-8.1.1911.qcow2 -nographic -netdev user,id=user.0,hostfwd=tcp::12345-:22 -device virtio-net,netdev=user.0 -enable-kvm -cpu host -drive file=cloud-init.img

Using SSH to Connect to Your Instance

Alright now that the instance is running, it’s sitting waiting for us to log into it. If you were to SSH to the IP address directly, you’d be logging into the HOST and not the guest VM. We told Nomad to allocate an ephemeral port. To find out what port to connect to run the following commands.

$ nomad status blog-test-centos8
ID            = blog-test-centos8
Name          = blog-test-centos8
Submit Date   = 2020-10-27T11:22:29-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost
blog-test-centos8  0       0         1        1       1         0

Allocations
ID        Node ID   Task Group     Version  Desired  Status   Created    Modified
a6587dec  6ea587a2  blog-test-centos8  1        run      running  2h51m ago  2h49m ago

Then to see the specific allocation information, using your own allocation ID from the output above.

$ nomad status a6587dec 
ID                  = a6587dec-f180-8839-c09f-a15a2ce28ce6
Eval ID             = 508e0ae2
Name                = blog-test-centos8.blog-test-centos8[0]
Node ID             = 6ea587a2
Node Name           = nuc2
Job ID              = blog-test-centos8
Job Version         = 1
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 2h59m ago
Modified            = 2h57m ago

Task "tk4-mainframe" is "running"
Task Resources
CPU         Memory           Disk     Addresses
78/500 MHz  830 MiB/1.0 GiB  1.0 GiB  ssh: 192.168.100.38:20176
                                                         

Task Events:
Started At     = 2020-10-27T18:14:35Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type                   Description
2020-10-27T13:14:35-05:00  Started                Task started by client
2020-10-27T13:14:27-05:00  Downloading Artifacts  Client is downloading artifacts
2020-10-27T13:14:27-05:00  Task Setup             Building Task Directory
2020-10-27T13:12:58-05:00  Received               Task received by client

In this particular case we have ssh 192.168.100.38:20176 , meaning we can SSH using something like this:

ssh centos@192.168.100.38 -p 20176

Next Steps

You have used some tooling build around Cloud Init to generate a preseed image that can bootstrap your virtual machine image. You used Nomad to create a VM with those images. This is a pretty rudimentary setup and you can do a lot more.

Head on over to the Cloud Init documentation! You can do a ton with Cloud Init to have your virtual machine come online with the correct configuration.

Categories
Uncategorized

Removing more than 1 Nomad job at a time

Sometimes when you’re hacking around (possibly with Hashicorp Waypoint) you will find yourself with a large number of dead jobs. Here’s a small snippet to clean those jobs up. This relies on pattern matching, so in this particular example the script would clean up:

Input

example-nodejs-01emsh754y7q217j536dk1bnp8
example-nodejs-01emshatc21ns5apgf9zfycfcw
example-nodejs-01emshfvg26rxn8km4amzn6dzf
example-nodejs-01emshjxs6639raf8kftmzqsby
example-nodejs-01emshmtvbd0xn4m0hjae9gafq
example-nodejs-01emshzq9ksn378f9z10t52bj9
example-nodejs-01emsj176yyjj3vpth2h2dgra7
example-nodejs-01emsj1ts3j9qpmdrtb1byd45p

Solution

$JOB_NAME=example-nodejs
nomad status | grep $JOB_NAME | awk '{print $1}' | xargs -n1 -I{}  nomad stop -purge {}

Output

==> Monitoring evaluation "ece9cdb4"
    Evaluation triggered by job "example-nodejs-01emsh754y7q217j536dk1bnp8"
    Evaluation within deployment: "5e08325c"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "ece9cdb4" finished with status "complete"
==> Monitoring evaluation "ef965120"
    Evaluation triggered by job "example-nodejs-01emshatc21ns5apgf9zfycfcw"
    Evaluation within deployment: "0b874c27"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "ef965120" finished with status "complete"
==> Monitoring evaluation "6f2ce59a"
    Evaluation triggered by job "example-nodejs-01emshfvg26rxn8km4amzn6dzf"
    Evaluation within deployment: "d78592aa"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "6f2ce59a" finished with status "complete"
==> Monitoring evaluation "4bea6e99"
    Evaluation triggered by job "example-nodejs-01emshjxs6639raf8kftmzqsby"
    Evaluation within deployment: "e72743df"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "4bea6e99" finished with status "complete"
==> Monitoring evaluation "cd338600"
    Evaluation triggered by job "example-nodejs-01emshmtvbd0xn4m0hjae9gafq"
    Evaluation within deployment: "75e88990"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "cd338600" finished with status "complete"
==> Monitoring evaluation "c30377d7"
    Evaluation triggered by job "example-nodejs-01emshzq9ksn378f9z10t52bj9"
    Evaluation within deployment: "3bb3da50"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "c30377d7" finished with status "complete"
==> Monitoring evaluation "03ff33ff"
    Evaluation triggered by job "example-nodejs-01emsj176yyjj3vpth2h2dgra7"
    Evaluation within deployment: "a16f38af"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "03ff33ff" finished with status "complete"
==> Monitoring evaluation "f37e556e"
    Evaluation triggered by job "example-nodejs-01emsj1ts3j9qpmdrtb1byd45p"
    Evaluation within deployment: "f35c0e58"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "f37e556e" finished with status "complete"
Categories
Uncategorized

Using Vault to Log into Nomad

Nomad should be protected before being used in any serious capacity. Nomad has the concept of ACLs and Roles that are associated with a token.

When you bootstrap the cluster (something we’ll discuss across later posts), a management token is issued called the Bootstrap Token. You can think of the Bootstrap Token as root. In my Nomad cluster deployments, I use the Bootstrap Token to create another management token that I give to Vault to manage access to my Nomad cluster from that point forward.

Vault creates short-lived Nomad tokens leveraging its Nomad Secret Backend. The illustration below is my attempt at describing the flow, but I’ll also show you what the process looks like.

Vault can be configured to broker Nomad tokens, controlling their lifecycle for enhanced security posture.

This diagram is confusing if you’re not familiar with the plumbing and nuances of Nomad ACLs and Vault policies, roles, and backends.

I think it makes sense to walk through this process to wrap our heads around the flow. It’s not as bad as one would think. We’ll use the Web UI to make things easier, although this can be done over the CLI as well.

Walk-thru

The first step is to log into Vault.

The Vault login screen. I have an OAuth2 Single Sign On integration on my cluster that uses GSuite as the identity provider. TLDR: I login with my GSuite Account.

Next we will want to expose the command prompt within the UI.

Clicking the little terminal icon in the upper-right corner will expose the command prompt.

Using the command prompt, we ask Vault to broker a Nomad Token on our behalf. My cluster is setup with a policy called “devops” which gives me most access that I need without being “root”. Your implementation may depend on your organization. I enter the following command into the command prompt:

vault read nomad/creds/devops

Again, you would replace “devops” with the appropriate Nomad policy name your account has access to. If you’re reading this post and you belong to the organization I work for in my 9-5, try “developer”.

We use the command prompt to give us a Nomad Token.

If everything was successful, you should have output from the command that contains secret_id. This is your Nomad Token. Copy it into your clipboard.

Copy the secret_id into your clipboard. This is your Nomad Token. We will use it to log into Nomad.

Navigate to your Nomad cluster. If ACLs are enabled, you should see ACL Tokens in the upper-right corner.

We need to go to the ACL page to enter our Nomad token. Click the link in the upper-right hand corner.

You will now be prompted to enter your Secret ID into a text field on the page. If there already is a token populated in the box, click the “Clear Token” button.

Enter your Nomad Token (secret_id from the previous step) into the text input field.

If your Nomad token is valid, Nomad will greet you.

If your Nomad token is valid, Nomad will let you know. You are now allowed to do everything associated with your associated ACL.

And that’s it! You’re in Nomad and free to roam around.

In posts to come, we’ll explore how to use Nomad and how the I’ve configured the plumbing of ACLs, Roles, Permissions, Single Sign On, and all the nitty-gritty details.

Happy Nomading!

I confidently use real token information in my tutorials from my home lab. The time-to-live (TTL) on my tokens brokered by Vault is set to 1 hour. By the time this article is live the token will cease to exist! This is the beauty of Vault, you have powerful security measures in-place for a little investment of time.

– kw
Categories
Uncategorized

Metrics & Instrumentation with Prometheus: An Ops Perspective

I am a huge proponent of Prometheus for metrics-based monitoring and code instrumentation. It is easy to setup and has a decent story around scaling. There is a large amount of community support and information available to get going. I really enjoy that they publish an opinionated Best Practices section that I have successfully used in multiple positions to introduce a standardized approach to instrumentation.

As an exercise in Golang Development and production service support, I wrote and operate https://isicircleciup.net. It’s a simple service that scrapes CircleCI’s status page, responds with a corresponding interpretation, and increases one of two counters: circleci_outage_scrape_count or circleci_success_scrape_count.

Looking back, I should have leveraged labels to indicate CircleCI’s status rather than two separate metrics. Admittedly I was distracted by getting this service running on Nomad, implementing centralized logging, and getting Prometheus and Grafana running on the cluster as well.

A sampling of 7 days of circleci_outage_scrape_count

In this implementation the single metric circleci_outage_scrape_count has two 3 labels: host, instance, and job. I am unsure how helpful the “job” metric is since it is captured as the prefix to the “outage_scrape_count” metric.

Setting up the metric is pretty straightforward using the Prometheus Golang client.

outageCounter = prometheus.NewCounterVec(
	prometheus.CounterOpts{
		Name: "circleci_outage_scrape_count",
		Help: "Count of page scrapes that yield a negative outlook",
	},[]string{"host"})

This counter is incremented by a single line.

outageCounter.With(prometheus.Labels{"host": hostname}).Inc()

Within Grafana, the following two PromQL queries create the graph below.

sum(rate(circleci_outage_scrape_count[30m]))
sum(rate(circleci_success_scrape_count[30m]))
A timeseries representation of CircleCI's status page

In this graph, I throw away the host, instance, and job labels. So maybe those labels aren’t necessary?

Lessons learned

Each label-value pair creates a unique time series which requires machine resources from Prometheus, so while labels are important and encouraged, we must be thoughtful in our use of them.

Following best practices I should have a single metric called circleci_scrape_count. I’d label the metric along the following dimensions:

  • host: records the container ID the service is running within.
  • instance: records the instance the container is running on. This is a built-in label that Prometheus adds for us.
  • status: “success” or “failure” to record the state of CircleCI’s operations.

I chose these labels to facilitate troubleshooting efforts when the service starts to misbehave. My logging implementation captures the container ID and the host. I may be able to develop some alerting across hostnames using application-specific metrics. Perhaps a single container is misbehaving during a rolling deployment? I may be able to identify that a newer container ID is problematic and roll-back to stabilize the service.

Status is the “product” in the instance of this site, so it is essential that I capture this.

Continued Reading

I encourage you to keep reading up on Prometheus. It’s a very compelling option for quickly rolling out instrumentation. In a past position, I rolled out the node_exporter on all EC2 instances in the organization and had machine metrics overnight. Combined with Consul, auto-discovery is very pleasant.

Happy instrumentation!

Categories
Uncategorized

Centralized Logging on Nomad

In this guide, I share how I implemented centralized logging on my Nomad cluster. One important thing to note about my configuration is that most of my workload runs in Docker managed by Nomad. I will share some of my configurations, but you may need to adapt for your specific needs.

I still need to provide full working Nomad Specifications, I suppose I’ll put them on GitHub/GitLab once I’ve cleaned them up.

Loki is a logging solution produced by Grafana Labs. I have chosen to use Loki as the log interface for my home lab Nomad cluster due to its simplicity and my desire to build out a single pane of glass for all of my cluster’s metrics.

There are three main components in the Loki stack:

  1. A log forwarder (I use two: promtail and fluent-bit, we’ll discuss this in a bit.)
  2. Loki – the log aggregator
  3. Grafana – the graphical interface used to query logs.

There are a few things that I like about Loki. First, the architecture is fairly simple to rationalize. In my implementation, I run fluent-bit as a logging driver on each of the Nomad hosts. Subsequently, I run Loki to accept the forwarded logs from fluent-bit. A simple query in Grafana and I have logs from all of my containers available to me with enough metadata to get an idea of what’s happening with the application.

One of my criticisms is that the documentation that Grafana offers up is pretty scattered and you’re left to kind of figure out the missing pieces.

Configuration

It took me a little bit of tweaking to figure out how to produce the output that works best for me. I will express most of my configuration through Nomad configurations. I like the idea of having all of the necessary bits to run an application in one file. I’ll walk through my Nomad Job Specification files. I’ll make them available in full. TBD.

Loki

My configuration isn’t necessarily what you’d want for a production environment, it is fairly solid for a development environment.

task "loki" {
  driver = "docker"
  config {
    image = "grafana/loki:latest"
    port_map {
      http = 3100
    }
    args = ["-config.file=/local/loki-config.yaml"]
  }

Grafana offers a Docker container image with the necessary components called grafana/loki. The HTTP interface runs on port 3100. I then tell Loki to use a templated configuration file that we’ll review later on in the file.

resources {
  cpu    = 500 # 500 MHz
  memory = 256 # 256MB
  network {
    mbits = 10
    port  "http"  { static = "4444" }
  }
}

At this time, I have allocated meager resources for the 1 Loki service running on Nomad. I set a static port so I’m able to easily leverage internal networking. While this isn’t necessary, it makes integration into Grafana easier. I chose port 4444, but you are free to set a non-conflicting port that works for your setup. The next portion is the configuration.

template {
        data        = <<EOH
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s
  max_transfer_retries: 0

schema_config:
  configs:
    - from: 2018-04-15
      store: boltdb
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 168h

storage_config:
  boltdb:
    directory: /tmp/loki/index

  filesystem:
    directory: /tmp/loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

chunk_store_config:
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s
EOH
        destination = "local/loki-config.yaml"
        env         = false
      }

The configuration is pretty much the example from the Docker Compose exapmle that Grafana provides for local development. It uses in-container storage mechanisms and has defaults setup for the ingester. Admittedly I haven’t played around too much with the configuration yet, but I’m excited to start digging into some of the tuneables.

Future work

Some things I’d like to look into:

Now that we have a destination for logs, let’s connect it to Grafana so we can see logs come into the system.

Grafana

Grafana 7.1 comes with Loki Data Sources available by default and the Explorer is enabled so you can start querying logs right away.

Setup is fairly straight-forward. I use Consul and the static port we allocated in the Loki Configuration section to connect over the private network. I really wanted to run this through my load balancer to have SSL termination, but the LB is public and I haven’t implemented authentication yet to protect Loki from the public Internet.

Now that we have a way to verify logs are flowing into Loki, we can start sending some logs into the system.

Fluent-bit: Docker Logs

Admittedly this was a little tricky to figure out, I am still tweaking my configuration, but I have something workable. I run Fluent-Bit on each of my Nomad client nodes as a Nomad Job.

Nomad Job Specification

This configuration was a bit tricky to figure out and requires knowledge of a couple tidbits on how Fluent Bit works. The generic Fluent Bit, offered by the Fluent company, has a plugin system that allows developers to write input or output plugins. Fluent Bit is a compiled executable as it is written in the C language. In order to load a plugin you must include the shared library object.

Grafana has written a Loki plugin for Fluent Bit and bundles it in their grafana/fluent-bit-plugin-loki Docker image.

In order to get Fluent Bit to read the templated configuration, you must pass the -c flag. To include the proper shared library, you must supply the -e flag.

You can see the “magic” in Grafana’s Dockerfile for Fluent Bit.

I elected to run this as a “system” job meaning that Nomad will place this job on each client in the cluster, which is what we’ll want later when I tell the container to send its logs to <HOST IP>:24224.

job "fluent-bit" {
  type = "system"
    task "fluent-bit" {
      driver = "docker"
      config {
        image = "grafana/fluent-bit-plugin-loki:latest"
        port_map {
          fluentd = 24224
        }
        command = "/fluent-bit/bin/fluent-bit"
        args = ["-c", "/local/fluent-bit.conf", "-e", "/fluent-bit/bin/out_loki.so"]
      }
      ...

Copy

The actual configuration file is pretty terse. In this case, Fluent Bit is listening for logs on 0.0.0.0:24224 and will forward the logs onto Loki. I strip out “source” and “container_id” from the Docker log JSON payload. We create labels “job” and “hostname” and use them as the base Loki log structure. We then instruct Fluent Bit to yank the “container_name” field from the Docker log and inject it into the Loki log structure.

See the documentation for each Loki plugin option.

template {
        destination = "/local/fluent-bit.conf"
        data = <<EOH
[INPUT]
    Name        forward
    Listen      0.0.0.0
    Port        24224
[Output]
    Name loki
    Match *
    Url http://loki.service.dc1.kwojo:4444/loki/api/v1/push
    RemoveKeys source,container_id
    Labels {job="fluent-bit", hostname="{{env "attr.unique.hostname" }}"}
    LabelKeys container_name
    BatchWait 1
    BatchSize 1001024
    LineFormat json
    LogLevel info
EOH
      }

Copy

For my current setup, I’ve provisioned pretty conservative resources while I gain more familiarity, test, and tweak. The most import part is to set a static port binding, in this case 24224.

resources {
  cpu    = 500 # 500 MHz
  memory = 256 # 256MB
  network {
    mode = "host"
    mbits = 10
    port  "fluentd"  { static = 24224 }
  }
}

Sending Docker Logs to Loki

Docker allows you to specify various log drivers. Nomad facilites this with the logging stanza. For services that I want to log to Loki, I copy/paste the logging stanza below.

The magic is done through environment variables that are managed by Nomad, namely ${atrr.unique.network.ip-address} which evaulates to the IP address of the Nomad client where the container is placed. Since we have Fluent Bit listening on all client hosts as a Nomad System Service, we get all logs from all containers.

task "wiki-js" {
  driver = "docker"
  config {
    image = "requarks/wiki:2"
    port_map {
      http = 3000
    }
    logging {
      type = "fluentd"
      config {
        fluentd-address = "${attr.unique.network.ip-address}:24224"
      }
    }
  }

Syslogs

TBD

Querying Logs

It takes a little while to get used to how Loki handles searching and filtering, but once I got the hang of it, it became really fast and powerful to gain insight into my workloads.

In the query bar, {job="fluent-bit"} will return all logs that Fluent Bit is sending into Loki. Due to our configuration our final logging structure looks similar to this.

Examples

References