Running JVMs in Kubernetes

UPDATED 7 Dec 2017 to reflect the experimental cgroup compliance flag available in JDK 9 and later builds of JDK 8.

Java + K8S

Containers are all the rage, and there’s good reason for that: they’re lightweight, they’re portable, they offer a lot of environmental consistency. They can be tricky, though… many a developer has confused containers for virtual machines. While containers can look and feel every bit like a full VM, they absolutely, positively definitely, are not.

In actuality, all of the containers running on a host are managed by the host OS, which creates an illusion of separation using the virtualization and isolation features of the Linux kernel, particularly namespaces and cgroups, which are used together keep the resources belonging to the containerized process (CPU, memory, disk I/O, network, etc.) isolated. While this illusion is pretty good, it’s not perfect, and some notable applications — including common tools such as top, free, and ps — were created before cgroups were implemented, and mostly don’t respect virtual limits imposed on them.

By default, the Java Virtual Machine (JVM) is one of these applications. Developers running more than one containerized JVM on a host — especially under an orchestration framework like Kubernetes — should be aware that the default garbage collector, heap size, and runtime compiler work quite differently from what you’d expect.

Memory allocation in the Java Virtual Machine

JVM memory management is a complex subject. Too complicated to be dealt with in much detail here, so I’ll do my best to hit the bare basics.

The JVM is an abstract computing machine that enables a computer to run a Java program, which is a fancy way of saying that it’s a program that runs a program. Objects in the JVM reside in an area of memory called the heap, which has a fixed minimum and maximum size, set in Java by using the -Xms and -Xmx flags. When the JVM starts up, the heap is created with an initial size equal to its minimum, and can increase or decrease in size as the application runs. When the heap nears its maximum size, unused objects and data are reclaimed by the garbage collector to recover space.

Default heap sizes

So, what if you don’t set the heap size? Well, then the defaults kick in, and that’s where things get interesting.

The default maximum heap size varies a bit depending on whether the JVM is running in “server mode”, or “client mode”. The good news is that by default, the JVM will run in “server mode”, which is optimized for long-running processes and is actually what you want for services that’ll up for some time. In server mode, the minimum heap size defaults to 25% of the amount of free physical memory in the system, up to 64 MB and at least 8 MB.

The maximum on the other hand is 50% of available physical memory, up to 2 GB. Now, you would think that the JVM would define “system memory” as the memory limit imposed by the container’s cgroup, but that isn’t the case. Instead the JVM queries the kernel directly to gauge its memory capacity, ignoring cgroups — and container memory limits — entirely. That’s not so bad if you only want to run one or two JVM’s, but here’s the sticky bit: if you start a few JVMs on a host, they’ll all start happily with minimal initial heap sizes. The problem is that they all now have a maximum heap size of up to 2GB. So if your applications begin to grow, eventually one will try to resize, find that the memory isn’t available because it’s all been taken up another JVMs, and crash. No warning, no ceremony, it just shits the bed.

In Kubernetes, this manifests as JVM pods just seeming to die randomly, leaving no useful logs or output messages.

Figuring out what you need

Alright, we’ve established that the default JVM doesn’t respect container memory limits, so how do we deal with this? Well, that depends on what version of the JVM we’re using. Fortunately, if you’re using JDK 8u131+ or JDK 9, there’s a magical “respect cgroup memory limits” flag, helpfully named -XX:+UseCGroupMemoryLimitForHeap. You lucky folks can choose to skip ahead to the section “Set reasonable pod requests and limits”, but keep in mind that heap memory isn’t the only JDK memory; I don’t know if that matters. If you’re stuck on an older JDK, however, you’ll have to set your memory limits manually.

You can guess how much memory you need, of course, but I suggest that you take the time to measure the actual memory footprint of all of your JVM containers with some of your favorite OS tools. The top and ps shell commands are a solid choice, and the Task Manager in Windows works fine. To see how the memory usage of a JVM process is distributed you can use jrcmd to print the JVM’s memory usage. Ideally, the heap should be large enough so that the JVM spends less time garbage collecting the heap than running code.

Regardless of which version of the JVM you’re using, if you really want to get fancy, you can consider going the extra mile and tuning your garbage collector, compaction, and object allocation behavior, but as long as you’re running in server mode — which is the default in most modern JVM implementations — you should be fine. For more about how to do this, take a look here.

An important note: the heap isn’t the entire story of the JVM’s memory footprint: methods, thread stacks, native handles, and JVM internal data structures are all stored in memory allocated entirely separately from the heap. The exact amount of non-heap memory can vary widely, but a safe bet if you’re doing resource planning is that the heap is about 80% of the JVM’s total memory. So if you set the set maximum heap to 1000 MB, you can expect that the whole JVM might need around 1250 MB.

Setting the heap size in Java

In Java, the heap size is set with the command line options -Xms (initial heap size) and -Xmx (maximum heap size); for example:

java -Xms:500m -Xmx:500m myApplication

or for Java 8u131 and Java 9:

java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap myApplication

Note that to use the experimental flag, you have to unlock it first.

Setting the heap size in Scala

In Scala using sbt, there are actually a few ways to tweak the heap size, but the recommended way appears to be to set it on the command line using the -J prefix for the usual Java options:

sbt run -J-Xms500m -J-Xmx500m

or if you’re using JDK 8u131 or Java 9:

sbt run -J-XX:+UnlockExperimentalVMOptions -J-XX:+UseCGroupMemoryLimitForHeap

I’m not a Scala expert though, so if there are better ways, feel free to let me know!

Setting the heap size in other JVM languages

If you’re using Clojure, Groovy, Kotlin, or one of the many other JVM-based languages available, let me know what the appropriate flags are, and I’ll add them here, too.

Set reasonable pod requests and limits

So we’re all good to run in Kubernetes, right? Nope.

Because even if you know how much memory your process needs, Kubernetes doesn’t, and without some guidance, your pod can be scheduled onto a node without enough memory, particularly on congested clusters. If the node doesn’t have the memory necessary to meet the JVM’s minimum requirements, it’ll just die immediately; if the heap doesn’t have enough to grow the JVM will run for a while, but die quietly when it fails to allocate more heap. What’s more, Kubernetes will helpfully try to restart the crashed pod on the same node, resulting in a lovely crash loop.

This is where Kubernetes resource requests and limits come in. This is actually a pretty complicated (and interesting!) topic. From the Kubernetes Resource Quality of Service:

For each resource, containers specify a request, which is the amount of [memory] that the system will guarantee to the container, and a limit which is the maximum amount that the system will allow the container to use. When request == limit, the resources are guaranteed.

For our use-case, we’ll probably want to set the request and limit the same. Other options are available, but you’ll want to read the Kubernetes Resource QoS to understand those.

For example, imagine a Java container with a 1GB minimum and maximum heap size. Keeping in mind that heap memory is only (very roughly) about 80% of the total memory footprint, the Deployment manifest would include the following:

containers:
- name: my-java-container
  resources:
    limits:
      memory: 1250M
    requests:
      memory: 1250M

If you don’t mind manually setting your Kubernetes resource and JVM heap values or other flags — and making sure they stay consistent — this will get it done. However, it’s ugly and error-prone and you’re inevitably going to change one value and forget to change the other.

Is there a better way? You bet.

The Kubernetes downward API

The Downward API is a little-known functionality of Kubernetes that allows you to expose Pod and Container fields to a process running in a container. With this, we can create a container that inspects its own resource requests and automatically sets its heap size appropriately.

Using the Downward API to inject resource fields is actually incredibly easy:

containers:
- name: my-java-container
  resources:
    limits:
      memory: 1000M
    requests:
      memory: 1000M
  env:
  - name: MEM_TOTAL_MB
    valueFrom:
      resourceFieldRef:
        resource: limits.memory

Magically, the running container now has a MEM_TOTAL_MB environment variable equal to 1000.

Containerizing it

Now we get a little tricksy: the java, sbt, or whatever tool you use needs to know how to access that value. In order to do that, we’ll need to bake that logic into the image by modifying the entrypoint. To do that for a Java image, we include a script like the following:

#!/usr/bin/env sh

set -e

if [ "$1" = 'java' ]; then
    shift

    DEFAULT_MEM_JAVA_PERCENT=80

    if [ -z "$MEM_JAVA_PERCENT" ]; then
        MEM_JAVA_PERCENT=$DEFAULT_MEM_JAVA_PERCENT
    fi

    # If MEM_TOTAL_MB is set, the heap is set to a percent of that
    # value equal to MEM_JAVA_PERCENT; otherwise it uses the default
    # memory settings.
    if [ ! -z "$MEM_TOTAL_MB" ]; then
        MEM_JAVA_MB=$(($MEM_TOTAL_MB * $MEM_JAVA_PERCENT / 100))
        MEM_JAVA_ARGS="-Xmx${MEM_JAVA_MB}m"
    else
        MEM_JAVA_ARGS=""
    fi

    java $MEM_JAVA_ARGS $@
else
    exec "$@"
fi

Of course, if all you want to do is automatically set the experimental -XX:+UseCGroupMemoryLimitForHeap flag, your container entrypoint gets a bit simpler:

#!/usr/bin/env sh

set -e

if [ "$1" = 'java' ]; then
    shift
    java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap $@
else
    exec "$@"
fi

Now, your Dockerfile would simply look something like this:

FROM openjdk:8-jre
ADD entrypoint.sh /tmp/entrypoint.sh
ENTRYPOINT ["/tmp/entrypoint.sh"]

With this image in place, all you need to do is define your resource requests and downward API call in your manifests, and your containers will automatically resize themselves accordingly.

Making it easy

Now, you could grab that snippet and roll your own images if you want, but if you’re lazy (like me) I’ve provided a couple pre-made images just for you. Just Java and Scala for now, though, I’m afraid. Sorry!

Java: https://hub.docker.com/r/clockworksoul/java-for-k8s
Scala: https://hub.docker.com/r/clockworksoul/scala-for-k8s (Coming Soon!)
Clojure: TBD
Kotlin: TBD

Last modified on 2017-12-05