Kubernetes Observability

Health Probes & Container Monitoring

What are Kubernetes Probes?

Probing the Container

  • The kubelet checks containers periodically using probes
  • Probes determine the health and readiness of containers
  • Three types of probes: Startup, Readiness, and Liveness
  • Each probe can use different action types to check health
  • Essential for maintaining application reliability
Kubelet
Probe
Container
Response

Startup Probe

To know when a container has started

Readiness Probe

To know when a container is ready to accept traffic

Liveness Probe

Indicates whether the code is running or not

Probe Types Explained

Startup Probe

Used for slow-starting containers to determine when the application has successfully started.

  • Disables liveness and readiness checks until it succeeds
  • Useful for legacy applications with long startup times
  • Prevents killing containers during initialization

Readiness Probe

Determines if a container is ready to serve requests.

  • A failing readiness probe stops traffic to the pod
  • Container remains running but not receiving traffic
  • Essential for rolling updates and load balancing

Liveness Probe

Determines if the container is running properly.

  • A failing liveness probe restarts the container
  • Detects deadlocks and hung applications
  • Ensures application remains responsive

Important: A failing readiness probe will stop the application from receiving traffic. A failing liveness probe will restart the container.

Probe Action Types

ExecAction

Execute a command inside the container

exec:
  command:
  - cat
  - /app/healthy

TCPSocketAction

Check if a TCP socket port is open

tcpSocket:
  port: 8080

HTTPGetAction

Performs an HTTP GET against a specific port and path

httpGet:
  path: /healthz
  port: 8080

HTTPGet Additional Options

httpGet:
  path: /health
  port: 8080
  host: 127.0.0.1
  scheme: HTTPS
  httpHeaders:
  - name: Custom-Header
    value: Awesome

Probe Configuration Parameters

  • initialDelaySeconds: Delay before first probe
  • periodSeconds: How often to probe
  • timeoutSeconds: Probe timeout
  • successThreshold: Consecutive successes needed
  • failureThreshold: Consecutive failures allowed

Complete Probes Example

apiVersion: v1
kind: Pod
metadata:
  name: goproxy
  labels:
    app: goproxy
spec:
  containers:
  - name: goproxy
    image: k8s.gcr.io/goproxy:0.1
    ports:
    - containerPort: 8080
    
    # Startup Probe - for slow starting containers
    startupProbe:
      httpGet:
        path: /healthz
        port: 8080
      failureThreshold: 3
      periodSeconds: 10
    
    # Readiness Probe - when container is ready for traffic
    readinessProbe:
      tcpSocket:
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 10
    
    # Liveness Probe - if container is running properly
    livenessProbe:
      tcpSocket:
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 20

Startup Probe

  • HTTP GET to /healthz on port 8080
  • Checks every 10 seconds
  • Allows 3 failures before giving up
  • Disables other probes until successful

Readiness Probe

  • TCP socket check on port 8080
  • Starts after 5 seconds
  • Checks every 10 seconds
  • If fails, stops traffic to the pod

Liveness Probe

  • TCP socket check on port 8080
  • Starts after 15 seconds
  • Checks every 20 seconds
  • If fails, restarts the container

Advanced Probe Examples

Exec Action Example

apiVersion: v1
kind: Pod
metadata:
  name: postgres-db
spec:
  containers:
  - name: postgres
    image: postgres:13
    env:
    - name: POSTGRES_PASSWORD
      value: "secret"
    livenessProbe:
      exec:
        command:
        - sh
        - -c
        - exec pg_isready -U postgres
      initialDelaySeconds: 30
      periodSeconds: 10
    readinessProbe:
      exec:
        command:
        - sh
        - -c
        - exec pg_isready -U postgres
      initialDelaySeconds: 5
      periodSeconds: 5

HTTP Get with Headers Example

apiVersion: v1
kind: Pod
metadata:
  name: web-application
spec:
  containers:
  - name: webapp
    image: nginx:latest
    livenessProbe:
      httpGet:
        path: /health
        port: 80
        httpHeaders:
        - name: X-Custom-Auth
          value: "Bearer token123"
        - name: Accept
          value: application/json
      initialDelaySeconds: 10
      periodSeconds: 5
      timeoutSeconds: 2
    readinessProbe:
      httpGet:
        path: /ready
        port: 80
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 2
      failureThreshold: 3

Database with All Three Probes

apiVersion: v1
kind: Pod
metadata:
  name: mysql-database
spec:
  containers:
  - name: mysql
    image: mysql:8.0
    env:
    - name: MYSQL_ROOT_PASSWORD
      value: "secret123"
    startupProbe:
      exec:
        command:
        - sh
        - -c
        - mysqladmin ping -h localhost -uroot -p${MYSQL_ROOT_PASSWORD}
      failureThreshold: 30
      periodSeconds: 10
    readinessProbe:
      exec:
        command:
        - sh
        - -c
        - mysql -e 'SELECT 1' -uroot -p${MYSQL_ROOT_PASSWORD}
      initialDelaySeconds: 5
      periodSeconds: 5
    livenessProbe:
      tcpSocket:
        port: 3306
      initialDelaySeconds: 30
      periodSeconds: 10

Probes Best Practices

Configuration Guidelines

  • Set appropriate initial delays: Allow applications to start properly
  • Use conservative timeouts: Avoid false positives from slow responses
  • Configure realistic periods: Balance between responsiveness and load
  • Use startup probes for slow applications: Prevent unnecessary restarts
  • Make readiness probes lightweight: They run frequently during traffic
  • Test failure scenarios: Ensure probes work as expected

Application Design

  • Implement proper health endpoints: /healthz, /readyz, /livez
  • Make health checks independent: Don't depend on external services
  • Include dependency checks in readiness: Database, cache, etc.
  • Keep liveness checks simple: Basic "is the process running" check
  • Use different endpoints: Separate health, readiness, and liveness
  • Log probe failures: Help with debugging issues

Probe Configuration Recommendations

Startup Probe
  • failureThreshold: 30
  • periodSeconds: 10
  • Use for apps taking >5min to start
Readiness Probe
  • periodSeconds: 5-10
  • timeoutSeconds: 1-3
  • failureThreshold: 3
Liveness Probe
  • periodSeconds: 10-30
  • initialDelaySeconds: 15-30
  • failureThreshold: 3

Warning: Avoid making liveness probes dependent on external services. If an external service fails, it could cause all your containers to restart in a cascade failure.