By Sébastien Baillet From
To prevent the possibility to escape as root, containers must not be executed as root.
A container should be started only with strictly minimum capabilities for it to run.
Each privileged operation is associated with a capability
Docker engine starts container with default capabilities described in Docker documentation.
This default list may be overloaded in Docker daemon configuration or at container start.
apiVersion: v1
kind: Pod
metadata:
name: demo-pod
spec:
containers:
- name: demo
image: demo
securityContext:
capabilities:
drop:
- NET_BIND_SERVICE
- SETUID
Complete list on HackTricks and explanation on how to exploit
The image you build your image from should be trustable:
Images should contain only the strict minimum to allow the application to run.
For example:
FROM maven:3.6.1-jdk-11-slim AS builder
WORKDIR /app
COPY . /app/
RUN mvn -T4 package -DskipTests
FROM gcr.io/distroless/java:11
COPY --from=builder /app/target/app.jar /app.jar
USER 1000
ENTRYPOINT ["java", "-server", "-jar", "/app.jar"]
Containers should be run read-only.
Only necessary read-write mounted volume will be writable: ideally none!
The Kubernetes equivalent option is readOnlyRootFilesystem, for example:
apiVersion: v1
kind: Pod
metadata:
name: volume-test
spec:
containers:
- name: container-test
image: busybox
securityContext:
readOnlyRootFilesystem: true
volumeMounts:
- name: all-in-one
mountPath: "/projected-volume"
readOnly: true
No secret in image!
Never
No exception
Use tools to enforce policies in your clusters, like Gatekeeper.
Containers are process "executed" by the host kernel, using chroot, namespace, and cgroups to provide isolation and resource limits.
Namespaces are a feature of the Linux kernel that partitions kernel resources. Processes that share the same namespace can see each other resources given the kind of the namespace.
Namespace kind list:
By default, when a Docker container starts, it starts with his own namespaces, assuring isolation from host
Docker offers the possibility to join other container or host namespaces.
Let's join a container network/pid namespace
version: '3.7'
services:
oauth2-proxy:
image: quay.io/oauth2-proxy/oauth2-proxy:v7.7.0
ports:
- "3000:3000"
volumes:
- ./oauth2-proxy-keycloak.cfg:/oauth2-proxy.cfg
myapp:
image: containous/whoami
network_mode: service:oauth2-proxy
pid: service:oauth2-proxy
Pods (atomic unit of Kubernetes deployment) can contain one or more containers.
You can share PID namespace between pod containers:
apiVersion: v1
kind: Pod
metadata:
name: my-app
spec:
shareProcessNamespace: true
containers:
- name: my-app
image: my-app-image:12
- name: my-other-app
image: my-other-app-image:14
Stable feature since Kubernetes 1.25
kubectl debug -it --image my-debugging-toolbox \
my-pod-XXXXXXXXX [--target my-container]
Demo: Playing with pid & network namespace
...that can help you out
80% of your code is not made by your team. Use scanner to detect known vulnerabilities in your dependencies.
For example: Dependabot, Checkmarx SCA, Trivy, J-Frog X-Ray, Renovate...
Directly patch container image vulnerabilities.
Useful for unmaintained images.
No isolation at all by default !!!
Use network policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: my-network-policy
namespace: my-namespace
spec:
podSelector: {}
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: grafana-agent-ns
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-controller-ns