By default Docker allows all of their containers to run with the CAP_NET_RAW capability, I believe to easily support ICMP health checks when needed. Supporting ping makes sense but this post will go through why CAP_NET_RAW is an unnecessary risk and how you can still send pings without needing CAP_NET_RAW.
CAP_NET_RAW controls a processes ability to build any types of packets that you want. TCP, UDP, ARP, ICMP, etc. For example, you may have noticed when running
nmap --priviliged as a normal user, the output would look something like this:
Starting Nmap 7.60 ( https://nmap.org ) at 2019-01-10 16:31 EST Couldn't open a raw socket. Error: Operation not permitted (1)
sudo setcap cap_net_raw+ep $(which nmap) and re-running the same command will give you a different result:
./nmap --privileged -sP ioioio.cz Starting Nmap 7.60 ( https://nmap.org ) at 2019-01-10 16:29 EST Nmap scan report for ioioio.cz (188.8.131.52) Host is up (0.028s latency). ...
setcap command explicitly adds the CAP_NET_RAW capability as an attribute of the
nmap binary on your disk so anyone that executes that binary will be able to forge its own packets including ICMP and UDP.
Hopefully you understood that last paragraph and are going back to undo what you just did. :) To do that it’s
sudo setcap -r $(which nmap).
Back to containers: Docker’s default containers are still running with CAP_NET_RAW enabled by default which means in theory the container can forge their own packets, send ICMP, maybe even do something like ARP poisoning in the right situation. For those that are trying to harden their containers, one of the most basic steps is to drop unnecessary capabilities, so why can’t we (almost) always drop CAP_NET_RAW?
Docker and other engines support a
--sysctl argument per container allowing you to set the
net.ipv4.ping_group_range value which lets you drop the CAP_NET_RAW requirement to send a ping. On a normal Linux box it would look like this:
sysctl -w net.ipv4.ping_group_range=0 65535
It means that any user between root (UID 0) and UID 65535 will be able to use the ping command. (NOTE: I’ve seen a much higher number than 65535 as well depending on your max UID.)
Lets do this in a container.
> docker run --rm -it willfarrell/ping ping 517d215a8fe2 every 300 sec PING 517d215a8fe2 (172.17.0.3): 56 data bytes 64 bytes from 172.17.0.3: seq=0 ttl=64 time=0.038 ms --- 517d215a8fe2 ping statistics --- 1 packets transmitted, 1 packets received, 0% packet loss round-trip min/avg/max = 0.038/0.038/0.038 ms
Ping works because the container has CAP_NET_RAW
Same image but dropping cap_net_raw:
> docker run --rm -it \ --cap-drop=net_raw \ willfarrell/ping ping 15a94d4af148 every 300 sec PING 15a94d4af148 (172.17.0.3): 56 data bytes ping: permission denied (are you root?)
Ping doesn’t work because I dropped CAP_NET_RAW.
Same image, still dropping net_raw, but using a custom sysctl argument:
> docker run --rm -it \ --cap-drop=net_raw \ --sysctl "net.ipv4.ping_group_range=0 1000000000" \ willfarrell/ping ping 0bedc92d1c86 every 300 sec PING 0bedc92d1c86 (172.17.0.3): 56 data bytes 64 bytes from 172.17.0.3: seq=0 ttl=42 time=0.036 ms --- 0bedc92d1c86 ping statistics --- 1 packets transmitted, 1 packets received, 0% packet loss round-trip min/avg/max = 0.036/0.036/0.036 ms
Ping works again even though I drop CAP_NET_RAW.
The point is, until someone tells me otherwise, I don’t think we need CAP_NET_RAW in most scenarios.
So, OK, can we just get rid of CAP_NET_RAW now? Probably not. There are probably some edge scenarios that Docker wants to officially support so they’ll keep it in and the goal of Docker is to be usable first and allow you to configure it in a secure way second. But that’s not going to stop you from dropping it from all of your containers. There are already some other container engines out there that leverage this method by default.