Troubleshooting the “Out of socket memory” error

Troubleshooting the “Out of socket memory” error

If the following error message occasionally gets written to the /var/log/messages file:

[root@host1 ~]# tail -f /var/log/messages
Jan 22 15:05:39 ztm-n08 kernel: [12624150.315458] Out of socket memory

It usually means one of two things:

  1. The server is running out of TCP memory
  2. There are too many orphaned sockets on the system

To see how much memory the kernel is configured to dedicate to TCP run:

[root@host1 ~]# cat /proc/sys/net/ipv4/tcp_mem
3480768 4641024 6961536


tcp_mem is a vector of 3 integers: min, pressure and max.

  • min : below this number of pages TCP is not bothered about its memory consumption.
  • pressure: when the amount of memory allocated to TCP by the kernel exceeds this threshold, the kernel starts to moderate the memory consumption. This mode is exited when memory consumption falls under min.
  • max : the max number of pages allowed for queuing by all TCP sockets. When the system goes above this threshold, the kernel will start throwing the “Out of socket memory” error in the logs.

Now let’s compare the ‘max’ number with how much of that memory TCP actually uses:

[root@host1 ~]# cat /proc/net/sockstat
sockets: used 48476
TCP: inuse 174950 orphan 126800 tw 153787 alloc 174954 mem 102910
UDP: inuse 34 mem 3
UDPLITE: inuse 0
RAW: inuse 1
FRAG: inuse 3 memory 4968

The last value on line 3 (mem 102910) is the number of pages currently allocated to TCP. In this example you can see that this value is way lower than the maximum number of pages the kernel is willing to give to TCP – the ‘max’ vector described above, so we can dismiss this as a cause of the error.
To examine if the server has too many orphan sockets run the following:

[root@host1 ~]# cat /proc/sys/net/ipv4/tcp_max_orphans

An orphan socket is a socket that isn’t associated with a file descriptor, usually after the close() call and there is no longer a file descriptor that reference it, but the scoket still exists in memory, until TCP is done with it.The tcp_max_orphans file shows the maximal number of TCP sockets not attached to any user file handle, held by system that the kernel can support. If this number is exceeded orphaned connections are reset immediately and warning is printed. This limit exists only to prevent simple DoS attacks. Each orphan sockets eats up to 64K of unswappable memory.
Now that we know what the limit of orphaned sockets on a system can be, let’s see the current number of orphaned sockets:

[root@host1 ~]# cat /proc/net/sockstat
sockets: used 48476
TCP: inuse 174950 orphan 126800 tw 153787 alloc 174954 mem 102910
UDP: inuse 34 mem 3
UDPLITE: inuse 0
RAW: inuse 1
FRAG: inuse 3 memory 4968

In this case the ‘orphan 126800’ on line 3 is the field we are interested in. If this number is bigger than the one from tcp_max_orphans then this can be a reason for the “Out of socket memory”.Fixing this is a matter of increasing the max limit in tcp_max_orphans:

[root@host1 ~]# echo 400000 > /proc/sys/net/ipv4/tcp_max_orphans

One thing worth mentioning is that in certain cases, the kernel may penalize some sockets more by multiplying the number of orphans by 2x or 4x to artificially increase the “score” of the “bad socket”.

To account for that get the number of orphaned sockets during peak server utilization and multiple that by 4 to be safe. That should be the value you set in tcp_max_orphans.

In some cases if there are many TCP short lived connections on the system the number of orphaned sockets such as TIME_WAIT will be pretty big. To fix this situation you might need to increase the TIME_WAIT timeout (MSL) and experiment with tcp_tw_reuse /tcp_tw_recycle kernel tunable as describe in my other article on TCP Tuning.

Linux TCP Tuning

The aim of this post is to point out potential kernel tunables that might improve network performance in certain scenarios. As with any other post on the subject, make sure you test before and after you make an adjustment to have a measurable, quantitative result. For the most part, the kernel is smart enough to detect and adjust certain TCP options after boot, or even dynamically, e.g the Sliding Window size etc.

With that in mind, here’s a quick overview of the steps taken during data transmission and reception:

1. The application first writes the data to a socket which in turn is put in the transmit buffer.
2. The kernel encapsulates the data into a PDU – protocol data unit.
3. The PDU is then moved onto the per-device transmit queue.
4. The NIC driver then pops the PDU from the transmit queue and copies it to the NIC.
5. The NIC sends the data and raises a hardware interrupt.
6. On the other end of the communication channel the NIC receives the frame, copies it on the receive buffer and raises a hard interrupt.
7. The kernel in turn handles the interrupt and raises a soft interrupt to process the packet.
8. Finally the kernel handles the soft interrupt and moves the packet up the TCP/IP stack for decapsulation and puts it in a receive buffer for a process to read from.

To make persistent changes to the kernel settings described bellow, add the entries to the /etc/sysctl.conf file and then run “sysctl -p” to apply.

Like all operating systems, the default maximum Linux TCP buffer sizes are way too small. I suggest changing them to the following settings:
To increase TCP max buffer size setable using setsockopt():
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432

Good starting point is the BDP based on a measured delay, e.g, multiple the bandwidth of the link to the average round trip time to some host.

To increase Linux autotuning TCP buffer limits min, default, and max number of bytes to use set max to 16MB for 1GE, and 32M or 54M for 10GE:

net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
You should also verify that the following are all set to the default value of 1:
sysctl net.ipv4.tcp_window_scaling
sysctl net.ipv4.tcp_timestamps
sysctl net.ipv4.tcp_sack
Note: you should leave tcp_mem alone. The defaults are fine.
Another thing you can do to help increase TCP throughput with 1GB NICs is to increase the size of the interface queue. For paths with more than 50 ms RTT, a value of 5000-10000 is recommended. To increase txqueuelen, do the following:
[root@server1 ~] ifconfig eth0 txqueuelen 5000
You can achieve increases in bandwidth of up to 10x by doing this on some long, fast paths. This is only a good idea for Gigabit Ethernet connected hosts, and may have other side effects such as uneven sharing between multiple streams.

Other kernel settings that help with the overall server performance when it comes to network traffic are the following:

TCP_FIN_TIMEOUT – This setting determines the time that must elapse before TCP/IP can release a closed connection and reuse its resources. During this TIME_WAIT state, reopening the connection to the client costs less than establishing a new connection. By reducing the value of this entry, TCP/IP can release closed connections faster, making more resources available for new connections. Addjust this in the presense of many connections sitting in the TIME_WAIT state:

[root@server:~]# echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout

TCP_KEEPALIVE_INTERVAL – This determines the wait time between isAlive interval probes. To set:

[root@server:~]# echo 30 > /proc/sys/net/ipv4/tcp_keepalive_intvl

TCP_KEEPALIVE_PROBES – This determines the number of probes before timing out. To set:

[root@server:~]# echo 5 > /proc/sys/net/ipv4/tcp_keepalive_probes

TCP_TW_RECYCLE – This enables fast recycling of TIME_WAIT sockets. The default value is 0 (disabled). Should be used with caution with loadbalancers.

[root@server:~]# echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle

TCP_TW_REUSE – This allows reusing sockets in TIME_WAIT state for new connections when it is safe from protocol viewpoint. Default value is 0 (disabled). It is generally a safer alternative to tcp_tw_recycle

[root@server:~]# echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse

Note: The tcp_tw_reuse setting is particularly useful in environments where numerous short connections are open and left in TIME_WAIT state, such as web servers and loadbalancers. Reusing the sockets can be very effective in reducing server load.

Starting in Linux 2.6.7 (and back-ported to 2.4.27), linux includes alternative congestion control algorithms beside the traditional ‘reno’ algorithm. These are designed to recover quickly from packet loss on high-speed WANs.
There are a couple additional sysctl settings for kernels 2.6 and newer:
Not to cache ssthresh from previous connection:
net.ipv4.tcp_no_metrics_save = 1
To increase this for 10G NICS:
net.core.netdev_max_backlog = 30000
Starting with version 2.6.13, Linux supports pluggable congestion control algorithms . The congestion control algorithm used is set using the sysctl variable net.ipv4.tcp_congestion_control, which is set to bic/cubic or reno by default, depending on which version of the 2.6 kernel you are using.
To get a list of congestion control algorithms that are available in your kernel (if you are running 2.6.20 or higher), run:
[root@server1 ~] # sysctl net.ipv4.tcp_available_congestion_control
The choice of congestion control options is selected when you build the kernel. The following are some of the options are available in the 2.6.23 kernel:
* reno: Traditional TCP used by almost all other OSes. (default)
* cubic: CUBIC-TCP (NOTE: There is a cubic bug in the Linux 2.6.18 kernel used by Redhat Enterprise Linux 5.3 and Scientific Linux 5.3. Use or higher!)
* bic: BIC-TCP
* htcp: Hamilton TCP
* vegas: TCP Vegas
* westwood: optimized for lossy networks
If cubic and/or htcp are not listed when you do ‘sysctl net.ipv4.tcp_available_congestion_control’, try the following, as most distributions include them as loadable kernel modules:
[root@server1 ~] # /sbin/modprobe tcp_htcp
[root@server1 ~] # /sbin/modprobe tcp_cubic
For long fast paths, I highly recommend using cubic or htcp. Cubic is the default for a number of Linux distributions, but if is not the default on your system, you can do the following:
[root@server1 ~] # sysctl -w net.ipv4.tcp_congestion_control=cubic
On systems supporting RPMS, You can also try using the ktune RPM, which sets many of these as well.
If you have a load server that has many connections in TIME_WAIT state decrease the TIME_WAIT interval that determines the time that must elapse before TCP/IP can release a closed connection and reuse its resources. This interval between closure and release is known as the TIME_WAIT state or twice the maximum segment lifetime (2MSL) state. During this time, reopening the connection to the client and server cost less than establishing a new connection. By reducing the value of this entry, TCP/IP can release closed connections faster, providing more resources for new connections. Adjust this parameter if the running application requires rapid release, the creation of new connections, and a low throughput due to many connections sitting in the TIME_WAIT state:

[root@host1 ~]# echo 5 > /proc/sys/net/ipv4/tcp_fin_timeout

If you are often dealing with SYN floods the following tunning can be helpful:

[root@host1 ~]# sysctl -w net.ipv4.tcp_max_syn_backlog=16384
[root@host1 ~]# sysctl -w net.ipv4.tcp_synack_retries=1
[root@host1 ~]# sysctl -w net.ipv4.tcp_max_orphans=400000

The parameter on line 1 is the maximum number of remembered connection requests, which still have not received an acknowledgment from connecting clients.
The parameter on line 2 determines the number of SYN+ACK packets sent before the kernel gives up on the connection. To open the other side of the connection, the kernel sends a SYN with a piggybacked ACK on it, to acknowledge the earlier received SYN. This is part 2 of the three-way handshake.
And lastly on line 3 is the maximum number of TCP sockets not attached to any user file handle, held by system. If this number is exceeded orphaned connections are reset immediately and warning is printed. This limit exists only to prevent simple DoS attacks, you _must_ not rely on this or lower the limit artificially, but rather increase it (probably, after increasing installed memory), if network conditions require more than default value, and tune network services to linger and kill such states more aggressively.

More information on tuning parameters and defaults for Linux 2.6 are available in the file ip-sysctl.txt, which is part of the 2.6 source distribution.

Warning on Large MTUs: If you have configured your Linux host to use 9K MTUs, but the connection is using 1500 byte packets, then you actually need 9/1.5 = 6 times more buffer space in order to fill the pipe. In fact some device drivers only allocate memory in power of two sizes, so you may even need 16/1.5 = 11 times more buffer space!
And finally a warning for both 2.4 and 2.6: for very large BDP paths where the TCP window is > 20 MB, you are likely to hit the Linux SACK implementation problem. If Linux has too many packets in flight when it gets a SACK event, it takes too long to located the SACKed packet, and you get a TCP timeout and CWND goes back to 1 packet. Restricting the TCP buffer size to about 12 MB seems to avoid this problem, but clearly limits your total throughput. Another solution is to disable SACK.
Starting with Linux 2.4, Linux implemented a sender-side autotuning mechanism, so that setting the optimal buffer size on the sender is not needed. This assumes you have set large buffers on the receive side, as the sending buffer will not grow beyond the size of the receive buffer.
However, Linux 2.4 has some other strange behavior that one needs to be aware of. For example: The value for ssthresh for a given path is cached in the routing table. This means that if a connection has has a retransmission and reduces its window, then all connections to that host for the next 10 minutes will use a reduced window size, and not even try to increase its window. The only way to disable this behavior is to do the following before all new connections (you must be root):
[root@server1 ~] # sysctl -w net.ipv4.route.flush=1

Lastly I would like to point out how important it is to have a sufficient number of available file descriptors, since pretty much everything on Linux is a file.
To check your current max and availability run the following:

[root@host1 ~]# sysctl fs.file-nr
fs.file-nr = 197600 0 3624009

The first value (197600) is the number of allocated file handles.
The second value (0) is the number of unused but allocated file handles. And the third value (3624009) is the system-wide maximum number of file handles. It can be increased by tuning the following kernel parameter:

[root@host1 ~]# echo 10000000 > /proc/sys/fs/file-max

To see how many file descriptors are being used by a process you can use one of the following:

[root@host1 ~]# lsof -a -p 28290
[root@host1 ~]# ls -l /proc/28290/fd | wc -l

The 28290 number is the process id.

Share Button