Splunk uLimits and You

Most folks are familiar with the concept of file descriptors in Unix/Linux. It gets mentioned in the Splunk docs for system requirements under the section “Considerations regarding file descriptor limits (FDs) on *nix systems” and for troubleshooting.

I run a very high volume index cluster on a daily basis. Complete with Splunk Enterprise Security. One thing I have seen is if you have timestamps off you can get a VERY LARGE number of buckets for low overall raw data size. If you see nearly 10000 buckets for only several hundred GB of data then you have that problem. Keep in mind that is a lot of file descriptors potentially in use. You should check your incoming logs and you will likely find some nasty multi line log file having a line breaking issue where some large integer is getting parsed as an epoch time and causing buckets with timestamps way back in time.

It got me thinking about the number of open files though. Especially, when also being concerned with all the buckets for data model accelerations to be built for supporting the Enterprise Security application. Maybe FD limits have been interfering with my data model acceleration bucket builds.

Then we had a couple of indexers spontaneously crash their splunkd processes. With an error indicating file descriptor limit problems.

I discussed it with my main Splunk partner in crime, Duane Waddle. He explained that if a process starts on it’s own without a user session that Linux might not honor ulimits from limits.conf. So even though we had done the right things accounting for ulimits, Transparent Huge Pages etc that we were still likely getting hosed.

Such as this example from /etc/security/limits.conf using a section like below for a high volume indexer in a cluster:

You might be getting the 4096 default if Splunk is kicking off via the enable boot-start option.

You can test this by logging into your server then do the following:

Check the results looking for the Max open files.

Duane suggested editing the Splunk init file. My coworker Matt Uebel ran with that and came up with the follow quick commands to make that edit. Use the following commands substituting your desired limits values.

Now when your system fully reboots and Splunk starts via enable-bootstart without a user session you should still get the desired ulimits values.