Splunk Importance of Indexes

I see a lot of folks new to Splunk have to work to mature their deployments because the did not tackle indexes early on. Indexes are how you control access to data and it\’s retention period.

Consider a \”traditional\” starting splunk deployment by a security group. You get the IT group to install the universal forwarder sending you logs. Up front they aren\’t interested in more than making you go away so they can work the next support ticket. Later, they find out how much access to their own logs in splunk can help operations succeed. Everything is all mixed together; your IDS, mail logs and web logs. Maybe a lot they don\’t need to see.

Splunk will put data into the index named \”main\” by default. Everyone with a login to splunk can see this index. There is no simple move command once data is in an index to shift it into a new one.

It gets to be a bigger mess when start installing apps. Some like the *nix app put everything into an index called \”os\”.

Naming Convention

You should setup different indexes as early on as possible in a new deployment. Above all use a naming convention. Sticking with the default retention period is ok. It\’s six years, so you have time to shrink it later.

I follow this naming convention.
* os_windows_groupname
* os_linux_groupname
* os_windows_groupname_secondgroupname

  1. I use underscores in index names.
  2. This type of index is for OS related logs so it starts with os_.
  3. The first and often only groupname is the IT or organizational group that owns the systems and provides the logs.
  4. Optionally what if you have a system developers and IT admins need to share log access. That is where I add _secondgroup name to it and send events for just those systems to this index.

Why do I follow this convention?

As mentioned indexes in Splunk are the control mechanism for access control and data retention. This is all set by index for user roles, then time periods for retention set for the index as well.

Searching with wildcards. Using this scheme you can setup a dashboard that leverages searches like

index=os_linux_* sudo

If you save that search or build it into a dashboard then if one group has access to the dashboard they see only their logs that match. The next group sees only theirs with the same dashboard. You get to see ALL events if as the security staff you have permissions to all the indexes. This also works well for eventtyping. Since eventtypes are defined by searches you can ensure an eventtype for only certain windows events run only across those indexes but ALL of them via the wildcard.

The downside shows up when you are not using the default index and you are new to splunk. There is a tendency to install some given Splunk app and expect it to just show data. Often these apps are coded to search just default indexes or their own. You will have to dig into their code and find where you have to replace the app searches etc with your wildcard naming scheme to get it wired up. It is still worth the effort and saves you from a lot of pain as your deployment matures.

For more about indexing be sure to read through the Splunk manual on Managing Indexes and Clusters.

  • automine

    Hey George,

    Great post, and some very good points. Much of this also carries over to sourcetype naming conventions (I tend to use “vendor:product:log_type”). Two comments:

    - Another reason (other than retention and access requirements) to have another index is for performance. This is usually true if you have a sourcetype that dwarfs others, in terms of volume. For example, web proxy data should usually be in its own index, as it tends to be high volume. While the proxy data being in its own index won’t speed up searches for that sourcetype, by limiting the amount of data in other indexes, you should speed up the searches for the other sourcetypes that may have been crowded.

    - The point about having problems with apps that are looking for a specific index name is a good one. Many of the new apps are getting away from this, and now have a search macro that is used to as a placeholder for where the data for that app is located. Usually the macro is as simple as “index=proxy”. That said, no all apps are doing that, and it does require some configuration.

    Thanks again for the great post!