Splunk Getting Extreme Part Six

Welcome to part six of my series on Splunk Extreme Search. I am dedicating this to my best buddy of 18 years, Rylie. He is my Miniature Pinscher whom I need to let rest come December 29th. He has been awesome all these years and had me thinking about Time. So let’s talk Time and Extreme Search.

We saw in part five that we do not have to always bucket data and stats across time. Still, it is the most common thing to do in Extreme Search. How you handle time is important.

Saving Time

There are two main things you can do to make your searches time efficient.

  1. Use well defined searches. The more precise you can make the up front restrictions like action=success src_ip= the better the job the indexers can do. This also applies when using tstats by placing these up front restrictions in the where statement.
  2. Use accelerated data and tstats. Use the Common Information Model data models where you can. Accelerate them over the time ranges you need. Remember, you can also make custom data models and accelerate those even if your data does not map well to the common ones.

Accelerated Data Models

Seriously. I hot linked the title of this section. Go read it! And remember to hug a Splunk Docs team member at conf. They do an amazing job putting all that in there for us to read.

You choose how much data to accelerate by time. Splunk takes the DMs that are to be accelerated and launches special hidden searches behind the scenes. These acceleration jobs consume memory and CPU core resources like all the other searches. If you do not have enough resources you may start running into warning about hitting the maximum number of searches and your accelerations may be skipped. Think about that. If you are running ES Notables that use summariesonly=true you will miss matching data. This is because the correlation search runs over a time range and it finds no matching accelerated data. Woo! It is quiet and no notables are popping up. Maybe that isn’t so great… uh oh…

A second way you can have data model acceleration disruption is by having low memory on your indexers. This one is easier to spot. If you check the Data Model audit in Enterprise Security and see in the last error message column references to oomkiller you have less ram than you need. When that dispatched acceleration job gets killed, Splunk has to toss the acceleration job and dispatch it again on the next run. The data models will never get caught up if the jobs keep getting disrupted.

Acceleration getting behind can happen another way. Index Clustering. Acceleration builds tsidx files with the primary searchable bucket on an indexer. Index clustering exists to replicates data buckets to reduce the chance of data loss or availability if you lose one or more indexers. Prior to Splunk 6.4 there was no replication of the accelerated buckets. Just the data buckets. That was bad news when you had an indexer go down or had a rolling restart across your cluster. It would take time for the accelerations to roll back through, find that the primary bucket is now assigned as primary on a different indexer than where it was earlier. You guessed it. It has to rebuild the acceleration bucket on the indexer that now had the primary flag for that bucket. This is why if you check Data Model Audit in Enterprise Security you will see percentages complete drop most times after restarts of the indexing layer. You can turn on accelerated bucket replication in 6.4, at the cost of storage of course. Are you on version before 6.4 and using Index Clustering with Enterprise Security? You better plan that upgrade.

How far back in time your accelerations are relative to percentages complete is different between environments. Imagine the network traffic data model is behind at 96%. It sounds pretty good, but in large volume environments it could means the latest events in acceleration are 6 hours ago. What does that mean if your threat matching correlation searches only range over the past two hours and use summariesonly? It means no notables firing and you think things are quiet and great. The same thing applies to XS Context Gens. If you use summariesonly and are building averages and other statistics, those numbers are thrown off from what they should be.

If your data is pretty constant, like in high volume environments this is a down and dirty search to gauge latest event time compared to now.

| tstats summariesonly=true min(_time) as earliestTime, max(_time) as latestTime from datamodel=Authentication | eval lagHours=(now()-latestTime)/3600 | convert ctime(*Time)

The message is be a good Splunk Admin. Check your data model accelerations daily in your operations review process. Make sure you are adding enough indexers for your data load so DM accelerations can build quickly and stay caught up. You can increase the acceleration.max_concurrent for a given datamodel if you have the CPU resources on both Search Heads and Indexers. Use accelerated bucket replication if you can afford it.

One way you can spot acceleration jobs using search is something like the following. You may have to mess with the splunk_server field to match your search head pattern if you are on search clustering.

| rest splunk_server=local /servicesNS/-/-/search/jobs | regex label="ACCELERATE" | fields label, dispatchState ,id, latestTime, runDuration

There is another option to help accelerations stay caught up for your critical searches. The gui doesn’t show it but there is a setting called acceleration.backfill_time from datamodels.conf. You can say accelerate the Web data model for 30 days of data, but only backfill 7 days. This means if data is not accelerated, such as by an index cluster rolling restart, Splunk will only go back 7 days to catch up accelerations. That can address short run correlation searches for ES. It still creates gaps when using summariesonly for context generation searches that trend over the full 30 days. That brings you back to acceleration replication as the solution.

Oh, one other little item about data models. A data model acceleration is tied to a search head guid. If you are using search head clustering, it will use a guid for the whole cluster. ONLY searches with the matching GUID can access the accelerated data. No sharing accelerations across search heads not in the same cluster. This is why most of us cringe when Splunk customers ask about running multiple Enterprise Security instances against the same indexers. It requires data model acceleration builds for each GUID. You can imagine how resource hungry that can be at all levels.

Context Gens and Correlation Searches

Search Scheduling

A context is about defining ways to ask if something is normal, high, extreme etc. Most of our context gens run across buckets of time for time ranges like 24 hours, 30 days and so on.

Scenario. Lets say we have a context gen that is anomalous login successes by source, app and user. This should let me catch use of credentials from sources not normally seen or at a volume off of normal. If I refresh that context hourly but also run my detection search that uses xswhere hourly; I run the risk of a race condition. I could normalize in the new bad or unexpected source into the context BEFORE I detect it. I would probably schedule my context gen nightly so during the day before it refreshes I get every chance to have ES Notables trigger before the data is normalized into our context. So be sure to compare how often you refresh your context compared to when you use the context.

Time Windows

Check your context generation time range lines up with how far back you accelerate the models. It is easy to say run over 90 days then find out you only accelerated 30 days.

Check the run duration of your searches. Validate your search is not taking longer to run than the scheduled interval of the context gen or correlation search. That always leads to tears. Your search will run on its schedule. Take longer to run and get scheduled for its next run. It will actually start to “time slide” as the next run time gets farther and farther behind compared to the real time the search job finished. I saw this happen with a threat gen search for a threat intel product app once. It was painful. Job/Activity inspector is your friend on checking run durations. Also check the scheduled search panel now and then.

Look back at the previous posts. We make contexts over time buckets and we make sure to run a search that leverages it over the same bucket width of time. Do trending over a day? Make sure you run your matching correlation search over a day’s worth of time to get numbers on the same scale. Same goes for by hour. Normally you would not make a context by day and search by hour. The scales are different. Mixing scales can lead to odd results.

Embracing the odd:

One thing you should get from this series. It is all about The Question. Imagine we trend event count, or data volume per day for a source. Would it ever make sense to use that context over only an hour’s worth of data? Sure. You would get the real low end of the terms like minimal, low, maybe medium. If you saw hits matching “is extreme” you know that you have a bad situation going on. After all, you are seeing a days worth of whatever in only an hour window. Sometimes you break the “rules” because that is the question you want to ask.

I probably would not do that with the Anomaly Driven contexts. After all, you want anomalous deviation off normal.