Splunk Getting Extreme Part Eight

Extreme Search has some other commands included with it. They have included the Haversine equation for calculating physical distance in the xsGetDistance command. We can couple that with the Splunk iplocation command to find user login attempts across distances too fast for realistic travel.

Context Gen:

Class: default

First we will create a default context with a maximum speed of 500mph. Note how we do not specify the class argument.

| xsCreateUDContext name=speed container=travel app=search scope=app terms="normal,fast,improbable,ludicrous” type=domain min=0 max=500 count=4 uom=mph

Class: all

Second we will create a context for the class all with the same maximum speed of 500mph. We could use a different maximum if we wanted here.

| xsCreateUDContext name=speed container=travel app=search scope=app terms="normal,fast,improbable,ludicrous” type=domain min=0 max=500 count=4 uom=mph class=all

Class: foot

Last we will create a context for the class foot with a maximum speed of 27.8mph. This is approximately the maximum foot speed of a human. This could be useful if measuring speed across a place like a college campus.

| xsCreateUDContext name=speed container=travel app=search scope=app terms="normal,fast,improbable,ludicrous” type=domain min=0 max=27.8 count=4 uom=mph class=foot

Search:

We will pretend my ssh authentication failures are actually successes. This is just because it is the data I have easily available.

Class: all

tag=authentication action=failure user=* src_ip=* user=* app=sshd | iplocation prefix=src_ src_ip | sort + _time | streamstats current=t window=2 earliest(src_lat) as prev_lat, earliest(src_lon) as prev_lon, earliest(_time) as prev_time, earliest(src_City) as prev_city, earliest(src_Country) as prev_country, earliest(src_Region) as prev_region, earliest(src) as prev_src, by user | eval timeDiff=(_time - prev_time) | xsGetDistance from prev_lat prev_lon to src_lat src_lon | eval speed=round((distance/(timeDiff/3600)),2) | table user, src, prev_src, src_Country, src_Region, src_City, prev_country, prev_region, prev_city, speed | eval travel_method="all" | xswhere speed from speed by travel_method in travel is above improbable | convert ctime(prev_time)

Class: foot

tag=authentication action=failure user=* src_ip=* user=* app=sshd | iplocation prefix=src_ src_ip | sort + _time | streamstats current=t window=2 earliest(src_lat) as prev_lat, earliest(src_lon) as prev_lon, earliest(_time) as prev_time, earliest(src_City) as prev_city, earliest(src_Country) as prev_country, earliest(src_Region) as prev_region, earliest(src) as prev_src, by user | eval timeDiff=(_time - prev_time) | xsGetDistance from prev_lat prev_lon to src_lat src_lon | eval speed=round((distance/(timeDiff/3600)),2) | table user, src, prev_src, src_Country, src_Region, src_City, prev_country, prev_region, prev_city, speed | eval travel_method="foot" | xswhere speed from speed by travel_method in travel is above improbable | convert ctime(prev_time)

Summary:

We combined a User Driven context with another XS command to provide ourselves an interesting tool. We also saw how we could use different classes within that UD context to answer the question on a different scale. Try adding another class like automobile with a max=100 to find speeds that are beyond safe local travel speeds.

This would be real fun when checking webmail logs to find compromised user accounts. Especially if you combine with Levenshtein for look alike domains sent to users to build the list of whom to check.

Share

Splunk Getting Extreme Part Seven

Welcome to part seven where we will try a User Driven context for Extreme Search.

Our use case is to find domain names in a from email address that are look alike domains to our own. We need to use Levenshtein to do this. There is a Splunk app for it on splunkbase. The app does have some issues and needs to be fixed. I also recommend editing the returned fields to be levenshtein_distance and levenshtein_ratio.

Test Data:

I took the new Top 1 Million Sites list from Cisco Umbrella as a source of random domain names. Then I matched it with usernames from a random name list. I needed some test data to pretend I had good email server logs. I do not have those kind of logs at home. The below data is MOCK data. Any resemblance to real email addresses is accidental.

source="testdata.txt" sourcetype="demo"

Context Gen:

This time we do not want to make a context based on data. We need to create a mapping of terms to values that we define regardless of the data. Technically we could just use traditional SPL to filter based on Levenshtein distance values. What fun would that be for this series? We also want to demonstrate a User Driven context. Levenshtein is the number of characters difference between two strings, aka the distance. A distance of zero means the strings match. I arbitrarily picked a max value of 15. Pretty much anything 10 or more characters different are so far out we could never care about them. I then picked terms I wanted to call the distance ranges. The closer to zero the more likely it is a look alike domain. “Uhoh” is generally going to be a distance of 0-2 then we go up from there. You could play with the max to get different value ranges mapped to the terms. It depends on your needs.

We can use the Extreme Search Visualization app to examine our context curves and values.

Exploring the Data:

We can try a typical stats count and wildcard search to see what domains might resemble ours of “georgestarcher.com”

It gets close but matches domains clearly not even close to our good one. Here is the list from my test data generation script.

georgeDomain = ['georgestarcher.com','ge0rgestarcher.com', 'g5orgestarhcer.net', 'georgestarcher.au', 'georgeestarcher.com']

We can see we didn’t find the domain staring with g5. Trying to define a regex to find odd combinations of our domain would be very difficult. So we will start testing our Levenshtein context.

Let’s try a getwhereCIX and sort on the distance.

Next let’s try using xsFindBestConcept to see what terms match the domains we are interested in compared to their distances.

Using our Context:

We have an idea what we need to try based on our exploring the data. Still we will try a few different terms with xswhere to see what we get.

Using: ”is interesting”

We can see we miss the closest matches this way and get more matches that clearly are not look alikes to our domain.

Using: “is near interesting”

Adding the hedge term “near” we can extend matching interesting into just a little into adjacent concept terms. We find all our terms even the closest ones. The problem is we also extended up into the higher distances too.

Using: “is near uhoh”

Again, we use near to extend up from uhoh but we find it is not far enough to find the domain “g5orgestarhcer.net”

Using: “is very below maybe”

This time we have some fun with the hedge terms and say very to pull in the edges and below to go downward from the maybe concept. This gives us the domains we are exactly trying to find. You may have noticed we dropped where the distance was zero in our searches. That is because we don’t care where it is from our own legitimate domain name.

Last Comments:

Levenshtein can be real hard to use on shorter domain names. It becomes too easy to match full legitimate other domain names compared to small distances of your own. If you try and use this in making notables you might want to incorporate a lookup table to drop known good domains that are not look alike domains. Here is the same search that worked well for my domain but for google.com. You can see it matches way too much stuff, though it does still capture interesting near domain names.

Example: google.com

Share

Splunk app for ES and Alexa Top Sites

Alexa recently decided to restrict the downloads of the top one million sites list. Splunk Enterprise Security has this as one of the initial and default intel sources. Honestly the docs for ES do not make it clear how ES uses it. But maybe you just want to be sure it works. Or maybe you do something like apply the list as a filter on DNS data.

The awesome Cisco Umbrella team has made a replacement list. It is the same format as the Alexa file, so you can quickly swap it out in ES.

  • Disable the existing Alexa threat download entry.

  • Clone it and make a new one for cisco_top_one_million_sites.

  • Make sure per the screen shot above that you leave the “type” as “alexa”. That is tied to hard code in the ES application. We are just fooling ES into using the data from a matching formatted list.

  • Save it and you are done.

Share

Splunk Getting Extreme Part Six

Welcome to part six of my series on Splunk Extreme Search. I am dedicating this to my best buddy of 18 years, Rylie. He is my Miniature Pinscher whom I need to let rest come December 29th. He has been awesome all these years and had me thinking about Time. So let’s talk Time and Extreme Search.

We saw in part five that we do not have to always bucket data and stats across time. Still, it is the most common thing to do in Extreme Search. How you handle time is important.

Saving Time

There are two main things you can do to make your searches time efficient.

  1. Use well defined searches. The more precise you can make the up front restrictions like action=success src_ip=10.0.0.0/8 the better the job the indexers can do. This also applies when using tstats by placing these up front restrictions in the where statement.

  2. Use accelerated data and tstats. Use the Common Information Model data models where you can. Accelerate them over the time ranges you need. Remember, you can also make custom data models and accelerate those even if your data does not map well to the common ones.

Accelerated Data Models

Seriously. I hot linked the title of this section. Go read it! And remember to hug a Splunk Docs team member at conf. They do an amazing job putting all that in there for us to read.

You choose how much data to accelerate by time. Splunk takes the DMs that are to be accelerated and launches special hidden searches behind the scenes. These acceleration jobs consume memory and CPU core resources like all the other searches. If you do not have enough resources you may start running into warning about hitting the maximum number of searches and your accelerations may be skipped. Think about that. If you are running ES Notables that use summariesonly=true you will miss matching data. This is because the correlation search runs over a time range and it finds no matching accelerated data. Woo! It is quiet and no notables are popping up. Maybe that isn’t so great… uh oh…

A second way you can have data model acceleration disruption is by having low memory on your indexers. This one is easier to spot. If you check the Data Model audit in Enterprise Security and see in the last error message column references to oomkiller you have less ram than you need. When that dispatched acceleration job gets killed, Splunk has to toss the acceleration job and dispatch it again on the next run. The data models will never get caught up if the jobs keep getting disrupted.

Acceleration getting behind can happen another way. Index Clustering. Acceleration builds tsidx files with the primary searchable bucket on an indexer. Index clustering exists to replicates data buckets to reduce the chance of data loss or availability if you lose one or more indexers. Prior to Splunk 6.4 there was no replication of the accelerated buckets. Just the data buckets. That was bad news when you had an indexer go down or had a rolling restart across your cluster. It would take time for the accelerations to roll back through, find that the primary bucket is now assigned as primary on a different indexer than where it was earlier. You guessed it. It has to rebuild the acceleration bucket on the indexer that now had the primary flag for that bucket. This is why if you check Data Model Audit in Enterprise Security you will see percentages complete drop most times after restarts of the indexing layer. You can turn on accelerated bucket replication in 6.4, at the cost of storage of course. Are you on version before 6.4 and using Index Clustering with Enterprise Security? You better plan that upgrade.

How far back in time your accelerations are relative to percentages complete is different between environments. Imagine the network traffic data model is behind at 96%. It sounds pretty good, but in large volume environments it could means the latest events in acceleration are 6 hours ago. What does that mean if your threat matching correlation searches only range over the past two hours and use summariesonly? It means no notables firing and you think things are quiet and great. The same thing applies to XS Context Gens. If you use summariesonly and are building averages and other statistics, those numbers are thrown off from what they should be.

If your data is pretty constant, like in high volume environments this is a down and dirty search to gauge latest event time compared to now.

| tstats summariesonly=true min(_time) as earliestTime, max(_time) as latestTime from datamodel=Authentication | eval lagHours=(now()-latestTime)/3600 | convert ctime(*Time)

The message is be a good Splunk Admin. Check your data model accelerations daily in your operations review process. Make sure you are adding enough indexers for your data load so DM accelerations can build quickly and stay caught up. You can increase the acceleration.max_concurrent for a given datamodel if you have the CPU resources on both Search Heads and Indexers. Use accelerated bucket replication if you can afford it.

One way you can spot acceleration jobs using search is something like the following. You may have to mess with the splunk_server field to match your search head pattern if you are on search clustering.

| rest splunk_server=local /servicesNS/-/-/search/jobs | regex label="ACCELERATE" | fields label, dispatchState ,id, latestTime, runDuration

There is another option to help accelerations stay caught up for your critical searches. The gui doesn’t show it but there is a setting called acceleration.backfill_time from datamodels.conf. You can say accelerate the Web data model for 30 days of data, but only backfill 7 days. This means if data is not accelerated, such as by an index cluster rolling restart, Splunk will only go back 7 days to catch up accelerations. That can address short run correlation searches for ES. It still creates gaps when using summariesonly for context generation searches that trend over the full 30 days. That brings you back to acceleration replication as the solution.

Oh, one other little item about data models. A data model acceleration is tied to a search head guid. If you are using search head clustering, it will use a guid for the whole cluster. ONLY searches with the matching GUID can access the accelerated data. No sharing accelerations across search heads not in the same cluster. This is why most of us cringe when Splunk customers ask about running multiple Enterprise Security instances against the same indexers. It requires data model acceleration builds for each GUID. You can imagine how resource hungry that can be at all levels.

Context Gens and Correlation Searches

Search Scheduling

A context is about defining ways to ask if something is normal, high, extreme etc. Most of our context gens run across buckets of time for time ranges like 24 hours, 30 days and so on.

Scenario. Lets say we have a context gen that is anomalous login successes by source, app and user. This should let me catch use of credentials from sources not normally seen or at a volume off of normal. If I refresh that context hourly but also run my detection search that uses xswhere hourly; I run the risk of a race condition. I could normalize in the new bad or unexpected source into the context BEFORE I detect it. I would probably schedule my context gen nightly so during the day before it refreshes I get every chance to have ES Notables trigger before the data is normalized into our context. So be sure to compare how often you refresh your context compared to when you use the context.

Time Windows

Check your context generation time range lines up with how far back you accelerate the models. It is easy to say run over 90 days then find out you only accelerated 30 days.

Check the run duration of your searches. Validate your search is not taking longer to run than the scheduled interval of the context gen or correlation search. That always leads to tears. Your search will run on it’s schedule. Take longer to run and get scheduled for it’s next run. It will actually start to “time slide” as the next run time gets farther and farther behind compared to the real time the search job finished. I saw this happen with a threat gen search for a threat intel product app once. It was painful. Job/Activity inspector is your friend on checking run durations. Also check the scheduled search panel now and then.

Look back at the previous posts. We make contexts over time buckets and we make sure to run a search that leverages it over the same bucket width of time. Do trending over a day? Make sure you run your matching correlation search over a day’s worth of time to get numbers on the same scale. Same goes for by hour. Normally you would not make a context by day and search by hour. The scales are different. Mixing scales can lead to odd results.

Embracing the odd:

One thing you should get from this series. It is all about The Question. Imagine we trend event count, or data volume per day for a source. Would it ever make sense to use that context over only an hour’s worth of data? Sure. You would get the real low end of the terms like minimal, low, maybe medium. If you saw hits matching “is extreme” you know that you have a bad situation going on. After all, you are seeing a days worth of whatever in only an hour window. Sometimes you break the “rules” because that is the question you want to ask.

I probably would not do that with the Anomaly Driven contexts. After all, you want anomalous deviation off normal.

Share

Splunk Getting Extreme Part Five

Part five brings another use case. We will use values in raw data not a calculated value to make our context and then match against the raw events without bucketing them.

First, we have glossed over an important question. What does data match when you use xsWhere and there is no matching class in the context? It uses the “default” class. Default is the weight average of all the existing classes within the context. If you look within the csv for the container you will find lines for your context where the class is an empty string “”. That is default. Default is also what is made for the class when no class is specified.

You get a message like the following when a class value is not in the context you are trying to use.

xsWhere-I-111: There is no context 'urllen_by_src' with class 'Not Found' from container '120.43.17.24' in scope 'none', using default context urllen_by_src

Use Case: Finding long urls of interest

Just the longer URLs:

Let’s try just creating a context of all our url_length data from the Web Data Model. This version of the search will not break this up by class. We will just see if we can find “extreme” length urls in our logs based on the log data itself.

Context Gen:

| tstats dc(Web.url_length) as count, avg(Web.url_length) as average, min(Web.url_length) as min, max(Web.url_length) as max from datamodel=Web where Web.src!="unknown" | rename Web.* as * | xsCreateDDContext name=urllen container=web_stats app=search scope=app type=domain terms="minimal,low,medium,high,extreme" notes="urllen" uom="length"

The table that is displayed when the xsCreateDDContext finishes is interesting. Below we sort for extreme and see the urllen value is 678. This tells us in my data the url_length value high end is around 678 characters. If we search the logs using this context we find that our results are not a magic all bad “is extreme” situation. All the interesting URLs are down in the low/medium ranges with all the good urls. You have to come up with a another way to slice data when the signal and noise are so close to each other. This approach might work for some other use case, but not for this particular data set.

Searches:

index=weblogs | xswhere url_length from urllen in web_stats is low | stats count by url

We get an overwhelming number of matches since most of our URLs are in the low range.

index=weblogs | xswhere url_length from urllen in web_stats is extreme | stats count by url

We get a manageable 5 events from extreme but they are not interesting URLs.

URL Length by Src:

We get a different url_length distribution if we break it out by src. Remember, default is the weighted average of all the classes in the context if you use classes. The table we see when our context gen finishes is that default.

Context Gen:

Notice in our by src version our urllen for extreme in the default context is around 133. That is going to come from the weighted average of the per source classes.

| tstats avg(Web.url_length) as average, min(Web.url_length) as min, max(Web.url_length) as max from datamodel=Web where Web.src!="unknown" by _time, Web.src span=1m | rename Web.* as * | stats count, min(min) as min, max(max) as max, avg(average) as average by src | eval max=if(min=max, max+average, max) | eval max=if(max-min < average , max+average, max) | xsCreateDDContext name=urllen_by_src container=web_stats app=search scope=app type=domain terms="minimal,low,medium,high,extreme" notes="url len by src" uom="length" class="src"

Search:

index=weblogs | xsWhere url_length from urllen_by_src in web_stats by src is very very extreme | stats values(src) AS sources, dc(src) as sourceCount by url, status_description

Even using is very very extreme we get a lot of results. However the urls are much more interesting. Granted none of the searches here in Part Five are super awesome. They do show a workable example of using and XS context directly against raw events. We also get a good comparison of a classless context which does what it is supposed to vs with a class that helps draw out more interesting events. Formulating your XS context and your search questions are very important so you really have to think about what question you are trying to answer and experiment with variations against your own data.

In my data I find interesting URLs trying to redirect through my site but they land on a useless wordpress page.

Share

Splunk Getting Extreme Part Four

Let’s revisit our EPS Splunk Metrics. This time we are going to use type=domain and do something else a little different. We are going to make a non classed context and apply it directly to the raw event data.

The Question:

The question we want is, what systems are generating metrics events well above low AND we want to know what concept term they fall in?

We also want get the original raw events precise in time. That is technically a different question than we asked in part one of this blog series. There we made more of a canary that asked when did a given host go over normal for it’s activity levels with no relation to the whole environment in a particular bucket of time.

Context Gen:

We want to make a context that is not setup for a class. Note we don’t even use a time bucketing step. The search just is set to run across the previous 30 days which is typically the retention period of Splunk index=_internal logs.

The reason we are doing it this way is we are wanting to find events that are something like high, extreme etc for our entire environment. We don’t care about trending per source system (series). We get count as the distinct count of source systems (series), then the min and max values for EPS for all sources.

index= _internal source=*metrics.log earliest=-30d latest=now group=per_host_thruput | stats dc(series) as count, min(eps) as min, max(eps) as max | xscreateddcontext name=eps container=splunk_metrics app=search scope=app type=domain terms="minimal,low,medium,high,extreme" notes="events per second" uom=“eps”

Search:

First we see if we have any extreme events in the past 30 days.

index= _internal source=*metrics.log group=per_host_thruput | xswhere eps from eps in splunk_metrics is extreme

I get one event, the largest catch up of web log imports.

11-11-2016 11:11:54.003 -0600 INFO Metrics - group=per_host_thruput, series="www.georgestarcher.com", kbps=2641.248883, eps=7144.764212, kb=81644.455078, ev=220854, avg_age=1172745.126151, max_age=2283054

Next let’s get fancier. We want to know events very above low and have XS tell us what concept term those events best fit. This is a handy way to get the word for the term it fits

index= _internal source=*metrics.log group=per_host_thruput | xswhere eps from eps in splunk_metrics is very above low | xsFindBestConcept eps from eps in splunk_metrics | table _time, series, eps, kbps, BestConcept

Summary

The point is that you can use XS to build a context profile for raw data values then apply them back to the raw events. Raw events, if you can keep the number of matches low, make great ES notable events because they have the most of the original data. Using stats and tstats boils down the fields. That requires us to pass through values as we we saw in Part Three to make the results more robust.

Share

Splunk Getting Extreme Part Three

We covered an example of an Anomalous Driven (AD) context in part one and how to use tstats in part two. We will cover the a traditional Domain type context example using Authentication data and tstats.

In XS commands DD mean Data Driven context. Here we will cover a use case using xsCreateDDContext of the type Domain. Using type=domain means we are going to need a count, max, mix. The terms we will use are minimal, low, medium, high, and extreme. This will let us find certain levels of activity without worrying about what “normal” is vs “anomalous” as we saw in part one.

Extreme Search Commands:

xsCreate

The Create method tells extreme search to create the container and populate or update all the classes if the container already exists. You have to use this if the container does not already exist.

xsUpdate

This functions exactly as the xsCreate except that it will NOT work if the container does not exist. It will return an error and stop.

xsDeleteContext

This will delete a SPECIFIC class or “all” if no class is specified from a context in a container. There is no XS command to actually remove the contents from the container. Deleting against a context/container without a class leaves all the class data but searching against the context will act as if it does not exist. The deletion without a class removed the default class lines. From there XS commands act as if the context is gone though most of the class data remains. This means the file exists with most of it’s file size intact. There is not even an XS command to remove an entire container. We can still cheat from within Splunk. Normally, you should NEVER touch the context files via the outputlookup command as it will often corrupt the file contents. If we want to empty a container file we can just overwrite the csv file with empty contents. The CSV file name will be in the format: containername.context.csv

If we had made a context with:
| xsupdateddcontext name=mytest container=mytestContainer app=search scope=app class=src terms=terms="minimal,low,medium,high,extreme"

We can nuke the contents of the file using the search:
makeresults | outputlookup mytestContainer.context.csv

We can now populate that container with either xsCreate or xsUpdate. xsUpdate will work since the container file exists. This trick can be handy to reset a container and cull out accumulated data because the file has grown very large over time with use or if you accidentally fed too much data into it.

Let’s talk about that for a minute. What is too large? XS has to read in the entire CSV into memory when it uses it. That has the obvious implications. A data set of 10 rows with the normal 5 domain terms of “minimal,low,medium,high,extreme” gives us 56 lines in the csv. 10 data items plus a default data item = 11 * 5 = 55 plus a header row = 56. Generally, if you are going to have 10K data items going into a context I would make one container for it and not share that container with any other contexts. That way you are not reading in a lot of large data into memory you are not using with your XS commands like for xswhere filtering.

One other thing to consider. The data size of this file is important in the Splunk data bundle replication. It is a csv file in the lookups folder and gets distributed with all the other data. If you made a context so large the CSV was 1.5GB in size you could negatively impact your search bundle replication and be in for the fun that brings.

xsFindBestConcept

This command comes from the Extreme Search Visualization app. It lets you run data against your context and have it tell you what concept terms best match the each result. This command has to work pretty hard so if your data going in is large it may take a few minutes to come back.

| tstats summariesonly=true dc(Authentication.user) as userCount from datamodel=Authentication where (nodename=Authentication.Failed_Authentication sourcetype=linux_secure) by _time, Authentication.src, Authentication.app span=1d | rename Authentication.* AS * | xsFindBestConcept userCount FROM users_by_src_1d IN auth_failures BY "src,app"

xsGetWhereCIX

This command comes acts like xswhere but does not actually filter results. It just displays ALL results that went in and what their CIX compatibility value is for the statement you used.

| tstats summariesonly=true count as failures, dc(Authentication.user) AS userCount from datamodel=Authentication where nodename=Authentication.Failed_Authentication by _time Authentication.src, Authentication.app span=1d | eval avgFailures=failures/userCount | rename Authentication.* AS * | xsgetwherecix avgFailures from failures_by_src_1d by "src,app" in auth_failures is extreme

Min and Max:

XS for the type=domain needs count, and min/max values with depth. This means where min/max are never equal. The fun part is HOW you get a min and max is up to you. You will see examples that just use the min() and max() functions. Other examples will get min() and make max the median()*someValue. You often have to experiment for what fits your data and gives you an acceptable result. We touched on this value spreading in part one of Getting Extreme.

Here a couple of different patterns though you can do it any way you like.

  1. stats min(count) as min, max(count) as max … | eval max=if(min=max,min+5,max) | eval max=if(max-min<5,min+5,max)

  2. stats min(count) as min, median(count) as median, average(count) as average … | eval median=if(average-median<5,median+5,average) | eval max=median*2

If you don’t get min/max spread out you will see a message like the following when trying to generate your context.

xsCreateDDContext-W-121: For a domain context failures_by_src_1d with class 103.207.36.133:sshd, min must be less than max, skipping

Use Case: Authentication Abusive Source IPs

Question: We will define our question as, what are the source IPs that are abusing our system via authentication failures by src and application type. We want to know by average failures/number of user accounts tried per day. We also want to know if it is simply an extreme number of user accounts failed regardless of the number of failures per day. Yeah, normally I would do by hour or shorter period. The test data I have is from a Raspberry Pi exposed to the Internet. The RPi is sending to Splunk using the UF for Raspberry PI. That RPi is also running fail2ban, so it limits the number of failures a source can cause before it is banned for a while. This means we will work with a scale that typically maxes out at 6 tries.

Avg Failures/userCount by src by day

Here we divide the number of failures by the number of users. This gives us a ball park number of failures for a user account from a given source. We could put user into the class but that would then make our trend too specific of being tied to a distinct src, app,user. We want more a threshold of failures per user per source in a day.

Context Gen:

| tstats summariesonly=true count as failures, dc(Authentication.user) AS userCount from datamodel=Authentication where nodename=Authentication.Failed_Authentication by _time Authentication.src, Authentication.app span=1d | eval avgFailures=failures/userCount | stats count, avg(avgFailures) as average, min(avgFailures) as min, max(avgFailures) as max by Authentication.src, Authentication.app | rename Authentication.* AS * | eval max=if(min=max,min+5,max) | xsCreateDDContext name=failures_by_src_1d app=search container=auth_failures scope=app type=domain terms="minimal,low,medium,high,extreme" notes="login failures by src by day" uom="failures" class="src,app"

Search:

Here we use the context to filter our data and find the extreme sources.

| tstats summariesonly=true count as failures, dc(Authentication.user) AS userCount from datamodel=Authentication where nodename=Authentication.Failed_Authentication by time Authentication.src, Authentication.app span=1d | eval avgFailures=failures/userCount | rename Authentication.* AS * | xswhere avgFailures from failures_by_src_1d by "src,app" in auth_failures is extreme | iplocation prefix=src src | rename src_City AS src_city, src_Country AS src_country, src_Region as src_region, src_lon AS src_long | lookup dnslookup clientip AS src OUTPUT clienthost AS src_dns

Distinct User Count by src by day

Here we are going to trend the distinct number of users tried per source without regard of the number of actual failures.

Context Gen:

| tstats summariesonly=true dc(Authentication.user) as userCount from datamodel=Authentication where (nodename=Authentication.Failed_Authentication sourcetype=linux_secure) by _time, Authentication.src, Authentication.app span=1d | stats min(userCount) as min, max(userCount) as max, count by Authentication.src, Authentication.app | rename Authentication.* as * | eval max=if(min=max,min+5,max) | xsCreateDDContext name=users_by_src_1d app=search container=auth_failures scope=app type=domain terms="minimal,low,medium,high,extreme" notes="user count failures by src by day" uom="users" class="src,app"

Search:

Here we use the context to filter our data and find the sources with user counts above medium.

| tstats summariesonly=true dc(Authentication.user) as userCount from datamodel=Authentication where (nodename=Authentication.Failed_Authentication sourcetype=linux_secure) by _time, Authentication.src, Authentication.app span=1d | rename Authentication.* AS * | xswhere userCount from users_by_src_1d in auth_failures by "src,app" is above medium

Merge to get the most abusive sources by app

We can actually merge both of these searches together. This lets us run one search over a give time period reducing our Splunk resource usage and giving us results that match either or both of our conditions.

Combined Search:

This search is bucketing the time range it runs across into days then compares to our contexts that were generated with day period as a target. Normally for an ES notable search you would not bucket time with the “by” and “span” portions as you would be only running the search over something like the previous day each day.

| tstats summariesonly=true count AS failures, dc(Authentication.user) as userCount, values(Authentication.user) as targetedUsers, values(Authentication.tag) as tag, values(sourcetype) as orig_sourcetype, values(source) as source, values(host) as host from datamodel=Authentication where (nodename=Authentication.Failed_Authentication sourcetype=linux_secure) by time, Authentication.src, Authentication.app span=1d | eval avgFailures=failures/userCount | rename Authentication.* AS * | xswhere avgFailures from failures_by_src_1d by "src,app" in auth_failures is extreme OR userCount from users_by_src_1d in auth_failures by "src,app" is above medium | iplocation prefix=src src | rename src_City AS src_city, src_Country AS src_country, src_Region as src_region, src_lon AS src_long | lookup dnslookup clientip AS src OUTPUT clienthost AS src_dns

The thing to note about the CIX value is anything that is greater than 0.5 means it matched both our contexts to some degree. The 1.0 matched them both solidly. If the CIX is 0.5 or less it means it matches only one of the contexts to some degree. Notice, I used “is extreme” on one test and “is above medium” on the other. You can adjust the statements to fit your use case and data.

Bonus Comments:

You will notice in the searches above I added some iplocation and dnslookup commands. I also used the values and extra eval functions to add to the field value content of the results. This is something you want to do when making Enterprise Security notables. This helps give your security analysts data robust notables that they might can triage without ever drilling down into the original event data.

Share

Splunk Getting Extreme Part Two

Part one gave us a walk through of a simple anomalous search. Now we need to go over foundational knowledge about search construction when building extreme search contexts.

Comparing Search Methods

Traditional Search

This is what we did in part one. We ran a normal SPL search across regular events then used a bucket by _time and stats combination to get our statistics trend over time. This is handy when your event data is not tied to an accelerated Data Model.

Context Gen Search Pattern:

search events action=failure | bucket _time span=1h | stats count by _time, src | stats min, max etc | XS Create/Update

Search Speed:

tag=authentication action=failure

“This search has completed and has returned 8,348 results by scanning 14,842 events in 7.181 seconds”

tstats Search

Splunk is great at the “dynamic schema” aka search time extractions. This flexibility comes at the cost of speed when searching. An Accelerated Data Model is a method to give a step up in performance by building an indexed map of a limited set of fields based on that data. This is much faster to search at the trade off of only being able to specify fields that are mapped in the Data Model. Tstats means tsidx stats. It functions on the tsidx indexing files of the raw data plus it runs the equivalent to “ | datamodel X | stats Z” to catch data that is not accelerated already. This is a middle ground between accelerated and non accelerated only data searching.

Context Gen Search Pattern:

| tstats count from datamodel=…. by _time… span=1h | stats min, max etc | XS Create/Update

Search Speed:

| tstats count from datamodel=Authentication where nodename=Authentication.Failed_Authentication

“This search has completed and has returned 1 results by scanning 12,698 events in 1.331 seconds”

tstats summariesonly=true Search

Using summaries only with tstats tells Splunk to search ONLY the data buckets that have had their Data Model map acceleration build completed. It leaves off the attempt to even check for non accelerated data to return. This does mean you can miss data that has not yet been accelerated. Or you can miss data if something happens where acceleration data has to be rebuilt. This often happens in an index cluster after a rolling restart.

Ball park, the accelerated data copy is going to consume an extra 3.4x storage the size of the indexed data. We are trading that storage for speed for the index of the data. So keep that in mind when you decide how much data to accelerate.

Context Gen Search Pattern:

| tstats summariesonly=true count from datamodel=…. by _time… span=1h | stats min, max etc | XS Create/Update

Search Speed:

| tstats count from datamodel=Authentication where nodename=Authentication.Failed_Authentication

“This search has completed and has returned 1 results by scanning 10,081 events in 0.394 seconds”

Summary:

We can see significant speed increases in the progression across how we constructed the searches.

  1. Traditional Search took 7.2 seconds

  2. tstats took 1.3 seconds

  3. tstats summariesonly=true took 0.4 seconds.

This tells us that when we want to generate stats trends for Extreme Search contexts over large data sets we should use tstats, and with summariesonly=true where we can. That often makes it trivial even in multi TB/day deployments to generate and update our XS search contexts quickly, even over months of data. That is handy when you are trying to “define normal” based on the existing data. All the above speeds are just using Splunk on my late 2012 MacBook Pro. Real indexers etc will perform even better. The point is to show you the gains between the base search methods when building your XS contexts.

The next posts in our series will focus on actual search use cases and the different XS context types.

Share