Splunk New Technology Add-ons: SyncKVStore and SendToHEC

I recently updated and replaced older repositories from my GitHub account that were hand made modular alerts to send search results to other Splunk instances. The first one sends the search results to a Splunk HTTP Event Collector receiver. The second one came from our Splunk 2016 .conf talk on KVStore. It was useful for sending search results (typically an inputlookup of a table) to a remote KVStore lookup table.

TA-Send_to_HEC

You can find the updated Send To HEC TA on Splunkbase: TA-Send_to_HEC or in my GitHub repository: TA-Send_to_HEC.

This is useful for taking search results and sending to another Splunk instance using HEC. If you chose JSON mode it will send the results as a JSON payload of all the fields after stripping any hidden fields. Hidden fields start with an underscore. RAW mode is a new option which takes the _raw field and sends ONLY that field to the remote HEC receiver.

TA-SyncKVStore

This has been completely redone. I have submitted it to Splunkbase, but for the moment you can get it from my GitHub repository: TA-SyncKVStore

Originally it only sent search results to a remote KVStore. Now it also has two modular inputs. The first pulls a remote KVStore collection (table) and puts it into a local KVStore collection. The second pulls the remote KVStore collection but indexes it locally in JSON format. It will strip the hidden fields before forming the JSON payload to index. You are responsible for making sure all the appropriate and matching KVStore collections exist.

If you look in the code you will notice an unusual hybrid of the Splunk SDK for Python to handle KVStore actions and my own python class for batch saving the data to the collection. I could not get the batch_save method from the SDK to work at all. My own class already existed and was threaded for performance from my old version of the modular input so I just used the SDK to clear data if you wanted a replace option and then my own code for saving the new or updated data.

I rebuilt both of these TAs using the awesome Splunk Add-on Builder. This makes it easy in the SyncKVStore TA to store the credentials in the internal Splunk encrypted storage. One comment to update on the previous post on credential storage. The Add-on Builder was recently updated and now gives much better multiple credential management with a “global account” pull down selector you can use in your inputs and alert actions.

Share

Splunk Stored Encrypted Credentials

I wrote about automating control of other systems from Splunk back in 2014. Things are very different now in what support Splunk provides for framework and SDKs. I have been looking to update some of the existing stuff in my git repo and using the Splunk Add-on builder. It handles a lot of the work for you when integrating to Splunk.

We now have modular alerts which is the evolution of the alert script stuff we were doing in 2014. Splunk also now has modular inputs, the old style custom search commands, and the new style custom search commands. In all cases, you could want to use credentials for a system that you do not want hard coded or left unencrypted in the code.

The Storage Passwords REST Endpoint

You will typically find two blog posts when you look into storing passwords in Splunk. Mine from 2014 and the one from Splunk in 2011 which I referenced in my details post with code. Both posts mention a critical point. The access permissions to make it work.

Knowledge objects in Splunk run as the user that owns them. I am talking the Splunk application user context. Not the OS system account you start Splunk under. If I run a search and save it as an alert then attach an alert action the code that executes in the alert action has Splunk user permissions as me. The owner of the search that triggered it at the time.

This is a critical point because you had to have a user capability known as ‘admin_all_objects’. Yes that is as godlike as it sounds. It normally is assigned to the admin user role. That has changed recently with Splunk 6.5.0. There is a new capability you can assign to a Splunk user role called ‘list_storage_passwords’. This lets your user account fetch from the storage passwords endpoint without being full admin over Splunk. It still suffers one downside. It is still an all or nothing access. If you have this permission you can pull ALL encrypted stored passwords. Still it is an improvement. Yes, it can be misused by Splunk users with the permission if they go figure out how to directly pull the entire storage. You have to decide whom your adversary is. The known Splunk user whom could pull it out, or an attacker or red team person whom finds credentials stored in scripts either directly on the system or in a code repository. I vote for using the storage as the better of the two choices.

Stored Credentials:

Where are they actually stored? On that point I am not going to bother with old versions of Splunk. You should be life cycle maintaining your deployment so I am going to just talk about 6.5.0+.

You need to have a username, the password, a realm and which app context you want to put it in. Realm? Yeah that is a fancy name for what is this credential for because you might actually have five different accounts named admin. How do you know which is the admin you want for a given use? Let’s say I have the username gstarcher on the service adafruit.io. I want to store that credential so I can send IOT data to my account there. I also have an account named gstarcher on another service and I want Splunk to be able to talk to both services using different alerts or inputs or whatever. So I use the realm to say adafruitio, gstarcher, password to define that credential. I might have the other be like ifttt, gstarcher, apikey. I can tell them apart because of the realm.

Wait, what about app context? If you have been around Splunk long you know that all configurations and knowledge objects exist within “applications” aka their app context. If you make a new credential via the API and do not tell the command what application you want it stored under then it will use the one your user defaults to. That is most often the Searching and Reporting app, aka search. That means if you look in $SPLUNK_HOME$/etc/apps/search/local/passwords.conf you will find the credentials you stored.

Example passwords.conf entry:

Do you notice it is encrypted? Yeah, it will be encrypted ONLY if you add the password using the API calls. If you do it by hand in the .conf text file then it will remain unencrypted. Even after a Splunk restart. This is odd behavior considering it uses splunk.secret to auto encrypt passwords in files like server.conf on a restart. So don’t do that.

How is it encrypted? It is encrypted using the splunk.secret private key for the Splunk install itself on that particular system. You can find that in $SPLUNK_HOME/etc/auth. That is why you tightly control whom has access to your Splunk system at the OS level. Audit it, make alerts on SSH into it etc. This file is needed as the software must have a way to know it’s own private key to decrypt things. Duane and I once wrote something in 30 minutes on a Saturday to decrypt passwords if you have the splunk.secret and conf files with encrypted passwords. So protect the private key.

Let me say this again. The app context ONLY matters in where the password lands for a passwords.conf perspective. The actual storage_passwords rest endpoint has no care in the world about app permissions for the user. It only checks if you have the capability list_storage_passwords. It will happily return every stored password to a get call. It will ONLY filter results if you set the app name when you make the API connection back to the Splunk REST interface. If you don’t specify the app as a filter it will return ALL credentials stored. Other than that, it is up to you to use username and realm to grab just the credential you need in your code. Don’t like that? Then please, please log a Splunk support ticket of type Enhancement Request against Core Splunk product asking for it to be updated to be more granular and respect app context permissions. Be sure to give a nice paragraph of your particular use case. That helps their developer stories.

Splunk Add-on Builder:

There are two ways the Splunk Add-on Builder handles “password” fields. First, if you place a password field in the Alert Actions Inputs panel for your alert, the Splunk GUI will obscure the password. The problem is that it is NOT encrypted. Let’s say you made this alert action. You attach your new alert action to a search. The password gets stored unencrypted in savedsearches.conf of the app where the search is saved.

The Add-on Builder provides an alternative solution that does encrypt credentials. You have to use the Add-on Setup Parameters panel and check the Add Account box. This lets you build a setup page you can enter credentials in for the TA. Those credentials will be stored in passwords.conf for the TA’s app context. There is one other issue. Currently the app builder internal libraries hard code realm to the be the app name. That is not great if you are making an Adaptive Response for Splunk Enterprise Security and want to reference credentials stored using the ES Credential Manager GUI. If you are making a TA that will never have multiple credentials that share the same username then this is still ok.

Patterns for Retrieval:

This is where everyone has the hardest time. Finding code examples on actually getting your credential back out. And it varies based on what you are making. So I am going to show an example for each type. Adapting it is up to you.

Splunklib Python SDK:

You will need to include the splunklib folder from the Splunk Python SDK in your App’s bin folder for the newer non InterSplunk style patterns. Yeah I know, why should you have to keep putting copies of the SDK in an app on a full install of Splunk that already should have it? Well there are reasons. I don’t get them all, but has to with design decisions and issues on paths, static vs dynamic linking concepts etc. All best left to the Splunk dev teams. Splunk admins hate the result of larger application bundles, but it is what it is.

Adding a Cred Script:

This is just a quick script that assumes it is in a folder and the splunklib is a folder level up which is why the sys.path.append is what it is for this example. This is handy if you are a business with a central password control system. You could use this as a template on how to reach into Splunk to keep credentials Splunk needs in sync with the centrally managed credential.

Modular Alert: Manual Style

The trick is always how do you get the session_key to work with. Traditional modular alerts send the information into the executes script via stdin. So here we grab stdin, parse it to JSON and pull off our session_key. Using that we can call a simple connect back to Splunk using the session_key and fetch the realm/username that are assumed to be setup in the modular alert configuration which is sent also in that payload of information.

Add-on Builder: Alert Action: Fetch realm other than app

Again it comes down to how do you obtain the session_key of the user that fires the knowledge object. The app builder has this great helper object and session_key is just a method hanging off it. We do not even have to grab stdin and parse it.

Add-on builder: Alert Action: App as realm

Just call their existing method you only specify the username because it is hardcoded to the app name for the realm.

Custom Search Command: Old InterSplunk Style

In an old style custom search command the easiest pattern to leverage the Intersplunk library to grab the sent “settings” which includes the sessionKey field. After we have that we are back to our normal Splunk SDK client pattern. You can see we are just returning all credential. You could use arguments on your custom search command to pass in the desired realm and username and borrow the credential for if pattern from the modular alert above.

Custom Search Command: New v2 Chunked Protocol Style

The new v2 chunked style of search command gives us an already authenticated session connection via the self object. Here we don’t even need to find and handle the session_key and just call the self. service.storage_passwords method to get all the credentials and leverage our usual SDK pattern to get the credential we want. The below pattern does not show it but you could pass realm and username in via arguments on your custom search command. You could then use the credential for if pattern from the modular alert example up above to grab just the desired credential.

Modular Input: Manual Style

I honestly recommend using the Add-on Builder these days. But if you want to use credentials with a manually built input Splunk has documentation here http://dev.splunk.com/view/SP-CAAAE9B#creds . Keep in mind you have to setup what username to send a session_key for by specifying the name in passAuth in the inputs.conf definition.

Modular Input: Add-on Builder

This works the same as our alert actions because of the helper object and the wrapping App Builder does for us. See Above on the other Add-on Builder examples. It is much easier to use and could be made to use the gui and named user creds.

Share

Splunk Getting Extreme Part Eight

Extreme Search has some other commands included with it. They have included the Haversine equation for calculating physical distance in the xsGetDistance command. We can couple that with the Splunk iplocation command to find user login attempts across distances too fast for realistic travel.

Context Gen:

Class: default

First we will create a default context with a maximum speed of 500mph. Note how we do not specify the class argument.

| xsCreateUDContext name=speed container=travel app=search scope=app terms="normal,fast,improbable,ludicrous” type=domain min=0 max=500 count=4 uom=mph

Class: all

Second we will create a context for the class all with the same maximum speed of 500mph. We could use a different maximum if we wanted here.

| xsCreateUDContext name=speed container=travel app=search scope=app terms="normal,fast,improbable,ludicrous” type=domain min=0 max=500 count=4 uom=mph class=all

Class: foot

Last we will create a context for the class foot with a maximum speed of 27.8mph. This is approximately the maximum foot speed of a human. This could be useful if measuring speed across a place like a college campus.

| xsCreateUDContext name=speed container=travel app=search scope=app terms="normal,fast,improbable,ludicrous” type=domain min=0 max=27.8 count=4 uom=mph class=foot

Search:

We will pretend my ssh authentication failures are actually successes. This is just because it is the data I have easily available.

Class: all

tag=authentication action=failure user=* src_ip=* user=* app=sshd | iplocation prefix=src_ src_ip | sort + _time | streamstats current=t window=2 earliest(src_lat) as prev_lat, earliest(src_lon) as prev_lon, earliest(_time) as prev_time, earliest(src_City) as prev_city, earliest(src_Country) as prev_country, earliest(src_Region) as prev_region, earliest(src) as prev_src, by user | eval timeDiff=(_time - prev_time) | xsGetDistance from prev_lat prev_lon to src_lat src_lon | eval speed=round((distance/(timeDiff/3600)),2) | table user, src, prev_src, src_Country, src_Region, src_City, prev_country, prev_region, prev_city, speed | eval travel_method="all" | xswhere speed from speed by travel_method in travel is above improbable | convert ctime(prev_time)

Class: foot

tag=authentication action=failure user=* src_ip=* user=* app=sshd | iplocation prefix=src_ src_ip | sort + _time | streamstats current=t window=2 earliest(src_lat) as prev_lat, earliest(src_lon) as prev_lon, earliest(_time) as prev_time, earliest(src_City) as prev_city, earliest(src_Country) as prev_country, earliest(src_Region) as prev_region, earliest(src) as prev_src, by user | eval timeDiff=(_time - prev_time) | xsGetDistance from prev_lat prev_lon to src_lat src_lon | eval speed=round((distance/(timeDiff/3600)),2) | table user, src, prev_src, src_Country, src_Region, src_City, prev_country, prev_region, prev_city, speed | eval travel_method="foot" | xswhere speed from speed by travel_method in travel is above improbable | convert ctime(prev_time)

Summary:

We combined a User Driven context with another XS command to provide ourselves an interesting tool. We also saw how we could use different classes within that UD context to answer the question on a different scale. Try adding another class like automobile with a max=100 to find speeds that are beyond safe local travel speeds.

This would be real fun when checking webmail logs to find compromised user accounts. Especially if you combine with Levenshtein for look alike domains sent to users to build the list of whom to check.

Share

Splunk Getting Extreme Part Seven

Welcome to part seven where we will try a User Driven context for Extreme Search.

Our use case is to find domain names in a from email address that are look alike domains to our own. We need to use Levenshtein to do this. There is a Splunk app for it on splunkbase. The app does have some issues and needs to be fixed. I also recommend editing the returned fields to be levenshtein_distance and levenshtein_ratio.

Test Data:

I took the new Top 1 Million Sites list from Cisco Umbrella as a source of random domain names. Then I matched it with usernames from a random name list. I needed some test data to pretend I had good email server logs. I do not have those kind of logs at home. The below data is MOCK data. Any resemblance to real email addresses is accidental.

source="testdata.txt" sourcetype="demo"

Context Gen:

This time we do not want to make a context based on data. We need to create a mapping of terms to values that we define regardless of the data. Technically we could just use traditional SPL to filter based on Levenshtein distance values. What fun would that be for this series? We also want to demonstrate a User Driven context. Levenshtein is the number of characters difference between two strings, aka the distance. A distance of zero means the strings match. I arbitrarily picked a max value of 15. Pretty much anything 10 or more characters different are so far out we could never care about them. I then picked terms I wanted to call the distance ranges. The closer to zero the more likely it is a look alike domain. “Uhoh” is generally going to be a distance of 0-2 then we go up from there. You could play with the max to get different value ranges mapped to the terms. It depends on your needs.

We can use the Extreme Search Visualization app to examine our context curves and values.

Exploring the Data:

We can try a typical stats count and wildcard search to see what domains might resemble ours of “georgestarcher.com”

It gets close but matches domains clearly not even close to our good one. Here is the list from my test data generation script.

georgeDomain = ['georgestarcher.com','ge0rgestarcher.com', 'g5orgestarhcer.net', 'georgestarcher.au', 'georgeestarcher.com']

We can see we didn’t find the domain staring with g5. Trying to define a regex to find odd combinations of our domain would be very difficult. So we will start testing our Levenshtein context.

Let’s try a getwhereCIX and sort on the distance.

Next let’s try using xsFindBestConcept to see what terms match the domains we are interested in compared to their distances.

Using our Context:

We have an idea what we need to try based on our exploring the data. Still we will try a few different terms with xswhere to see what we get.

Using: ”is interesting”

We can see we miss the closest matches this way and get more matches that clearly are not look alikes to our domain.

Using: “is near interesting”

Adding the hedge term “near” we can extend matching interesting into just a little into adjacent concept terms. We find all our terms even the closest ones. The problem is we also extended up into the higher distances too.

Using: “is near uhoh”

Again, we use near to extend up from uhoh but we find it is not far enough to find the domain “g5orgestarhcer.net”

Using: “is very below maybe”

This time we have some fun with the hedge terms and say very to pull in the edges and below to go downward from the maybe concept. This gives us the domains we are exactly trying to find. You may have noticed we dropped where the distance was zero in our searches. That is because we don’t care where it is from our own legitimate domain name.

Last Comments:

Levenshtein can be real hard to use on shorter domain names. It becomes too easy to match full legitimate other domain names compared to small distances of your own. If you try and use this in making notables you might want to incorporate a lookup table to drop known good domains that are not look alike domains. Here is the same search that worked well for my domain but for google.com. You can see it matches way too much stuff, though it does still capture interesting near domain names.

Example: google.com

Share

Splunk Getting Extreme Part Six

Welcome to part six of my series on Splunk Extreme Search. I am dedicating this to my best buddy of 18 years, Rylie. He is my Miniature Pinscher whom I need to let rest come December 29th. He has been awesome all these years and had me thinking about Time. So let’s talk Time and Extreme Search.

We saw in part five that we do not have to always bucket data and stats across time. Still, it is the most common thing to do in Extreme Search. How you handle time is important.

Saving Time

There are two main things you can do to make your searches time efficient.

  1. Use well defined searches. The more precise you can make the up front restrictions like action=success src_ip=10.0.0.0/8 the better the job the indexers can do. This also applies when using tstats by placing these up front restrictions in the where statement.

  2. Use accelerated data and tstats. Use the Common Information Model data models where you can. Accelerate them over the time ranges you need. Remember, you can also make custom data models and accelerate those even if your data does not map well to the common ones.

Accelerated Data Models

Seriously. I hot linked the title of this section. Go read it! And remember to hug a Splunk Docs team member at conf. They do an amazing job putting all that in there for us to read.

You choose how much data to accelerate by time. Splunk takes the DMs that are to be accelerated and launches special hidden searches behind the scenes. These acceleration jobs consume memory and CPU core resources like all the other searches. If you do not have enough resources you may start running into warning about hitting the maximum number of searches and your accelerations may be skipped. Think about that. If you are running ES Notables that use summariesonly=true you will miss matching data. This is because the correlation search runs over a time range and it finds no matching accelerated data. Woo! It is quiet and no notables are popping up. Maybe that isn’t so great… uh oh…

A second way you can have data model acceleration disruption is by having low memory on your indexers. This one is easier to spot. If you check the Data Model audit in Enterprise Security and see in the last error message column references to oomkiller you have less ram than you need. When that dispatched acceleration job gets killed, Splunk has to toss the acceleration job and dispatch it again on the next run. The data models will never get caught up if the jobs keep getting disrupted.

Acceleration getting behind can happen another way. Index Clustering. Acceleration builds tsidx files with the primary searchable bucket on an indexer. Index clustering exists to replicates data buckets to reduce the chance of data loss or availability if you lose one or more indexers. Prior to Splunk 6.4 there was no replication of the accelerated buckets. Just the data buckets. That was bad news when you had an indexer go down or had a rolling restart across your cluster. It would take time for the accelerations to roll back through, find that the primary bucket is now assigned as primary on a different indexer than where it was earlier. You guessed it. It has to rebuild the acceleration bucket on the indexer that now had the primary flag for that bucket. This is why if you check Data Model Audit in Enterprise Security you will see percentages complete drop most times after restarts of the indexing layer. You can turn on accelerated bucket replication in 6.4, at the cost of storage of course. Are you on version before 6.4 and using Index Clustering with Enterprise Security? You better plan that upgrade.

How far back in time your accelerations are relative to percentages complete is different between environments. Imagine the network traffic data model is behind at 96%. It sounds pretty good, but in large volume environments it could means the latest events in acceleration are 6 hours ago. What does that mean if your threat matching correlation searches only range over the past two hours and use summariesonly? It means no notables firing and you think things are quiet and great. The same thing applies to XS Context Gens. If you use summariesonly and are building averages and other statistics, those numbers are thrown off from what they should be.

If your data is pretty constant, like in high volume environments this is a down and dirty search to gauge latest event time compared to now.

| tstats summariesonly=true min(_time) as earliestTime, max(_time) as latestTime from datamodel=Authentication | eval lagHours=(now()-latestTime)/3600 | convert ctime(*Time)

The message is be a good Splunk Admin. Check your data model accelerations daily in your operations review process. Make sure you are adding enough indexers for your data load so DM accelerations can build quickly and stay caught up. You can increase the acceleration.max_concurrent for a given datamodel if you have the CPU resources on both Search Heads and Indexers. Use accelerated bucket replication if you can afford it.

One way you can spot acceleration jobs using search is something like the following. You may have to mess with the splunk_server field to match your search head pattern if you are on search clustering.

| rest splunk_server=local /servicesNS/-/-/search/jobs | regex label="ACCELERATE" | fields label, dispatchState ,id, latestTime, runDuration

There is another option to help accelerations stay caught up for your critical searches. The gui doesn’t show it but there is a setting called acceleration.backfill_time from datamodels.conf. You can say accelerate the Web data model for 30 days of data, but only backfill 7 days. This means if data is not accelerated, such as by an index cluster rolling restart, Splunk will only go back 7 days to catch up accelerations. That can address short run correlation searches for ES. It still creates gaps when using summariesonly for context generation searches that trend over the full 30 days. That brings you back to acceleration replication as the solution.

Oh, one other little item about data models. A data model acceleration is tied to a search head guid. If you are using search head clustering, it will use a guid for the whole cluster. ONLY searches with the matching GUID can access the accelerated data. No sharing accelerations across search heads not in the same cluster. This is why most of us cringe when Splunk customers ask about running multiple Enterprise Security instances against the same indexers. It requires data model acceleration builds for each GUID. You can imagine how resource hungry that can be at all levels.

Context Gens and Correlation Searches

Search Scheduling

A context is about defining ways to ask if something is normal, high, extreme etc. Most of our context gens run across buckets of time for time ranges like 24 hours, 30 days and so on.

Scenario. Lets say we have a context gen that is anomalous login successes by source, app and user. This should let me catch use of credentials from sources not normally seen or at a volume off of normal. If I refresh that context hourly but also run my detection search that uses xswhere hourly; I run the risk of a race condition. I could normalize in the new bad or unexpected source into the context BEFORE I detect it. I would probably schedule my context gen nightly so during the day before it refreshes I get every chance to have ES Notables trigger before the data is normalized into our context. So be sure to compare how often you refresh your context compared to when you use the context.

Time Windows

Check your context generation time range lines up with how far back you accelerate the models. It is easy to say run over 90 days then find out you only accelerated 30 days.

Check the run duration of your searches. Validate your search is not taking longer to run than the scheduled interval of the context gen or correlation search. That always leads to tears. Your search will run on it’s schedule. Take longer to run and get scheduled for it’s next run. It will actually start to “time slide” as the next run time gets farther and farther behind compared to the real time the search job finished. I saw this happen with a threat gen search for a threat intel product app once. It was painful. Job/Activity inspector is your friend on checking run durations. Also check the scheduled search panel now and then.

Look back at the previous posts. We make contexts over time buckets and we make sure to run a search that leverages it over the same bucket width of time. Do trending over a day? Make sure you run your matching correlation search over a day’s worth of time to get numbers on the same scale. Same goes for by hour. Normally you would not make a context by day and search by hour. The scales are different. Mixing scales can lead to odd results.

Embracing the odd:

One thing you should get from this series. It is all about The Question. Imagine we trend event count, or data volume per day for a source. Would it ever make sense to use that context over only an hour’s worth of data? Sure. You would get the real low end of the terms like minimal, low, maybe medium. If you saw hits matching “is extreme” you know that you have a bad situation going on. After all, you are seeing a days worth of whatever in only an hour window. Sometimes you break the “rules” because that is the question you want to ask.

I probably would not do that with the Anomaly Driven contexts. After all, you want anomalous deviation off normal.

Share

Splunk Getting Extreme Part Five

Part five brings another use case. We will use values in raw data not a calculated value to make our context and then match against the raw events without bucketing them.

First, we have glossed over an important question. What does data match when you use xsWhere and there is no matching class in the context? It uses the “default” class. Default is the weight average of all the existing classes within the context. If you look within the csv for the container you will find lines for your context where the class is an empty string “”. That is default. Default is also what is made for the class when no class is specified.

You get a message like the following when a class value is not in the context you are trying to use.

xsWhere-I-111: There is no context 'urllen_by_src' with class 'Not Found' from container '120.43.17.24' in scope 'none', using default context urllen_by_src

Use Case: Finding long urls of interest

Just the longer URLs:

Let’s try just creating a context of all our url_length data from the Web Data Model. This version of the search will not break this up by class. We will just see if we can find “extreme” length urls in our logs based on the log data itself.

Context Gen:

| tstats dc(Web.url_length) as count, avg(Web.url_length) as average, min(Web.url_length) as min, max(Web.url_length) as max from datamodel=Web where Web.src!="unknown" | rename Web.* as * | xsCreateDDContext name=urllen container=web_stats app=search scope=app type=domain terms="minimal,low,medium,high,extreme" notes="urllen" uom="length"

The table that is displayed when the xsCreateDDContext finishes is interesting. Below we sort for extreme and see the urllen value is 678. This tells us in my data the url_length value high end is around 678 characters. If we search the logs using this context we find that our results are not a magic all bad “is extreme” situation. All the interesting URLs are down in the low/medium ranges with all the good urls. You have to come up with a another way to slice data when the signal and noise are so close to each other. This approach might work for some other use case, but not for this particular data set.

Searches:

index=weblogs | xswhere url_length from urllen in web_stats is low | stats count by url

We get an overwhelming number of matches since most of our URLs are in the low range.

index=weblogs | xswhere url_length from urllen in web_stats is extreme | stats count by url

We get a manageable 5 events from extreme but they are not interesting URLs.

URL Length by Src:

We get a different url_length distribution if we break it out by src. Remember, default is the weighted average of all the classes in the context if you use classes. The table we see when our context gen finishes is that default.

Context Gen:

Notice in our by src version our urllen for extreme in the default context is around 133. That is going to come from the weighted average of the per source classes.

| tstats avg(Web.url_length) as average, min(Web.url_length) as min, max(Web.url_length) as max from datamodel=Web where Web.src!="unknown" by _time, Web.src span=1m | rename Web.* as * | stats count, min(min) as min, max(max) as max, avg(average) as average by src | eval max=if(min=max, max+average, max) | eval max=if(max-min < average , max+average, max) | xsCreateDDContext name=urllen_by_src container=web_stats app=search scope=app type=domain terms="minimal,low,medium,high,extreme" notes="url len by src" uom="length" class="src"

Search:

index=weblogs | xsWhere url_length from urllen_by_src in web_stats by src is very very extreme | stats values(src) AS sources, dc(src) as sourceCount by url, status_description

Even using is very very extreme we get a lot of results. However the urls are much more interesting. Granted none of the searches here in Part Five are super awesome. They do show a workable example of using and XS context directly against raw events. We also get a good comparison of a classless context which does what it is supposed to vs with a class that helps draw out more interesting events. Formulating your XS context and your search questions are very important so you really have to think about what question you are trying to answer and experiment with variations against your own data.

In my data I find interesting URLs trying to redirect through my site but they land on a useless wordpress page.

Share

Splunk Getting Extreme Part Four

Let’s revisit our EPS Splunk Metrics. This time we are going to use type=domain and do something else a little different. We are going to make a non classed context and apply it directly to the raw event data.

The Question:

The question we want is, what systems are generating metrics events well above low AND we want to know what concept term they fall in?

We also want get the original raw events precise in time. That is technically a different question than we asked in part one of this blog series. There we made more of a canary that asked when did a given host go over normal for it’s activity levels with no relation to the whole environment in a particular bucket of time.

Context Gen:

We want to make a context that is not setup for a class. Note we don’t even use a time bucketing step. The search just is set to run across the previous 30 days which is typically the retention period of Splunk index=_internal logs.

The reason we are doing it this way is we are wanting to find events that are something like high, extreme etc for our entire environment. We don’t care about trending per source system (series). We get count as the distinct count of source systems (series), then the min and max values for EPS for all sources.

index= _internal source=*metrics.log earliest=-30d latest=now group=per_host_thruput | stats dc(series) as count, min(eps) as min, max(eps) as max | xscreateddcontext name=eps container=splunk_metrics app=search scope=app type=domain terms="minimal,low,medium,high,extreme" notes="events per second" uom=“eps”

Search:

First we see if we have any extreme events in the past 30 days.

index= _internal source=*metrics.log group=per_host_thruput | xswhere eps from eps in splunk_metrics is extreme

I get one event, the largest catch up of web log imports.

11-11-2016 11:11:54.003 -0600 INFO Metrics - group=per_host_thruput, series="www.georgestarcher.com", kbps=2641.248883, eps=7144.764212, kb=81644.455078, ev=220854, avg_age=1172745.126151, max_age=2283054

Next let’s get fancier. We want to know events very above low and have XS tell us what concept term those events best fit. This is a handy way to get the word for the term it fits

index= _internal source=*metrics.log group=per_host_thruput | xswhere eps from eps in splunk_metrics is very above low | xsFindBestConcept eps from eps in splunk_metrics | table _time, series, eps, kbps, BestConcept

Summary

The point is that you can use XS to build a context profile for raw data values then apply them back to the raw events. Raw events, if you can keep the number of matches low, make great ES notable events because they have the most of the original data. Using stats and tstats boils down the fields. That requires us to pass through values as we we saw in Part Three to make the results more robust.

Share

Splunk Getting Extreme Part Three

We covered an example of an Anomalous Driven (AD) context in part one and how to use tstats in part two. We will cover the a traditional Domain type context example using Authentication data and tstats.

In XS commands DD mean Data Driven context. Here we will cover a use case using xsCreateDDContext of the type Domain. Using type=domain means we are going to need a count, max, mix. The terms we will use are minimal, low, medium, high, and extreme. This will let us find certain levels of activity without worrying about what “normal” is vs “anomalous” as we saw in part one.

Extreme Search Commands:

xsCreate

The Create method tells extreme search to create the container and populate or update all the classes if the container already exists. You have to use this if the container does not already exist.

xsUpdate

This functions exactly as the xsCreate except that it will NOT work if the container does not exist. It will return an error and stop.

xsDeleteContext

This will delete a SPECIFIC class or “all” if no class is specified from a context in a container. There is no XS command to actually remove the contents from the container. Deleting against a context/container without a class leaves all the class data but searching against the context will act as if it does not exist. The deletion without a class removed the default class lines. From there XS commands act as if the context is gone though most of the class data remains. This means the file exists with most of it’s file size intact. There is not even an XS command to remove an entire container. We can still cheat from within Splunk. Normally, you should NEVER touch the context files via the outputlookup command as it will often corrupt the file contents. If we want to empty a container file we can just overwrite the csv file with empty contents. The CSV file name will be in the format: containername.context.csv

If we had made a context with:
| xsupdateddcontext name=mytest container=mytestContainer app=search scope=app class=src terms=terms="minimal,low,medium,high,extreme"

We can nuke the contents of the file using the search:
makeresults | outputlookup mytestContainer.context.csv

We can now populate that container with either xsCreate or xsUpdate. xsUpdate will work since the container file exists. This trick can be handy to reset a container and cull out accumulated data because the file has grown very large over time with use or if you accidentally fed too much data into it.

Let’s talk about that for a minute. What is too large? XS has to read in the entire CSV into memory when it uses it. That has the obvious implications. A data set of 10 rows with the normal 5 domain terms of “minimal,low,medium,high,extreme” gives us 56 lines in the csv. 10 data items plus a default data item = 11 * 5 = 55 plus a header row = 56. Generally, if you are going to have 10K data items going into a context I would make one container for it and not share that container with any other contexts. That way you are not reading in a lot of large data into memory you are not using with your XS commands like for xswhere filtering.

One other thing to consider. The data size of this file is important in the Splunk data bundle replication. It is a csv file in the lookups folder and gets distributed with all the other data. If you made a context so large the CSV was 1.5GB in size you could negatively impact your search bundle replication and be in for the fun that brings.

xsFindBestConcept

This command comes from the Extreme Search Visualization app. It lets you run data against your context and have it tell you what concept terms best match the each result. This command has to work pretty hard so if your data going in is large it may take a few minutes to come back.

| tstats summariesonly=true dc(Authentication.user) as userCount from datamodel=Authentication where (nodename=Authentication.Failed_Authentication sourcetype=linux_secure) by _time, Authentication.src, Authentication.app span=1d | rename Authentication.* AS * | xsFindBestConcept userCount FROM users_by_src_1d IN auth_failures BY "src,app"

xsGetWhereCIX

This command comes acts like xswhere but does not actually filter results. It just displays ALL results that went in and what their CIX compatibility value is for the statement you used.

| tstats summariesonly=true count as failures, dc(Authentication.user) AS userCount from datamodel=Authentication where nodename=Authentication.Failed_Authentication by _time Authentication.src, Authentication.app span=1d | eval avgFailures=failures/userCount | rename Authentication.* AS * | xsgetwherecix avgFailures from failures_by_src_1d by "src,app" in auth_failures is extreme

Min and Max:

XS for the type=domain needs count, and min/max values with depth. This means where min/max are never equal. The fun part is HOW you get a min and max is up to you. You will see examples that just use the min() and max() functions. Other examples will get min() and make max the median()*someValue. You often have to experiment for what fits your data and gives you an acceptable result. We touched on this value spreading in part one of Getting Extreme.

Here a couple of different patterns though you can do it any way you like.

  1. stats min(count) as min, max(count) as max … | eval max=if(min=max,min+5,max) | eval max=if(max-min<5,min+5,max)

  2. stats min(count) as min, median(count) as median, average(count) as average … | eval median=if(average-median<5,median+5,average) | eval max=median*2

If you don’t get min/max spread out you will see a message like the following when trying to generate your context.

xsCreateDDContext-W-121: For a domain context failures_by_src_1d with class 103.207.36.133:sshd, min must be less than max, skipping

Use Case: Authentication Abusive Source IPs

Question: We will define our question as, what are the source IPs that are abusing our system via authentication failures by src and application type. We want to know by average failures/number of user accounts tried per day. We also want to know if it is simply an extreme number of user accounts failed regardless of the number of failures per day. Yeah, normally I would do by hour or shorter period. The test data I have is from a Raspberry Pi exposed to the Internet. The RPi is sending to Splunk using the UF for Raspberry PI. That RPi is also running fail2ban, so it limits the number of failures a source can cause before it is banned for a while. This means we will work with a scale that typically maxes out at 6 tries.

Avg Failures/userCount by src by day

Here we divide the number of failures by the number of users. This gives us a ball park number of failures for a user account from a given source. We could put user into the class but that would then make our trend too specific of being tied to a distinct src, app,user. We want more a threshold of failures per user per source in a day.

Context Gen:

| tstats summariesonly=true count as failures, dc(Authentication.user) AS userCount from datamodel=Authentication where nodename=Authentication.Failed_Authentication by _time Authentication.src, Authentication.app span=1d | eval avgFailures=failures/userCount | stats count, avg(avgFailures) as average, min(avgFailures) as min, max(avgFailures) as max by Authentication.src, Authentication.app | rename Authentication.* AS * | eval max=if(min=max,min+5,max) | xsCreateDDContext name=failures_by_src_1d app=search container=auth_failures scope=app type=domain terms="minimal,low,medium,high,extreme" notes="login failures by src by day" uom="failures" class="src,app"

Search:

Here we use the context to filter our data and find the extreme sources.

| tstats summariesonly=true count as failures, dc(Authentication.user) AS userCount from datamodel=Authentication where nodename=Authentication.Failed_Authentication by time Authentication.src, Authentication.app span=1d | eval avgFailures=failures/userCount | rename Authentication.* AS * | xswhere avgFailures from failures_by_src_1d by "src,app" in auth_failures is extreme | iplocation prefix=src src | rename src_City AS src_city, src_Country AS src_country, src_Region as src_region, src_lon AS src_long | lookup dnslookup clientip AS src OUTPUT clienthost AS src_dns

Distinct User Count by src by day

Here we are going to trend the distinct number of users tried per source without regard of the number of actual failures.

Context Gen:

| tstats summariesonly=true dc(Authentication.user) as userCount from datamodel=Authentication where (nodename=Authentication.Failed_Authentication sourcetype=linux_secure) by _time, Authentication.src, Authentication.app span=1d | stats min(userCount) as min, max(userCount) as max, count by Authentication.src, Authentication.app | rename Authentication.* as * | eval max=if(min=max,min+5,max) | xsCreateDDContext name=users_by_src_1d app=search container=auth_failures scope=app type=domain terms="minimal,low,medium,high,extreme" notes="user count failures by src by day" uom="users" class="src,app"

Search:

Here we use the context to filter our data and find the sources with user counts above medium.

| tstats summariesonly=true dc(Authentication.user) as userCount from datamodel=Authentication where (nodename=Authentication.Failed_Authentication sourcetype=linux_secure) by _time, Authentication.src, Authentication.app span=1d | rename Authentication.* AS * | xswhere userCount from users_by_src_1d in auth_failures by "src,app" is above medium

Merge to get the most abusive sources by app

We can actually merge both of these searches together. This lets us run one search over a give time period reducing our Splunk resource usage and giving us results that match either or both of our conditions.

Combined Search:

This search is bucketing the time range it runs across into days then compares to our contexts that were generated with day period as a target. Normally for an ES notable search you would not bucket time with the “by” and “span” portions as you would be only running the search over something like the previous day each day.

| tstats summariesonly=true count AS failures, dc(Authentication.user) as userCount, values(Authentication.user) as targetedUsers, values(Authentication.tag) as tag, values(sourcetype) as orig_sourcetype, values(source) as source, values(host) as host from datamodel=Authentication where (nodename=Authentication.Failed_Authentication sourcetype=linux_secure) by time, Authentication.src, Authentication.app span=1d | eval avgFailures=failures/userCount | rename Authentication.* AS * | xswhere avgFailures from failures_by_src_1d by "src,app" in auth_failures is extreme OR userCount from users_by_src_1d in auth_failures by "src,app" is above medium | iplocation prefix=src src | rename src_City AS src_city, src_Country AS src_country, src_Region as src_region, src_lon AS src_long | lookup dnslookup clientip AS src OUTPUT clienthost AS src_dns

The thing to note about the CIX value is anything that is greater than 0.5 means it matched both our contexts to some degree. The 1.0 matched them both solidly. If the CIX is 0.5 or less it means it matches only one of the contexts to some degree. Notice, I used “is extreme” on one test and “is above medium” on the other. You can adjust the statements to fit your use case and data.

Bonus Comments:

You will notice in the searches above I added some iplocation and dnslookup commands. I also used the values and extra eval functions to add to the field value content of the results. This is something you want to do when making Enterprise Security notables. This helps give your security analysts data robust notables that they might can triage without ever drilling down into the original event data.

Share