Splunk Getting Extreme Part Seven

Welcome to part seven where we will try a User Driven context for Extreme Search.

Our use case is to find domain names in a from email address that are look alike domains to our own. We need to use Levenshtein to do this. There is a Splunk app for it on splunkbase. The app does have some issues and needs to be fixed. I also recommend editing the returned fields to be levenshtein_distance and levenshtein_ratio.

Test Data:

I took the new Top 1 Million Sites list from Cisco Umbrella as a source of random domain names. Then I matched it with usernames from a random name list. I needed some test data to pretend I had good email server logs. I do not have those kind of logs at home. The below data is MOCK data. Any resemblance to real email addresses is accidental.

source="testdata.txt" sourcetype="demo"

Context Gen:

This time we do not want to make a context based on data. We need to create a mapping of terms to values that we define regardless of the data. Technically we could just use traditional SPL to filter based on Levenshtein distance values. What fun would that be for this series? We also want to demonstrate a User Driven context. Levenshtein is the number of characters difference between two strings, aka the distance. A distance of zero means the strings match. I arbitrarily picked a max value of 15. Pretty much anything 10 or more characters different are so far out we could never care about them. I then picked terms I wanted to call the distance ranges. The closer to zero the more likely it is a look alike domain. “Uhoh” is generally going to be a distance of 0-2 then we go up from there. You could play with the max to get different value ranges mapped to the terms. It depends on your needs.

We can use the Extreme Search Visualization app to examine our context curves and values.

Exploring the Data:

We can try a typical stats count and wildcard search to see what domains might resemble ours of “georgestarcher.com”

It gets close but matches domains clearly not even close to our good one. Here is the list from my test data generation script.

georgeDomain = ['georgestarcher.com','ge0rgestarcher.com', 'g5orgestarhcer.net', 'georgestarcher.au', 'georgeestarcher.com']

We can see we didn’t find the domain staring with g5. Trying to define a regex to find odd combinations of our domain would be very difficult. So we will start testing our Levenshtein context.

Let’s try a getwhereCIX and sort on the distance.

Next let’s try using xsFindBestConcept to see what terms match the domains we are interested in compared to their distances.

Using our Context:

We have an idea what we need to try based on our exploring the data. Still we will try a few different terms with xswhere to see what we get.

Using: ”is interesting”

We can see we miss the closest matches this way and get more matches that clearly are not look alikes to our domain.

Using: “is near interesting”

Adding the hedge term “near” we can extend matching interesting into just a little into adjacent concept terms. We find all our terms even the closest ones. The problem is we also extended up into the higher distances too.

Using: “is near uhoh”

Again, we use near to extend up from uhoh but we find it is not far enough to find the domain “g5orgestarhcer.net”

Using: “is very below maybe”

This time we have some fun with the hedge terms and say very to pull in the edges and below to go downward from the maybe concept. This gives us the domains we are exactly trying to find. You may have noticed we dropped where the distance was zero in our searches. That is because we don’t care where it is from our own legitimate domain name.

Last Comments:

Levenshtein can be real hard to use on shorter domain names. It becomes too easy to match full legitimate other domain names compared to small distances of your own. If you try and use this in making notables you might want to incorporate a lookup table to drop known good domains that are not look alike domains. Here is the same search that worked well for my domain but for google.com. You can see it matches way too much stuff, though it does still capture interesting near domain names.

Example: google.com