Shane Caldwell - Deep Reinforcement Learning for Security: Toward an Autonomous Pentesting Agent

Introduction

I’ve found myself very interested in reinforcement learning recently. As you do deep learning work, you can sometimes feel limited in the problems you can solve by the paradigms you have available. To paraphrase Andrej Karpathy, the APIs to deep learning can seem constraining, despite their power. We start with a fixed size input and fixed size output for problems like classification routinely solved by CNNs. To deal with text, we have RNNs and the more intricate LSTM models that can deal intelligently with long sequences with a kind of memory. There’s an incredible array of kinds of problems that can formulated to be solved by those approaches. We’ve seen generated artwork with GANs, object detectors used for medical diagnostics, and CNNs applied to sound classification. It will be a long time before we’re out of runway applying these techniques with novel variations to different fields with a lot of success. There are careers to be made for clever folks to use domain knowledge in a subject to reformulate their problem into one of these “solved problems”.

When I started studying machine learning, I actually had a specific domain in mind I wanted to apply it to. I’d been a penetration tester for almost two years and recently earned my OSCP when I was offered a position in a Masters in Data Science program. Pentesting was super fun, but I found myself daydreaming on the problem of whether it was possible to develop intelligent tools to aid in penetration testing. What would a tool like that be like? Specifically, I wanted to know whether it was possible to create an autonomous pentesting agent, like the kind of sentient hacking AI that make up the endlessly readable William Gibson novels.

It was also partially born out of a desire to make a useful tool in a competitive field. There are really wonderful tools out there for the would-be attacker. For web application pentesting, Burp Suite is an incredibly comprehensive exploitation tool. It’s a proxy that sits inbetween your HTTP requests coming from your client browser heading to the server, allowing you to freely edit the content going to the server. Through this, all sorts of interesting attacks are possible. Using the tool is easy, as well! After browsing the site normally for awhile, it logs all the routes you can send requests to, and all the types of requests you’ve sent and recieved while interacting with the tool. From there, you can run a scan. The scan can reliably find everything from cross-site scripting to SQL injection mostly with the power of regular expressions and a handy list of strings that are usually used to exploit these sorts of attacks.

From the network side of things, Metasploit is even more compelling. It’s a tool and framework all in one. From within the metasploit tool you can keep track of almost everything you need to run a penetration test successfully. You can run scans, store information about target hosts, customize and launch exploits, and select payloads all from within that tool. Even more incredible - it’s open source! Once a proof of concept for an exploit has been discovered, there’s an easy to use API that allows you to write a little Ruby and produce your own exploit that you can share with others.

Those tools are remarkably solid and being produced by a community of talented security professionals. Better yet, they’re frameworks that allow a developer to add new functionality for anything they find lacking and share it with the world. Still, I couldn’t help but think it should be possible to perform the work automatically. I don’t mean ‘script recurring tasks’ automatic, I mean ‘set it, perform pentest, let me know how to patch the holes you found’ automatically. That’s not to say I want the work to go away. The most exciting aspects of the work are this rare 15% of it that requires an insane amount of creativity and knowledge. You can read writeups from folks who have found seemingly invisible bugs that you would think don’t have any impact at all, and used them to completely compromise applications and plunder their databases. If you don’t believe me, the popularization of bug bounties have made it incredibly easy to see what kind of hacks are out there in the wild. Bug bounties allow hackers to make money for security bugs found within their applications or networks, and many organizations running the programs allow for writeups to be published after the fact. It’s humbling to read them.

That other 85% or so can be a bit of a slog, though. There are several well known security issues that crop up time and time again. Finding them is always exciting in the way that all hacking is - you broke a thing that’s not supposed to break! You have access to stuff you’re not supposed to have! But it’s not challenging or engaging, really. Is it possible to build tools that make all of security the fun part? And of course, the holy grail - is it possible to make an agent even better at penetration testing than humans?

But before we plot the future, let’s see where we stand. How is ML being applied to security today?

The state of ML in Defense

Most machine learning naturally lends itself to defense, more than attack. There’s actually been a pretty good amount of defensive tooling developed. And why not? The paradigms fit like a glove. As a defender your biggest problem is probably that you have too much information. Networks are just happening all the time, generating all sorts of traffic on all sorts of services. You’re a human being with two eyes and a limited amount of caffeine to throw at the problem of perceiving incredibly granular logs. If you knew something bad was happening, you’re probably educated enough to take an action, but how can you know? Frequently some scripted logic and a regular expression list can alert you of some well described dangers - imagine your database administrator logged in from an IP belonging to a country they don’t live in and then changed their password - but not all dangerous situations are that well-described. What about stuff that’s just weird?

These fall under the general bucket of anomoly detection as a problem. First, you gather a lot of data and group it into some sort of observation at a fidelity a model can interpret. Then, you run the observation through the model and get a boolean output. Either it’s bad, and you alert a person, or it’s good, and nothing happens. Think about it as a “weird/not weird” classifier. The intuition behind the perceptual task is stored within the dataset, and the algorithm transforms it into something that’s augmenting a human’s capabilities by taking cognitive load off of them.

If you’re looking for something with a similar principle but more automated, all sorts of “smart firewalls” can be made this way. You learn what looks normal, train a network to recognize normal, and then if you’re not normal you’re an anomoly. The upside is big - if you detect an attack, you can take an action. The downside of a false alarm can be bad depending on the tooling, but as long as you’re not overwhelemed with anomalies to look at a false positive is fine. At least in theory whatever you’re looking at should be anomalous and therefore interesting.

In practice, this is challenging to pull off. What’s normal for a network is a living, breathing thing. New people come in, they leave. New servers come on site. If configured poorly, all of these things can be anomolous. Training a network in a custom way is also challenging - you want to learn a good distribution of normal but for that to be legitimate you would need to know within a shadow of a doubt that your network is currently not compromised as you’re training. Obviously, you have no idea whether that’s the case or not and there’s really no way to prove otherwise. So you have this sort of ontological problem for these types of detectors that’s challenging to solve, at least at the network level.

Cylance claims to do this on the endpoint level, using AI to find malware processes on desktops and phones. There’s not really a clear whitepaper that breaks down how, but it sounds pretty cool. The approach for an endpoint anamoly detector seems equally sound to others in the anomoly detection paradigm - in each you find this distribution of process behavior that’s normal or acceptable, and if you fall outside of that you can flag it and allow a user to make the call to override detection if it’s a false positive.

You couldn’t really call any of these tools autonomous defenders though. You don’t have agents on the environment watching network traffic and taking actions in response to them. You might automatically put someone on a block list, or filter bad traffic (I too have scraped websites agressively enough that I was hit with a captcha) but none of those tools are giving the Security Operations Center the day off to play golf. We don’t have ourselves an “autonomous defender”, we have a fire alarm.

The state of ML in Offense

The state of things over on the offensive side is actually starting to catch up to defense, at least over the last couple of years. Attackers do a lot of enumerating resources, which is its own form of data collection (though it pales in comparison to the sheer volume of the defensive side).

They follow a very similar paradigm as well, actually. Except now anamoly means something different. On the offensive side it’s “Hey bud, that’s a whole lotta attack surface to look at there. Want me to check it out and see if any tires seem worth kicking”?

BishopFox’s eyeballer is actually a really cool example of one of these. Many security tools sniff HTTP endpoints of a target and screenshot them for you to review. Eyeballer goes that extra step forward and lets you apply classification to the problem. Run them through the classifier to find out if they’re login pages, or they look like old custom code, whatever. It’s a great example of taking a domain specific pentesting problem and making it fit into the classification paradigm.

There’s been similar work done with text. I even found a language model used to do reconaissance on a target’s twitter and then use text models customize messages with phishing links catered to them. This is a BlackHat talk from ZeroFox. As you might’ve noticed, there are a lot of foxes in security consulting. But also, this is very much in line with what I was thinking of - an automated, intelligent tool to assist with security testing.

For the record, I think all of the tools I’ve listed above are insanely cool and I would’ve been proud to have worked on any of them. It is not a critique that none of them seem to fit the paradigm I’m looking for: how would you go about developing an agent that could act autonomously? To be specific, the ‘hello world’ of such an agent might look as follows:

How could you develop a system that had never seen Metasploitable or similar vulnerable-by-design single hosts that could be placed on the same network as them, automatically enumerate information about, exploit, and extract data from them? If such a system was robust enough to handle many different intentionally vulnerable systems, it would be an autonomous pentesting agent.

Reinforcement Learning

If you’re interested in AI, you’ve probably heard of reinforcement learning. Even if you haven’t heard it by that name, it’s definitely been in the news. It’s the paradigm that made AlphaGo possible, and is the same paradigm that’s helped OpenAI crush Atari score for game after game. It’s also made a bot that can play Smash Bros pretty dang well. But what is it? And how might it help us develop a system that can hack autonomously?

Broadly, reinforcement learning is the study of agents that learn by trial and error. Agents learn policies that direct them to take actions and then observe the change in environments and the reward they recieve to inform their next action.

Multi-Armed Bandits

The classical non-deep example, the one a reader is most likely to have come across in the past, is the multi-armed bandit. The problem is a simple one: you find yourself in a casino. You stand in front of a slot machine with three arms. You’re told that each of the arms has a different probability of success - some are luckier than others. Your goal is to find the best strategy to achieve the highest reward you can in a given number of arm pulls.

A naive approach might be to play with each arm many times. In fact, play each arm so many rounds you can eventually estimate the true probability of reward on the machine when the law of large numbers kicks in. Once you’ve done this for each machine, you merely need to hang out on the machine that ended up with the highest reward probability, right? Easy peezy.

Those of you who have gone to a casino would surely retort that this is an inefficient and expensive strategy. Fine, then: let’s introduce some definitions and try to use math to be a little more than lucky.

We have \(n\) arms on the machine, and \(t\) number of time steps to play the game. Each arm represents an action \(a\) we can take. Our goal is to approximate the true success probability of each of the arms or \(q(a)\) and then exploit that knowledge reward.

We’ve established we can’t know the true reward, so we’ll call our approximation \(Q(a)\). Because this is an approximation based on our current understanding of the environment, and we’re an intelligent agent that updates our believes based on our observations, it makes most sense to think about \(Q_t(a)\), or our estimate valued of a given action at a given time step, \(t\).

First, we know nothing about the environment, so we pull an arm at random. Let’s say it gives us a reward! For one pull of the arm you’ve gotten exactly one reward. What do you think about that machines odd of success now?

Well, it makes the most sense to basically just keep a running list of how many times we’ve tried the action, and what our total reward has been with the action. That’s our estimated probability. Something like:

\[ Q_t(a) = \frac{R_1 + R_2 + ... + R_Nt(a)}{N_t(a)} \]

With this, we could keep a running best guess of the reward for each action.

But that’s a lot of information to record. For a computer program, that means the memory needed for the program scales up linearly with the amount of time steps considered. In pratice, we use something called a q table to keep the memory constant. I won’t go into it too much here but you’ll see it below in my python implementation. The idea is the same, which is to update \(Q_t(a)\) at each timestep allowing it to become slowly more accurate.

So what is our strategy? A greedy strategy is just to read the action from the Q table that maximizes your reward:

\[ A_t = argmax Q_t(a) \]

Remember, we already pulled a lever once and it yielded an award. So that action is the only one in the Q table with a value over 0.0. So does that just mean we select that action over and over again, without ever trying the other arms? How do we know the other actions wouldn’t give us even greater rewards?

This is the essence of the multi-armed bandit problem. To exploit our current knowledge of the environment to the best of our ability or explore to learn more about an action we don’t currently understand very well.

To do this, we introduce \(\epsilon\). Every \(\epsilon%\) of the time, we will choose a random action instead of the action we know will yield us the most gain, observe our success or failure, and update our \(Q_t(a)\) for that action.

Given a reasonable choice of \(\epsilon\) and enough time steps, this allows us to converge on the best solution, even if our initial solution is not optimal.

We can examine this in code, as below:


import numpy as np

class Environment:
    def __init__(self, p):
        '''
        p is the probability of success for each 
        casino arm
        '''
        self.p = p
    
    def step(self, action):
        '''
        The agent pulls an arm and selects an action.
        The reward is stochastic - you only get anything 
        with the probability given in self.p for a given arm.
        
        action - the index of the arm you choose to pull
        '''
        result_prob = np.random.random() # Samples from continous uniform distribution
        if result_prob < self.p[action]:
            return 1
        else:
            return 0

class Agent:
    def __init__(self, actions, eps):
        '''
        actions - The number of actions (arms to pull)
        
        eps - The frequency with which the agent will explore,
              rather than selecting the highest reward action
        '''
        self.eps = eps
        self.num_acts = actions
        self.actions_count = [0 for action in range(actions)]
        self.Q = [0 for action in range(actions)]
    
    def act(self):
        if np.random.random() < self.eps:
            #we explore
            action = np.random.randint(self.num_acts)
        else:
            #we exploit
            action = np.argmax(self.Q)
        return action
    
    def update_q_table(self, action, reward):
        self.actions_count[action] += 1
        step_size = 1.0 / self.actions_count[action]
        self.Q[action] = self.Q[action] = (1 - step_size) * self.Q[action] + step_size * reward

def experiment(p, time_steps, eps):
    '''
    p is probabilities of success for arms
    time_steps - number of time steps to run experiment for
    epsilon to choose for agent
    '''
    env = Environment(p)
    agent = Agent(len(p), eps)
    for time_step in range(time_steps):
        action = agent.act() # get action from agent
        reward = env.step(action) # take action in env
        agent.update_q_table(action, reward) #update with reward
    return agent.Q

q_table = experiment([0.24, 0.33, 0.41], 1_000_000, 0.1)

The final q_table appears as [0.2397833283177857, 0.3332216502695646, 0.41020130865076515], indicating we pretty successful in estimating \(q(a)\) with \(Q_(a)\).

So it’s a simplistic example, but illustrates the power of reinforcement learning. Unlike a supervised learning example, we never told the system what the right answer was the third level, with \(q(a_3) = 0.41\). We enabled the agent to observe the effects of its actions to update its policy, and change it’s behavior.

If you want to read more about classic reinforcement learning, I highly recommend the extremely pleasent to read and extremely free Reinforcement Learning: An Introduction. Hopefully this gentle introduction has convinced you there’s an interesting power here, different from supervised or unsupervised learning methods you may have known in the past.

The Successes (and Caveats) of Deep Reinforcement Learning

Reinforcement learning allows for self-directed optimization. Deep learning allows for function approximation. By combining the two we’re able to map environment state and action pairs into expected rewards.

Successes

I won’t go too long here, because there’s already plenty of hype. AlphaZero can play Go better than anyone has ever played Go, and through self-play eventually invented novel openings that human beings are now studying. Hard to overstate how mindblowing that is. I think this was a pretty epoch defining event for anyone interested in AI in any field.

Caveats

Before I get into the weeds of the challenges deep reinforcement learning faces as a field, I’d be remiss to not advise anyone interested to read Alex Irpan’s Deep Reinforcement Learning Doesn’t Work Yet. I’ll be summarizing some of these points below, but the whole article is a sobering but ultimately optimistic read for those looking to cut their teeth on deep RL.

I’ll be looking at each of these as challenges to be overcome for my own research: developing an autonomous pentesting agent.

Sample Inefficiency

One of the key problems in deep RL is sample ineffiency: that is, you need a whole lot of data to get good performance. The ratio of environment complexity to data required for strong performance can seem frighteningly high. For many environments, particularly real life ones, you’re almost out of luck.

Even in my multi-armed bandit scenario, I ran 1,000,000 episodes. This was a pretty simple environment to learn from. Imagine training an agent against Metasploitable. You allow the agent to take action until the completion of the episode. Then you restart the virtual machine in a clean state, and begin again. Parallelizing this requires multiple virtual machines, and the time between episodes is as long as it takes to load up a fresh disk image - and that’s for a single host! Full environments representing entire network would be even harder to generate adequate experience for. Think about how long it takes you to spin up a fleet of boxes in Amazon, much less configure all the network policies. Brutal. For a single host, resetting metasploitable to a clean state a million times would take, optimistically, two minutes a pop. Doing that one million times? That would take about 4 years.

So even if the method could work in principle, experience generating the data to overcome sample inefficiency is going to be tough.

Reward Function Design is Tough

Designing reward for Go is kinda easy. Collecting territory and winning? These things are good. Giving up territory and losing the game? This is very bad. Atari is pretty straightforward as well. Each of these games provide a score - if you make the score go up, you’re doing well! If the score goes down, or you die, you’re doing poorly.

Expressing those sorts of reward functions in simple environments mathematically is not extraordinarily difficult.

How about more subtle goals though? Take our goal of pentesting:

How do you define good pentesting? To do that, you’d need to ask a good pentester what their goals are on an assesment. Since I don’t have any on hand, my personal experience will have to suffice: good pentesting is about careful thoroughness.

For a real life attacker, your only goal is to find a single exploitable hole good enough to weasel your way into the network, find high-value information, and take off with it. Ideally without letting anyone know you were there. Sort of a depth-first search kinda deal.

Pentesting needs to be wide and deep. You want to present the client with evidence you looked over their network to the best of your ability, found as many chinks in their armor as possible at all levels of access you were able to achieve. And while doing this, you’re under certain constraints. You can’t break their network to discover a high value target. Some things are off limits, also known as out-of-scope. Also you have a fixed amount of time, So you can’t explore everything. You have to provide breadth, and use your intuition to decide where to spend time going deep that will provide the biggest bang for the client’s buck. That’s good pentesting.

There are two kinds of rewards we might try: sparse rewards only provide reward at the end of the episode if the policy resulted in a ‘success’. The agent “won” the game. We’re having a hard time defining success for pentesting if we use the above definition, but even if the answer was just ‘got root access on a specific machine’ that likely wouldn’t be enough. With so little to go off of, you can imagine a pentesting agent firing off some random scans, maybe trying to some random exploits against random machines, and never recieving even a drop of reward for its trouble. The policy network has no valuable information to backprop on, and you’re essentially dead stuck unless by some miracle the network chooses random actions that lead to success. As a former pentester, I can attest that I have tried that strategy and been very disappointed in it.

In this case, we need something more complicated. Shaped reward provides increasing rewards for states as they become closer to the end goal, rewarding actions that are useful. This sound like a better fit for our problem. For example, scanning a potential target is not getting root on a high value target, but it’s a useful step on the way, so we should give some reward there.

How would you express that as a reward function? Exploits are good! Discovering hosts, and information about hosts is also good. But we want to ensure we’re not just brute-forcing throwing exploits at hosts to see if they work, so maybe we impose noisiness cost per action to encourage strategic exploits and scanning. How do we weigh the reward of exploit vs scanning? When it comes to information exfiltration, how do we teach an agent to understand what high-value vs low-value information is? We want the agent to understand high-value targets that deserve more intensive study, but how do we communicate that? In fact, we don’t want to do that at all - we want the agent to discover that. Now how do you say that with math? When you try to piece these ideas into a singular reward function it gets hard quick.

Reward Functions like to Blow Up in Your Face

Agents do not care about your problems. They only care about the reward their actions can give them. Despite the elegant expressiveness of mathematics and your best personal efforts, there will probably be a gap between these intentions. In these gaps, the agent will attempt to find whatever action in the environment gives them the quick fix of reward without all the challenge of discovering a really useful reward function.

OpenAI provides an infamous example in one of their experiments: in a boat racing game, they used a shaped reward. The agent got the most reward for winning, but they got partial reward for picking up powerups (useful for winning!) and passing checkpoints.

The agent quickly discovers you can get the most reward by just collecting the powerups, since they regenerate quickly. It finds itself stuck in a really elegant bender as its opponents whiz by. The agent will never win the race this way, and still get an incredible amount of reward. This is called reward hacking.

Think about our previously proposed hodge-podge of actions that would give our hypothetical agent reward. It’s easy to imagine an agent that had not yet penetrated the network finding a successful exploit that got it access to another machine. Great place to farm! The agent would likely just fire off that again and again, and each success would give it more reward. The same could be said about a scan enumerating a host, or any number of activities. Without a carefully crafted reward, our proposed shaped reward could be easily “hacked”, with plenty of reward gained and our task un-done.

The Environment Challenge

State Space

Another thing deep reinforcement learning requires is an environment. For a game like chess or shogi, this is just the board. It’s pretty easy to gracefully represent as a matrix.

Defining a board for pentesting is kind of hard. You kind of start with a fog of war situation where you know about the perimiter of a network early on, but you really don’t know the full size of the environment in terms of number of hosts until you find one. So it’s an environment that starts small and gets bigger overtime, with each new host found having different properties.

Most game environments are pretty fixed, so that’s tough. It could be seen as a blessing, though. You’re encouraged to overfit like crazy in reinforcement learning when generating experience in the game, often these learned skills don’t transfer to a new environment. For penetration testing each “game” starts on a new network, or a new sized “board”. There’s a general pattern of penetration testing that should stay consistent, but the shape of the network and hosts on it will define what your optimal actions are. Hopefully that keeps overfitting to a minimum.

Action Space

Your action space, the actions available to an agent that can be taken, also need to be provided. Chess, for example, this might be the legal moves your agent can take for any input board state.

There are continous and discrete action spaces. Discrete action spaces basically just means a countable number of actions. The chess example applies here. Continous action spaces might be found when you’re using RL to set the specific value of a sensor, for example. Where the value of the sensor can take on any real-numbered value between a lower and upper bound. To be honest, I haven’t totally wrapped my head around methods for continous action spaces but I have seen a lot of clever problem formulation to make the action space discrete instead.

For example, take that sensor problem - pretty continous. But what if we assume there’s a minimum amount you can tune the sensor up or down that’s meaningful? Call it \(x\). Now, after taking an observation from our environment, let’s say we only have two options - up or down by \(x\). Well gollee gee, sir, up or down? I ain’t no mathematician but that’s a pretty discrete space if I do say so myself.

This sort of judo is on display whenever the problem allows for it. When OpenAI tackled Dota 2, they easily could have considered the action space continous - but they didn’t. They discretized the action space on a per-hero basis, arriving at a model choosing among 8,000 to 80,000 discrete actions depending on their hero. A discrete action space will be pried from their cold, dead hands.

That’s a lot of moves. OpenAI had access to the game engine’s API, so these actions were probably read rather than hand-coded. For our pentesting problem, how do we handle that? You’re sitting in front of a terminal, where you can enter any text. A very miniscule part of the distribution of all text you can type into a terminal is going to be valuable for accessing your hacking tools. Within those tools, there’s very specific syntax that will be valuable. That’s a pretty big action space, and I’m not sure we can specify reward that will make that valuable, even shaped. So what’s the play?

Metasploit API: The ‘game engine’ of pentesting

I puzzled over this for a long-time before I did some literature review and found Jonathon Schwartz’s thesis Autonomous Penetration Testing using Reinforcement Learning. In it, he creates a pretty convincing partially observable Markov decision process to form a model of penetration testing. It’s one of the few real attempts I’ve seen to tackle the formulation of the problem. One line inparticular really inspired me to take a serious look at the problem again. While justifying some simplifications to his network model, Jonathon says:

The specific details of performing each action, for example which port to communicate with, are details that can be handled by application specific implementations when moving towards higher fidelity systems. Penetration testing is already moving in this direction with frameworks such as metasploit which abstract away exactly how an exploit is performed and simply provide a way to find if the exploit is applicable and launch it, taking care of all the lower level details of the exploit

First, this struck me as an oversimplification. How many times had I loaded up an exploit in metasploit only to have it not work? Then I had to dig into the specifics of the Ruby code and twiddle with things. Many exploits also have a pretty large number of required arguments to set that require some domain/target specific knowledge. Then I decided this was totally genius. That insanely large action space of the open terminal now starts to more resemble a game board. Metasploit stores information about hosts it knows about, their open services and distribution information. Exploits apply to specific distributions and services. Metasploit even provides tools for information gathering once you’ve compromised your host. It’s not always enough - often you need to break out of their laundry list of commands and use an honest-to-god terminal. But there’s a lot you can do restricting the action space to the Metasploit level. I haven’t done the back of the envelope math, but that feels like Dota 2 size action space to me, maybe smaller.

The actions you can take with Metasploit, and the information it chooses to store reduces the complications in considering both the action space and the state space of penetration testing.

Simulation as a path forward

If you’ve read this far, you might be under the impression I have a pretty negative view of the odds of solving penetration testing with RL. Nothing could be further from the truth! I’m just being honest about the many, potentially very thorny, sub-problems on the way to that problem.

To me, the immediate work to be done is in the simulation space. One has to choose a subset of Metasploit actions directly from their api and map them to actions an agent can take.

There’s still the problem of sample inefficiency - how do you generate enough experience?

The answer has to be simulation. Instead of interacting with a full virtual machine environment, you need a simulated environment that makes it easy for an agent to quickly test a policy against an environment. The way the network is composed needs to be, to my mind, similar to a rogue-like game. We want procedurally generated volunerable networks at a just realistic enough fidelity for policies learned to apply to a real network. These could be spun up and down quickly and easily parallelized to achieve the kind of massive experience generation achieved by OpenAI with Dota 2.

The aforementioned Jonathon Schwartz has already developed a simulator that I believe steps in that direction, and extending it would certainly make a good environment for the metasploit driven-agent I’m picturing.

For now, I need to consider the design of the subset of metasploit actions that would make an acceptable action space for solving non-trivial vulnerable networks. Achieving an acceptable fidelity for the simulation is also key - but to me it’s just the minimum viable environment that allows the metasploit action APIs to be meaningful.

In a future post, I’ll take my first steps using the OpenAI Gym framework to develop a simple environment I can train one of their prewritten models on. Whatever the final shape of the simulator, I believe making sure it fits within the OpenAI gym framework popularized by researchers at the forefront of RL is the best way to get new eyes onto the project. It’s also a good way for me to get some experience with DRL tooling.