Here at Cosive, we’ve both used and written a fair number of integrations and transformers for MISP events and data. A classic problem is MISP data processing scripts which end up falling over or taking forever to run because they didn’t necessarily expect as much data as they ended up receiving.
In an all-too-painful-but-familiar example, what if you have your MISP processing script executing every 15 minutes but it actually takes 41 minutes to run each time, resulting in multiple scripts executing simultaneously? Eventually, everything may get confused or just fall over after running out of system resources.
A MISP event is a single bundle of JSON data representing some set of cyber threat intelligence concepts. It can describe an incident, a piece of malware or malware family, a threat actor, or just be a big uncategorised bucket of IoCs like IP addresses and hashes gathered over the course of a day. The one in the below screenshot is a continually updated list of Tor exit nodes.
So, when it comes to MISP events, exactly how big are we talking? Based purely on gut feel, is an average of 50 attributes reasonable? Maybe that’s how I make my events, but there’s few limits in MISP to how much data another sharing organisation can cram into a single event. If you’ve got good eyes, you’ll see the Tor exit event above has 2,127 attributes at present.
As well as appreciating the average size, an important thing for our integration needs are knowing the outliers. Sure our script handles 99% of regularly-sized events in a canter, but how about those mega-events which cause our automated process to choke up or even fail completely?
The good news is that MISP being the sharing platform that it is, we don’t have to guess - we just have to do some gathering and then write some nice data processing and visualisation libraries to help us answer these questions.
Always after an excuse to try out new toys, I’ve used the Marimo Python notebook, a Jupyter notebook alternative. I’ll sing the praises of it in full another time, but just know that it’s great at Pandas dataframes and data visualisation using the Altair library.
An important disclaimer: to satisfy our initial curiosity, we haven’t gone out of our way to perform a comprehensive worldwide census of all MISP datasets available globally. We’ve only analysed the feeds which are already available to us and which have data representing things we care about. So, treat these results as an indication rather than anything which is going to tell you the absolute limits of MISP data volumes.
We’ve anonymised the feed names here since we’re interested in the scale of data in aggregate - we’re not trying to call out any particular feed. Some of them have perfectly good reasons for publishing the volume of events that they do, such as the type of data, how events are grouped, or just the length of time they've been operating.
A simple place to start is just to count the events in each feed:
Our biggest feed here is Hotel, which serves 1,600 events. That isn’t actually too bad, all things considered, when it comes to data processing.
But hold on though: not shown on this initial graph is a monster of a feed, which let’s call Omega. It’s not on the first graph because the visualisation library started to have issues - at least not without some data massaging. The number of events in Omega was 115,644 and rather changes our graph:
That handsomely beats our second place Hotel feed by a factor of 72 and is ever-growing. It’s big, but perhaps manageable with the right data processing approach. With some numbers in hand, we can begin to work out our likely processing times were we to run some operation on that feed.
How far back do these events go?
Hotel and Golf go back a long way - back to 2012 and 2015 respectively. You may or may not want such old data in your MISP instance.
In fact, here’s a good tip for setting up a MISP feed where we don’t ingest everything since the beginning of time. If you go to your feed configs at Sync Actions -> Feeds -> click the “Edit” icon next to your feed of interest -> Filter rules Modify button, you’ll see this screen:
You can similarly do this for remote MISP server sync configuration, but you add this rule to “Pull rules” rather than “Filter rules”.
We can set the bottom filter in our feed pull rule to only ingest feed data after a timestamp, done in UTC epoch seconds:
{ “timestamp”: “<epoch seconds>”}
You can use a site like https://www.epochconverter.com/ to work out the number of seconds to achieve your desired date cutoff.
Okay, so we know we can easily hit thousands or a hundred thousand events and a single feed can go back for years. Next, if we’re writing a data processing script, how big can we expect a single event to be?
If we’re looking for outliers which may cause us issues, this sounds like a job for the old box and whisker plot (aka box plot, but we’re certainly not going to walk past the chance to say “whiskers”). Box and whisker plots help us understand where the bulk of the data lies - mostly within the box, with the whiskers showing the tails to either side. Outliers beyond those tails are shown as circles.
Since we’re particularly interested in outliers with lots of data, we’re looking for the top-most circle in each feed’s data distribution to find the highest value. Note we also use a log scale here to avoid the classic data visualisation problem of “Help, all my data is squashed at the bottom of the graph”.
We see that for most of our feeds, the median event size is somewhere between 0-200 KB, though Bravo is up around 2.7 MB.
What we’re really interested in is the biggest types of files our processing script might expect to find and may cause bottlenecks if not handled well. Each of Bravo and Foxtrot have events in the range of 40-50 MB, while Hotel’s biggest event clocks in at a whopping 418 MB! This may have a bearing when it comes to memory and database usage.
One other thing we’ve observed are problems when you have network services or data pipeline components inline with your MISP's connectivity. Would a service upstream from MISP be able to handle a 400MB file, or would it silently discard it?
Such a big file isn’t typical as we can see, but we want to avoid the classic problem of a processing script which has run out of memory or disk storage and has quietly fallen over for months on end without anyone noticing. Monitoring can help here too, but it’d be best if we design our MISP data handling script to elegantly handle these kinds of sizes.
So that’s event file sizes, but how many attributes (e.g. IP addresses, hashes, domain names) per MISP event are we talking about? Once again, note it’s a log scale graph to visualise the big range of values in a useful way.
As we see here, all feeds have events of at least 2,500 attributes or more, with our biggest event on the graph is from Hotel with a healthy 169,329 attributes! That might be many more attributes than a processing script may have counted on, and may once again have impacts on storage and memory for our script, as well as processing time if we happen to be doing enrichment or lookups on every single attribute.
So there we have it - some quantification of what we’re up against when processing MISP data. Naturally your mileage may vary depending on where your feeds come from, but we’ve established that worst case:
Cosive does a lot in the MISP and CTI space, including:
Get in touch with us for more information!