DocIntel & MISP - Threat Intelligence Without Boiling the Ocean

February 21, 2024

I presented on this topic recently at the AISA CyberCon in Canberra which ran in March 2023. The purpose of my talk was to provide some practical ideas for handling large amounts of open source intelligence and how to extract and store relevant information.

I decided to create a series of blog posts and short videos to follow up on this talk. This topic will be split into three parts:

Part 1 - Threat intelligence & DocIntel
Part 2 - Use cases for DocIntel
Part 3 - Integrating DocIntel into MISP

Part 1 - Threat intelligence & DocIntel

Introduction

“When looking for a solution to a problem, don't take on too much all at once or make the scope so big that it becomes difficult or even impossible to find an answer.” (source)

Whilst no one will ever tell you that threat intelligence is easy, setting out on the path to use it in smart ways doesn't need to be overwhelming. The amount of open source intelligence feeds, documents, blog posts and information shared in the community can easily become overwhelming, but there are some ways we can tackle this without jumping in the deep end, so to speak.

Problem #1

There are so many different forms of "threat intelligence" it can be overwhelming to know where to start. Paid or open source, what tools are needed, as well as what skills and people. Even if you are looking at purely open source options, there is a plethora of threat feeds, blogs, twitter feeds and hashtags that can easily overwhelm, and it's almost an unnecessary beast to tame.

I often talk to people that have paid for multiple threat feeds and have a Threat Intelligence Platform (TIP) and they still struggle to be able to distil the information and gain the insights that they need. Combining shared threat information from different sources and industries makes the relevant intelligence hard to find and makes it difficult to generate value.

We need to begin by defining threat intelligence and what it means to your specific organisation or use case.

What is threat intelligence?

If you're new to threat intelligence I recommend checking out these two articles we’ve previously published on the subject, Cyber Threat Intelligence (CTI) Crash Course and Establishing a Threat Intel Program.

WHERE ARE YOU GOING TO GET THE THREAT INTELLIGENCE FROM?

This is a much harder question to answer, and in the context of these articles I am going to look at this from the perspective that the information and intelligence data will come from blog posts, PDFs, reports, both internal and external, maybe paid and open source. Where the data comes from will need to be researched by you to ensure that you are getting the relevant information for your requirements and needs.

HOW DO YOU DECIDE WHAT YOU NEED AND WHERE TO GET IT?

You must know your threat intelligence requirements and we cover this in more depth in our post called Establishing a Threat Intel Program. Some of the things that you need to think about when planning out your threat intelligence journey are:

What questions is your organisation trying to answer with threat intelligence?
Are your requirements clear enough to prevent your organisation from aimlessly collecting data?
Good starting questions include: Which threat groups are most likely to target our organisation? What tactics, techniques and procedures (TTPs) are they most likely to use against us?

Problem #2

Threat intelligence all comes in the same format right?

Wrong.

In an ideal world every vendor would develop their tools to ingest and export event data in the same format, but unfortunately we do not live in an ideal world and have to deal with threat intelligence in many different formats.

Let's consider JSON as an example. Both MISP and STIX have their own JSON taxonomies that do not have a direct overlap or mapping. Blog posts might have tables with limited context and no relation to a TTP, a CSV can have multiple columns which may be different for each vendor or team, PDFs need to be scraped and this can’t always be done successfully. Add in chat rooms, telegram, social media and twitter feeds, credential information … the list of ways that this data is presented to us is almost endless.

DocIntel

I was drawn to using DocIntel when I heard about it by watching the replays from the 2022 CTI Summit held in Luxembourg. The new version of this presentation was recently delivered by Antoine Cailliau at the SANS Cyber Threat Intelligence Summit 2023 (you can watch that here).

I was interested immediately as DocIntel gave me a step into having a repository for threat feeds and blogs and PDFs that I have collected, and allows for me to pull out the IOCs and search on them, or feed them into another TIP, such as MISP.

In part 2 I’ll cover a little bit more about the potential use cases that you could consider setting DocIntel up for.

WHAT PROBLEM DOES DOCINTEL SOLVE?

I've suggested two problems in this article and those are the problems I'll be looking to solve with this technical setup. DocIntel is helping me solve the problem of having multiple different sources of threat intelligence and pulling them all into one place.

DocIntel processes the sources (documents and RSS feeds of blog posts) and parses the information for me quietly in the background ready for review. This also means that I have my data in one place while also maintaining the context of these IOCs, for example, the name of the threat report, the date of publication, and the confidence level we have for that information.

*Screenshot from DocIntel showing extracted observables from a threat intel report.*

INSTALLING DOCINTEL

Please note that all the steps and configuration I am running for this tutorial are designed for demonstration purposes only. If you intend to run this in production ensure that you choose suitable passwords and configurations for your production environment.

DocIntel runs on docker. For a simple test environment, I created an Ubuntu server virtual machine and installed Docker. The DocIntel installation is incredibly easy and the steps and script are provided on GitHub.

*Download the installation script and run.*

Simply grab the installation script and run.

*The installation script will do the rest for you with some input.*

You’ll be asked for where to store the data and to set up passwords. For a test environment I leave everything as default, but if you are using this for any real data ensure that you change the password to something secure.

The script will then pull all the required Docker images for you without any further interaction. Once that has finished, you can start the docker containers by running the command:

sudo docker compose -f docker-compose.yml -p docintel-dev up -d.

*You should see all the docker containers come up as “Started”*

I then use the command ‘sudo docker container list -a’ to check if the containers are all up. Usually a few fail on the first go, so I rerun the ‘docker compose’ command from above again and make sure they are all marked as ‘Up’.

*Check the status of all containers are up before proceeding*

The screenshot above shows a few containers that failed to start the first time. Rerunning docker compose should start them all.

CREATE THE ADMIN USER

Once you have DocIntel up and running the first thing that you will need to do is create your first administrative account. This is done via the command line. The instructions are provided on Github for you:

docker exec -it docintel-dev-webapp \
dotnet /cli/DocIntel.AdminConsole.dll \
user add --username admin
docker exec -it docintel-dev-webapp \
dotnet /cli/DocIntel.AdminConsole.dll \
user role --username admin --role administrator

Running these commands should give you the output as shown in the screenshot below.

*Set up the admin account in the administrator group*

Once you have your admin account set up, you are good to login via the web GUI on http://localhost:5005/.

CREATE YOUR USER ACCOUNT

Once you have logged in with your newly created admin account the first thing you will want to do is create yourself an everyday account and give yourself the administrator role. To do that you use the left side menu and choose Users. Create the user and save.

*Managing users from the web interface.*

Then go back to the left hand menu and select Roles. Click on Administrator and you should see your newly created user account that you then Add user to the group.

*Give your user account the administrator role.*

Log out of DocIntel and log back in with your user account.

Configuring DocIntel for sources

There are a few things that I suggest thinking about and setting up before adding your sources. Think about what sources you are wanting to add and why. What information are you looking to scrape from them? We’ll take a look at the different configuration items and then I'll walk through an example source and how I’d go about setting it up before actually adding the source itself.

FACETS & TAGS

Before I add any sources I’m going to set up my facets and tags. This will take some forethought as I suggested above.

Facets are essentially buckets for your tags. You can use facets to set up a bucket for different threat actor naming conventions, for example. If you want to get more specific tags later, you can add those to facets as well. You can either set up the facets and tags manually, or you can configure these to be automatically extracted during the pre-processing stage.

AUTOMATED EXTRACTION

There are two ways for the automated extraction of tags:

Using keywords in tags. If auto extract is selected, DocIntel will look for the exact match for the label and add the tag.
Using a regular expression. You can add regular expressions to automatically extract tags using facets and tags. There are some provided in the documentation and I’ll be using those here. They are a good start depending on what sources you are adding and what information is interesting and relevant to you.

Facet Name	Description	Regular Expression
vulnerabilities	Matches CVE numbers	CVE-\d{4}-\d{4,7}
actor.mandiant	Matches Mandiant group names	(APT\|FIN\|UNC)\d{1,4}
tlp	Matches common TLP notations	TLP[\s:_\-](red\|amber\|white\|clear\|green\|amber+strict)
attack.groups	Mitre ATT&CK threat actor group IDs	G\d{4}
attack.techniques	Mitre ATT&CK techniques	(T[0-9]{4}(\.[0-9]{3})?)
malware	A facet to add common malware names as tags

CREATE A NEW FACET WITH AUTOMATED EXTRACTION

On the left hand menu select Tags and then Create facet.

1. Give the facet a title.
2. Give the facet a prefix, this needs to be unique and is shown in the document listing.

*See the facet names before the extracted tags.*

3. Add a description as you see fit.
4. Select “Automated Extraction and make sure it turns green and add your regex under Extraction Regex.
5. I generally select the label Normalisation as all capitals just for consistency.
6. Then click Add and your facet is created. Now when you add a source or document, tags will be created based on the regex that you have provided and be listed under this facet.

Set up a facet with automated extraction to tag on CVE information.

*View of the facets page with the created tags.*

TAGGING MALWARE NAMES

Because there are so many different families and tagging on all caps isn't reliable (especially for government organisations that frequently use all caps in documents), for now I am creating a facet called malware and adding the tags under that for common malware types or malware that is of particular interest to me. You can see in the screenshot above the malware Cobalt Strike and Emotet have been tagged in documents.

Note: If you are wondering, and you will wonder this, the colour selection at present is limited and also in some cases wrong. There is a current enhancement issue on GitHub to get this fixed.

Classifications

DocIntel comes configured with only one default classification of “Unclassified”. How you want to set up your classification scheme is obviously dependent on your organisation and how you intend to use DocIntel. A simple classification system could include: Restricted, Confidential, Internal, Public, as an example.

To configure additional classifications (which you may want to do before you configure your importers and scrapers), once again go to the left hand menu, go to Administration and Classifications. Here you can add the ones that you like and save.

*Add in your organisation’s classification titles.*

Importers

RSS SOURCE IMPORTER

Importers allow you to automate the submission of URLs to the scrapers. For example, they allow you to automatically import data from RSS feeds and send the URLs to be scraped for observables and tags.

I’ll be installing the RSS Source Importer as it is the only one that I am currently using.

To do this go to the left hand menu, go to Administration and select Importers. Click on Install and select RSS Source Importer from the dropdown menu, and ensure you choose enabled under status. Collection delay is how often the importer will go and check the feed. By default it's set to 30 seconds but in reality you likely only need it once or twice a day.

You can leave everything else default, however, if you want to override the default classification for what is imported by the RSS Source Importer, perhaps always mark as public, you can configure that here.

You can also configure DocIntel so that only particular groups can access the documents imported via this method. I can’t imagine that I would need this with the RSS importer, but with others this might be relevant. For example, if you are scraping a mailbox you might want to set the classification as confidential and for a specific analyst group's eyes only.

That's it, no further configuration needed.

Scrapers

Scrapers are automated programs that collect information from the web, including from private websites or APIs. Once the importer has gone off and fetched our blog post from the RSS feed, or we have uploaded a PDF document, the scraper extracts the data.

PDF SCRAPER

To scrape PDF documents you need to configure the PDF scraper.

READABILITY

To scrape web pages, such as blog posts, you need to configure the readability scraper.

Adding the scrapers is very similar to adding an importer so I won't cover this again but you can watch the video above to see the process.

*Readability and PDF Scraper installed.*

Example source preparation

I want to look at adding the RSS feed CISA cybersecurity alerts & advisories as it's a feed that I find useful. It contains IOCs as well as TTPs and vulnerabilities and is a reliable source of threat intelligence. These are the things to consider when adding a source and making sure you have the relevant facets and tags all set up before adding.

Do you have the facets you need for automated extraction already set up?
How can you identify the threat actor names? For example, do you need to add an additional facet for a specific vendor such as Crowdstrike?
What is the reliability of the source in context of the information that you are getting and how relevant it is to you?
Are there any particular keywords that you might want to filter your incoming documents with? For example, with CISA maybe you are only interested in the Analysis Reports and want to filter for only those to be imported.

I suggest doing this before adding the source as once the source is added and the documents are ingested, there is not currently any way to retrospectively add tags to documents.

ADDING SOURCES FOR AUTOMATED IMPORT

What sources you add are going to be up to you and what you have decided will provide the most relevant information for you. If you are a bit like me, though, and are just playing around I suggest adding the “Kaspersky Securelist” RSS feed as I've tested it and it works well and consistently. If you come up against any issues, you might want to check GitHub. Currently I've tested and had issues with Mandiant and CISA feeds and it's a known issue.

*Importer errors on Mandiant and CISA RSS feeds.*

For testing I’ll add the Kaspersky RSS Feed.

Title will be simply Kaspersky Securelist.
You can add a description as to why you added this particular source and what you are looking to extract.
Any keywords that you might want to extract as tags, you could add keywords here for specific malware types for example used by this source.
Choose your reliability.
Add any external links if you like, or not.
Under syndication is where you add the RSS URL - https://securelist.com/feed/.
Check Scrape RSS feed and Save.

Now depending on how often you set your importer to pull you should see the documents start to flow in in a few minutes (if set to 30 seconds). Once you start reviewing and registering the articles you will see those come up in the source view.

ADDING A DOCUMENT OR URL MANUALLY

There are two ways that you can add documents or blog posts to DocIntel manually, either by uploading the PDF directly (has to be a PDF) or by submitting URL(s).

This should be obvious on how to do this.

Documents workflow

Once the document has uploaded, or the URL has been parsed, or the importer has grabbed the RSS feed, the incoming document will live under Documents > View pending. This is where you do your reviews and analysis of the documents and decide whether they will be registered or discarded.
View All under Documents is where you will see all your approved and registered documents.

REGISTER A NEW DOCUMENT

When you want to register the pending document, click on the title and you will see a new screen. There is a bit here to consider:

The title should automatically be made pretty, but you can always change the title as you need.
The analyst summary is auto extracted from the feed or document, but you might want to edit this.
Document Date: Ensure this is correct, it will default to the day it was ingested.
Tags: the automated tags will show here but you can add others as you like.
Classification can be adjusted here too, as well as the source and the source URL can be added.

REVIEW OBSERVABLES

The big green button is where you can see what observables have been scraped from a document. These include IP addresses, URLs and hash sums. Review these and ensure you arent adding URLs such as google.com that may have been scraped from the article.

I have submitted an enhancement to be able to identify the observable easily within the document when selected so you can see the context around it. The other enhancement to this page I've asked for is the ability to add observables from this page as you are reviewing.

Once you have reviewed the observables here you can register the document. If you want to add additional observables you can then open the document and edit and add them manually.

EXPORTING OBSERVABLES

Once you have registered the document you can export the observables by scrolling down to the bottom of the document pages and selecting PDF, Excel, CSV or copy.

Resources, contributing and support

At this point DocIntel is released as an Open Source platform and is maintained primarily by one person who also has a day job. Supporting this project can be done in many ways from testing and adding issues, helping to fix code and issues, writing documentation or joining the community and helping others get themselves set up.

Website: https://docintel.org
Docker images: https://hub.docker.com/u/docintelapp
Github: https://github.com/docintelapp/DocIntel
Slack: https://docintelapp.slack.com/
Documentation: https://docs.docintel.org/
Documentation Repository: https://github.com/docintelapp/docs

Part 2

Stay tuned for part 2 where I’ll dive into ideas and use cases for DocIntel to showcase its potential. Finally, in part 3 I’ll show you how to integrate DocIntel with MISP.

‍