I presented on this topic recently at the AISA CyberCon in Canberra which ran in March 2023. The purpose of my talk was to provide some practical ideas for handling large amounts of open source intelligence and how to extract and store relevant information.
I decided to create a series of blog posts and short videos to follow up on this talk. This topic will be split into three parts:
“When looking for a solution to a problem, don't take on too much all at once or make the scope so big that it becomes difficult or even impossible to find an answer.” (source)
Whilst no one will ever tell you that threat intelligence is easy, setting out on the path to use it in smart ways doesn't need to be overwhelming. The amount of open source intelligence feeds, documents, blog posts and information shared in the community can easily become overwhelming, but there are some ways we can tackle this without jumping in the deep end, so to speak.
There are so many different forms of "threat intelligence" it can be overwhelming to know where to start. Paid or open source, what tools are needed, as well as what skills and people. Even if you are looking at purely open source options, there is a plethora of threat feeds, blogs, twitter feeds and hashtags that can easily overwhelm, and it's almost an unnecessary beast to tame.
I often talk to people that have paid for multiple threat feeds and have a Threat Intelligence Platform (TIP) and they still struggle to be able to distil the information and gain the insights that they need. Combining shared threat information from different sources and industries makes the relevant intelligence hard to find and makes it difficult to generate value.
We need to begin by defining threat intelligence and what it means to your specific organisation or use case.
If you're new to threat intelligence I recommend checking out these two articles we’ve previously published on the subject, Cyber Threat Intelligence (CTI) Crash Course and Establishing a Threat Intel Program.
This is a much harder question to answer, and in the context of these articles I am going to look at this from the perspective that the information and intelligence data will come from blog posts, PDFs, reports, both internal and external, maybe paid and open source. Where the data comes from will need to be researched by you to ensure that you are getting the relevant information for your requirements and needs.
You must know your threat intelligence requirements and we cover this in more depth in our post called Establishing a Threat Intel Program. Some of the things that you need to think about when planning out your threat intelligence journey are:
Threat intelligence all comes in the same format right?
Wrong.
In an ideal world every vendor would develop their tools to ingest and export event data in the same format, but unfortunately we do not live in an ideal world and have to deal with threat intelligence in many different formats.
Let's consider JSON as an example. Both MISP and STIX have their own JSON taxonomies that do not have a direct overlap or mapping. Blog posts might have tables with limited context and no relation to a TTP, a CSV can have multiple columns which may be different for each vendor or team, PDFs need to be scraped and this can’t always be done successfully. Add in chat rooms, telegram, social media and twitter feeds, credential information … the list of ways that this data is presented to us is almost endless.
I was drawn to using DocIntel when I heard about it by watching the replays from the 2022 CTI Summit held in Luxembourg. The new version of this presentation was recently delivered by Antoine Cailliau at the SANS Cyber Threat Intelligence Summit 2023 (you can watch that here).
I was interested immediately as DocIntel gave me a step into having a repository for threat feeds and blogs and PDFs that I have collected, and allows for me to pull out the IOCs and search on them, or feed them into another TIP, such as MISP.
In part 2 I’ll cover a little bit more about the potential use cases that you could consider setting DocIntel up for.
I've suggested two problems in this article and those are the problems I'll be looking to solve with this technical setup. DocIntel is helping me solve the problem of having multiple different sources of threat intelligence and pulling them all into one place.
DocIntel processes the sources (documents and RSS feeds of blog posts) and parses the information for me quietly in the background ready for review. This also means that I have my data in one place while also maintaining the context of these IOCs, for example, the name of the threat report, the date of publication, and the confidence level we have for that information.
Please note that all the steps and configuration I am running for this tutorial are designed for demonstration purposes only. If you intend to run this in production ensure that you choose suitable passwords and configurations for your production environment.
DocIntel runs on docker. For a simple test environment, I created an Ubuntu server virtual machine and installed Docker. The DocIntel installation is incredibly easy and the steps and script are provided on GitHub.
Simply grab the installation script and run.
You’ll be asked for where to store the data and to set up passwords. For a test environment I leave everything as default, but if you are using this for any real data ensure that you change the password to something secure.
The script will then pull all the required Docker images for you without any further interaction. Once that has finished, you can start the docker containers by running the command:
I then use the command ‘sudo docker container list -a’ to check if the containers are all up. Usually a few fail on the first go, so I rerun the ‘docker compose’ command from above again and make sure they are all marked as ‘Up’.
The screenshot above shows a few containers that failed to start the first time. Rerunning docker compose should start them all.
Once you have DocIntel up and running the first thing that you will need to do is create your first administrative account. This is done via the command line. The instructions are provided on Github for you:
Running these commands should give you the output as shown in the screenshot below.
Once you have your admin account set up, you are good to login via the web GUI on http://localhost:5005/.
Once you have logged in with your newly created admin account the first thing you will want to do is create yourself an everyday account and give yourself the administrator role. To do that you use the left side menu and choose Users. Create the user and save.
Then go back to the left hand menu and select Roles. Click on Administrator and you should see your newly created user account that you then Add user to the group.
Log out of DocIntel and log back in with your user account.
There are a few things that I suggest thinking about and setting up before adding your sources. Think about what sources you are wanting to add and why. What information are you looking to scrape from them? We’ll take a look at the different configuration items and then I'll walk through an example source and how I’d go about setting it up before actually adding the source itself.
Before I add any sources I’m going to set up my facets and tags. This will take some forethought as I suggested above.
Facets are essentially buckets for your tags. You can use facets to set up a bucket for different threat actor naming conventions, for example. If you want to get more specific tags later, you can add those to facets as well. You can either set up the facets and tags manually, or you can configure these to be automatically extracted during the pre-processing stage.
There are two ways for the automated extraction of tags:
On the left hand menu select Tags and then Create facet.
1. Give the facet a title.
2. Give the facet a prefix, this needs to be unique and is shown in the document listing.
3. Add a description as you see fit.
4. Select “Automated Extraction and make sure it turns green and add your regex under Extraction Regex.
5. I generally select the label Normalisation as all capitals just for consistency.
6. Then click Add and your facet is created. Now when you add a source or document, tags will be created based on the regex that you have provided and be listed under this facet.
Because there are so many different families and tagging on all caps isn't reliable (especially for government organisations that frequently use all caps in documents), for now I am creating a facet called malware and adding the tags under that for common malware types or malware that is of particular interest to me. You can see in the screenshot above the malware Cobalt Strike and Emotet have been tagged in documents.
Note: If you are wondering, and you will wonder this, the colour selection at present is limited and also in some cases wrong. There is a current enhancement issue on GitHub to get this fixed.
DocIntel comes configured with only one default classification of “Unclassified”. How you want to set up your classification scheme is obviously dependent on your organisation and how you intend to use DocIntel. A simple classification system could include: Restricted, Confidential, Internal, Public, as an example.
To configure additional classifications (which you may want to do before you configure your importers and scrapers), once again go to the left hand menu, go to Administration and Classifications. Here you can add the ones that you like and save.
Importers allow you to automate the submission of URLs to the scrapers. For example, they allow you to automatically import data from RSS feeds and send the URLs to be scraped for observables and tags.
I’ll be installing the RSS Source Importer as it is the only one that I am currently using.
To do this go to the left hand menu, go to Administration and select Importers. Click on Install and select RSS Source Importer from the dropdown menu, and ensure you choose enabled under status. Collection delay is how often the importer will go and check the feed. By default it's set to 30 seconds but in reality you likely only need it once or twice a day.
You can leave everything else default, however, if you want to override the default classification for what is imported by the RSS Source Importer, perhaps always mark as public, you can configure that here.
You can also configure DocIntel so that only particular groups can access the documents imported via this method. I can’t imagine that I would need this with the RSS importer, but with others this might be relevant. For example, if you are scraping a mailbox you might want to set the classification as confidential and for a specific analyst group's eyes only.
That's it, no further configuration needed.
Scrapers are automated programs that collect information from the web, including from private websites or APIs. Once the importer has gone off and fetched our blog post from the RSS feed, or we have uploaded a PDF document, the scraper extracts the data.
To scrape PDF documents you need to configure the PDF scraper.
To scrape web pages, such as blog posts, you need to configure the readability scraper.
Adding the scrapers is very similar to adding an importer so I won't cover this again but you can watch the video above to see the process.
I want to look at adding the RSS feed CISA cybersecurity alerts & advisories as it's a feed that I find useful. It contains IOCs as well as TTPs and vulnerabilities and is a reliable source of threat intelligence. These are the things to consider when adding a source and making sure you have the relevant facets and tags all set up before adding.
I suggest doing this before adding the source as once the source is added and the documents are ingested, there is not currently any way to retrospectively add tags to documents.
What sources you add are going to be up to you and what you have decided will provide the most relevant information for you. If you are a bit like me, though, and are just playing around I suggest adding the “Kaspersky Securelist” RSS feed as I've tested it and it works well and consistently. If you come up against any issues, you might want to check GitHub. Currently I've tested and had issues with Mandiant and CISA feeds and it's a known issue.
For testing I’ll add the Kaspersky RSS Feed.
Now depending on how often you set your importer to pull you should see the documents start to flow in in a few minutes (if set to 30 seconds). Once you start reviewing and registering the articles you will see those come up in the source view.
There are two ways that you can add documents or blog posts to DocIntel manually, either by uploading the PDF directly (has to be a PDF) or by submitting URL(s).
This should be obvious on how to do this.
When you want to register the pending document, click on the title and you will see a new screen. There is a bit here to consider:
The big green button is where you can see what observables have been scraped from a document. These include IP addresses, URLs and hash sums. Review these and ensure you arent adding URLs such as google.com that may have been scraped from the article.
I have submitted an enhancement to be able to identify the observable easily within the document when selected so you can see the context around it. The other enhancement to this page I've asked for is the ability to add observables from this page as you are reviewing.
Once you have reviewed the observables here you can register the document. If you want to add additional observables you can then open the document and edit and add them manually.
Once you have registered the document you can export the observables by scrolling down to the bottom of the document pages and selecting PDF, Excel, CSV or copy.
At this point DocIntel is released as an Open Source platform and is maintained primarily by one person who also has a day job. Supporting this project can be done in many ways from testing and adding issues, helping to fix code and issues, writing documentation or joining the community and helping others get themselves set up.
Stay tuned for part 2 where I’ll dive into ideas and use cases for DocIntel to showcase its potential. Finally, in part 3 I’ll show you how to integrate DocIntel with MISP.