Detecting Canary Tokens in Microsoft Office Documents

Detecting Canary Tokens in Office Files


Today, let’s explore the Canary Token Scanner, a nifty Python script designed to detect canary tokens hidden in Microsoft Office (docx, xlsx, pptx) and Acrobat Reader PDF documents. These days more and more companies are setting up honeypots in their networks and meant to test how well they can defend against simulated cyber attacks. It’s like a practice run – a way to see how the internal security team, often called the blue team, would react if there were a real hacking attempt.

In such exercises, red teams – the good guys pretending to be hackers – often find files that are left as bait. These files might look normal, like any other Word or Excel document, but they can be traps. Opening them carelessly might set off alarms, blowing the cover of the red team and ending the exercise. That’s why being cautious is key. You need to check what’s inside these files without actually running them, and that's where the Canary Token Scanner comes in.

The goal of the Canary Token Scanner is to carefully look inside these Microsoft Office files. It pulls out and shows you all the web links (URLs) hidden in them. The clever part? It does this without triggering any secret alarms (the canary tokens) in the files. It’s like tiptoeing through a room full of sensors without setting any off – you get to see where the traps are without getting caught!

Let's have a closer look at

# Initial Setup
import os
import zipfile
import re
import shutil
import sys
# Check for the correct usage of the script
if len(sys.argv) != 2:
   print("Usage: python FILE_OR_DIRECTORY_PATH")

This initial segment sets up the stage. It’s like gearing up before a mission, ensuring all the necessary tools (libraries) are in place and verifying the input (file or directory path) for the mission ahead.

def decompress_and_scan(file_path):
   is_suspicious = False
   temp_dir = "temp_extracted"
   os.makedirs(temp_dir, exist_ok=True)
       # If it's a zip file or an Office file (which is technically a zip), extract it
       if file_path.endswith('.zip') or file_path.endswith(('.docx', '.xlsx', '.pptx')):
           with zipfile.ZipFile(file_path, 'r') as zip_ref:
       url_pattern = re.compile(r'https?://\S+')
       ignored_domains = ['', '', '', '']
       for root, dirs, files in os.walk(temp_dir):
           for file_name in files:
               extracted_file_path = os.path.join(root, file_name)
               with open(extracted_file_path, 'r', errors='ignore') as extracted_file:
                   contents =
                   urls = url_pattern.findall(contents)
                   for url in urls:
                       if not any(domain in url for domain in ignored_domains):
                           print(f"URL Found: {url}")
                           is_suspicious = True
   except Exception as e:
       print(f"Error processing file {file_path}: {e}")
       shutil.rmtree(temp_dir, ignore_errors=True)
   return is_suspicious

Here, the decompress_and_scan function is the heart of the operation. It carefully opens up the Office files (which are really just fancy zip files) and starts looking for web links. Think of it as a detective searching a room with a flashlight, looking for clues without touching anything.

Practical example

$ python3 "../feauhr3n9iu2wowh1eluts7b0.docx"
 The file ../feauhr3n9iu2wowh1eluts7b0.docx seems normal.
$ python3 "../arh7zm0m6gffxilvkio3tmbjl.docx"
 URL Found:"
 URL Found:"
 The file ../arh7zm0m6gffxilvkio3tmbjl.docx is suspicious.

In the shell session above, we see a tale of two documents. The first Word file, as the script indicates, appears normal with no hidden URLs. However, it's a different story with the second file. The Canary Token Scanner uncovers canary token links lurking within – a clear signal of something more than meets the eye.

This example illustrates the script’s ability to differentiate between a regular document and one embedded with canary tokens. It's this simple yet effective functionality that could be the difference between a successful red team operation and one compromised by detection.

For those who are intrigued and want to dive deeper or even use this script in their own operations, the full script is available on our GitHub profile.

It's a straightforward script for anyone involved in red team exercises and ensuring you stay one step ahead in the game of cat and mouse!


[Update 2/4/2024]

If you're interested in trying out the script or simply exploring what it can do, we have introduced our honey tokens service. This service is available for free and lets you create various types of documents, including Word, PowerPoint, Excel, CSV, SVG Images, QR Code, or PDF files, and embed them with honey tokens. What sets our honey tokens apart is the use of AI to fill these documents with content that's designed to be deceptive in a subtle way. By creating your own documents with embedded tokens, you can test the effectiveness of the and understand more about the concept of digital deception and the role of canary tokens in it.


Feel free to try creating your own documents at:

Happy trapping! 🍯🐤