paw: Parse and Anonymize Whole-slide images

Lametti, André

doi:10.59350/cedew-n6930

Published March 23, 2025 | https://doi.org/10.59350/cedew-n6930

paw: Parse and Anonymize Whole-slide images

Lametti, André

Disclosure: First and foremost, the entirety of this project is built around a tool developed by others at EMPAIA: wsi-anon. Please take a look at the tool itself and the accompanying publication [1] by Bisson, Franz, Dogan O, Romberg, Jansen and other collaborators, which goes into much more details and offers a far better explanation than what I could hope to offer here.

In other words, I am a script kiddie. If you find the contents of this post useful, you should consider implementing a similar solution yourself!

Background

Patients expect that their privacy will be treated with the utmost care when they generously consent to the use of their medical data in research. This is particularly intuitive for image data, including photographs of faces and external anatomy, but other medical imaging data including microscopic images need to be handled carefully as well.

Sometimes, these images contain metadata such as names, places, and dates; other times, their contents can also be sufficiently identifying to pose a privacy risk, in the case of famous people with very rare diagnoses (for example). Research use of these materials implies the need to remove identifying data and metadata as much as possible. The extent to which one must go is not always clear, but some in-depth guidance is available [2].

Even with metadata-free digital pathology whole-slide images (WSIs), there is a theoretical risk of re-identification using inherent similarities between WSIs that would allow a link to be established between WSIs, and so on up the chain all the way to specific institutions or patients [3]. Many jurisdictions rightfully treat WSIs as essentially equivalent to any other information contained in a patient chart for the purposes of data protection. Despite this, robustly scrubbing digital slides to a metadata-free state is not always easily accomplished or routinely done in practice.

When performing research, especially with artificial intelligence (AI), we commit to stringent privacy protection measures, and the research ethics/institutional review boards of our institutions expect us to follow through with such promises. I admit that I haven't had the easiest time doing this correctly, which was a pressing reason for me to look into ways that it could be done with the resources available to me (which is not very much!)

Tools for removing WSI metadata

There are many ways to remove these metadata from digital slides, but most of them suffer from at least one limitation that didn't work for my potential use cases. Obviously, since I don't have a slush fund to play around with software, anything neither open source nor free (in the "free speech" sense [4]) wouldn't cut it. Converting slides to open TIFF-like formats like .ome.tif does work, but it takes a long time even on powerful workstations.

Some problems I encountered in my search for an anonymization solution adapted to my needs, and possible solutions (click for full-size figure).

One excellent available solution is the wsi-anon library from the EMPAIA consortium, which is fast, has few dependencies, and is extensively compatible with slide scanner vendors [1]. EMPAIA distributes it according to the MIT License on its GitLab repository, and it is still (as of this post's publication in March 2025) actively developed and maintained.

This library is particularly fast because it directly overwrites the (relatively small) metadata blocks in the binary digital slide files without performing operations on the image data. Obviously, each vendor's own file format needs to be separately supported, but there aren't too many on the market for it to be unmanageable, and most of them are glorified TIFF files with a few idiosyncrasies. Until everyone (eventually?) switches to DICOM, a piecemeal approach is the only way to go. Despite these theoretical challenges, I haven't had any issues with wsi-anon in practice besides weird edge cases like file names with special characters or previously corrupted metadata from other anonymization tools.

This is all well and good for someone like me who is comfortable calling an executable from the command line, but I quickly started getting requests from others less familiar to help them anonymize batches of digital slides. I wasted enough time trying to find something that would do the trick, but after coming up short, I gave up and decided to give it a go myself.

Designing a solution

Skipping the command line is only one part of the problem (albeit an essential one). Since there is often one person handling slide scanning/digitization for many others at the same time, there needs to be a way to preserve some information about anonymized slides (so they can be returned to the correct requestor). Sometimes, when the slides from an entire case undergo anonymization, it is still worth it to preserve specimen, block, slide and stain identifiers, which usually do not constitute identifying metadata but are very important to match slides to findings in the pathology reports.

End users in my laboratory also work with a variety of operating systems (Windows, Mac and Linux-based). This includes dedicated devices like commercial network-attached storage that are Linux-based but not fully-featured. Ideally, whatever I did would have to work on most platforms.

A cross-platform compatibility framework I am somewhat acquainted with is containerization via Docker. Conveniently, wsi-anon already provides a Dockerfile for testing purposes, that was easily adaptable to build and package the executable into a light Alpine Linux image with minimum dependencies.

To track which files have been anonymized, a good option was PostgreSQL, which is widely available as a Docker image, and for which there is an existing ecosystem of web interfaces and command line tools. Another option was simply to use plain-text comma-separated value files, which are easy to manipulate with ubiquitous graphical user interface tools if necessary. I chose to allow for either option, depending on user preference.

Most applications handle file selection through an import menu, but this would require a lot more work to implement. Instead, as Docker supports mounting host directories as volumes, a more straightforward solution is simply to benefit from the user's desktop environment, which provides graphical file operation tools, usually by dragging and dropping or copying and pasting files from one folder to another. In essence, as long as the user knows into which folder they should put digital slides in need of anonymization, this functions as a sort of native import dialog.

At that point, all that remained were a few utilities to handle calling wsi-anon and logging. Even though they are not the easiest nor the safest, POSIX utilities like its shell, cron, and sed can handle periodic parsing, file operation, and executable calling tasks without requiring additional dependencies in Alpine. They are also well-documented for use by a non-programmer like myself! So I wrote and packaged a few scripts in the container with the wsi-anon executable to handle "watching" the digital slide source directory.

Here is a summary of my solution:

Schematic representation of the default Docker compose stack of paw, with a PostgreSQL database (click for full-size figure).

For lack of a better name, I ended up code-naming this prototype paw, for "Parse and anonymize WSIs" (WSI is a commonly-used acronym for digital slides, meaning whole-slide image).

A real-life use case

For a few months now, this setup has been in production use on one of our Synology NAS devices, which is directly networked with two of our slide scanners. A shared folder on the NAS is mapped as the source directory for paw, which runs as a Docker compose project.

This is convenient, because when our slide scanning technician is digitizing a mix of slides for clinical and research use, all that is needed is to set the output folder for the research slides to the paw source directory, and all of them end up being anonymized on-the-fly. The handy log of everything that has been processed is great for building the research project's dataset afterwards. And throughout all this, the operator of the scanner doesn't need to know anything about paw, wsi-anon, or using the command line — but more importantly, neither does the person who requested the digital slides in the first place!

A colleague of mine has used this to scan more than a thousand slides intermittently over a few days for a research project, as has another who is prospectively enrolling patients and their pathology materials into a biobank. Both of these projects committed to best practices in privacy protection in their ethics approvals, and I am proud that paw has made it easier for them to follow through with that promise.

Code

The public repository for paw is hosted on Codeberg at bertogatti/paw. The version that was presented at USCAP 2025 is archived at Zenodo [5], as is the poster describing it [6].

References

[1] T. Bisson et al., "Anonymization of whole slide images in histopathology for research and education," Digital Health, vol. 9, p. 20552076231171475, Jan. 2023, doi: 10.1177/20552076231171475

[2] D. A. Clunie et al., "Report of the Medical Image De-Identification (MIDI) Task Group – Best Practices and Recommendations." arXiv, Apr. 2023. doi: 10.48550/arXiv.2303.10473. Available: https://arxiv.org/abs/2303.10473. [Accessed: Mar. 02, 2025]

[3] P. Holub et al., "Privacy risks of whole-slide image sharing in digital pathology," Nature Communications, vol. 14, no. 1, p. 2577, May 2023, doi: 10.1038/s41467-023-37991-y

[4] R. Stallman, "Why Open Source Misses the Point of Free Software - GNU Project," GNU.org. https://www.gnu.org/philosophy/open-source-misses-the-point.html, 2007.

[5] A. Lametti, "Paw: Open-Source, Cross-Platform Drag and Drop Pipeline for Real-Time Whole-Slide Image Anonymization." Zenodo, Mar. 2025. doi: 10.5281/zenodo.14994533

[6] A. Lametti, "Open-Source, Cross-Platform Drag and Drop Pipeline for Real-Time Whole-Slide Image Anonymization," in United States and Canadian Academy of Pathology, Boston, Massachusetts, United States of America: Zenodo, Mar. 2025. doi: 10.5281/zenodo.15061681

Additional details

Background Patients expect that their privacy will be treated with the utmost care when they generously consent to the use of their medical data in research. This is particularly intuitive for image data, including photographs of faces and external anatomy, but other medical imaging data including microscopic images need to be handled carefully as well. Sometimes, these images contain metadata such as names, places, and dates;

UUID: 6f6c7ae3-d112-45ef-bcb6-b19c8497b1f3
GUID: https://doi.org/10.59350/cedew-n6930
URL: https://justapa.thologi.st/posts/paw-digital-slide-anonymization/

Issued: 2025-03-23T01:00:00
Updated: 2025-03-23T01:00:00

paw: Parse and Anonymize Whole-slide images

Background

Tools for removing WSI metadata

Designing a solution

A real-life use case

Code

References

Additional details

Description

Identifiers

Dates

References

paw: Parse and Anonymize Whole-slide images

Creators & Contributors

Background

Tools for removing WSI metadata

Designing a solution

A real-life use case

Code

References

Additional details

Description

Identifiers

Dates

References