paw: Parse and Anonymize Whole-slide images
Creators & Contributors
Disclosure: First and foremost, the entirety of this project is built around a tool developed by others at EMPAIA:
wsi-anon. Please take a look at the tool itself and the accompanying publication [1] by Bisson, Franz, Dogan O, Romberg, Jansen and other collaborators, which goes into much more details and offers a far better explanation than what I could hope to offer here.In other words, I am a script kiddie. If you find the contents of this post useful, you should consider implementing a similar solution yourself!
Background
Patients expect that their privacy will be treated with the utmost care when they generously consent to the use of their medical data in research. This is particularly intuitive for image data, including photographs of faces and external anatomy, but other medical imaging data including microscopic images need to be handled carefully as well.
Sometimes, these images contain metadata such as names, places, and dates; other times, their contents can also be sufficiently identifying to pose a privacy risk, in the case of famous people with very rare diagnoses (for example). Research use of these materials implies the need to remove identifying data and metadata as much as possible. The extent to which one must go is not always clear, but some in-depth guidance is available [2].
Even with metadata-free digital pathology whole-slide images (WSIs), there is a theoretical risk of re-identification using inherent similarities between WSIs that would allow a link to be established between WSIs, and so on up the chain all the way to specific institutions or patients [3]. Many jurisdictions rightfully treat WSIs as essentially equivalent to any other information contained in a patient chart for the purposes of data protection. Despite this, robustly scrubbing digital slides to a metadata-free state is not always easily accomplished or routinely done in practice.
When performing research, especially with artificial intelligence (AI), we commit to stringent privacy protection measures, and the research ethics/institutional review boards of our institutions expect us to follow through with such promises. I admit that I haven't had the easiest time doing this correctly, which was a pressing reason for me to look into ways that it could be done with the resources available to me (which is not very much!)
Tools for removing WSI metadata
There are many ways to remove these metadata from digital slides, but
most of them suffer from at least one limitation that didn't work for my
potential use cases. Obviously, since I don't have a slush fund to play
around with software, anything neither open source nor free (in the
"free speech" sense [4]) wouldn't cut it.
Converting slides to open TIFF-like formats like .ome.tif does work,
but it takes a long time even on powerful workstations.
Some problems I encountered in my search for an anonymization solution adapted to my needs, and possible solutions (click for full-size figure).
One excellent available solution is the wsi-anon library from the
EMPAIA consortium, which is fast, has few dependencies, and is
extensively compatible with slide scanner vendors
[1]. EMPAIA distributes it according to
the MIT License on its GitLab
repository, and it is
still (as of this post's publication in March 2025) actively developed
and maintained.
This library is particularly fast because it directly overwrites the
(relatively small) metadata blocks in the binary digital slide files
without performing operations on the image data. Obviously, each
vendor's own file format needs to be separately supported, but there
aren't too many on the market for it to be unmanageable, and most of
them are glorified TIFF files with a few idiosyncrasies. Until everyone
(eventually?) switches to DICOM, a piecemeal approach is the only way to
go. Despite these theoretical challenges, I haven't had any issues with
wsi-anon in practice besides weird edge cases like file names with
special characters or previously corrupted metadata from other
anonymization tools.
This is all well and good for someone like me who is comfortable calling an executable from the command line, but I quickly started getting requests from others less familiar to help them anonymize batches of digital slides. I wasted enough time trying to find something that would do the trick, but after coming up short, I gave up and decided to give it a go myself.
Designing a solution
Skipping the command line is only one part of the problem (albeit an essential one). Since there is often one person handling slide scanning/digitization for many others at the same time, there needs to be a way to preserve some information about anonymized slides (so they can be returned to the correct requestor). Sometimes, when the slides from an entire case undergo anonymization, it is still worth it to preserve specimen, block, slide and stain identifiers, which usually do not constitute identifying metadata but are very important to match slides to findings in the pathology reports.
End users in my laboratory also work with a variety of operating systems (Windows, Mac and Linux-based). This includes dedicated devices like commercial network-attached storage that are Linux-based but not fully-featured. Ideally, whatever I did would have to work on most platforms.
A cross-platform compatibility framework I am somewhat acquainted with
is containerization via Docker. Conveniently, wsi-anon already
provides a Dockerfile for testing purposes, that was easily adaptable to
build and package the executable into a light Alpine Linux image with
minimum dependencies.
To track which files have been anonymized, a good option was PostgreSQL, which is widely available as a Docker image, and for which there is an existing ecosystem of web interfaces and command line tools. Another option was simply to use plain-text comma-separated value files, which are easy to manipulate with ubiquitous graphical user interface tools if necessary. I chose to allow for either option, depending on user preference.
Most applications handle file selection through an import menu, but this would require a lot more work to implement. Instead, as Docker supports mounting host directories as volumes, a more straightforward solution is simply to benefit from the user's desktop environment, which provides graphical file operation tools, usually by dragging and dropping or copying and pasting files from one folder to another. In essence, as long as the user knows into which folder they should put digital slides in need of anonymization, this functions as a sort of native import dialog.
At that point, all that remained were a few utilities to handle calling
wsi-anon and logging. Even though they are not the easiest nor the
safest, POSIX utilities like its shell, cron, and sed can handle
periodic parsing, file operation, and executable calling tasks without
requiring additional dependencies in Alpine. They are also
well-documented for use by a non-programmer like myself! So I wrote and
packaged a few scripts in the container with the wsi-anon executable
to handle "watching" the digital slide source directory.
Here is a summary of my solution:
Schematic representation of the default Docker compose stack of paw, with a PostgreSQL database (click for full-size figure).
For lack of a better name, I ended up code-naming this prototype paw,
for "Parse and anonymize WSIs" (WSI is a commonly-used acronym for
digital slides, meaning whole-slide image).
A real-life use case
For a few months now, this setup has been in production use on one of
our Synology NAS devices, which is directly networked with two of our
slide scanners. A shared folder on the NAS is mapped as the source
directory for paw, which runs as a Docker compose project.
This is convenient, because when our slide scanning technician is
digitizing a mix of slides for clinical and research use, all that is
needed is to set the output folder for the research slides to the paw
source directory, and all of them end up being anonymized on-the-fly.
The handy log of everything that has been processed is great for
building the research project's dataset afterwards. And throughout all
this, the operator of the scanner doesn't need to know anything about
paw, wsi-anon, or using the command line — but more importantly,
neither does the person who requested the digital slides in the first
place!
A colleague of mine has used this to scan more than a thousand slides
intermittently over a few days for a research project, as has another
who is prospectively enrolling patients and their pathology materials
into a biobank. Both of these projects committed to best practices in
privacy protection in their ethics approvals, and I am proud that paw
has made it easier for them to follow through with that promise.
Code
The public repository for paw is hosted on Codeberg at
bertogatti/paw. The version that
was presented at USCAP 2025 is archived at Zenodo
[5], as is the
poster
describing it [6].
References
[1] T. Bisson et al., "Anonymization of whole slide images in histopathology for research and education," Digital Health, vol. 9, p. 20552076231171475, Jan. 2023, doi: 10.1177/20552076231171475
[2] D. A. Clunie et al., "Report of the Medical Image De-Identification (MIDI) Task Group – Best Practices and Recommendations." arXiv, Apr. 2023. doi: 10.48550/arXiv.2303.10473. Available: https://arxiv.org/abs/2303.10473. [Accessed: Mar. 02, 2025]
[3] P. Holub et al., "Privacy risks of whole-slide image sharing in digital pathology," Nature Communications, vol. 14, no. 1, p. 2577, May 2023, doi: 10.1038/s41467-023-37991-y
[4] R. Stallman, "Why Open Source Misses the Point of Free Software - GNU Project," GNU.org. https://www.gnu.org/philosophy/open-source-misses-the-point.html, 2007.
[5] A. Lametti, "Paw: Open-Source, Cross-Platform Drag and Drop Pipeline for Real-Time Whole-Slide Image Anonymization." Zenodo, Mar. 2025. doi: 10.5281/zenodo.14994533
[6] A. Lametti, "Open-Source, Cross-Platform Drag and Drop Pipeline for Real-Time Whole-Slide Image Anonymization," in United States and Canadian Academy of Pathology, Boston, Massachusetts, United States of America: Zenodo, Mar. 2025. doi: 10.5281/zenodo.15061681
Additional details
Description
Background Patients expect that their privacy will be treated with the utmost care when they generously consent to the use of their medical data in research. This is particularly intuitive for image data, including photographs of faces and external anatomy, but other medical imaging data including microscopic images need to be handled carefully as well. Sometimes, these images contain metadata such as names, places, and dates;
Identifiers
- UUID
- 6f6c7ae3-d112-45ef-bcb6-b19c8497b1f3
- GUID
- https://doi.org/10.59350/cedew-n6930
- URL
- https://justapa.thologi.st/posts/paw-digital-slide-anonymization/
Dates
- Issued
-
2025-03-23T01:00:00
- Updated
-
2025-03-23T01:00:00
References
- [1] T. Bisson et al., "Anonymization of whole slide images in histopathology for research and education," Digital Health, vol. 9, p. 20552076231171475, Jan. 2023, doi: 10.1177/20552076231171475 https://doi.org/10.1177/20552076231171475
- [2] D. A. Clunie et al., "Report of the Medical Image De-Identification (MIDI) Task Group – Best Practices and Recommendations." arXiv, Apr. 2023. doi: 10.48550/arXiv.2303.10473. Available: https://arxiv.org/abs/2303.10473. [Accessed: Mar. 02, 2025] https://doi.org/10.48550/arxiv.2303.10473
- [3] P. Holub et al., "Privacy risks of whole-slide image sharing in digital pathology," Nature Communications, vol. 14, no. 1, p. 2577, May 2023, doi: 10.1038/s41467-023-37991-y https://doi.org/10.1038/s41467-023-37991-y
- [4] R. Stallman, "Why Open Source Misses the Point of Free Software - GNU Project," GNU.org. , 2007. https://www.gnu.org/philosophy/open-source-misses-the-point.html
- [5] A. Lametti, "Paw: Open-Source, Cross-Platform Drag and Drop Pipeline for Real-Time Whole-Slide Image Anonymization." Zenodo, Mar. 2025. doi: 10.5281/zenodo.14994533 https://doi.org/10.5281/zenodo.14994533
- [6] A. Lametti, "Open-Source, Cross-Platform Drag and Drop Pipeline for Real-Time Whole-Slide Image Anonymization," in United States and Canadian Academy of Pathology, Boston, Massachusetts, United States of America: Zenodo, Mar. 2025. doi: 10.5281/zenodo.15061681 https://doi.org/10.5281/zenodo.15061681