Slacker II: archiving information from a Slack workspace

Royle, Stephen

doi:10.59350/rna8r-7gr14

Published July 7, 2024 | https://doi.org/10.59350/rna8r-7gr14

Slacker II: archiving information from a Slack workspace

Royle, Stephen

I wrote previously about how to archive a Slack workspace.

Well, now Slack announced a new policy, effective from August 26th, 2024. Under the new policy, messages and files older than one year in workspaces on the free tier will be deleted. This is in addition to the 90 day limit for viewing content in the workspace.

This causes a problem with the solution I outlined previously. Briefly, the viewable archive generated by the excellent slack-export-viewer is served up using a flask application. However, all the content (images and links to files) are all hosted by Slack. This means that everything will disappear when slack deletes any files older than one year.

Time for action!

tl;dr I have written a solution, which is available here.

For anyone who is interested in the details, read on.

More info

The problem is really contained to stuff hosted by slack, every other external link should be OK (well they’re susceptible to link-rot, but let’s assume we can live with that). There are three types of files: avatars (nice to have), previews of the files (thumbnails) and the files themselves. If we assume all three types will disappear, we need to rescue them!

The avatars are hosted like this:

https://avatars.slack-edge.com/YYY-MM-DD/13NumericChars_20HexChars_res.jpg

There are also gravatars for people who don’t have a hosted avatar. These are hosted at secure.gravatar.com

Files are hosted like this:

https://files.slack.com/files-pri/teamid-userid/filename?t=secret_token

The thumbnails are hosted in a similar way in a directory called files-tmb rather than files-pri and with the size of the thumbnail appended to the filename.

What is surprising is that you can access all your old files (as long as they weren’t manually deleted on Slack) going back to the beginning of the workspace. Even if you are on the free tier and the files are older than 90 days. Note that this is currently the case, and obviously once Slack start to delete stuff you won’t have access at all! This is all possible via a secret token which is essential for viewing the content and for downloading the files. If you exported an archive some time age, it is possible that the token will expire and you will lose access to your files.

You can see all of this by looking at the json files in an exported archive. What slack-export-viewer does is to build all of that into a viewable archive. I was able to write something quickly which downloaded all the content, but I couldn’t figure out how to get it to display in the flask application…

The nice thing is that slack-export-viewer can also generate a local static html instance of the archive using:

slack-export-viewer -z path/to/archive.zip --html-only

OK, now we can work with this static html instance. We just need to download all of the linked content, store it locally in a way that can be accessed easily by the instance and then change all the links to point to the local instance. This is what the python script does.

It would be really cool to add this option into slack-export-viewer itself, but it is beyond me to do that and generate a PR.

Here’s the code as it stands. I’m not a pythonista, so if anyone has any suggestions for improvement, hmu. Up-to-date version is on GitHub here.

#!/usr/bin/python

import sys
import os
import requests
from urllib.parse import urlparse
from bs4 import BeautifulSoup

def download_file(url, root_dir, html_file_path):
    """Download a file from a URL into the specified directory structure if it matches allowed domains."""
    allowed_domains = ['files.slack.com', 'avatars.slack-edge.com', 'secure.gravatar.com']
    parsed_url = urlparse(url)
    
    # Check if the URL's domain is in the list of allowed domains
    if parsed_url.netloc not in allowed_domains:
        return url  # Return the original URL if it's not from an allowed domain
    
    # Create a path based on the URL structure
    file_path = os.path.join(root_dir, parsed_url.netloc, parsed_url.path.strip("/"))
    os.makedirs(os.path.dirname(file_path), exist_ok=True)

    # Creat the relative path to the html file
    file_path_relative = os.path.relpath(file_path, os.path.dirname(html_file_path))

    # Skip downloading if the file already exists
    if os.path.exists(file_path):
        return file_path_relative

    # Download and save the file
    response = requests.get(url)
    with open(file_path, 'wb') as file:
        file.write(response.content)
    
    return file_path_relative

def update_html_file(html_file_path, root_dir):
    """Update HTML file to download and reference local copies of external resources."""
    with open(html_file_path, 'r', encoding='utf-8') as file:
        soup = BeautifulSoup(file, 'html.parser')

    # Handle img tags with src attributes
    for img_tag in soup.find_all('img'):
        original_url = img_tag.get('src')
        if original_url:
            new_path = download_file(original_url, root_dir, html_file_path)
            img_tag['src'] = new_path
    
    # Handle tags with href attributes (e.g., a, link)
    for href_tag in soup.find_all(href=True):
        original_url = href_tag.get('href')
        if original_url:
            new_path = download_file(original_url, root_dir, html_file_path)
            href_tag['href'] = new_path
    
    # Write the updated HTML back to a file
    with open(html_file_path, 'w', encoding='utf-8') as file:
        file.write(str(soup))

def scan_and_update_html_files(channel_dir, root_dir):
    """Scan each folder in the channel directory and update the index.html file found within."""
    for subdir, dirs, files in os.walk(channel_dir):
        # Print message to indicate the current directory being processed
        print(f"Processing directory: {subdir}")
        for file in files:
            if file == 'index.html':
                html_file_path = os.path.join(subdir, file)
                update_html_file(html_file_path, root_dir)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python slack-export-local.py <root_directory_path>")
        sys.exit(1)

    root_directory_path = sys.argv[1]
    channel_dir = os.path.join(root_directory_path, "channel")
    root_dir = os.path.join(root_directory_path, "attachments")

    scan_and_update_html_files(channel_dir, root_dir)

—

The post title comes from "Slacker" by Wall of Sleep off their eponymous album. So tempted to name this post after the famous Superchunk song, but this is a family-friendly website.

Additional details

I wrote previously about how to archive a Slack workspace. Well, now Slack announced a new policy, effective from August 26th, 2024. Under the new policy, messages and files older than one year in workspaces on the free tier will be deleted . This is in addition to the 90 day limit for viewing content in the workspace. This causes a problem with the solution I outlined previously.

UUID: 24c41feb-48b3-4f3d-a000-21b4196b00a6
GUID: https://quantixed.org/?p=3299
URL: https://quantixed.org/2024/07/07/slacker-ii-archiving-information-from-a-slack-workspace

Issued: 2024-07-07T18:37:15
Updated: 2024-07-07T18:37:17

Slacker II: archiving information from a Slack workspace

Time for action!

More info

Additional details

Description

Identifiers

Dates

Slacker II: archiving information from a Slack workspace

Creators & Contributors

Time for action!

More info

Additional details

Description

Identifiers

Dates