Safely serving reader-submitted media files for news orgs


#1

Hey folks. One of the projects I'm involved in aims to better connect news orgs and their readers. To that end, we're building a couple of tools that will allow users submit files to news orgs - a form builder (like Google Forms, but accepting more types of files - images, videos) and a commenting platform that will also allow users to upload files. With that in mind, I want to curb the potential for sharing malware embedded in those media files.

Questions for security friends: Do you have any thoughts on the best approaches for serving reader-submitted media files safely? How would you tackle this problem?


#2

What type of files are they, and what sort of preprocessing can you afford to do on them? In general, I think you're going to find yourself trying to build a solution for each filetype individually.

So of course you can do anti-virus scanning on files, but that won't catch someone who doesn't want to be caught. Browser MIME sniffing might enable an attacker to achieve XSS or account hijacking, so serving the files from a separate top level domain is a good idea too.

But I think the next step is to build-per filetype solutions.

  • I have to assume there are tools or libraries that can look inside microsoft office files (docx, xlsx) and look for Macros.
  • I wonder if there are image filetype 'linters' that try to clean PNG, JPG files. At the very least you could pass them through an optimizer, resizer, filter to muck with them and potentially break an exploit
  • Similar story for video files, just more work/cpu.

#3

Appreciate it, these are helpful thoughts.

Right now we're still discussing the specific file formats we want to support. I imagine we'll want to serve a small number of standard image file formats (e.g., .jpeg, .png, .gif), videos (e.g., mp4), possibly audio (e.g., mp3), and possibly documents (e.g., .docx .pdf). With that in mind, I understand there's a lot of overhead associated with some of these formats, particularly office docs and PDFs. We may not want to support some of these altogether.

Besides javascript launchers in PDFs or macros in office docs, are there any other common attack vectors news orgs should know about when opening docs? Can anyone here foresee other potential problems with some of the formats I've described?


#4

I would first introduce one more question / concern: metadata. Even if a reader is not a confidential source, they might subject themselves to harassment if they uploaded an image file that was tagged with location coordinates in EXIF image metadata.

Here are some further comments on particular file types:

Document files:
Consider rendering the files to images or "clean" PDFs of images with no scripts. Micah Lee at The Intercept put some code out for redacting and cleaning PDFs that might be helpful. https://firstlook.org/code/project/pdf-redact-tools/
You might also be able to do some scripted conversion of doc(x) / ppt(x) files to "clean" PDFs using LibreOffice and its libraries. I recommend that you avoid publishing Office documents because it's a terrible reader experience and very un-web friendly. Not everyone has Office / LibreOffice, nor do they want to wait for it to load.

Office documents also carry metadata like authors, modifications, computer name, and sometimes change history. The submitter may be sharing more information than they are aware they are.

You should also look at whether DocumentCloud or something similar should be part of your flow.

Image files:
Stripping metadata is important. There have been exploits against image processing engines, so some kind of minor tweak (as @tomrittervg suggests) might break exploit code, and / or cause it to run on your submission processing server but possibly not make it to readers.

To give you an idea of what you might need to worry about on your server: there was a recent series of exploits found against ImageMagick, which is often used for image processing. https://imagetragick.com/

Video & Audio files:
I'm less knowledgeable about issues with video and audio files, but you should seek to strip metadata at a minimum. Transcoding video and audio might be necessary for practical (storage & bandwidth) reasons. This would generally rearrange the data and likely break any exploits. The audio / video processing software could trigger exploits, but this would again likely prevent readers from being exposed.

I suggest that you do the processing of submissions on a separate server / virtual machine, and periodically destroy the VM and replace it. This will help wipe out anyone that might have broken in through a file exploit, if that exploit was subsequently patched. You should also be able to monitor it decently from a security standpoint, since it does not need to do anything other than process files. If you use a DevOps automation tool like ansible or puppet, the work involved in creating new VMs will be minimal.


#5

One more note on metadata - Metadata Anonymisation Toolkit (MAT) will strip metadata from many file formats. I would still recommend the other transformations above to break potential exploits.

MAT is part of the Tails project, but there's also a Debian package and git repo that should work on most Linux systems.
https://mat.boum.org/


#6

Definitely strip metadata, this is the prime concern. There are many tools out there to do this (thank you other posters! thank you developers) but if you absolutely need to get foolproof, I recommend two very simple solutions:

1) Print it out, then scan it.
2) Open the file on your computer, then take a screenshot.

Both of these work on the principle that you are making a visual copy, so what you see is guaranteed to be exactly what you get.

Then read the documents and think about all the other problems that are not metadata... if you were an investigator who wanted to find the source, is there some way this material would narrow down the suspect list?

  • Jonathan

#7

Smart stuff, thanks Jason.


#8

Has anyone tried this USB sanitizer?

CIRCLean is a independent hardware solution to clean documents from untrusted (obtained) USB keys / USB sticks. The device converts automatically untrusted documents into a readable format and stores these clean files on a trusted (user owned) USB key/stick.

https://www.circl.lu/projects/CIRCLean/


#9

The general approach I'd recommend is preserving the good rather than removing the bad. In other words, extract the content from the user-submitted file and use it to create a new file. This is easier and more reliable than trying to remove malware, metadata, etc, from the original file, because you know what the content's supposed to look like, and you can be sure that your newly created file contains only what you've put into it.

For example, convert an uploaded image to an uncompressed bitmap (BMP). If your images need to be served in a compressed format like JPEG or PNG, you can convert the clean bitmap into that format. Likewise, convert an uploaded MP3 file to an uncompressed WAV. Then you can recompress to MP3 if necessary, knowing that any extra data hidden in the original MP3 has been removed.

To convert a PDF to a folder full of bitmaps and then back to a PDF, you can use something like the following (on Linux with ImageMagick):


convert -density 300 dirty.pdf page-%02d.bmp
convert -density 300 page-*.bmp -compress lzw clean.pdf

LibreOffice can convert DOCX and other supported formats to PDF on the command line using something like the following:

loffice --headless --convert-to pdf dirty.docx

I second @jason_nstar's point that conversion should be done inside some kind of sandbox, otherwise malware in the original files may infect the system doing the conversion. Recent vulnerabilities in ImageMagick are very relevant here.