How to Implement Multi-file Downloads in Ruby Web Apps

Streaming tar and zip file downloads on-the-fly

Simon Johnson
Better Programming

--

Image credit: Author

One of my first challenges for Weill Cornell Medicine was to develop a file distribution portal that allowed genomic sequencing labs to distribute big data files to collaborators over the web. A core requirement was a pick-list feature so users could create sets of files to be downloaded all at once from a browser.

The ability to download multiple files from a single browser click has long been missing from HTTP but it is a requisite feature for most web-based file services. The limitation is more specifically due to the underlying TCP that treats all transfers as single files or streams causing head-of-line blocking. HTTP/3 promises more flexibility with multiplexing and multi-file downloads because it instead sits on top of the QUIC protocol but it will still be some time before there is widespread adoption and new standards for browser file downloads.

Providing lists of file URLs to browser extensions and download manager software is always an option but it’s a hard sell and especially inconvenient for one-off visitors that don’t want to install new software just to download a single set of data. The best workaround we have is to archive (eg tar or zip) the files into a single file for download and this is still what’s used by Google Drive, MS SharePoint, and all the big names.

The question then becomes: archive first and download after or archive on the fly with a streaming download? Archiving to a temp file and then serving the static file from disk is appealing because it’s familiar and simple to implement. The real challenge, however, is implementing everything after the archive has been created, for example: How long are the users going to wait and how are you going to notify them when the download is ready? How long will you keep the file available for download? What if files change after archives are created but before they’re completely downloaded? It is only after all of these considerations that the streaming alternative becomes the clear winner.

Streaming Tar Files with Ruby Rack Socket Hijacking

Downloading tar files require slightly more technical competence from your users than zip files — tar file extraction has been natively supported by OSX for some time but only more recently by Windows and for the average web user tar files are not as ubiquitous or familiar as their zip counterpart. If your users are more technical, as in my case with the bioinformatics data distribution portal, tar is often the preferred format.

The main difference between the tar and zip format is that tar is not compressed, making the structure of a tar file relatively simple as outlined below. Each Entry in a tar file has a header containing the metadata and one or more content blocks of file data. This uncompressed, sequential structure simplifies the streaming because each file can be read off disk and appended as a new entry without any special processing.

© The Apache Software Foundation from Jackrabbit Oak docs

When working with big data in bioinformatics for example, any (text-based) files that can be significantly reduced in size by compression (eg gzip) are typically already compressed and travel around as compressed files. For this reason, the gains in speed from extracting large uncompressed tar files far outweigh any slight gain in size reduction from compressing already compressed files.

When it came to selecting a stack, Ruby was my first choice only because the other legacy integrating apps were written in Rails and due to this uncompressed, sequential structure of tar files, streaming turns out to be ridiculously straightforward.

Rack is the underlying interface behind Rails that provides the basic, bare-bones interaction with HTTP. A little-known feature of Rack is the socket hijacking API that allows direct writing to the socket from a Ruby IO stream. Now by adding the few lines of code below to the config.ru we have a tar streaming solution.

Depending on your use case you may need to spend some time fine-tuning the app server worker memory and Nginx/Apache buffers if you’re using a reverse proxy. In our case, we were streaming very large files (100GB+) and had some success using the Linux Pipe Viewer utility to limit the tar output speed and provide more bandwidth for concurrent downloads. Our tar command looked more like this:

tar --to-stdout -c #{env['tar_path']} | pv -q -L #{env['pv_limit']}
Source: Sinatra

Streaming Zip Files with Ruby Sinatra

TL;DR: Zip streaming is only slightly more tricky, checkout the zip_tricks library examples or our FileSlide Streamer microservice.

Recently I was developing a more generic, larger-scale data distribution SaaS and once again we needed a multiple file download feature but this time because our users were less technical we had to deliver zip files rather than tar files.

My first thought was there must surely be some kind of service out there that can handle this for us whereby we pass a few URIs and it sends back a zip. A full morning searching the web did not bear fruit — I found plenty of file uploading and transforming/transcoding services such as FileStack that could zip files after they’ve been uploaded but they charged by bandwidth so this was like cracking a nut with a sledgehammer.

I also thought maybe there was some way to use a cloud backup service such as Backblaze to backup a selection of files into a zip and extract the links but from what I could gather only entire images could be zipped rather than an arbitrary selection of files.

Giving up on finding a drop-in service, my next plan of attack was to try and work out what stacks and libraries the big players use and hope that one of them is open source. Without searching for Ruby in particular, I quickly came across a fantastic presentation by Julik Tarkhanov who implemented the zip streaming solution for WeTransfer that zips millions of file per day.

WeTransfer has a neat open source library called zip_tricks that handles all of their zipping-on-the-fly and conveniently for me it is written in Ruby.

With Julik’s presentation, the zip_tricks library, and my experience with Rack tar streaming I set out to build a zip streaming microservice. I was planning for a limited feature set — parse requests, validate parameters, return useful errors, fetch files and call webhooks — enough that I didn’t want to be doing it all in Rack but not really enough to justify a full-blown Rails app. This is where Ruby Sinatra chimes (or sings) in as a nice in-between.

After a couple of days working with Sinatra examples and the zip_tricks documentation I just couldn’t get my code to deliver the zipped byte stream — it kept returning either empty files or never-ending responses. Deadlines were fast approaching so I called on the expertise of Wander Hillen for help and he ended up doing all of the heavy lifting for our FileSlide Streamer app.

Before looking at the code it’s worthwhile understanding the basic structure of a zip file. Once again we have a sequential set of entries, each with a header and content data but in addition we have a Central Directory at the end of the file which is scanned by apps to quickly display the contents of the archive without having to read the whole file.

© John Yamich from WikiMedia Commons

Because the Central Directory is at the end of the file, we can still stream the data sequentially and just keep track of the metadata for each file and write the Central Directory last. The other major difference from tar however is that the format requires a couple of local headers that need to be written in order but are a bit more tricky to calculate up-front. Firstly the CRC-32 field is a checksum of the uncompressed file, so the only issue here is that this can take some time to compute for large files and we don’t want the stream waiting too long before it starts. Secondly, the Compressed size field is used to record the size of the compressed file and the challenge here is we need to know this before we write the zip stream.

In a typical zip creation scenario, the library would write the compressed file, calculate the size, rewind and update the header before closing the file. Any kind of rewinding obviously isn’t going to work for a streaming application and this is where zip_tricks does some very clever processing by writing out a fake archive to estimate sizes and assemble the stream without rewinding.

So with zip_tricks and just a few more lines of code outlined below, we can get a basic zip streaming app up and running in Sinatra.

For a more in-depth explanation on how the libraries interact with Rack take a look at Wander’s post.

Range Requests and Resuming Zip Streams

Most modern browsers support download resuming by issuing HTTP range requests in the header, for example Range: bytes=256–1023. When serving static files from a disk this simply involves opening the file and reading the corresponding byte ranges. When streaming files however we need to start the stream from the beginning, discard up to the start of the range (while keeping the client on hold), and then serve the remaining length.

With the prevalence of high-speed Internet the only real use case for resuming streams is if the zip is very large (gigabytes) and therefore composed of either a few large files or a large number of smaller files. Either way, reconstituting a big zip file only to discard some portion of it is inefficient and can cause the browser to timeout while waiting for the specific range. In addition to this, our microservice does not have the files sitting around on disk but has to fetch them from remote servers, making the wait time issue even more critical.

To workaround this Wander implemented a method for simulating files with placeholder objects to get the streaming moving as quickly as possible and this example highlights the beauty of Sinatra — you have direct access to the requests and responses to process and tune to your own use case.

Whether zipping or taring, file archiving is still the best option for allowing users to download multiple files with a single click and for now, it looks like it’s here to stay. Ruby Rack is well setup for streaming directly from the OS and open source battle-tested libraries like zip_tricks make streaming zip files a cinch. To see it in the wild check out our FileSlide Streamer microservice on GitHub.

--

--