Unarchiving S3 ZIP Files --- Part 3: The Pragmatic Solution

TL;DR

Streaming ZIP files from S3 is usually the right approach. It’s
efficient, scalable, and widely used. But ZIP archives sometimes require
random access to extract reliably. When extraction cannot fail, the
safest solution is to provide stable random access — either through
full download or S3 Range requests. In our case, downloading first and
extracting from disk restored reliability completely.


Streaming Works. Until It Doesn’t.

Streaming was the initial approach because it minimized disk and memory usage and scaled well with large files. The ZIP was read directly from S3 and passed to the unzip library.

However, certain ZIP files failed mid-extraction. Once the failure occurred, extraction stopped and no subsequent entries could be processed.

Since extraction was not expendable, partial results were unacceptable. A more reliable access method was required to ensure complete extraction.


What Is S3 Random Access?

Most developers think of S3 as a simple object store: you request a
file, and it streams from beginning to end. But S3 supports something
much more powerful — byte-range access.

Using the HTTP Range header, you can request specific portions of a
file instead of downloading the entire object.

For example:

Range: bytes=0-1023

or

Range: bytes=1048576-2097151

You can even request the end of the file directly:

Range: bytes=-65536

Conceptually, this allows access like this:

Sequential stream:
S3 → byte 0 → byte 1 → byte 2 → ... → byte N

Random access:
S3 → byte 9,000,000 → byte 1,200 → byte 400,000 → byte 800

This matters because ZIP archives store their index — the central
directory — at the end of the file. With range requests, a library can
read that index first, then jump directly to the correct file offsets.
This enables reliable extraction without downloading everything upfront.

In other words, S3 can behave less like a stream and more like a
filesystem — if the library supports it.


Libraries That Support This (yauzl)

Some ZIP libraries are designed with random access in mind. One example
is yauzl.

Instead of assuming a forward-only stream, yauzl reads the central
directory first and uses that information to locate files precisely. It
can work with custom readers, including readers backed by S3 Range
requests.

This allows:

  • Reliable extraction
  • No full in-memory buffering
  • No dependency on sequential streaming
  • Proper handling of complex ZIP structures

This approach preserves many benefits of streaming while restoring
structural reliability.

It does require more implementation effort, since you must provide a
reader capable of fetching byte ranges from S3. But architecturally, it
aligns much better with how ZIP files are designed.


The Solution We Chose

In our case, reliability mattered more than preserving pure streaming semantics.

Instead of treating the ZIP as a forward-only stream, we switched to a random-access approach using yauzl.

Conceptually:

S3 → Range Requests → yauzl (RandomAccessReader) → Extract

Rather than downloading the entire archive upfront, we allowed the library to request specific byte ranges from S3 as needed. This enabled it to read the central directory first, resolve offsets correctly, and access entries deterministically.

Once extraction was aligned with the ZIP format’s expectations, the failures stopped.

Nothing about the ZIP file changed. Only the access pattern did.

    import {
    S3Client,
    GetObjectCommand,
    GetObjectCommandOutput,
    } from "@aws-sdk/client-s3";
    import yauzl from "yauzl";
    import { Readable } from "stream";

    const s3 = new S3Client({ region: "ap-south-1" });

    class S3RandomAccessReader extends yauzl.RandomAccessReader {
        private bucket: string;
        private key: string;
        private size: number;

    constructor(bucket: string, key: string, size: number) {
        super();
        this.bucket = bucket;
        this.key = key;
        this.size = size;
    }

    _readStreamForRange(start: number, end: number): Readable {
        const command = new GetObjectCommand({
            Bucket: this.bucket,
            Key: this.key,
            Range: `bytes=${start}-${end - 1}`,
        });

        const streamPromise = s3.send(command).then(
            (res: GetObjectCommandOutput) => {
                if (!res.Body) {
                throw new Error("Empty S3 body");
                }
                return res.Body as Readable;
            }
        );

        // yauzl expects a readable stream immediately
        const passthrough = new Readable().wrap({
        async *[Symbol.asyncIterator]() {
            const stream = await streamPromise;
            for await (const chunk of stream) {
            yield chunk;
            }
        },
        } as any);

        return passthrough;
    }    

    _close(callback: (err?: Error | null) => void): void {
        callback();
    }

    _destroy(callback: (err?: Error | null) => void): void {
        callback();
    }
    }

    export async function extractFromS3RandomAccess(
    bucket: string,
    key: string,
    size: number
    ): Promise<void> {
    const reader = new S3RandomAccessReader(bucket, key, size);

    yauzl.fromRandomAccessReader(
        reader,
        size,
        { lazyEntries: true },
        (err, zipfile) => {
        if (err || !zipfile) throw err;

        zipfile.readEntry();

        zipfile.on("entry", (entry) => {
            console.log("File:", entry.fileName);

            zipfile.openReadStream(entry, (err, stream) => {
            if (err || !stream) throw err;

            stream.on("end", () => zipfile.readEntry());

            // Process file stream here
            stream.resume();
            });
        });
        }
    );}

Why This Was the Right Choice for Our Case

This approach worked well for our constraints because:

  • Extraction reliability was mandatory

    • Partial extraction or mid-archive failures were not acceptable.
  • It aligned with ZIP’s structural requirements

    • yauzl could read the central directory first and resolve correct offsets.
  • It enabled true random access using S3 Range requests

    • The library fetched only the required byte ranges instead of relying on forward-only streaming.
  • It avoided full archive downloads

    • No need to materialize the entire ZIP on disk before extraction.
  • It preserved scalability characteristics of streaming

    • Disk usage remained minimal, and large archives could be processed efficiently.
  • It provided deterministic and repeatable extraction behavior

    • The same archive that failed under streaming extracted consistently with random access.

This was not a one-size-fits-all solution, but it was the most reliable and efficient approach for our requirements.


Streaming Is Still the Preferred Default

Streaming remains the best approach in most systems. It is an industry standard. It minimizes disk
usage, scales naturally, and keeps pipelines efficient.

This situation was an edge case created by the interaction between ZIP
structure and access patterns. It doesn’t invalidate streaming as a
strategy. It simply highlights that some formats expect capabilities
beyond forward-only reads.

Given the choice, streaming is still preferable.


Author’s Note: If You Can Choose the Format, Choose Streaming-Friendly Ones

Some archive formats are naturally compatible with streaming:

  • gzip
  • tar.gz
  • tar
  • newline-delimited JSON (ndjson)

These formats are designed for sequential processing. They don’t rely on
centralized metadata located at the end of the file.

ZIP, while extremely common, was originally designed around filesystem
access. It works best when random access is available.

If your system depends heavily on streaming, format choice matters more
than it initially appears.


Closing

The ZIP file wasn’t broken. S3 wasn’t broken. Streaming wasn’t broken.

But the combination exposed a structural mismatch between the file
format and the access pattern.

Once access aligned with the format’s expectations, extraction became
completely reliable again.

Sometimes the solution isn’t changing the file — it’s changing how you
read it.


:speech_balloon: Feedback

If you have suggestions, insights, or alternative approaches, feel free to leave a comment.

:books: Series: Unarchiving S3 ZIP Files

:link: References

:folded_hands: Credits

  • Proof Read Credits: Swasthik
  • Thank you for helping me come up with the solution: Swasthik, Sonal, Ashwath, Sandhyashri