Unarchiving S3 ZIP Files: It Worked, Then Disaster Struck - Part 1 🪣

TL;DR

Extracting ZIP files from S3 isn’t just an unzip operation — it’s an architectural choice.

There are three ways to do it: streaming, buffering in memory, or writing to disk. Each shifts the burden between memory usage, performance, and reliability.

Streaming feels like the most efficient option — especially for large archives — but ZIP files are structured in ways that often assume random access. That mismatch can cause unexpected failures, cryptic errors, and fragile behavior depending on the unzip library.

The key lesson: your extraction strategy and library must align with how ZIP files actually work, not just what looks optimal on paper.

What worked perfectly at first can fail under real-world conditions — and understanding why requires looking beyond S3 and into the ZIP format itself.


Here’s what happened — and why the obvious solution wasn’t actually the safest one.

Unarchiving ZIP files from S3 started out as one of those tasks you barely think about. Pull the object, unzip it, move on. I wired everything up, hit deploy, and watched files stream out exactly as expected. Green logs. Clean runs.

Then, without warning, disaster struck.

A job that “worked” suddenly didn’t. A cryptic error replaced progress. And what looked like a solved problem turned into a deep dive through ZIP internals, streaming assumptions, and infrastructure trade-offs I hadn’t planned on learning that week.

It wasn’t just the ZIP file that mattered — it was how I was trying to read it. Where the bytes flowed, how much of the file was available at any given moment, and what the unzip library silently expected all started to matter.

That’s when it became clear: extracting a ZIP file from S3 isn’t just about unzipping — it’s about choosing the right access strategy.


Why Streaming Seemed Like the Perfect Choice

In my case, the choice felt obvious. I was running on ECS, had stable networking, and was dealing with large ZIP archives — often exceeding 10 GB in compressed size — where buffering everything into memory felt wasteful and disk I/O felt unnecessary.

I wanted something quick, clean, and reasonably optimized.

Streaming fit perfectly: use Unzipper, pipe the S3 object directly into the unzip logic and process files as they arrived. No intermediate storage. No large memory allocations. No unnecessary disk writes.

On paper, it checked every box.

That assumption — that streaming was the safest and simplest option in this setup — is what made the failure so surprising.


The Three Ways to Unzip an S3 Object

When a ZIP file lives in S3, you effectively have three extraction strategies, each optimized for a different axis: memory usage, performance, or reliability.

Understanding these trade-offs is critical, because the extraction method directly affects scalability, stability, and error resilience.

  1. Stream directly from S3
  2. Buffer the entire archive in memory
  3. Write the file to disk, then extract

They may produce the same output, but they behave very differently under pressure.


:one: Streaming Directly from S3

S3 Object → Network Stream → Decompression → Output

This is usually the first approach developers reach for — and the one most tutorials demonstrate.

Example:

  import {
    S3Client,
    GetObjectCommand,
    PutObjectCommand,
  } from "@aws-sdk/client-s3";
  import unzipper from "unzipper";

  const s3 = new S3Client({ region: "ap-south-1" });

  export async function extractAndUploadStreaming(
    sourceBucket: string,
    sourceKey: string,
    targetBucket: string,
    targetPrefix: string
  ): Promise<void> {
    const response = await s3.send(
      new GetObjectCommand({
        Bucket: sourceBucket,
        Key: sourceKey,
      })
    );

    if (!response.Body) {
      throw new Error("Empty S3 response");
    }

    const zipStream = response.Body as NodeJS.ReadableStream;

    await new Promise<void>((resolve, reject) => {
      zipStream
        .pipe(unzipper.Parse())
        .on("entry", async (entry) => {
          const fileName = entry.path;

          if (entry.type === "Directory") {
            entry.autodrain();
            return;
          }

          try {
            await s3.send(
              new PutObjectCommand({
                Bucket: targetBucket,
                Key: `${targetPrefix}/${fileName}`,
                Body: entry, // stream directly to S3
              })
            );
          } catch (err) {
            entry.autodrain();
            reject(err);
          }
        })
        .on("close", resolve)
        .on("error", reject);
    });
  }

:white_check_mark: Pros

  • Minimal memory footprint
    Only small chunks are held in memory, making this suitable for very large ZIP files.

  • Early data availability
    Extraction can begin as soon as bytes arrive, reducing perceived latency.

  • No disk I/O
    Avoids EBS or /tmp writes, which can become bottlenecks at scale.

  • Cost-efficient for large files
    No need to overprovision memory or storage just to unzip.

  • Pipeline-friendly
    Works well when extracted files are immediately streamed elsewhere (another S3 bucket, message queue, etc.).

:cross_mark: Cons

  • ZIP format friction
    ZIP archives rely heavily on metadata stored in the central directory, typically located at the end of the file. While local headers allow forward extraction, many libraries still depend on central-directory metadata for validation, offsets, and consistency checks. Streaming makes the central directory unavailable until the end of the stream is reached, preventing libraries from validating offsets, metadata, and archive consistency upfront.

  • Library fragility
    Many unzip libraries claim streaming support but still assume:

    • random seeks
    • known file offsets
    • pre-read metadata
  • Cryptic failure modes
    Errors like z_buff_error or invalid distance code often appear far from the real cause.

  • No mid-file retries
    A network hiccup usually means restarting from byte zero.

  • Extraction order is fixed
    You process files in ZIP order, not business priority order.

  • Backpressure sensitivity
    Slow consumers downstream can stall decompression and amplify memory spikes.

  • Central directory dependency
    Many ZIP readers use the central directory to verify offsets and compression metadata, which may not be fully available during forward-only streaming.

Best fit:
Large archives, stable networking, ECS/EC2 environments, and teams comfortable with ZIP internals.


:two: Buffering the Entire ZIP in Memory

S3 Object → Memory Buffer → Decompression

This approach downloads the full archive into memory before extraction begins.

Example:

  import {
    S3Client,
    GetObjectCommand,
    PutObjectCommand,
  } from "@aws-sdk/client-s3";
  import AdmZip from "adm-zip";
  import { Readable } from "stream";

  const s3 = new S3Client({ region: "ap-south-1" });

  async function streamToBuffer(stream: Readable): Promise<Buffer> {
    const chunks: Buffer[] = [];
    for await (const chunk of stream) {
      chunks.push(Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk));
    }
    return Buffer.concat(chunks);
  }

  export async function extractAndUploadBuffer(
    sourceBucket: string,
    sourceKey: string,
    targetBucket: string,
    targetPrefix: string
  ): Promise<void> {
    const response = await s3.send(
      new GetObjectCommand({
        Bucket: sourceBucket,
        Key: sourceKey,
      })
    );

    if (!response.Body) throw new Error("Empty S3 response");

    const zipBuffer = await streamToBuffer(response.Body as Readable);
    const zip = new AdmZip(zipBuffer);

    for (const entry of zip.getEntries()) {
      if (entry.isDirectory) continue;

      const fileBuffer = entry.getData();

      await s3.send(
        new PutObjectCommand({
          Bucket: targetBucket,
          Key: `${targetPrefix}/${entry.entryName}`,
          Body: fileBuffer,
        })
      );
    }
  }

:white_check_mark: Pros

  • Immediate access to all metadata
    Central directory, offsets, compression flags — everything is available upfront.

  • Predictable behavior
    Entire classes of streaming-related ZIP errors simply disappear.

  • Fast random access
    Jumping between entries is cheap once everything lives in RAM.

  • Clearer failure semantics
    Corrupt archives usually fail early and loudly.

:cross_mark: Cons

  • High memory usage
    Memory scales linearly with archive size.

  • OOM risk
    One unexpectedly large ZIP can crash the process.

  • Poor concurrency scaling
    Multiple parallel extractions multiply memory pressure quickly.

  • Lambda cost penalties
    Large memory allocations increase cold-start time and billing.

  • Wasteful for partial reads
    Even if you only need one file, you pay to buffer the entire archive.

Best fit:
Small-to-medium ZIP files, low concurrency workloads, and situations where simplicity beats scalability.


:three: Writing to Disk Before Unzipping

S3 Object → Disk → Decompression → Cleanup

The most traditional approach — and still the most predictable.

Example:

  import {
    S3Client,
    GetObjectCommand,
    PutObjectCommand,
  } from "@aws-sdk/client-s3";
  import yauzl from "yauzl";
  import fs from "fs";
  import path from "path";
  import { pipeline } from "stream/promises";

  const s3 = new S3Client({ region: "ap-south-1" });

  export async function extractAndUploadDisk(
    sourceBucket: string,
    sourceKey: string,
    tempZipPath: string,
    targetBucket: string,
    targetPrefix: string
  ): Promise<void> {
    // Step 1 — Download ZIP to disk
    const response = await s3.send(
      new GetObjectCommand({
        Bucket: sourceBucket,
        Key: sourceKey,
      })
    );

    if (!response.Body) throw new Error("Empty S3 response");

    await pipeline(
      response.Body as NodeJS.ReadableStream,
      fs.createWriteStream(tempZipPath)
    );

    // Step 2 — Extract using yauzl
    await new Promise<void>((resolve, reject) => {
      yauzl.open(tempZipPath, { lazyEntries: true }, (err, zipfile) => {
        if (err || !zipfile) return reject(err);

        zipfile.readEntry();

        zipfile.on("entry", (entry) => {
          if (/\/$/.test(entry.fileName)) {
            zipfile.readEntry();
            return;
          }

          zipfile.openReadStream(entry, async (err, readStream) => {
            if (err || !readStream) return reject(err);

            try {
              await s3.send(
                new PutObjectCommand({
                  Bucket: targetBucket,
                  Key: `${targetPrefix}/${entry.fileName}`,
                  Body: readStream, // stream directly to S3
                })
              );

              zipfile.readEntry();
            } catch (uploadErr) {
              reject(uploadErr);
            }
          });
        });

        zipfile.on("end", resolve);
        zipfile.on("error", reject);
      });
    });
  }

:white_check_mark: Pros

  • Maximum ZIP compatibility
    ZIP tooling was designed for files, not streams.

  • Excellent debuggability
    Archives can be inspected, retried, or manually tested.

  • Stable memory usage
    Disk absorbs the data footprint.

  • Partial recovery possible
    If extraction fails, the archive is still available for retry.

  • Works everywhere
    Lambda (/tmp), ECS, EC2 — as long as storage limits are respected.

:cross_mark: Cons

  • Higher end-to-end latency
    Download and extraction are separate phases.

  • Extra I/O cost
    Disk writes and reads add overhead at scale.

  • Storage constraints

    • Lambda /tmp limits
    • EBS volume sizing and cleanup
  • Operational overhead

    • Temp file lifecycle management
    • Disk monitoring
    • Cleanup on failure paths
  • Less streaming-friendly
    You lose the ability to process data as it arrives.

Best fit:
Reliability-first systems, legacy tooling, or environments where disk is cheap and predictable.


Platform Considerations at a Glance

Platform Streaming Buffering Disk
Lambda :warning: ZIP quirks :warning: Memory-bound :warning: /tmp limits
ECS :white_check_mark: Strong fit :warning: Memory-bound :white_check_mark: Reliable
EC2 :white_check_mark: Best fit :warning: Memory-bound :white_check_mark: Most flexible

Library Behavior Matters As Much As the Strategy

Your unzip library often matters more than the architecture itself. Two systems can use the same S3 object and the same extraction strategy — and behave completely differently — purely because the libraries make different assumptions about how ZIP files should be read.

Some libraries are stream-optimistic. Others quietly assume random access.

Common ZIP Libraries and Their Real Behavior

  • unzipper — Stream-friendly API, but optimistic about ZIP structure; can break on archives that rely heavily on central-directory metadata.

  • yauzl — Designed around random access; reads the central directory via file seeks, making it highly reliable for complex ZIPs but unsuitable for pure forward-only streams without range support.

  • archiver — Excellent for creating ZIP files; not designed for robust extraction of unknown archives.

  • adm-zip — Buffer-based and simple; reliable for small ZIPs but memory-heavy and unsuitable for large archives.

  • node-stream-zip — Central-directory-first design; efficient for large ZIPs when paired with files or range reads.

  • jszip — In-memory only; great for client-side or small payloads, dangerous at scale.

  • 7zip / p7zip bindings — Very robust format support, but assumes filesystem access and adds operational overhead.


Hidden Axis: Operational Risk

Beyond performance and cost, each approach carries different failure blast radii:

Failure Stream Buffer Disk
Network hiccup Restart stream Restart download Resume possible with implementation
Corrupt ZIP Late failure Early failure Inspectable
Large file surprise Safe :collision: OOM Disk pressure
Concurrency spike Stable :collision: Memory Disk contention

This is why “it worked locally” so often collapses in production.

The Core Decision

Choosing how to unzip from S3 isn’t about correctness — all approaches can work with different libraries.

It’s about which cost you’re willing to pay:

  • Streams optimize for throughput and scale
  • Buffers optimize for simplicity and correctness
  • Disk optimizes for compatibility and recoverability

Every optimization shifts pressure somewhere else — memory, I/O, latency, or operational complexity.

In the next part, we’ll look at why streaming ZIP extraction sometimes explodes with baffling zlib errors — and how the ZIP central directory quietly orchestrates the whole mess.

The failure wasn’t random — it was structural.


:speech_balloon: Feedback

If you have suggestions, insights, or alternative approaches, feel free to leave a comment.

:books: Series: Unarchiving S3 ZIP Files

:link: References

:folded_hands: Credits

  • Proof Read Credits: Swasthik
  • Thank you for helping me come up with the solution: Swasthik, Sonal, Ashwath, Sandhyashri