Unarchiving S3 ZIP Files: It Worked, Then Disaster Struck - Part 1 🪣

Clinton · March 31, 2026, 12:29pm

TL;DR

Extracting ZIP files from S3 isn’t just an unzip operation — it’s an architectural choice.

There are three ways to do it: streaming, buffering in memory, or writing to disk. Each shifts the burden between memory usage, performance, and reliability.

Streaming feels like the most efficient option — especially for large archives — but ZIP files are structured in ways that often assume random access. That mismatch can cause unexpected failures, cryptic errors, and fragile behavior depending on the unzip library.

The key lesson: your extraction strategy and library must align with how ZIP files actually work, not just what looks optimal on paper.

What worked perfectly at first can fail under real-world conditions — and understanding why requires looking beyond S3 and into the ZIP format itself.

Here’s what happened — and why the obvious solution wasn’t actually the safest one.

Unarchiving ZIP files from S3 started out as one of those tasks you barely think about. Pull the object, unzip it, move on. I wired everything up, hit deploy, and watched files stream out exactly as expected. Green logs. Clean runs.

Then, without warning, disaster struck.

A job that “worked” suddenly didn’t. A cryptic error replaced progress. And what looked like a solved problem turned into a deep dive through ZIP internals, streaming assumptions, and infrastructure trade-offs I hadn’t planned on learning that week.

It wasn’t just the ZIP file that mattered — it was how I was trying to read it. Where the bytes flowed, how much of the file was available at any given moment, and what the unzip library silently expected all started to matter.

That’s when it became clear: extracting a ZIP file from S3 isn’t just about unzipping — it’s about choosing the right access strategy.

Why Streaming Seemed Like the Perfect Choice

In my case, the choice felt obvious. I was running on ECS, had stable networking, and was dealing with large ZIP archives — often exceeding 10 GB in compressed size — where buffering everything into memory felt wasteful and disk I/O felt unnecessary.

I wanted something quick, clean, and reasonably optimized.

Streaming fit perfectly: use Unzipper, pipe the S3 object directly into the unzip logic and process files as they arrived. No intermediate storage. No large memory allocations. No unnecessary disk writes.

On paper, it checked every box.

That assumption — that streaming was the safest and simplest option in this setup — is what made the failure so surprising.

The Three Ways to Unzip an S3 Object

When a ZIP file lives in S3, you effectively have three extraction strategies, each optimized for a different axis: memory usage, performance, or reliability.

Understanding these trade-offs is critical, because the extraction method directly affects scalability, stability, and error resilience.

Stream directly from S3
Buffer the entire archive in memory
Write the file to disk, then extract

They may produce the same output, but they behave very differently under pressure.

Streaming Directly from S3

S3 Object → Network Stream → Decompression → Output

This is usually the first approach developers reach for — and the one most tutorials demonstrate.

Example:

  import {
    S3Client,
    GetObjectCommand,
    PutObjectCommand,
  } from "@aws-sdk/client-s3";
  import unzipper from "unzipper";

  const s3 = new S3Client({ region: "ap-south-1" });

  export async function extractAndUploadStreaming(
    sourceBucket: string,
    sourceKey: string,
    targetBucket: string,
    targetPrefix: string
  ): Promise<void> {
    const response = await s3.send(
      new GetObjectCommand({
        Bucket: sourceBucket,
        Key: sourceKey,
      })
    );

    if (!response.Body) {
      throw new Error("Empty S3 response");
    }

    const zipStream = response.Body as NodeJS.ReadableStream;

    await new Promise<void>((resolve, reject) => {
      zipStream
        .pipe(unzipper.Parse())
        .on("entry", async (entry) => {
          const fileName = entry.path;

          if (entry.type === "Directory") {
            entry.autodrain();
            return;
          }

          try {
            await s3.send(
              new PutObjectCommand({
                Bucket: targetBucket,
                Key: `${targetPrefix}/${fileName}`,
                Body: entry, // stream directly to S3
              })
            );
          } catch (err) {
            entry.autodrain();
            reject(err);
          }
        })
        .on("close", resolve)
        .on("error", reject);
    });
  }

Pros

Minimal memory footprint
Only small chunks are held in memory, making this suitable for very large ZIP files.
Early data availability
Extraction can begin as soon as bytes arrive, reducing perceived latency.
No disk I/O
Avoids EBS or /tmp writes, which can become bottlenecks at scale.
Cost-efficient for large files
No need to overprovision memory or storage just to unzip.
Pipeline-friendly
Works well when extracted files are immediately streamed elsewhere (another S3 bucket, message queue, etc.).

Cons

ZIP format friction
ZIP archives rely heavily on metadata stored in the central directory, typically located at the end of the file. While local headers allow forward extraction, many libraries still depend on central-directory metadata for validation, offsets, and consistency checks. Streaming makes the central directory unavailable until the end of the stream is reached, preventing libraries from validating offsets, metadata, and archive consistency upfront.
Library fragility
Many unzip libraries claim streaming support but still assume:
- random seeks
- known file offsets
- pre-read metadata
Cryptic failure modes
Errors like z_buff_error or invalid distance code often appear far from the real cause.
No mid-file retries
A network hiccup usually means restarting from byte zero.
Extraction order is fixed
You process files in ZIP order, not business priority order.
Backpressure sensitivity
Slow consumers downstream can stall decompression and amplify memory spikes.
Central directory dependency
Many ZIP readers use the central directory to verify offsets and compression metadata, which may not be fully available during forward-only streaming.

Best fit:
Large archives, stable networking, ECS/EC2 environments, and teams comfortable with ZIP internals.

Buffering the Entire ZIP in Memory

S3 Object → Memory Buffer → Decompression

This approach downloads the full archive into memory before extraction begins.

Example:

  import {
    S3Client,
    GetObjectCommand,
    PutObjectCommand,
  } from "@aws-sdk/client-s3";
  import AdmZip from "adm-zip";
  import { Readable } from "stream";

  const s3 = new S3Client({ region: "ap-south-1" });

  async function streamToBuffer(stream: Readable): Promise<Buffer> {
    const chunks: Buffer[] = [];
    for await (const chunk of stream) {
      chunks.push(Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk));
    }
    return Buffer.concat(chunks);
  }

  export async function extractAndUploadBuffer(
    sourceBucket: string,
    sourceKey: string,
    targetBucket: string,
    targetPrefix: string
  ): Promise<void> {
    const response = await s3.send(
      new GetObjectCommand({
        Bucket: sourceBucket,
        Key: sourceKey,
      })
    );

    if (!response.Body) throw new Error("Empty S3 response");

    const zipBuffer = await streamToBuffer(response.Body as Readable);
    const zip = new AdmZip(zipBuffer);

    for (const entry of zip.getEntries()) {
      if (entry.isDirectory) continue;

      const fileBuffer = entry.getData();

      await s3.send(
        new PutObjectCommand({
          Bucket: targetBucket,
          Key: `${targetPrefix}/${entry.entryName}`,
          Body: fileBuffer,
        })
      );
    }
  }

Pros

Immediate access to all metadata
Central directory, offsets, compression flags — everything is available upfront.
Predictable behavior
Entire classes of streaming-related ZIP errors simply disappear.
Fast random access
Jumping between entries is cheap once everything lives in RAM.
Clearer failure semantics
Corrupt archives usually fail early and loudly.

Cons

High memory usage
Memory scales linearly with archive size.
OOM risk
One unexpectedly large ZIP can crash the process.
Poor concurrency scaling
Multiple parallel extractions multiply memory pressure quickly.
Lambda cost penalties
Large memory allocations increase cold-start time and billing.
Wasteful for partial reads
Even if you only need one file, you pay to buffer the entire archive.

Best fit:
Small-to-medium ZIP files, low concurrency workloads, and situations where simplicity beats scalability.

Writing to Disk Before Unzipping

S3 Object → Disk → Decompression → Cleanup

The most traditional approach — and still the most predictable.

Example:

  import {
    S3Client,
    GetObjectCommand,
    PutObjectCommand,
  } from "@aws-sdk/client-s3";
  import yauzl from "yauzl";
  import fs from "fs";
  import path from "path";
  import { pipeline } from "stream/promises";

  const s3 = new S3Client({ region: "ap-south-1" });

  export async function extractAndUploadDisk(
    sourceBucket: string,
    sourceKey: string,
    tempZipPath: string,
    targetBucket: string,
    targetPrefix: string
  ): Promise<void> {
    // Step 1 — Download ZIP to disk
    const response = await s3.send(
      new GetObjectCommand({
        Bucket: sourceBucket,
        Key: sourceKey,
      })
    );

    if (!response.Body) throw new Error("Empty S3 response");

    await pipeline(
      response.Body as NodeJS.ReadableStream,
      fs.createWriteStream(tempZipPath)
    );

    // Step 2 — Extract using yauzl
    await new Promise<void>((resolve, reject) => {
      yauzl.open(tempZipPath, { lazyEntries: true }, (err, zipfile) => {
        if (err || !zipfile) return reject(err);

        zipfile.readEntry();

        zipfile.on("entry", (entry) => {
          if (/\/$/.test(entry.fileName)) {
            zipfile.readEntry();
            return;
          }

          zipfile.openReadStream(entry, async (err, readStream) => {
            if (err || !readStream) return reject(err);

            try {
              await s3.send(
                new PutObjectCommand({
                  Bucket: targetBucket,
                  Key: `${targetPrefix}/${entry.fileName}`,
                  Body: readStream, // stream directly to S3
                })
              );

              zipfile.readEntry();
            } catch (uploadErr) {
              reject(uploadErr);
            }
          });
        });

        zipfile.on("end", resolve);
        zipfile.on("error", reject);
      });
    });
  }

Pros

Maximum ZIP compatibility
ZIP tooling was designed for files, not streams.
Excellent debuggability
Archives can be inspected, retried, or manually tested.
Stable memory usage
Disk absorbs the data footprint.
Partial recovery possible
If extraction fails, the archive is still available for retry.
Works everywhere
Lambda (/tmp), ECS, EC2 — as long as storage limits are respected.

Cons

Higher end-to-end latency
Download and extraction are separate phases.
Extra I/O cost
Disk writes and reads add overhead at scale.
Storage constraints
- Lambda /tmp limits
- EBS volume sizing and cleanup
Operational overhead
- Temp file lifecycle management
- Disk monitoring
- Cleanup on failure paths
Less streaming-friendly
You lose the ability to process data as it arrives.

Best fit:
Reliability-first systems, legacy tooling, or environments where disk is cheap and predictable.

Platform Considerations at a Glance

Platform	Streaming	Buffering	Disk
Lambda	ZIP quirks	Memory-bound	`/tmp` limits
ECS	Strong fit	Memory-bound	Reliable
EC2	Best fit	Memory-bound	Most flexible

Library Behavior Matters As Much As the Strategy

Your unzip library often matters more than the architecture itself. Two systems can use the same S3 object and the same extraction strategy — and behave completely differently — purely because the libraries make different assumptions about how ZIP files should be read.

Some libraries are stream-optimistic. Others quietly assume random access.

Common ZIP Libraries and Their Real Behavior

unzipper — Stream-friendly API, but optimistic about ZIP structure; can break on archives that rely heavily on central-directory metadata.
yauzl — Designed around random access; reads the central directory via file seeks, making it highly reliable for complex ZIPs but unsuitable for pure forward-only streams without range support.
archiver — Excellent for creating ZIP files; not designed for robust extraction of unknown archives.
adm-zip — Buffer-based and simple; reliable for small ZIPs but memory-heavy and unsuitable for large archives.
node-stream-zip — Central-directory-first design; efficient for large ZIPs when paired with files or range reads.
jszip — In-memory only; great for client-side or small payloads, dangerous at scale.
7zip / p7zip bindings — Very robust format support, but assumes filesystem access and adds operational overhead.

Hidden Axis: Operational Risk

Beyond performance and cost, each approach carries different failure blast radii:

Failure	Stream	Buffer	Disk
Network hiccup	Restart stream	Restart download	Resume possible with implementation
Corrupt ZIP	Late failure	Early failure	Inspectable
Large file surprise	Safe	OOM	Disk pressure
Concurrency spike	Stable	Memory	Disk contention

This is why “it worked locally” so often collapses in production.

The Core Decision

Choosing how to unzip from S3 isn’t about correctness — all approaches can work with different libraries.

It’s about which cost you’re willing to pay:

Streams optimize for throughput and scale
Buffers optimize for simplicity and correctness
Disk optimizes for compatibility and recoverability

Every optimization shifts pressure somewhere else — memory, I/O, latency, or operational complexity.

In the next part, we’ll look at why streaming ZIP extraction sometimes explodes with baffling zlib errors — and how the ZIP central directory quietly orchestrates the whole mess.

The failure wasn’t random — it was structural.

Feedback

If you have suggestions, insights, or alternative approaches, feel free to leave a comment.

Series: Unarchiving S3 ZIP Files

References

Credits

Proof Read Credits: Swasthik
Thank you for helping me come up with the solution: Swasthik, Sonal, Ashwath, Sandhyashri