Unarchiving S3 ZIP Files: The zlib Error That Made No Sense - Part 2 🔎

TL;DR

The ZIP file wasn’t failing randomly. It was failing structurally.

Streaming extraction relied only on local file headers, but the ZIP
format stores its authoritative metadata in the central directory at the
end of the archive.

Without that information, the unzip library misjudged file boundaries.
zlib expected more data than existed — and threw Z_BUF_ERROR.

Buffering and disk extraction worked locally because they allowed random
access to the full ZIP structure. But those approaches didn’t translate
cleanly to ECS, where memory and disk behavior made them unreliable or
impractical at scale.

Streaming didn’t fail because of missing bytes.

It failed because it couldn’t see the full structure.


The Error That Shouldn’t Have Happened

The first failure didn’t look dramatic. It looked ordinary. A job
started normally. The stream connected. Extraction began. Files flowed
out. Logs printed exactly what I expected. Then, midway through
extraction, everything stopped.

Error: unexpected end of file
code: 'Z_BUF_ERROR'

No retry. No partial recovery. No continuation. Just a hard stop.

What made it unsettling wasn’t the error itself. It was the context
around it. The S3 object was fully present. Its size matched
expectations. The stream didn’t drop. Other ZIP files processed
successfully in the same pipeline. Nothing obvious had changed. And yet,
this one archive consistently failed — not at the beginning, but
somewhere in the middle.

That made it feel less like an infrastructure
problem and more like something hidden inside the file itself.


The Investigation Started With the Most Obvious Suspect: Corruption

When a decompression error appears mid-stream, corruption is the natural
first suspect. So the debugging process turned into an archaeological
dig through the ZIP file.

Instead of trusting the stream, I pulled the archive down locally and
started dissecting it. Extract individual folders. Remove sections.
Retry extraction. Repeat.

Eventually, a pattern emerged. There was one particular folder. Whenever
that folder existed, streaming extraction failed. When it was removed,
everything worked. That seemed definitive. The folder was corrupt. Or at
least, that was the simplest explanation.

But this ZIP file wasn’t arbitrary input. It wasn’t something uploaded
manually or generated by an unknown system. It was a Smart TV
configuration archive — a structured bundle containing settings,
configuration, and runtime data. That exact same ZIP file worked
perfectly in its original environment. The Smart TV extracted it using
its Java-based pipeline without issue.

That meant something critical: the ZIP file was valid enough for its
primary consumer. The source of truth was correct. The failure wasn’t
about whether the ZIP could be extracted. It was about how it was being
extracted.


The Failure Was Absolute — Not Partial

One detail made the situation worse. The failure didn’t just affect that
folder. It stopped everything.

Streaming extraction is sequential by nature. Each file depends on
correctly finishing the previous one. When the extractor encountered the
structural mismatch, it lost its ability to safely continue. It couldn’t
skip the entry. It couldn’t jump ahead. It couldn’t recover. It simply
stopped.

Files after the failure point were never processed. Skipping wasn’t a
safe option, because without structural certainty, the extractor
couldn’t reliably locate the next valid file boundary. The pipeline
didn’t degrade. It halted.


What Z_BUF_ERROR Was Really Saying

At first glance, the error message seemed misleading. Z_BUF_ERROR
sounds like a buffering or memory issue. It isn’t.

It means the decompressor reached the end of available compressed data
while still expecting more. From zlib’s perspective, the file ended
prematurely — not at the archive level, but at the individual entry
level. The decompressor believed it was still inside a file, but the
stream had already moved on.

This wasn’t a transport problem. It was a structural interpretation
problem. Which raised a deeper question: how did the extractor decide
how much data belonged to that file? The answer lives inside the ZIP
format itself.


ZIP Files Aren’t Just Sequential — They’re Indexed

ZIP Files Aren’t Just Sequential — They’re Indexed

At first glance, a ZIP file feels like a simple sequence of compressed files. One entry ends, the next begins, and extraction moves forward.

But internally, a ZIP file is structured more like an indexed container than a pure stream.

Here’s what the actual layout looks like:

+------------------------------------------------------------+
| Local File Header #1                                       |
+------------------------------------------------------------+
| Compressed Data #1                                         |
+------------------------------------------------------------+
| Local File Header #2                                       |
+------------------------------------------------------------+
| Compressed Data #2                                         |
+------------------------------------------------------------+
| Local File Header #3                                       |
+------------------------------------------------------------+
| Compressed Data #3                                         |
+------------------------------------------------------------+
| ...                                                        |
+------------------------------------------------------------+
| Central Directory Entry #1                                 |
|  -> points to Local File Header #1                         |
+------------------------------------------------------------+
| Central Directory Entry #2                                 |
|  -> points to Local File Header #2                         |
+------------------------------------------------------------+
| Central Directory Entry #3                                 |
|  -> points to Local File Header #3                         |
+------------------------------------------------------------+
| ...                                                        |
+------------------------------------------------------------+
| End of Central Directory Record (EOCD)                     |
|  -> points to start of Central Directory                   |
+------------------------------------------------------------+

Each file appears twice in the archive — once as a local entry, and once in the central directory.

These two parts serve very different purposes.

The local file headers appear before each file’s compressed data. They allow extraction to begin immediately, which is why streaming extraction works at all. They contain basic information like the filename, compression method, and sometimes size information.

[Local Header]
[File Data]

[Local Header]
[File Data]
...

But they aren’t always complete.

Some ZIP creation tools defer important metadata until after compression finishes. In those cases, the local header acts more like a placeholder than a definitive record.

The central directory, located at the end of the archive, is the authoritative index. It contains the exact offsets, exact compressed sizes, exact uncompressed sizes, and the precise structural map of the entire ZIP file.

[Central Directory]
[End of Central Directory]

Think of it like this:

Local headers are scattered sticky notes placed before each chapter

The central directory is the table of contents at the back of the book

Streaming extraction reads the sticky notes and hopes they’re accurate.

Random-access extraction reads the table of contents first and knows exactly where everything is.

Most of the time, both approaches work.

But when the sticky notes are incomplete — and the table of contents isn’t available yet — the extractor is forced to make assumptions.

And that’s where things begin to break down.

Why That Folder Triggered the Failure

That folder wasn’t corrupt. It was structurally dependent on metadata
confirmed by the central directory.

The streaming extractor read the local header and began decompressing.
Based on incomplete metadata, it expected more compressed bytes. When
the data ended — correctly — the extractor still believed the file
wasn’t finished. zlib raised Z_BUF_ERROR, not because the archive was
invalid, but because the extractor didn’t yet have the complete
structural map.


Why Buffering and Disk Extraction Solved It Locally — But Not in ECS

This was one of the most revealing parts of the investigation. When I
switched to buffering or disk extraction locally, the problem
disappeared instantly. The archive extracted cleanly. No errors. No
interruptions.

Local extraction allowed random access. The extractor could read the
central directory first and knew the exact boundaries of every entry
before decompressing. Extraction became deterministic instead of
interpretive.

But when applying the same approach in ECS, things changed. Buffering
large archives into memory created unacceptable memory pressure,
especially with concurrent jobs. Disk extraction introduced its own
constraints — storage limits, performance variability, and operational
overhead managing temporary files inside containers.

What worked perfectly as a debugging tool locally didn’t translate into
a reliable production strategy. It solved the structural visibility
problem, but introduced scaling and infrastructure problems. It wasn’t a
real solution. It was a diagnostic confirmation.


The Core Realization

The failure wasn’t caused by bad data. It was caused by incomplete
structural visibility during streaming extraction.

ZIP files are optimized for random access. Streaming removes that
capability. As long as local headers contain sufficient metadata,
streaming works. When they don’t, extraction becomes fragile. And once
structural interpretation fails, extraction cannot safely continue.


Lessons From the Failure

This experience reshaped how I thought about ZIP extraction entirely —
not as a simple decompression step, but as an interaction between file
format structure and access strategy.

Key lessons emerged:

  • ZIP archives rely heavily on metadata stored at the end of the file
  • Streaming extraction operates without access to that metadata
  • Some ZIP files stream cleanly; others don’t, depending on structure
  • Streaming Failures halt extraction completely, not just individual files
  • Local buffering and disk extraction restore structural visibility
  • But those approaches don’t scale cleanly in ECS environments
  • Extraction reliability depends on aligning access strategy with file
    structure

The Missing Capability Was Random Access

Streaming itself wasn’t the problem. The real limitation was the
inability to seek.

ZIP extraction becomes reliable the moment the extractor can access the
central directory. Which raises an obvious question: what if streaming
didn’t have to be strictly forward-only? What if the extractor could
access the index first — without downloading the entire archive?

That’s when the real solution emerged.

Because S3 already supports exactly that capability.

And it changes everything.


:speech_balloon: Feedback

If you have suggestions, insights, or alternative approaches, feel free to leave a comment.

:books: Series: Unarchiving S3 ZIP Files

:link: References

:folded_hands: Credits

  • Proof Read Credits: Swasthik
  • Thank you for helping me come up with the solution: Swasthik, Sonal, Ashwath, Sandhyashri