← Blog

·

There are only two file formats: zip and txt

JSON? Text.

EPUB? Zip.

CSV? Text.

.docx? Zip.

SVG? Text.

.jar? Zip.

YAML? Text.

.apk? Zip.

Strip the branding off a file and what you have left is usually one of two things: a text file you can open in any editor, or a ZIP archive full of smaller files. And those smaller files are very often text.

Where the line comes from

This is old programmer folklore. The usual phrasing is "there are only two file formats worth using: text files and zips of text files," and it has floated around forums for years.1 Every so often someone reframes it as a punchy list and it goes viral again,2 usually with a Calvin and Hobbes panel attached.3

The text half

A lot of what we treat as separate formats are plain text with rules bolted on.

JSON is defined by its spec as "a text format for the serialization of structured data."4

CSV is tabular data stored "in plain text," with its own registered text/csv media type.5

SVG is an XML vocabulary, so an image file you can read line by line.6

Add YAML, TOML, INI, Markdown, HTML, and every source file you have ever written.

Some of these hide better than others. A Jupyter notebook (.ipynb) is JSON. So are GeoJSON maps and glTF 3D scenes.

Surprisingly, also ZIP

A pile of formats that look proprietary are ZIP archives wearing a different extension. You can rename their extension to .zip, unzip it, and read what is inside. Here are two real ones, a Word document and an EPUB e-book. Click through the folders and see for yourself:

Wait what?

The same is true of NuGet packages, VS Code extensions, Windows .appx installers, Apple iWork documents, Google Earth .kmz files, comic book .cbz archives, and 3D-printing .3mf models.

PK, as in Phil Katz

Open a .docx, an .epub, or a .jar in a text editor and the first two characters are PK. Those are the initials of Phil Katz, who created the ZIP format at PKWARE and released it in 1989.15 He died in 2000, after years of struggle with alcoholism.16 His initials now sit at the front of a large share of the files on every computer, phone, and e-reader on the planet.

There is a structural reason ZIP became the default container. A ZIP's index lives at the end of the file, so you read it back to front. That lets a format pin one small uncompressed file at the very front for fast identification. EPUB and OpenDocument both use this trick: the first entry is an uncompressed mimetype file, so a reader can tell what a document is without unpacking the whole archive.98

What does that look like?

You can watch both ideas at once in the first bytes of an EPUB. The PK signature is the first thing in the file, and because the mimetype is stored uncompressed, it sits there in plain text right after it:

$ xxd book.epub | head -3
00000000: 504b 0304 0a00 0000 0000 8861 d55c 6f61  PK.........a.\oa
00000010: ab2c 1400 0000 1400 0000 0800 0000 6d69  .,............mi
00000020: 6d65 7479 7065 6170 706c 6963 6174 696f  metypeapplicatio

Read the right-hand column: PK, then mimetype, then applicatio(n/epub+zip).

Not all files

Plenty of formats are genuinely neither: PNG, JPEG, and GIF images; MP3, MP4, and WebM media; SQLite databases; Protocol Buffers; Parquet; WebAssembly modules; fonts. These are binary formats with their own layouts, and no amount of renaming turns them into text or a ZIP.

A few sit in the cracks. An .exe is normally a compiled binary, but a self-extracting archive is a valid .exe and a valid .zip at the same time: because a ZIP's index lives at the end of the file, the same bytes can satisfy both readers.1517 PDF is the mirror image. It is a mostly text-based structure whose embedded streams are usually compressed with DEFLATE, the same algorithm ZIP uses.18 PDF is the rare case of a text file containing a zip rather than the other way round.

But for the formats most of us create and parse on a normal day, the aphorism holds up better than it has any right to. It is text, or it is a zip of text. Next time you hit a file extension you do not recognize, try two things before you go looking for a special tool: open it in a text editor, and try to unzip it. One of them works more often than you would guess.


  1. The saying circulates without a clear original source; it appears in programmer forums and comment threads such as The Daily WTF

  2. It resurfaces regularly across social media; one of many recent examples is this Bluesky post and its long reply thread of people adding formats. 

  3. "The two types of file format are txt and zip," ProgrammerHumor.io

  4. "The JavaScript Object Notation (JSON) Data Interchange Format," RFC 8259: "JSON is a text format for the serialization of structured data." 

  5. "Common Format and MIME Type for Comma-Separated Values (CSV) Files," RFC 4180

  6. "Scalable Vector Graphics (SVG) 2," W3C. SVG is an XML language. 

  7. "Open Packaging Conventions," used by ECMA-376 Office Open XML; see Microsoft's OPC overview and the Library of Congress format record

  8. "Open Document Format for Office Applications (OpenDocument) v1.3, Part 3: Packages," OASIS 

  9. "EPUB 3.3," W3C, and the Open Container Format, which defines the ZIP container and the uncompressed mimetype rule.  

  10. "JAR File Specification," Oracle

  11. "Android Package (APK)," Library of Congress

  12. ".ipa," Wikipedia (note: since 2017 Apple may use LZFSE compression inside the archive). 

  13. "The Wheel Binary Package Format 1.0," PEP 427

  14. "Usdz File Format Specification," OpenUSD: a "zero compression, unencrypted zip archive." 

  15. "ZIP (file format)," Wikipedia; PKWARE's original APPNOTE.TXT 

  16. His own story was bleaker than his legacy. See the 2000 profile "The short, tormented life of computer genius Phil Katz," widely discussed on Hacker News

  17. One file can conform to several formats at once. See "How to Create HTML/ZIP/PNG Polyglot Files" on Hacker News, and the PoC||GTFO polyglot files. 

  18. PDF stores its objects in a text-based syntax and compresses embedded streams, most commonly with the DEFLATE-based FlateDecode filter. "PDF," Wikipedia