whatwg / mimesniff

MIME Sniffing Standard
https://mimesniff.spec.whatwg.org/
Other
106 stars 44 forks source link

7.1. Identify unknown MIME type with sniff-scriptable unset is inconsistent for PDFs #68

Open dd8 opened 6 years ago

dd8 commented 6 years ago

https://mimesniff.spec.whatwg.org/#identifying-a-resource-with-an-unknown-mime-type

When sniff-scriptable is unset and a PDF resource header is processed the algorithm falls through to:

  1. If resource’s resource header contains no binary data bytes, return "text/plain".
  2. Return "application/octet-stream".

Some PDFs have no binary data in first 1500 bytes so are sniffed as "text/plain", others with binary data in first 1500 bytes are sniffed as "application/octet-stream". It would be desirable to consistently return one or the other.

PDFs are a mix of 7-bit ASCII with some binary sections (compressed sections, images, embedded fonts). Binary sections may (or may not) appear within the first 1500 characters of the PDF, and not all PDFs contain binary sections.

minimal.pdf is a valid text-only PDF (sourced from https://brendanzagaeski.appspot.com/0004.html) containing this text:

%PDF-1.1 %¥±ë

1 0 obj << /Type /Catalog /Pages 2 0 R

endobj

2 0 obj << /Type /Pages /Kids [3 0 R] /Count 1 /MediaBox [0 0 300 144]

endobj

3 0 obj << /Type /Page /Parent 2 0 R /Resources << /Font << /F1 << /Type /Font /Subtype /Type1 /BaseFont /Times-Roman

/Contents 4 0 R

endobj

4 0 obj << /Length 55 >> stream BT /F1 18 Tf 0 0 Td (Hello World) Tj ET endstream endobj

xref 0 5 0000000000 65535 f 0000000018 00000 n 0000000077 00000 n 0000000178 00000 n 0000000457 00000 n trailer << /Root 1 0 R /Size 5

startxref 565 %%EOF

domenic commented 6 years ago

Do you have a proposed solution here? As noted elsewhere, this standard is called "MIME sniffing," not "infallable MIME type determination," so it's not surprising to me that you can create cases that are not sniffed correctly.

dd8 commented 6 years ago

My motivation for highlighting this case is that PDF is potentially dangerous, so it's better to make the behaviour predictable

Solution is probably easy - in "7.1. Identifying a resource with an unknown MIME type" add a new row to the pattern table for step 2 identical to the last row in the pattern table for step 1, but with a different MIME type:

25 50 44 46 2D | FF FF FF FF FF | None. | application/octet-stream | The string "%PDF-"

This would also need a warning that the MIME type is meant to be different

Some debate over the best/safest MIME type would be useful. Should it be application/octet-stream or text/plain? Both are returned at the moment.

application/octet-stream is probably more useful (PDF is downloaded) and is likely to be sniffed more often by the current algorithm, but is potentially more dangerous in a social engineering attack

text/plain is probably safer, but text/plain PDFs saved using the browser Save command will probably be unreadable in a PDF viewer if any line ending characters are changed (e.g. translating \r\n to \n). This would break the PDF xref table where each xref record must end with \r\n and the file offsets in the table would break.

domenic commented 6 years ago

OK. Can you test what browsers currently do, whether they use the current algorithm or your proposed one? Using a test such as https://github.com/whatwg/mimesniff/issues/69#issuecomment-380525200 . That will determine whether we can change this or not.

GPHemsley commented 3 years ago

This page (quoting from the PDF spec) states:

The second line comment contains 6 high bit characters (displayed as 3 characters in UTF-8 encoding), as required by the "File Header" subsection of the specification (Section 7.5.2):

If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be immediately followed by a comment line containing at least four binary characters—that is, characters whose codes are 128 or greater. This ensures proper behaviour of file transfer applications that inspect data near the beginning of a file to determine whether to treat the file’s contents as text or as binary.

So it seems the problem stems from our definition of binary data byte, which is limited to control characters and does not include bytes above 0x7F. This allows non-ASCII characters to be detected as UTF-8, to the detriment of situations like this which are actively trying to signal the presence of binary data.

Both Firefox and Chrome behave identically here, per spec:

Firefox Chrome
without nosniff application/pdf application/pdf
with nosniff text/plain text/plain