wjdp / htmltest

:white_check_mark: Test generated HTML for problems
MIT License
323 stars 54 forks source link

mailto with plus sign incorrect marked as invalid #182

Open theory opened 2 years ago

theory commented 2 years ago

Describe the bug

On this page, I have a mailto: link like this:

<a href="mailto:sqitch-users+subscribe@googlegroups.com">subscribe by email</a>

Running htmltest (just installed via go install) it reports:

  invalid email address (invalid format): 'sqitch-users subscribe@googlegroups.com' --- 2013/06/sqitch-list/index.html --> mailto:sqitch-users+subscribe@googlegroups.com

I think this is incorrect: isn't the plus sign valid there, and not representing a space. I tried pasting it into mailtolinkgenerator.com and it also output it with a plus. Looking at rfc6068, there's this table:

      mailtoURI    = "mailto:" [ to ] [ hfields ]
      to           = addr-spec *("," addr-spec )
      hfields      = "?" hfield *( "&" hfield )
      hfield       = hfname "=" hfvalue
      hfname       = *qchar
      hfvalue      = *qchar
      addr-spec    = local-part "@" domain
      local-part   = dot-atom-text / quoted-string
      domain       = dot-atom-text / "[" *dtext-no-obs "]"
      dtext-no-obs = %d33-90 / ; Printable US-ASCII
                     %d94-126  ; characters not including
                               ; "[", "]", or "\"
      qchar        = unreserved / pct-encoded / some-delims
      some-delims  = "!" / "$" / "'" / "(" / ")" / "*"
                   / "+" / "," / ";" / ":" / "@"

If I'm reading it right, the dot-atom-text bit (documented in rfc5322 appears to allow + signs in the local-part:

   atext           =   ALPHA / DIGIT /    ; Printable US-ASCII
                       "!" / "#" /        ;  characters not including
                       "$" / "%" /        ;  specials.  Used for atoms.
                       "&" / "'" /
                       "*" / "+" /
                       "-" / "/" /
                       "=" / "?" /
                       "^" / "_" /
                       "`" / "{" /
                       "|" / "}" /
                       "~"

   atom            =   [CFWS] 1*atext [CFWS]

   dot-atom-text   =   1*atext *("." 1*atext)

To Reproduce

Steps to reproduce the behaviour:

  1. Create a file with a mailto: anchor with a + sign in the local part
  2. Scan it with htmltest
  3. See error

.htmltest.yml

DirectoryPath: public

Source files

https://justatheory.com/2013/06/sqitch-list/

Expected behaviour

A mailto: address with a + in the local part should be valid.

Actual behaviour

htmltest finds it invalid with this message:

  invalid email address (invalid format): 'sqitch-users subscribe@googlegroups.com' --- 2013/06/sqitch-list/index.html --> mailto:sqitch-users+subscribe@googlegroups.com

Versions

Additional context

Thanks!

theory commented 2 years ago

Also added

IgnoreURLs:
  - mailto:sqitch-users+subscribe@googlegroups.com

To my config and htmltest still reports it. http URls on the list are properly ignored.

wjdp commented 1 year ago

This is a problem as I'm currently using github.com/badoux/checkmail to validate emails and their regex is failing the above. Likely need to remove this and replace with a much more forgiving one.

Qup42 commented 5 months ago

There is actually another problem at play here: The mail address with the + is URL decoded with net/url.QueryUnescape first. This converts the + to a space. Notice that the + changed in the beginning of the error message. This is also the reason why your IgnoreURLs entry does not work. A workaround is to encode the + as %2B. A better solution would be to use PathUnescape instead (at least for checking mailto). It does not unescape the + to a space which is common but controversial ^1 and strongly discouraged for the mailto schema ^2.

Qup42 commented 5 months ago

I dug a bit deeper. The actual mail validation is not the problem. Mails with + are accepted just fine. Just use %2B as a workaround. The decoding of + to space by QueryUnescape is really the only problem here.

This would only concern a small part of the mailto URI handling. RFC6068 states the spaces should be percent encoded and advises against encoding spaces as +. But I also woudn't want to break something just to move closer to the standard. What are your thoughts on switching QueryUnescape to PathUnescape?