weppos / publicsuffix-ruby

Domain name parser for Ruby based on the Public Suffix List.
https://simonecarletti.com/code/publicsuffix
MIT License
617 stars 109 forks source link

Always read data/list.txt as UTF-8 to avoid "ArgumentError: invalid byte sequence in US-ASCII" when parsing it #118

Open dentarg opened 7 years ago

dentarg commented 7 years ago

If your environment fails to specify UTF-8, Ruby defaults to US-ASCII and when public_suffix try to parse the list data, it fails:

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH) ; PublicSuffix::List.parse(list_data, private_domains: false) ; nil
ArgumentError: invalid byte sequence in US-ASCII
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:89:in `strip!'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:89:in `block (2 levels) in parse'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:88:in `each_line'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:88:in `block in parse'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:128:in `initialize'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:87:in `new'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:87:in `parse'
    from (irb):1
    from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):002:0> Encoding.default_external
=> #<Encoding:US-ASCII>
irb(main):003:0> RUBY_VERSION
=> "2.2.5"
irb(main):004:0>

Passing encoding: Encoding::UTF_8 to File.read makes it work, even if the default encoding isn't UTF-8:

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8) ; PublicSuffix::List.parse(list_data, private_domains: false) ; nil
=> nil
irb(main):002:0> RUBY_VERSION
=> "2.2.5"
irb(main):003:0> Encoding.default_external
=> #<Encoding:US-ASCII>

Related to https://github.com/weppos/publicsuffix-ruby/issues/94 (maybe the list data has changed since?)

weppos commented 7 years ago

Thankis @dentarg, I'll investigate. Are you able to tell me which line in the definition file is causing the issue?

dentarg commented 7 years ago

@weppos I hope this help (I'm in a hurry now, so I haven't checked this too closely)

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH) ; nil
=> nil
irb(main):002:0> list_data.class
=> String
irb(main):007:0> ctr = 0 ; outside_line = "" ; list_data.each_line { |line| ctr += 1 ; outside_line = line ; line.strip! } ; nil
ArgumentError: invalid byte sequence in US-ASCII
    from (irb):7:in `strip!'
    from (irb):7:in `block in irb_binding'
    from (irb):7:in `each_line'
    from (irb):7
    from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):008:0> ctr
=> 610
irb(main):009:0> outside_line
=> "\xE5\x85\xAC\xE5\x8F\xB8.cn\n"
dentarg commented 7 years ago

This was with 2.0.3:

irb(main):010:0> PublicSuffix::List::DEFAULT_LIST_PATH
=> "/Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.3/lib/public_suffix/../../data/list.txt"
dentarg commented 7 years ago

Hmm... maybe I was naive to believe that everything would be good by File.read with encoding: Encoding::UTF_8 just because it doesn't raise any exception. Seems like "网络.cn\n" is read as "\u7F51\u7EDC.cn\n". This is on OS X 10.11.6, Ruby 2.2.5, zsh 5.0.8, publicsuffix-2.0.3. I don't think I fully understand all the LANG, LANGUAGE, `LC*` business.

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix'
=> true
irb(main):002:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8).each_line.to_a[610]
=> "\u7F51\u7EDC.cn\n"
irb(main):003:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8).each_line.to_a[610].strip!
=> "\u7F51\u7EDC.cn"
irb(main):004:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610]
=> "\xE7\xBD\x91\xE7\xBB\x9C.cn\n"
irb(main):005:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610].strip!
ArgumentError: invalid byte sequence in US-ASCII
    from (irb):5:in `strip!'
    from (irb):5
    from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):006:0> %w(LANG LANGUAGE LC_ALL LC_CTYPE).map { |v| ENV[v] }
=> ["", "", "", ""]
$ irb
irb(main):001:0> require 'public_suffix'
=> true
irb(main):002:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610]
=> "网络.cn\n"
irb(main):003:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610].strip!
=> "网络.cn"
irb(main):004:0> %w(LANG LANGUAGE LC_ALL LC_CTYPE).map { |v| ENV[v] }
=> ["en_US.UTF-8", "en_US.UTF-8", "en_US.UTF-8", "en_US.UTF-8"]
tamoyal commented 6 years ago

I'm having this problem with version 3.0.3

SeanDunford commented 5 years ago

Bump. Is this project dead? Does anyone have a fork or alternate project where this is working? @weppos

weppos commented 5 years ago

Bump. Is this project dead? Does anyone have a fork or alternate project where this is working? @weppos

It is not dead. If your operating environment is set with the correct UTF8 language value, the library will work perfectly.

aleksandrs-ledovskis commented 5 years ago

FWIW, it would seem correct if gem wouldn't depend/be agnostic to any environment setups for nominal operation.

weppos commented 5 years ago

@SeanDunford @aleksandrs-ledovskis feel free to provide a patch and I will review it. So far, the only one that provided a practical help was @dentarg but even him admitted the problem may not be that easy to solve.

Frankly, I am reluctant to put any effort into trying to make UTF-8 work because the real solution is to pre-process the list and have it stored in Punycode as this is how names should be managed and compared.

It's just not a the top of my priorities right now. PRs are always welcome.

alexef commented 3 years ago

This is still broken in 4.0.3 on ruby:2.4-slim-buster docker image.

A workaround is setting: LANG=en_US.UTF-8 LANGUAGE=en_US.UTF-8 LC_ALL=en_US.UTF-8 before calling ruby.

dentarg commented 3 years ago
Looks like LANG=C.UTF-8 is enough, the Docker images for Ruby >= 2.5 sets that ``` $ docker run --rm ruby:2.4-slim-buster env PATH=/usr/local/bundle/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=2ea0e1a03e36 RUBY_MAJOR=2.4 RUBY_VERSION=2.4.10 RUBY_DOWNLOAD_SHA256=d5668ed11544db034f70aec37d11e157538d639ed0d0a968e2f587191fc530df RUBYGEMS_VERSION=3.0.3 GEM_HOME=/usr/local/bundle BUNDLE_SILENCE_ROOT_WARNING=1 BUNDLE_APP_CONFIG=/usr/local/bundle HOME=/root ``` vs ``` $ docker run --rm ruby:2.5-slim-buster env PATH=/usr/local/bundle/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=7d11ed52a0af LANG=C.UTF-8 RUBY_MAJOR=2.5 RUBY_VERSION=2.5.8 RUBY_DOWNLOAD_SHA256=0391b2ffad3133e274469f9953ebfd0c9f7c186238968cbdeeb0651aa02a4d6d RUBYGEMS_VERSION=3.0.3 GEM_HOME=/usr/local/bundle BUNDLE_SILENCE_ROOT_WARNING=1 BUNDLE_APP_CONFIG=/usr/local/bundle HOME=/root ``` Running my initial example ```ruby # publicsuffix.rb require 'bundler/inline' gemfile do source 'https://rubygems.org' gem 'public_suffix' end puts RUBY_VERSION puts PublicSuffix::List::DEFAULT_LIST_PATH list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH) PublicSuffix::List.parse(list_data, private_domains: false) ``` In `ruby:2.4-slim-buster` ```shell $ docker run --rm -it -v $(pwd):/app -w /app ruby:2.4-slim-buster bash root@aa7eb67dce29:/app# gem install bundler Fetching bundler-2.2.8.gem Successfully installed bundler-2.2.8 1 gem installed root@aa7eb67dce29:/app# ruby publicsuffix.rb 2.4.10 /usr/local/bundle/gems/public_suffix-4.0.6/data/list.txt /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `count': invalid byte sequence in US-ASCII (ArgumentError) from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `initialize' from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `new' from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `build' from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:334:in `factory' from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:94:in `block (2 levels) in parse' from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `each_line' from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `block in parse' from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:108:in `initialize' from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `new' from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `parse' from publicsuffix.rb:9:in `
' root@aa7eb67dce29:/app# LANG=C.UTF-8 ruby publicsuffix.rb 2.4.10 /usr/local/bundle/gems/public_suffix-4.0.6/data/list.txt ``` In `ruby:2.5-slim-buster` ```shell $ docker run --rm -it -v $(pwd):/app -w /app ruby:2.5-slim-buster bash root@b87a1b578bbf:/app# ruby publicsuffix.rb 2.5.8 /usr/local/bundle/gems/public_suffix-4.0.6/data/list.txt ```

The problematic code in public_suffix is PublicSuffix::List.default

https://github.com/weppos/publicsuffix-ruby/blob/c4c301231549f98b53bd987c9398b3a366aad815/lib/public_suffix/list.rb#L44-L52

$ docker run --rm -it ruby:2.4-slim-buster bash
root@31cd6631fcaa:/# gem install public_suffix
Fetching public_suffix-4.0.6.gem
Successfully installed public_suffix-4.0.6
1 gem installed
root@31cd6631fcaa:/# ruby -rpublic_suffix -e 'PublicSuffix::List.default'
/usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `count': invalid byte sequence in US-ASCII (ArgumentError)
    from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `initialize'
    from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `new'
    from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `build'
    from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:334:in `factory'
    from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:94:in `block (2 levels) in parse'
    from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `each_line'
    from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `block in parse'
    from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:108:in `initialize'
    from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `new'
    from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `parse'
    from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:51:in `default'
    from -e:1:in `<main>'
root@31cd6631fcaa:/# LANG=C.UTF-8 ruby -rpublic_suffix -e 'PublicSuffix::List.default'
zavan commented 3 years ago

I'm encountering an error that is probably related to this:

domain = PublicSuffix.domain(request.host)
Tenant.find_by!(domain: domain)

Raises: ArgumentError (Cannot transliterate strings with ASCII-8BIT encoding)

Forcing UTF-8 works:

domain = PublicSuffix.domain(host).to_s.force_encoding('UTF-8')

Ruby: 3.0.0 Rails: 6.1.3 Gem: 4.0.6

mcarpenter commented 4 months ago

Two workarounds below.

  1. Set the encoding using the Ruby interpreter's -E flag:

    ruby -E utf-8 ./foo.rb
  2. Set the external encoding progamatically:

    
    require 'public_suffix'

Encoding.default_external = 'utf-8' puts PublicSuffix.parse('example.com').inspect