Open dentarg opened 7 years ago
Thankis @dentarg, I'll investigate. Are you able to tell me which line in the definition file is causing the issue?
@weppos I hope this help (I'm in a hurry now, so I haven't checked this too closely)
$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH) ; nil
=> nil
irb(main):002:0> list_data.class
=> String
irb(main):007:0> ctr = 0 ; outside_line = "" ; list_data.each_line { |line| ctr += 1 ; outside_line = line ; line.strip! } ; nil
ArgumentError: invalid byte sequence in US-ASCII
from (irb):7:in `strip!'
from (irb):7:in `block in irb_binding'
from (irb):7:in `each_line'
from (irb):7
from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):008:0> ctr
=> 610
irb(main):009:0> outside_line
=> "\xE5\x85\xAC\xE5\x8F\xB8.cn\n"
This was with 2.0.3:
irb(main):010:0> PublicSuffix::List::DEFAULT_LIST_PATH
=> "/Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.3/lib/public_suffix/../../data/list.txt"
Hmm... maybe I was naive to believe that everything would be good by File.read
with encoding: Encoding::UTF_8
just because it doesn't raise any exception. Seems like "网络.cn\n"
is read as "\u7F51\u7EDC.cn\n"
. This is on OS X 10.11.6, Ruby 2.2.5, zsh 5.0.8, publicsuffix-2.0.3. I don't think I fully understand all the LANG
, LANGUAGE
, `LC*` business.
$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix'
=> true
irb(main):002:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8).each_line.to_a[610]
=> "\u7F51\u7EDC.cn\n"
irb(main):003:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8).each_line.to_a[610].strip!
=> "\u7F51\u7EDC.cn"
irb(main):004:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610]
=> "\xE7\xBD\x91\xE7\xBB\x9C.cn\n"
irb(main):005:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610].strip!
ArgumentError: invalid byte sequence in US-ASCII
from (irb):5:in `strip!'
from (irb):5
from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):006:0> %w(LANG LANGUAGE LC_ALL LC_CTYPE).map { |v| ENV[v] }
=> ["", "", "", ""]
$ irb
irb(main):001:0> require 'public_suffix'
=> true
irb(main):002:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610]
=> "网络.cn\n"
irb(main):003:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610].strip!
=> "网络.cn"
irb(main):004:0> %w(LANG LANGUAGE LC_ALL LC_CTYPE).map { |v| ENV[v] }
=> ["en_US.UTF-8", "en_US.UTF-8", "en_US.UTF-8", "en_US.UTF-8"]
I'm having this problem with version 3.0.3
Bump. Is this project dead? Does anyone have a fork or alternate project where this is working? @weppos
Bump. Is this project dead? Does anyone have a fork or alternate project where this is working? @weppos
It is not dead. If your operating environment is set with the correct UTF8 language value, the library will work perfectly.
FWIW, it would seem correct if gem wouldn't depend/be agnostic to any environment setups for nominal operation.
@SeanDunford @aleksandrs-ledovskis feel free to provide a patch and I will review it. So far, the only one that provided a practical help was @dentarg but even him admitted the problem may not be that easy to solve.
Frankly, I am reluctant to put any effort into trying to make UTF-8 work because the real solution is to pre-process the list and have it stored in Punycode as this is how names should be managed and compared.
It's just not a the top of my priorities right now. PRs are always welcome.
This is still broken in 4.0.3
on ruby:2.4-slim-buster
docker image.
A workaround is setting: LANG=en_US.UTF-8 LANGUAGE=en_US.UTF-8 LC_ALL=en_US.UTF-8
before calling ruby
.
LANG=C.UTF-8
is enough, the Docker images for Ruby >= 2.5 sets thatThe problematic code in public_suffix
is PublicSuffix::List.default
$ docker run --rm -it ruby:2.4-slim-buster bash
root@31cd6631fcaa:/# gem install public_suffix
Fetching public_suffix-4.0.6.gem
Successfully installed public_suffix-4.0.6
1 gem installed
root@31cd6631fcaa:/# ruby -rpublic_suffix -e 'PublicSuffix::List.default'
/usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `count': invalid byte sequence in US-ASCII (ArgumentError)
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `initialize'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `new'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `build'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:334:in `factory'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:94:in `block (2 levels) in parse'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `each_line'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `block in parse'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:108:in `initialize'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `new'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `parse'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:51:in `default'
from -e:1:in `<main>'
root@31cd6631fcaa:/# LANG=C.UTF-8 ruby -rpublic_suffix -e 'PublicSuffix::List.default'
I'm encountering an error that is probably related to this:
domain = PublicSuffix.domain(request.host)
Tenant.find_by!(domain: domain)
Raises:
ArgumentError (Cannot transliterate strings with ASCII-8BIT encoding)
Forcing UTF-8 works:
domain = PublicSuffix.domain(host).to_s.force_encoding('UTF-8')
Ruby: 3.0.0 Rails: 6.1.3 Gem: 4.0.6
Two workarounds below.
Set the encoding using the Ruby interpreter's -E
flag:
ruby -E utf-8 ./foo.rb
Set the external encoding progamatically:
require 'public_suffix'
Encoding.default_external = 'utf-8' puts PublicSuffix.parse('example.com').inspect
If your environment fails to specify UTF-8, Ruby defaults to US-ASCII and when public_suffix try to parse the list data, it fails:
Passing
encoding: Encoding::UTF_8
toFile.read
makes it work, even if the default encoding isn't UTF-8:Related to https://github.com/weppos/publicsuffix-ruby/issues/94 (maybe the list data has changed since?)