weppos / publicsuffix-ruby

Domain name parser for Ruby based on the Public Suffix List.
https://simonecarletti.com/code/publicsuffix
MIT License
617 stars 109 forks source link

Any suggestions to speed up list parsing? #78

Closed benbalter closed 9 years ago

benbalter commented 9 years ago

I use PublicSuffix as part of Gman.

In addition to using PublicSuffix's native valid? check, Gman also contains it's own public-suffix formatted list of government domains which it then checks using PublicSuffix's rule logic. The relevant code is:

   # check using public suffix's standard logic
    rule = Gman.list.find domain

    # domain is on the domain list and
    # domain is not explicitly blacklisted and
    # domain matches a standard public suffix list rule
    !rule.nil? && rule.type != :exception && rule.allow?(".#{domain}")

Profiling against 10,000 random (valid) domains, which took about 250 seconds, here's the breakdown:

 %self      total      self      wait     child     calls  name
 23.59     78.365    63.150     0.000    15.215 32708764   PublicSuffix::Rule::Base#odiff
 18.37    232.715    49.154     0.000   183.560 32708764   PublicSuffix::Rule::Base#match?
 16.08    105.479    43.029     0.000    62.451 32777587   <Class::PublicSuffix::Domain>#domain_to_labels
 14.47     38.736    38.736     0.000     0.000 32827949   String#split
  8.26    254.831    22.116     0.000   232.715    49998   Array#select
  5.68     15.215    15.215     0.000     0.000 32708764   Array#[]
  5.10     13.648    13.648     0.000     0.000 32787586   Array#reverse
  3.84     10.270    10.270     0.000     0.000 33055272   String#to_s
  0.79      2.507     2.106     0.000     0.400    79975   PublicSuffix::Rule::Normal#decompose
  0.40      1.066     1.066     0.000     0.000    49998   Array#values_at
  0.18      3.915     0.477     0.000     3.438    30001   <Class::Addressable::URI>#parse
  0.15    256.623     0.414     0.000   256.209    49998   PublicSuffix::List#select

It appears the bottleneck is in PublicSuffix's matching. I see https://github.com/weppos/publicsuffix-ruby/pull/2, but do you have any suggestions how to speed up PublicSuffix's parsing, both for it's own native list, and for Gman's vendor list?

weppos commented 9 years ago

Hi Ben,

When I originally conceived the list, I honestly didn't care too much about performance. There are several possible optimization that could be introduced.

A simple one can be to parse the list into a balanced BST which can be queried in O(log n). Right now, if I recall correctly, the lookup is O(n).

This is the quickest idea that come to my mind. I should spend some time investigating the code and measuring performance. However, I can't promise I'll be able to dig into this issue in the near future.

benbalter commented 9 years ago

Ahh awesome. I wasn't sure if it was my implementation or within PublicSuffix itself. Actually was able to speed things up quite a bit by passing the PublicSuffix::Domain object around internally, rather than the domain string, and memoizing domain.to_s, which brought things down from about 2:30 to parse 10,000 domains to about 20 seconds. Thanks for the awesome Gem. :smile_cat:

weppos commented 9 years ago

Wow, that's a great result!

On Friday, July 10, 2015, Ben Balter notifications@github.com wrote:

Ahh awesome. I wasn't sure if it was my implementation or within PublicSuffix itself. Actually was able to speed things up quite a bit by passing the PublicSuffix::Domain object around internally, rather than the domain string, and memoizing domain.to_s, which brought things down from about 2:30 to parse 10,000 domains to about 20 seconds. Thanks for the awesome Gem. [image: :smile_cat:]

— Reply to this email directly or view it on GitHub https://github.com/weppos/publicsuffix-ruby/issues/78#issuecomment-120457293 .

Simone Carletti Passionate programmer and dive instructor

http://simonecarletti.com/ Twitter: @weppos https://twitter.com/weppos

weppos commented 8 years ago

Ben, FYI I made quite significant performance improvements this week. You can check them out at #92. I'll be happy to resurrect this issue if you need some help.