weppos / publicsuffix-ruby

Domain name parser for Ruby based on the Public Suffix List.
https://simonecarletti.com/code/publicsuffix
MIT License
614 stars 110 forks source link

Optimise select() for long subdomains #268

Open elliotwutingfeng opened 9 months ago

elliotwutingfeng commented 9 months ago

Current implementation of select() searches for longest matching TLDs from the right end all the way to the left end.

This approach is necessary to handle edge cases like example.s3.cn-north-1.amazonaws.com.cn, where

However, this disadvantages URLs with long subdomains like a.very.long.subdomain.example.co.uk.

We can terminate the search early by limiting the search size to [parts.size, @max_rule_size].min, where parts.size is number of parts in the hostname, and @max_rule_size is the number of parts in the largest rule in @rules.

Also replaced the kernel loop with a faster bounded while loop, as it is possible to convert the current break condition to a loop condition.

Before

$ ruby test/benchmarks/bm_find_all.rb 1000000
Rehearsal -------------------------------------------------------------
NAME_SHORT                  2.348576   0.000000   2.348576 (  2.350146)
NAME_SHORT (noprivate)      2.444302   0.000000   2.444302 (  2.445995)
NAME_MEDIUM                 2.890648   0.000000   2.890648 (  2.892380)
NAME_MEDIUM (noprivate)     3.014823   0.000000   3.014823 (  3.017137)
NAME_LONG                   3.705042   0.002693   3.707735 (  3.710142)
NAME_LONG (noprivate)       3.727960   0.000000   3.727960 (  3.730321)
NAME_WILD                   3.657520   0.000000   3.657520 (  3.659759)
NAME_WILD (noprivate)       3.815247   0.000000   3.815247 (  3.817492)
NAME_EXCP                   4.420996   0.000000   4.420996 (  4.423570)
NAME_EXCP (noprivate)       4.408350   0.000000   4.408350 (  4.411540)
IAAA                        2.604410   0.000000   2.604410 (  2.605894)
IAAA (noprivate)            2.688674   0.000000   2.688674 (  2.690398)
IZZZ                        2.605931   0.000000   2.605931 (  2.607543)
IZZZ (noprivate)            2.679484   0.000000   2.679484 (  2.681334)
PAAA                        4.506107   0.000000   4.506107 (  4.509242)
PAAA (noprivate)            4.174697   0.000000   4.174697 (  4.177737)
PZZZ                        4.618712   0.000000   4.618712 (  4.622306)
PZZZ (noprivate)            4.323496   0.000000   4.323496 (  4.327372)
JP                          4.151477   0.000000   4.151477 (  4.154904)
JP (noprivate)              4.230317   0.000000   4.230317 (  4.234143)
IT                          2.645423   0.000000   2.645423 (  2.647490)
IT (noprivate)              2.731147   0.000000   2.731147 (  2.733281)
COM                         2.672895   0.000000   2.672895 (  2.675236)
COM (noprivate)             2.796167   0.000000   2.796167 (  2.798951)
--------------------------------------------------- total: 81.865094sec

                                user     system      total        real
NAME_SHORT                  2.455661   0.000000   2.455661 (  2.458051)
NAME_SHORT (noprivate)      2.465275   0.000000   2.465275 (  2.468431)
NAME_MEDIUM                 2.946424   0.000000   2.946424 (  2.949358)
NAME_MEDIUM (noprivate)     3.023296   0.000000   3.023296 (  3.025300)
NAME_LONG                   3.770850   0.000000   3.770850 (  3.773397)
NAME_LONG (noprivate)       3.828416   0.000000   3.828416 (  3.830904)
NAME_WILD                   3.749617   0.000000   3.749617 (  3.752038)
NAME_WILD (noprivate)       3.827687   0.000000   3.827687 (  3.830190)
NAME_EXCP                   4.418445   0.000000   4.418445 (  4.421315)
NAME_EXCP (noprivate)       4.531002   0.000000   4.531002 (  4.535273)
IAAA                        2.699374   0.000000   2.699374 (  2.700931)
IAAA (noprivate)            2.768779   0.000000   2.768779 (  2.771347)
IZZZ                        2.699160   0.000000   2.699160 (  2.702339)
IZZZ (noprivate)            2.766278   0.000000   2.766278 (  2.769706)
PAAA                        4.706753   0.000000   4.706753 (  4.711835)
PAAA (noprivate)            4.363877   0.000000   4.363877 (  4.367030)
PZZZ                        4.716710   0.000000   4.716710 (  4.722447)
PZZZ (noprivate)            4.109007   0.000000   4.109007 (  4.111433)
JP                          3.937950   0.000000   3.937950 (  3.941688)
JP (noprivate)              4.065472   0.000000   4.065472 (  4.070663)
IT                          2.628695   0.000000   2.628695 (  2.630612)
IT (noprivate)              2.718972   0.000000   2.718972 (  2.721554)
COM                         2.647181   0.000000   2.647181 (  2.649369)
COM (noprivate)             2.714115   0.000000   2.714115 (  2.715725)

After

$ ruby test/benchmarks/bm_find_all.rb 1000000
Rehearsal -------------------------------------------------------------
NAME_SHORT                  2.237599   0.000000   2.237599 (  2.239443)
NAME_SHORT (noprivate)      2.336548   0.000000   2.336548 (  2.338574)
NAME_MEDIUM                 2.713107   0.000000   2.713107 (  2.714795)
NAME_MEDIUM (noprivate)     2.830825   0.000000   2.830825 (  2.832685)
NAME_LONG                   3.042471   0.000000   3.042471 (  3.044456)
NAME_LONG (noprivate)       3.019529   0.003196   3.022725 (  3.024463)
NAME_WILD                   2.978485   0.000000   2.978485 (  2.980252)
NAME_WILD (noprivate)       3.088728   0.000000   3.088728 (  3.090743)
NAME_EXCP                   3.682105   0.000000   3.682105 (  3.684332)
NAME_EXCP (noprivate)       3.815742   0.000000   3.815742 (  3.818032)
IAAA                        2.458039   0.000000   2.458039 (  2.459425)
IAAA (noprivate)            2.496389   0.000000   2.496389 (  2.497893)
IZZZ                        2.404844   0.000000   2.404844 (  2.406255)
IZZZ (noprivate)            2.463744   0.000000   2.463744 (  2.465130)
PAAA                        3.515573   0.000000   3.515573 (  3.517585)
PAAA (noprivate)            3.193961   0.000000   3.193961 (  3.195845)
PZZZ                        3.587199   0.000000   3.587199 (  3.589388)
PZZZ (noprivate)            3.254129   0.000000   3.254129 (  3.256092)
JP                          3.783495   0.000000   3.783495 (  3.785693)
JP (noprivate)              3.885775   0.003331   3.889106 (  3.891664)
IT                          2.513112   0.000000   2.513112 (  2.514673)
IT (noprivate)              2.599210   0.000000   2.599210 (  2.600769)
COM                         2.539283   0.000000   2.539283 (  2.540692)
COM (noprivate)             2.485424   0.000000   2.485424 (  2.486922)
--------------------------------------------------- total: 70.931843sec

                                user     system      total        real
NAME_SHORT                  2.218905   0.000000   2.218905 (  2.220197)
NAME_SHORT (noprivate)      2.282971   0.000000   2.282971 (  2.284161)
NAME_MEDIUM                 2.707217   0.000000   2.707217 (  2.708815)
NAME_MEDIUM (noprivate)     2.781946   0.000000   2.781946 (  2.783615)
NAME_LONG                   3.018843   0.000000   3.018843 (  3.020559)
NAME_LONG (noprivate)       3.079345   0.000000   3.079345 (  3.081143)
NAME_WILD                   3.041727   0.000000   3.041727 (  3.043618)
NAME_WILD (noprivate)       3.079496   0.000000   3.079496 (  3.081228)
NAME_EXCP                   3.655873   0.000000   3.655873 (  3.658370)
NAME_EXCP (noprivate)       3.754648   0.000000   3.754648 (  3.756916)
IAAA                        2.507284   0.000000   2.507284 (  2.509283)
IAAA (noprivate)            2.540126   0.000000   2.540126 (  2.541872)
IZZZ                        2.466202   0.000000   2.466202 (  2.467584)
IZZZ (noprivate)            2.544616   0.000000   2.544616 (  2.546141)
PAAA                        3.622206   0.000000   3.622206 (  3.624447)
PAAA (noprivate)            3.272909   0.000000   3.272909 (  3.274831)
PZZZ                        3.675658   0.000000   3.675658 (  3.677843)
PZZZ (noprivate)            3.318359   0.000000   3.318359 (  3.320537)
JP                          3.882480   0.000000   3.882480 (  3.885434)
JP (noprivate)              3.971438   0.000000   3.971438 (  3.974437)
IT                          2.548282   0.000000   2.548282 (  2.549875)
IT (noprivate)              2.609304   0.000000   2.609304 (  2.610879)
COM                         2.569648   0.000000   2.569648 (  2.571186)
COM (noprivate)             2.497100   0.000000   2.497100 (  2.498543)
weppos commented 9 months ago

Thanks for your contribution @elliotwutingfeng. I need some time to review the changes.