Closed mscrivo closed 1 year ago
That is interesting. Unfortunately, I was unable to reproduce the problem using your sample code in my environment with Ruby 3.2 (and Python 3.6.10 installed via pyenv) It seems more likely to be a problem on the PyCall side, but I will take another look anyway. Thanks!
@mscrivo, I have modified the code and released version 0.1.5.0. Could you try it and check if the problem still occurs?
will do! I'll get back to you early next week to let you know if it worked or not .. thank you!
deployed this version to our test environment and it seems we're still seeing roughly the same volume of errors as before :(
TypeError: class or module required
/home/affinity/vendor/bundle/ruby/3.2.0/gems/pycall-1.4.2/lib/pycall.rb:47:in `exec': class or module required (TypeError)
from /home/vendor/bundle/ruby/3.2.0/gems/pycall-1.4.2/lib/pycall.rb:47:in `exec'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.0/lib/ruby-spacy.rb:216:in `initialize'
from /home/lib/nlp.rb:27:in `new'
from /home/lib/nlp.rb:27:in `initialize'
Hmmm... As a workaround, I added a retry mechanism to the constructor of the Spacy::Language
class, update to version 0.1.5.1 and check it out.
The maximum number of retries (every 0.5 seconds) is 20 by default, but you can set it in the second argument of the constructor:
@nlp = Spacy::Language.new('en_core_web_sm', 30)
I hope this workaround is effective.
That didn't seem to help unfortunately. Looking at your change, it seems like you put the retry in a different initialize? We're hitting the error here: https://github.com/yohasebe/ruby-spacy/blob/490c685ff3f93056e54f5f39e7540a7dfa6c61dd/lib/ruby-spacy.rb#L64
You are correct, I have added a retry to the Doc constructor and released version 0.1.5.2. If this does not work, please provide some more details about your use case.
This is what we're seeing now:
RuntimeError: Error: Failed to construct a Doc object
/home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:68:in `rescue in initialize': Error: Failed to construct a Doc object (RuntimeError)
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:71:in `rescue in initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:62:in `initialize'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:236:in `new'
from /home/vendor/bundle/ruby/3.2.0/gems/ruby-spacy-0.1.5.2/lib/ruby-spacy.rb:236:in `read'
from /home/lib/nlp.rb:49:in `segment_sentences'
and it's taking 10s now, which is expected because of the 20 retries * 0.5s, but it's causing increased latency. What I'm noticing is that the errors don't start right away, but after the service has been running for a little while, and it gradually gets worse until it's restarted. I'll have to do some more investigating into why initialize is even being called so much. I would have thought with a singleton and a single thread, it wouldn't be initialized once on startup and that's it.
I would suggest to remove the retries as they clearly aren't the right solution. I will spend some time investigating this on my end and see if I can figure out what's going on and get back to you. Thank you for all your help so far!
Sounds good. I would appreciate it if you could let me know if you find out anything about the cause. Thanks!
Ok I was able to repro, one time at least, using the following code:
require 'benchmark'
require 'ruby-spacy'
require 'faker'
class NLP
include Singleton
def initialize
@nlp = Spacy::Language.new('en_core_web_sm')
end
def segment_sentences(text)
return [] if text.blank?
@nlp.read(text).sents.map(&:text)
end
end
n = 5000
i = 0
Benchmark.bm do |x|
x.report do
n.times do
i = i+1
if i % 10 == 0 then
puts "GC start/compact ..."
GC.start
GC.compact
end
NLP.instance.segment_sentences(Faker::Lorem.paragraph(sentence_count: 500))
end
end
end
With that, I was able to see the following error on my machine:
x.report { n.times {NLP.instance.segment_sentences(Faker::Lorem.paragraph(sentence_count: 500))}}
user system total real
GC start/compact ...
NoMethodError: undefined method `each' for [Ratione odit atque., Nihil voluptatem autem...<full contents of string was here, but removed to make it shorter>]:Object
from /Users/michaelscrivo/.rbenv/versions/3.2.0/lib/ruby/gems/3.2.0/gems/pycall-1.4.2/lib/pycall/pyobject_wrapper.rb:52:in `method_missing'
Then if I ran it again in the same process, I got:
x.report { n.times {NLP.instance.segment_sentences(Faker::Lorem.paragraph(sentence_count: 500))}}
user system total real
GC start/compact ...
TypeError: class or module required
from /Users/michaelscrivo/.rbenv/versions/3.2.0/lib/ruby/gems/3.2.0/gems/pycall-1.4.2/lib/pycall/pyobject_wrapper.rb:93:in `call_object'
which mimics exactly what we're seeing in production. Once that happens, you cannot instantiate the object again in the same process, you have to restart. I even got it to crash and do a core dump once. So something is definitely interacting badly with Ruby's GC, and it looks like whatever changes they made in 3.2 made it much worse.
Here is the first part of the dump when it crashed:
/Users/michaelscrivo/.rbenv/versions/3.2.0/lib/ruby/gems/3.2.0/gems/pycall-1.4.2/lib/pycall.rb:45: [BUG] Segmentation fault at 0x0000000000000000
ruby 3.2.0 (2022-12-25 revision a528908271) [arm64-darwin22]
-- Crash Report log information --------------------------------------------
See Crash Report log file in one of the following locations:
* ~/Library/Logs/DiagnosticReports
* /Library/Logs/DiagnosticReports
for more details.
Don't forget to include the above Crash Report log file in bug reports.
-- Control frame information -----------------------------------------------
c:0056 p:---- s:0304 e:000303 CFUNC :__dict__
c:0055 p:0011 s:0300 e:000299 METHOD /Users/michaelscrivo/.rbenv/versions/3.2.0/lib/ruby/gems/3.2.0/gems/pycall-1.4.2/lib/pycall.rb:45
c:0054 p:0044 s:0292 e:000291 METHOD /Users/michaelscrivo/.rbenv/versions/3.2.0/lib/ruby/gems/3.2.0/gems/ruby-spacy-0.1.4/lib/ruby-spacy.rb:221 [FINISH]
c:0053 p:---- s:0287 e:000286 CFUNC :new
c:0052 p:0006 s:0282 e:000281 METHOD (pry):50 [FINISH]
c:0051 p:---- s:0278 e:000277 CFUNC :new
c:0050 p:0015 s:0274 e:000273 BLOCK /Users/michaelscrivo/.rbenv/versions/3.2.0/lib/ruby/3.2.0/singleton.rb:127 [FINISH]
I've narrowed it down to the GC.compact
call. If I have that in there, then at least 1 time out of 10 when running the code, it will exhibit the above problem. If I don't have the compact call, then I can run it dozens of times without failure.
Yep, definitely the GC.compact
calls. We have a background thread in our app that calls GC.compact every 30 mins that we've had for a while now that has worked well without causing any problems up till now, so I totally forgot about it. After removing that from this particular deployment that uses ruby-spacy, the issue has disappeared.
Excellent! I'm glad you finally found a way around the problem. That said, I am not sure if the problem is caused by Ruby's GC alone. I tried out your code on different environments:
It took a while, but everything finished without error.
One thing I noticed was that your test code uses String#blank?
provided by ActiveRecord. Also it uses Singleton
class without an explicit require "singleton"
. Is there any chance that some rubygems loaded elsewhere interfere with the GC?
One thing I noticed was that your test code uses String#blank? provided by ActiveRecord. Also it uses Singleton class without an explicit require "singleton". Is there any chance that some rubygems loaded elsewhere interfere with the GC
Highly doubtful. We use compact in all our deployments and have been for several years and it's never been an issue. This problem only manifests with pycall it seems and you can see that in the stack dump above. It's specifically the nlp instance that no longer points to anything. I will try to whittle down the repro to just something using Pycall and if I can do that, I'll file a bug over there.
My naive guess is that Pycall does something clever with its FFI and relies on a specific memory mapping to the python objects it's holding on to and GC.compact messes that up, or it's some very bizarre timing issue with GC.compact. But it's beyond my depth of expertise.
Seems like we might be missing a rb_gc_mark somewhere
I forgot about a potentially relevant feature of PyCall:
Releasing the RubyVM GVL during Python function calls https://github.com/mrkn/pycall.rb#releasing-the-rubyvm-gvl-during-python-function-calls
Can you run your code inside a PyCall.without_gvl
block to see if it helps to release Ruby's global VM lock?
I forgot about a potentially relevant feature of PyCall:
Releasing the RubyVM GVL during Python function calls mrkn/pycall.rb#releasing-the-rubyvm-gvl-during-python-function-calls
Can you run your code inside a
PyCall.without_gvl
block to see if it helps to release Ruby's global VM lock?
Tried with:
require 'benchmark'
require 'ruby-spacy'
require 'faker'
n = 5000
i = 0
Benchmark.bm do |x|
x.report do
n.times do
i = i+1
#if i % 2 == 0 then
puts "GC start/compact ..."
GC.compact
#end
PyCall.without_gvl do
Affinity::NLP.instance.segment_sentences(Faker::Lorem.paragraph(sentence_count: 500))
end
end
end
end
And the error was reproduced on the second try. This is not an issue of locking via the global lock, but rather PyCall losing it's memory mapping for the object when GC.compact
is called.
I see. Could you submit a report to PyCall? Hopefully, they can find something relevant to this issue.
yeah will do .. I'll close this issue for now. Thanks again for walking through that with me!
Hi @yohasebe, first off, thank you for this wonderful gem, it's been a very useful tool for us! We are having a bit of an issue with it lately that I'm stumped on and hoping you might be able to shed some light on. Perhaps the issue is completely on pycall and should be file over there, let me know. Anyhow, we have a singleton where we instantiate an instance of the lib and do some basic sentence segmenting. Code looks like this:
segment_sentences
is called by a single threaded process many times and what we're seeing quite often is many errors that look like this:What's interesting is that we had been getting this error periodically for quite some time, but since upgrading to Ruby 3.2, the number of them has exploded.
Do you have any idea what's wrong or how we can possibly mitigate the errors?
Thank you