Closed mzorz closed 4 years ago
Great work in this investigation @mzorz :tada:
Nice work on logging those details! Very clever way of "peeking under the hood". I found it useful to redirect stderr
as well:
adb -s emulator-5554 shell monkey -p org.wordpress.aztec -v 10000 | monkey-log.txt 2>&1
Also, at one point, I had closed the emulator while the monkey was running, forgetting that the monkey process is actually running on the emulator, not the host. I was reminded of this the next time I launched that particular image. :laughing:
I left a few comments in the code. I had a difficult time reproducing the specific IOB crash on Android 9.0, but the overall concept makes sense to me, as well as the workaround. I only wonder why the current behavior doesn't wreak more havoc in other scenarios. I.e. what is it about this app that surfaces this underlying Android behavior?
As you mentioned earlier, the seed does not guarantee a deterministic outcome. I had some crashes both with and without this PR applied on API 27, but I could not reproduce any of mentioned crashes with or without the PR on API 29 ⃰.
E.g. this was Pixel 2 emulator with API 27, seed 24004:
// CRASH: org.wordpress.aztec (pid 2020)
// Short Msg: java.lang.ArrayIndexOutOfBoundsException
// Long Msg: java.lang.ArrayIndexOutOfBoundsException: length=19; index=-1
// Build Label: google/sdk_gphone_x86/generic_x86:8.1.0/OSM1.180201.007/4586646:user/release-keys
// Build Changelist: 4586646
// Build Time: 1517878968000
// java.lang.ArrayIndexOutOfBoundsException: length=19; index=-1
// at android.text.DynamicLayout.getBlockIndex(DynamicLayout.java:648)
// at android.widget.Editor.drawHardwareAccelerated(Editor.java:1703)
// at android.widget.Editor.onDraw(Editor.java:1672)
// at android.widget.TextView.onDraw(TextView.java:6882)
// at android.view.View.draw(View.java:19192)
// at android.view.View.updateDisplayListIfDirty(View.java:18142)
// at android.view.View.draw(View.java:18920)
// at android.view.ViewGroup.drawChild(ViewGroup.java:4236)
// at android.view.ViewGroup.dispatchDraw(ViewGroup.java:4022)
// at android.view.View.updateDisplayListIfDirty(View.java:18133)
// at android.view.View.draw(View.java:18920)
// at android.view.ViewGroup.drawChild(ViewGroup.java:4236)
// at android.view.ViewGroup.dispatchDraw(ViewGroup.java:4022)
// at android.view.View.draw(View.java:19195)
// at android.widget.ScrollView.draw(ScrollView.java:1739)
// at android.view.View.updateDisplayListIfDirty(View.java:18142)
// at android.view.View.draw(View.java:18920)
// at android.view.ViewGroup.drawChild(ViewGroup.java:4236)
// at android.view.ViewGroup.dispatchDraw(ViewGroup.java:4022)
// at android.view.View.updateDisplayListIfDirty(View.java:18133)
// at android.view.View.draw(View.java:18920)
// at android.view.ViewGroup.drawChild(ViewGroup.java:4236)
// at android.view.ViewGroup.dispatchDraw(ViewGroup.java:4022)
// at android.view.View.updateDisplayListIfDirty(View.java:18133)
// at android.view.View.draw(View.java:18920)
// at android.view.ViewGroup.drawChild(ViewGroup.java:4236)
// at android.view.ViewGroup.dispatchDraw(ViewGroup.java:4022)
// at android.view.View.updateDisplayListIfDirty(View.java:18133)
// at android.view.View.draw(View.java:18920)
// at android.view.ViewGroup.drawChild(ViewGroup.java:4236)
// at android.view.ViewGroup.dispatchDraw(ViewGroup.java:4022)
// at android.view.View.updateDisplayListIfDirty(View.java:18133)
// at android.view.View.draw(View.java:18920)
// at android.view.ViewGroup.drawChild(ViewGroup.java:4236)
// at android.view.ViewGroup.dispatchDraw(ViewGroup.java:4022)
// at android.view.View.updateDisplayListIfDirty(View.java:18133)
// at android.view.View.draw(View.java:18920)
// at android.view.ViewGroup.drawChild(ViewGroup.java:4236)
// at android.view.ViewGroup.dispatchDraw(ViewGroup.java:4022)
// at android.view.View.draw(View.java:19195)
// at com.android.internal.policy.DecorView.draw(DecorView.java:788)
// at android.view.View.updateDisplayListIfDirty(View.java:18142)
// at android.view.ThreadedRenderer.updateViewTreeDisplayList(ThreadedRenderer.java:669)
// at android.view.ThreadedRenderer.updateRootDisplayList(ThreadedRenderer.java:675)
// at android.view.ThreadedRenderer.draw(ThreadedRenderer.java:783)
// at android.view.ViewRootImpl.draw(ViewRootImpl.java:2992)
// at android.view.ViewRootImpl.performDraw(ViewRootImpl.java:2806)
// at android.view.ViewRootImpl.performTraversals(ViewRootImpl.java:2359)
// at android.view.ViewRootImpl.doTraversal(ViewRootImpl.java:1392)
// at android.view.ViewRootImpl$TraversalRunnable.run(ViewRootImpl.java:6752)
// at android.view.Choreographer$CallbackRecord.run(Choreographer.java:911)
// at android.view.Choreographer.doCallbacks(Choreographer.java:723)
// at android.view.Choreographer.doFrame(Choreographer.java:658)
// at android.view.Choreographer$FrameDisplayEventReceiver.run(Choreographer.java:897)
// at android.os.Handler.handleCallback(Handler.java:790)
// at android.os.Handler.dispatchMessage(Handler.java:99)
// at android.os.Looper.loop(Looper.java:164)
// at android.app.ActivityThread.main(ActivityThread.java:6494)
// at java.lang.reflect.Method.invoke(Native Method)
// at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:438)
// at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:807)
//
And this was the same run, with the PR applied:
// CRASH: org.wordpress.aztec (pid 4908)
// Short Msg: java.lang.IndexOutOfBoundsException
// Long Msg: java.lang.IndexOutOfBoundsException: setSpan (569 ... 597) ends beyond length 569
// Build Label: google/sdk_gphone_x86/generic_x86:8.1.0/OSM1.180201.007/4586646:user/release-keys
// Build Changelist: 4586646
// Build Time: 1517878968000
// java.lang.IndexOutOfBoundsException: setSpan (569 ... 597) ends beyond length 569
// at android.text.SpannableStringBuilder.checkRange(SpannableStringBuilder.java:1309)
// at android.text.SpannableStringBuilder.setSpan(SpannableStringBuilder.java:680)
// at android.text.SpannableStringBuilder.setSpan(SpannableStringBuilder.java:672)
// at android.widget.Editor$SuggestionsPopupWindow.updateSuggestions(Editor.java:3719)
// at android.widget.Editor$SuggestionsPopupWindow.show(Editor.java:3591)
// at android.widget.Editor.replace(Editor.java:386)
// at android.widget.Editor$3.run(Editor.java:2231)
// at android.os.Handler.handleCallback(Handler.java:790)
// at android.os.Handler.dispatchMessage(Handler.java:99)
// at android.os.Looper.loop(Looper.java:164)
// at android.app.ActivityThread.main(ActivityThread.java:6494)
// at java.lang.reflect.Method.invoke(Native Method)
// at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:438)
// at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:807)
⃰I did get an IOB crash at one point, on API 29, without the PR, but realized later the trace was on stderr
, so did not get saved to the log. Since then, I have not been able to observe the crash on API 29, neither with nor without the PR applied.
Since this touches some "deeper" parts, it may be helpful to have an extra set of eyes on this one.
Nice work on logging those details! Very clever way of "peeking under the hood". I found it useful to redirect stderr as well: adb -s emulator-5554 shell monkey -p org.wordpress.aztec -v 10000 | monkey-log.txt 2>&1
Thank you! Nice one there too
Also, at one point, I had closed the emulator while the monkey was running, forgetting that the monkey process is actually running on the emulator, not the host. I was reminded of this the next time I launched that particular image. 😆
Lol - forgot to add this bit of info, we talked over DM but for reference of others reading this here, you can kill the monkey process running on the device by issuing this adb command:
adb shell ps | awk '/com.android.commands.monkey/ { system("adb shell kill " $2) }'
.
I left a few comments in the code. I had a difficult time reproducing the specific IOB crash on Android 9.0, but the overall concept makes sense to me, as well as the workaround.
I actually couldn't see this specific crash on Android 9 either (seen other weirder ones though).
I only wonder why the current behavior doesn't wreak more havoc in other scenarios. I.e. what is it about this app that surfaces this underlying Android behavior?
And that's a good question, my take on this is I think something particular that other apps don't have here is pretty much the Android focus is always on the same EditText
(AztecText) here, which makes things quite different from other typical screens where you have some buttons maybe various EditText and you won't spend much time on these screens either, as opposed to the situation of writing a blog post where the screen is the same, you may switch to different kind of inputs (voice, text, change keyboard language), things you probably wouldn't do on the same EditText normally on a typical form screen.
And this was the same run, with the PR applied: // java.lang.IndexOutOfBoundsException: setSpan (569 ... 597) ends beyond length 569 // at android.text.SpannableStringBuilder.checkRange(SpannableStringBuilder.java:1309) // at android.text.SpannableStringBuilder.setSpan(SpannableStringBuilder.java:680) // at android.text.SpannableStringBuilder.setSpan(SpannableStringBuilder.java:672) // at android.widget.Editor$SuggestionsPopupWindow.updateSuggestions(Editor.java:3719) // at android.widget.Editor$SuggestionsPopupWindow.show(Editor.java:3591)
I got that one too without the patch - in fact I forgot to mention earlier I also did tests with suggestions turned off (by adding noSuggestions
value to the inputType attribute in the manifest) to avoid getting stopped there as well. I tracked that one here as well: https://github.com/wordpress-mobile/AztecEditor-Android/issues/833.
I was tempted to try and do some more research on that one as it looked a bit easier to grasp but, since it doesn't seem to be affecting many users (haven't heard of that being reported by actual users before) I only tracked it as an issue and continue onto chasing the other IOB problem.
Thank you so much for taking the time to try and review this one @mkevins !! ❤️
Since this touches some "deeper" parts, it may be helpful to have an extra set of eyes on this one.
💯 , hoping to get maybe @daniloercoli and / or @hypest as you find time to play with monkeys a bit? 🙏 🐒
I got that one too without the patch - in fact I forgot to mention earlier I also did tests with suggestions turned off (by adding noSuggestions value to the inputType attribute in the manifest) to avoid getting stopped there as well. I tracked that one here as well: #833.
Ah, thanks for the extra context. That makes sense, and I guess it's a good sign that this monkey can find all these obscure bugs.. even if they aren't currently being reported. I suppose there is always a chance a code change later on could put more traffic in these code paths, and it's good to find these kind of things beforehand.
I wasn't able to reproduce the crash on develop (hash: 167ffe6a).
Still, the fix in this PR is interesting since it seems to fix the issue on your tests @mzorz . By the way @mkevins , I'm, not sure from you comments so far: does this PR fix the crash for you on any API levels?
There's only one thing that bothers me and not sure how to decide upon: we're breaking the constructor's assumption that it will create a new InputConnection. That is, the framework code that calls it is most probably assuming that a new InputConnection will be created and so, not sure how disruptive it can be to break this assumption.
The IndexOutOfBounds
crash seems to be a whole class of issues, with different root causes and different manifestations (some crashes are caused by the DynamicLayout class, some have the indexes only be off by one from the text's length and others have the indexes be far away for the text's length. Which particular subcase(s) do you think this fix addresses @mzorz ?
By the way, the original issue mentioned here (https://github.com/wordpress-mobile/WordPress-Android/issues/9904) seems to be about DynamicLayout reflow, and there's at least this PR that mentions that DynamicLayout had a bug before Android P (taken from this comment).
I guess what I'm trying to say is, I'd prefer if we could "focus" this fix, maybe only apply it in Android versions lower than Android P, and see if the reports for the particular subcases decrease.
Thanks for all those good questions @hypest!
Which particular subcase(s) do you think this fix addresses?
This is the full stacktrace for the IOB crash mentioned in #9904, which this PR says in its description is aiming at fixing:
java.lang.IndexOutOfBoundsException: 105, -99
at android.text.PackedIntVector.deleteAt(PackedIntVector.java:222)
at android.text.DynamicLayout.reflow(DynamicLayout.java:317)
at android.text.DynamicLayout.-wrap0
at android.text.DynamicLayout$ChangeWatcher.reflow(DynamicLayout.java:750)
at android.text.DynamicLayout$ChangeWatcher.onSpanChanged(DynamicLayout.java:779)
at android.text.SpannableStringBuilder.sendSpanChanged(SpannableStringBuilder.java:1296)
at android.text.SpannableStringBuilder.sendToSpanWatchers(SpannableStringBuilder.java:651)
at android.text.SpannableStringBuilder.replace(SpannableStringBuilder.java:581)
at android.text.SpannableStringBuilder.replace(SpannableStringBuilder.java:504)
at android.text.SpannableStringBuilder.replace(SpannableStringBuilder.java:502)
at android.view.inputmethod.BaseInputConnection.replaceText(BaseInputConnection.java:848)
at android.view.inputmethod.BaseInputConnection.commitText(BaseInputConnection.java:197)
at com.android.internal.widget.EditableInputConnection.commitText(EditableInputConnection.java:183)
at com.android.internal.view.IInputConnectionWrapper.executeMessage(IInputConnectionWrapper.java:344)
at com.android.internal.view.IInputConnectionWrapper$MyHandler.handleMessage(IInputConnectionWrapper.java:85)
at android.os.Handler.dispatchMessage(Handler.java:108)
at android.os.Looper.loop(Looper.java:166)
at android.app.ActivityThread.main(ActivityThread.java:7529)
at java.lang.reflect.Method.invoke(Method.java)
at com.android.internal.os.Zygote$MethodAndArgsCaller.run(Zygote.java:245)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:921)
This crash I managed to reproduce in my tests with monkeys (as well as the other mentioned crashes - my efforts in investigating this crash were mostly about trying to avoid hitting the other crashes first, and managed to do so by following the steps described above). So it's true there is a set of causes that make an IOB Exception occur - the investigation done for this PR only tried to isolate the one described in #9904 and the code in this PR only tries to workaround the situation in which those conditions are met.
With the help of much logging using this wrapper for InputConnection (I may have forgotten to link it here), I was able to determine the relation between a beginBatchEdit
starting and endBatchEdit
finishing, and seeing the change in IME connection in between was something that appeared as a pattern in my tests right a few lines before crashing.
So, that crash pasted here in this comment 2 paragraphs above is the particular crash this PR aims at fixing, but the true information about context of when it's about to happen is given by the logs one can see once implementing the wrapper. Note, the wrapper is not part of the solution, it was just a tool to be able to gain more insight on why this was happening.
There's only one thing that bothers me and not sure how to decide upon: we're breaking the constructor's assumption that it will create a new InputConnection.
The docs say here:
An editor should start by implementing View.onCreateInputConnection(EditorInfo) to return its own input connection.
It doesn't say anything about an assumption that it should create a new one each time it gets called. I'll take documentation may not be updated or not directly reflecting reality, but also we could argue reality is what we can experiment and observe for sure and not what documentation or method names suggest 😅 (this is supposed to be nothing else but a funny comment BTW)
So, the InputConnection we're returning is that of AztecText (actually, its super class EditText) first, and then keeping its reference. When the IME changes, we can tell them to use the already created InputConnection.
By the way, the original issue mentioned here (wordpress-mobile/WordPress-Android#9904) seems to be about DynamicLayout reflow, and there's at least this PR that mentions that DynamicLayout had a bug before Android P (taken from this comment).
True, that's another different issue. This PR does not intend to fix that one. In fact, it was sometimes hard to be able to get the crash I was investigating without getting that other crash first, as described above.
I guess what I'm trying to say is, I'd prefer if we could "focus" this fix, maybe only apply it in Android versions lower than Android P, and see if the reports for the particular subcases decrease.
💯 I think I haven't seen it on Android 9, and @mkevins apparently has seen it once as per this:
⃰I did get an IOB crash at one point, on API 29, without the PR, but realized later the trace was on stderr, so did not get saved to the log. Since then, I have not been able to observe the crash on API 29, neither with nor without the PR applied.
It would look like API 29 is more stable in this regard so, if we'd like to focus this better, I'd say we can make the code run conditionally only for Android 8.0 and 8.1.
What do you think? @mkevins @hypest ?
Hi @mzorz :wave:
My apologies for the long delay in getting back on this one. Do we have any numbers to guide us on this? I certainly like the idea of limiting the fix to affected versions only, since the workaround has the potential for side effects undocumented in the Android source code and docs. Also, do you know if we have a way to try the fix in an extremely "conservative" way (i.e. very limited rollout), to observe effectiveness in a low-risk way?
Do we have any numbers to guide us on this?
At the details link at https://sentry.io/share/issue/25c5482018784560b498f2fddb147974/ you can see that in the last 90 days there were ~2000 crash events logged by Sentry and the percentages of OSs affected were:
Do you know if we have a way to try the fix in an extremely "conservative" way (i.e. very limited rollout), to observe effectiveness in a low-risk way?
I can check with our release managers about this, however, I think rolling out a change that's ready to revert in a hotfix would be preferable in this case because it would be far easier to measure the impact using a tool such as Sentry.
It is possible to do a staged rollout, FWIW – we do it for our Android apps – you have to update the % by hand manually, but it is doable.
It would look like API 29 is more stable in this regard so, if we'd like to focus this better, I'd say we can make the code run conditionally only for Android 8.0 and 8.1.
and
in the last 90 days there were ~2000 crash events logged by Sentry and the percentages of OSs affected were:
62% Android 7.0 17% Android 6.0 9% Android 7.1.1 6% Android 7.1.2 3% Android 7.1 <1% Android 6.0.1
Judging from these stats Sheri shared in the comment above, I'd suggest we run the code conditionally for Android v7.x only, and see if the v7.x stats decrease. We can roll this out in the wild for two weeks (one sprint or app release cycle) and see. @designsimply , can you also share here the stats for just the last 15days perhaps, so we can compare at the end of the app release cycle?
Thank you all for keeping up with this from where we left! 🙇
Judging from these stats Sheri shared in the comment above, I'd suggest we run the code conditionally for Android v7.x only, and see if the v7.x stats decrease
Agreed on that approach @hypest 👍 just a note though - Sentry reports the same problem as a different issue in https://sentry.io/share/issue/81b90713f6fe4893b463255d6895b21b/ (first reported here).
In this other issue, the affected platform is 100% Android 8.0:
100.00% | Android 8.0.0
Affecting 314 users and ~1.8k events in the past 14 days (that's more than the 62% of 340 users / 2000 events happening on Android 7.0). I'd lean towards trying on Android 8 (also since it's where I based most of my monkey tests setup on).
I'll make the version check and update here 👍
Conditional Android version check added in 2169affc15e7804e17259ea119abe0409db6100f - tested on a few emulators to check it effectively goes through / bypasses the fix.
Ready for another 👀pass @mkevins @hypest 🙇
I'll merge this PR and also include it in the currently open gutenberg-mobile release.
Fix
This PR aims at fixing or alleviating the crash being observed in https://github.com/wordpress-mobile/WordPress-Android/issues/9904
Preamble
Long story short as per what I could observe: each time focus is changed, or when the input method is changed (for example, you change the keyboard or tap the microphone on the keyboard), this will create a new input method and the actively focused EditText will be called on its
onCreateInputConnection
. Here, the base implementation (same one as AztecText is using, as it's not overriding that method) will create a new instance and return that one.By using an
InputConnectionWrapper
and plenty of logs I was able to identify the following situation:Then we have this: a
beginBatchEdit
is followed by a constructor call. This means, a newInputConnection
is going to be established, but also a batch edit has already started on the InputConnection that is being detached. This was followed by many entries like this one:At some point this can be followed by an IndexOutOfBounds crash, for example looking in the logs here:
These logs were added by using an
InputConnectionWrapper
that didn't affect behavior (just added the logs).This piece here shows an insertion of a $ character at position 718. When an insertion / addition happens, the new length should be added by one and the new cursor position should be also augmented, as it will stay to the right of the appended/inserted character. The problem we can see here in that log snippet is newCursorPosition should have moved the selection by one (1), but still previousSelectionStart and previousSelectionEnd is 718, and currentSelectionStart after moving should be 718+1=719, but instead it's still 718. See how things can go wrong there? On the other runs where no new constructor call is seen in between, things are added up correctly.
What this PR does
This PR just keeps a reference for the latest
InputConnection
created, as long as no new attributes for a newInputConnection
to be created are sent (so we're working under the premise that there is no need to create a newInputConnection
if the parameters passed are the same as per the current one).Test
have been trying to test this using monkey https://developer.android.com/studio/test/monkey, you can test it the same way issuing a command like this one from Terminal:
adb -s emulator-5554 shell monkey -p org.wordpress.aztec -v 10000
To make conditions more convenient for this crash to happen:
android:inputType="textNoSuggestions"
add this to the xml https://github.com/wordpress-mobile/AztecEditor-Android/blob/develop/app/src/main/res/layout-v17/activity_main.xml#L26echo -n '' | pbcopy
on a terminal window), then run:adb -s emulator-5554 shell monkey -p org.wordpress.aztec -v 10000
Alternatively, you can also try pinning the Aztec demo screen so no unneeded app switches happen, and in this case do not disable sourceView, as you'll want to have focus to alternate so the new InputConnection is created when switching between EditText:
If you need to kill the monkey process running in the device/emulator, issue the following command:
adb shell ps | awk '/com.android.commands.monkey/ { system("adb shell kill " $2) }'
Make sure strings will be translated:
strings.xml
as a part of the integration PR.