Closed loremattei closed 3 years ago
It looks like this error has been added in 15.7 and we are getting some reports about it. I'm adding this to Triage as I guess it's something we want to investigate since we added the error reporting for it.
Thanks @loremattei. @malinajirka could you investigate this one?
This is not a crash and it's an expected error. We added it to get a better idea how many users are encountering it. It doesn't have any user facing impact - the post gets saved correctly when the user exits the editor. This issue can be reproduced by making a change in the editor and exiting the editor within the next 500ms.
Thanks @malinajirka. Does that mean this issue can be closed?
Hello π If this is not a crash, can we log it in a different way? (Not to Sentry or as info instead of an error?) Also, do we know why we are getting so many events? Are more users quitting the editor real quick or is it more about Gutenberg failing to update the app?
cc: @mchowning
Sentry has two types of logs: Error and Fatal (crash). This issue has correct type (error). Ideally this log would disappear if everything in Gutenberg and WPAndroid would work correctly. Also there is a huge spike in the number of occurrences, so there is a new bug in the app.
I'm not sure if we changed the app since September 23rd or I just wrongly evaluated this, but This issue can be reproduced by making a change in the editor and exiting the editor within the next 500ms.
this is not true. The message would be different in this scenario.
I think this issue might be related to https://github.com/wordpress-mobile/WordPress-Android/issues/13169 and https://github.com/wordpress-mobile/WordPress-Android/issues/13361#issuecomment-741773403. Especially the second issue should have a very high priority in my opinion. Because if the gutenberg fragment doesn't return the most up to date content when the EditPostActivity asks for it, the EditPostActivity can't save it -> users might lose their changes.
Here is what I see when I check this issue in Sentry:
Events in the last 90d: 107,000 errors (not fatals) Users affected in the last 90d: 24,000 WORDPRESS-ANDROID-QA1: https://sentry.io/share/issue/648ba279d5064d88af6d395f2fa73358/
I think this issue might be related to #13169 and #13361 (comment). Especially the second issue should have a very high priority in my opinion. Because if the gutenberg fragment doesn't return the most up to date content when the EditPostActivity asks for it, the EditPostActivity can't save it -> users might lose their changes.
Just want to comment that #13169 does not seem to cause the SaveOnExitException$NoAutosavePending warning (though it is still a important issue to solve).
As for the second issue #13361, I investigated, and I was able to get the SaveOnExitException$NoAutosavePending exception to happen when testing a large article. I also opened up a second related issue that showed that if you left the article while in this state you could potentially experience content loss, see: https://github.com/wordpress-mobile/gutenberg-mobile/issues/3177
Also noting that another case of SaveOnExitException$NoAutosavePending
was documented in this issue https://github.com/wordpress-mobile/gutenberg-mobile/issues/2701, but we have a fix up for that case here: https://github.com/wordpress-mobile/WordPress-Android/pull/14141, so we should see some but not all of the sentry issues here go down once that fix is merged. In the mean time we're still working on the other case related to https://github.com/wordpress-mobile/WordPress-Android/issues/13361 and identifying if any other paths can lead to autosave not behaving as expected in WPAndroid.
Looks like we found another potential cause of this Exception in sentry: https://github.com/wordpress-mobile/gutenberg-mobile/pull/3205#issuecomment-790015642 Caused by leaving the editor while an upload is ongoing.
We had potential steps for reproducing exiting a page without autosaving, with steps detailed in this comment here (attempting to publish a specific page template: https://github.com/wordpress-mobile/WordPress-Android/pull/14231#issuecomment-797227717. I was unable to reproduce.
A previous issue with publishing empty page templates was fixed here: https://github.com/wordpress-mobile/WordPress-Android/pull/14141
Checked Sentry and this error is still being logged as of release 17.1-rc-1 so definitely still an issue. Leaving this ticket open.
π @designsimply and @cameronvoell !
So, I have been investigating this today as well and, after I got accustomed to everything said within this issue, its related ones and those related to the related ones (inception intensifies), after I got accustomed with the codebase (EditPostActivity
, etc), after I got to be playing around and trying to replicate the NoAutosavePending
exception (AutosavePending
was easier I must admit), I then managed to replicate a case where it seems to me as a false positive.
The video below describes a scenario where the NoAutosavePending
exception is triggered, on a long post, but the content is actually saved anyway and everything seems to be working okay. The pending local changes end up being available, the post can be also published as expected and the update becomes available online afterwards.
https://user-images.githubusercontent.com/9729923/119673994-81d1dc80-be44-11eb-9767-bbf4fef39081.mp4
Would that be considered as false positive to you as well? These kind of cases might appear in production, thus contributing to the high number of NoAutosavePending
hitting Sentry, right?
Also, it seems that some of the recent changes (like #2701 via #14141) aim to decrease the amount of NoAutosavePending
exceptions seen, and from what I am seeing they did their job, Sentry's numbers are almost halved recently.
With all that going on, and all the other related issues that surround this one, I am not sure how and whether to progress further in investigating and solving this for good. However, what I might suggest is to:
Saving
dialog.I am probably missing lots of context here and I am really unfamiliar with how the editor works, or the EditPostActivity
codebase for that matter. So, I am most probably making assumptions and not helping much. IMHO, I think this is a problem that can't be easily fixed in one go, thus I am not sure how else I should be contributing here. @designsimply and @cameronvoell , would you mind letting me know if I am missing anything before me progressing any further with this?
Thanks for looking into this @ParaskP7 - great job getting oriented in EPA π ! I'd like to provide some history context, which might clarify some tricky parts.
EditPostActivity is a class which was created 10 years ago and it carries a lot of baggage. We tried to refactor it a bit and we phased our refactoring in 4 stages. Unfortunately, only stage 1 was prioritized, so we never got to the remaining stages.
One of the main changes we did during stage 1 was that we refactored how (auto)saving content works. The main idea is, that both gutenberg and Aztec invoke a callback when the content changes -> the EditPostActivity(EPA) retrieves the updated content and saves it into the database. This approach worked great for several long months.
Unfortunately, it turned out that some changes in gutenberg introduced an issue and the content was not being saved -> gutenberg stopped invoking the callback in some scenarios and therefore the app didn't save the changes into the database.
The solution was that the we started manually saving the content when the user leaves the screen (even when the callback was not invoked). Since this breaks the whole idea of imho a nice "inform me of changes and I'll (read as EPA will) save the content into the db" approach, we added the "NoAutosavePending" log (I believe it's not a crash, just a log).
This log simply keeps informing us of the fact that gutenberg still contains the bug and is not informing EPA about changes (is not invoking the callback) -> the fact that the content gets saved is because we manually save it before the user leaves the screen -> if you try to reproduce this scenario, but instead of leaving the app you kill the app, the changes won't get saved, right? -> so it's imo not a false positive.
Hope this helps clarify some tricky parts. Let me know if you'd want me to elaborate more on some parts.
Thanks for looking into this @ParaskP7 - great job getting oriented in EPA π ! I'd like to provide some history context, which might clarify some tricky parts.
π
EditPostActivity is a class which was created 10 years ago and it carries a lot of baggage. We tried to refactor it a bit and we phased our refactoring in 4 stages. Unfortunately, only stage 1 was prioritized, so we never got to the remaining stages.
One of the main changes we did during stage 1 was that we refactored how (auto)saving content works. The main idea is, that both gutenberg and Aztec invoke a callback when the content changes -> the EditPostActivity(EPA) retrieves the updated content and saves it into the database. This approach worked great for several long months.
Unfortunately, it turned out that some changes in gutenberg introduced an issue and the content was not being saved -> gutenberg stopped invoking the callback in some scenarios and therefore the app didn't save the changes into the database.
The solution was that the we started manually saving the content when the user leaves the screen (even when the callback was not invoked). Since this breaks the whole idea of imho a nice "inform me of changes and I'll (read as EPA will) save the content into the db" approach, we added the "NoAutosavePending" log.
Thanks so much for this historical background, it is really valuable! π
(I believe it's not a crash, just a log)
The app crashes on debug mode, otherwise, and on release, it just logs the crash, without crashing. But, since we are investigating this crash logging, I am perceiving it as a crash or at least a malfunctioning. I know I might incorrectly use the crash
wording, but when I do, please bare in mind that I am referring to crash logging
.
This log simply keeps informing us of the fact that gutenberg still contains the bug and is not informing EPA about changes (is not invoking the callback) -> the fact that the content gets saved is because we manually save it before the user leaves the screen...
Thanks for explaining that to me, I am now more aware about this feature and its expected behavior! π
... -> if you try to reproduce this scenario, but instead of leaving the app you kill the app, the changes won't get saved, right? -> so it's imo not a false positive.
I agree. Actually, and since there are two mechanisms for saving (I'll call them Gutenberg-on-Update
and Manual-on-Exit
), me thinking that it was just one, made me think that there exist a false positive, but this is indeed not the case.
Hope this helps clarify some tricky parts. Let me know if you'd want me to elaborate more on some parts.
Once more, thanks so much for all the information you shared and for taking the time to do so, I really appreciate that. I am now more able to understand the basics or more forward. π
Having said that, I am not sure if I am able to proceed with trying to identify what's wrong with Gutenberg-on-Update
, that is terms of picking up where others, with more experience, on Gutenberg and ETA, left things and then to move those forward. I feel I would not do much progress, at least with the time given, I will be just catching up on the specifics, which might not be the most effective time of my Groundskeeping
. @designsimply what are your thoughts on that?
PS: I am actually a bit skeptical on how effective some team members might end up being while working on such problems during Groundskeeping
, especially when they don't have prior experience with it. Also, even if experience is present, it seems to me that such problems need a dedicated team or project to get them totally resolved. Going forward, we might want to identify and categorize those prior to assign them during Groundskeeping
. Wdyt? Please consider that as an initial feedback. It might totally be that I am not full understanding the big picture or the end goal of Groundskeeping
, doing it for the first time.
I agree that for this particular issue a team/project might be needed. I think we have the following options:
Recommended solution: Fix gutenberg and make sure it always invokes the "contentChanged" callback when the content changes. That's imho the only solution which will "guarantee" that we don't lose user content (I realize we might still lose the changes if the app crashes/is killed before the changes get saved into the db).
Current solution: Don't make any changes and use the log as a reminder that we still have a bug in gutenberg and user might lose changes in some edge cases (eg. if they leave the app while editing a post and the system kills the app)
Fallback solution: Decide that gutenberg will never be able to inform EPA about content changes, remove the log and live with the fact that users might lose their changes in some scenarios. (Update: the log might still be actually useful for debugging, so I wouldn't remove it, I'd just stop crashing in debug builds)
Even if we "fix Gutenberg", we can't be confident that none of the hundreds of web developers working on Gutenberg won't accidentally break sending all the contentChanged events back to mobile at some point in the future. That's why I think this event would still be important even if Gutenberg worked perfectly right now. Sudden changes in the frequency of the event is a strong signal that something has broken (or gotten worse) with the autosave mechanism, so it's a really valuable signal to have.
π @malinajirka and @mchowning !
Thanks a lot for your input. π
It is true that picking a solution, even in the idea world, where we would "fix Gutenberg" might not be enough. @mchowning I am thinking, and please correct me if I am way off, having zero knowledge on Gutenberg and how the autosave mechanism works, what if some kind of a linter mechanism was put in place to warn/error those hundred of web developers working on Gutenberg when they accidentally break sending all the contentChanged
events back to mobile, while they are developing new features or updating existing. Is that even possible somehow?
In the meanwhile, I too believe that this event is still important and I wouldn't remove it as well.
Suggestion (π‘): Just like the other OOM issue (#9997) that I was working yesterday, I believe that these kind of issues deserve a different kind of treatment to the rest of the issues that we are having, another space, in addition to the Groundskeeping
space, maybe a new Monitoring
space, where everyone will be aware that those are an ongoing issues, not likely to be solved, but instead monitored closely and improved upon in case the numbers go higher (or even lower) than expected. Then, dedicated SMEs (Subject Matter Experts) will be assigned to them, on rotation even, to keep an eye on all that. Also, it might be good if as much documentation as possible is included so that the effort to catch-up becomes minimal and future converstations like the one I opened is avoided, especially since all that was already well known.
Wdyt?
what if some kind of a linter mechanism was put in place to warn/error those hundred of web developers working on Gutenberg when they accidentally break sending all the contentChanged events back to mobile, while they are developing new features or updating existing. Is that even possible somehow?
We don't have any automated way to detect if contentChanged
events get broken currently. Increasing the automated checks that CI does for us on Gutenberg to catch breakages is a very high priority issue for us now, but automating something that catches this particular type of error will be quite difficult since it both involves many different parts of Gutenberg and also is not something that can be detected at the UI level. With all that said, trying to find an automated way to catch these breakages like you suggest is a very worthwhile goal, I'm just not optimistic that we'll have that anytime in the near future. π
π @mchowning !
We don't have any automated way to detect if contentChanged events get broken currently. Increasing the automated checks that CI does for us on Gutenberg to catch breakages is a very high priority issue for us now, but automating something that catches this particular type of error will be quite difficult since it both involves many different parts of Gutenberg and also is not something that can be detected at the UI level. With all that said, trying to find an automated way to catch these breakages like you suggest is a very worthwhile goal, I'm just not optimistic that we'll have that anytime in the near future. π
Once more, thanks so much for helping bring more context to this and our limitations, time-wise or otherwise. I guess then what we can at least do is to setup a well-defined monitoring process around this issue and hope that in the future, if we get some additional time we can work on the next level of automation for preventing this kind of issues appearing in the first place.
π
I think what we might want to consider is adding a link to this convo to the sentry event and using "Ignore this issue until it occurs againβ¦"
to get notified about any spikes. One downside is that if we fix a bug a spike of a newly introduced bug might not be visible. But I guess that's a risk that is ok. Wdyt?
π @malinajirka !
I think what we might want to consider is adding a link to this convo to the sentry event and using "Ignore this issue until it occurs again... to get notified about any spikes."
Indeed, that's a nice idea and good starting point!
One downside is that if we fix a bug a spike of a newly introduced bug might not be visible. But I guess that's a risk that is ok.
True, but if there is someone on rotation, keeping a close eye on that, then that person might be adjusting the Number of times
regularly and if need be. As long as there will be someone to look at it I am comfortable with this first step.
Wdyt?
I am personally okay with your suggestion and the potential risk it holds. We can at least try it for some period and see if it provides value, else we will just drop it. In the meanwhile we can think of additional actions we can take to monitor this more accurately.
@malinajirka @ParaskP7 Did either of you set that "ignore" message in Sentry and link to this convo? Just wanted to make sure we had done that before I remove this from groundskeeping for the time being :)
π @thehenrybyrd !
@malinajirka @ParaskP7 Did either of you set that "ignore" message in Sentry and link to this convo?
I haven't done anything but adding my comments and ideas on this issue.
Just wanted to make sure we had done that before I remove this from groundskeeping for the time being :)
I am actually not aware of what we had to do, like setting the ignore
message in Sentry, link to this convo, etc. Especially since all those were pure suggestions waiting for confirmation. So, if we are good with what @malinajirka suggested, then this makes sense and we can progress doing so. π
There was a related discussion on Slack (p1622214161104300-slack-C02QANACA) where we (@mkevins @mchowning) discussed if we could put the autosaving mechanism on change of "blocks" or a similar field. @mkevins mentioned "Here are some high-order functions that may be relevant and worth investigating: withPersistentBlockChange and withBlockCache".
Considering this issue may lead to a content-loss I'd hold the last resort "solution" of ignoring it in Sentry. It'd be great if someone from Gutenberg could look into it and make the final decision whether the fix requires a too high effort and ignoring it is a better call.
A quick note for additional information, please see the write-up @cameronvoell did at p9ugOq-1Jh-p2
Closing this issue as the SaveOnExitException
logging is not required anymore and has been removed as part of the Saving Posts Task Force
(Internal reference: pb3aDo-18j-p2) and below PRs:
WordPress-Android
: https://github.com/wordpress-mobile/WordPress-Android/pull/15509
Gutenberg
: https://github.com/WordPress/gutenberg/pull/36072
Cc @thehenrybyrd
Sentry Url: https://sentry.io/share/issue/648ba279d5064d88af6d395f2fa73358/ User Count: 45 Count: 177 First Release: org.wordpress.android@15.7-rc-1+916 First Seen: 2020-09-10T06:54:21Z Last Seen: 2020-09-22T05:53:33Z 24 Hours: 19 30 Days: 177