Signal 11 crash still present after SIMCTL_CHILD_NSZombieEnabled=YES + latest Firebase version

shamilovtim commented 2 years ago

Description

A clear and concise description of what the bug is.

[x] I have tested this issue on the latest Detox release and it still reproduces

Reproduction

We still see Signal 11 crashes in our project even after adding both SIMCTL_CHILD_NSZombieEnabled=YES detox test -c ios.sim.release as well as updating firebase + firebase perf to the latest versions and ensuring that GoogleUtilities are at 7.7+.

The error is commonly reproduced on CircleCI and is incredibly difficult / practically impossible to reproduce locally.

In order to help debug I've provided a list of all of the dependencies in our project. We use Expo 43 (no expo updates), React Native 0.65.2 and Codepush (probably unrelated, the error happens in the middle of a test, not at the beginning). I've found that the Stripe iOS SDK (via tipsi-stripe) is a source of method swizzling. Also, while Flipper is a source of method swizzling, we are building a release project so Flipper should be excluded.

Let me know if there's anything else I can provide to help debug and resolve this issue.

The stacktrace is:

DetoxRuntimeError: The pending request #99 ("invoke") has been rejected due to the following error:

The app has crashed, see the details below:

Signal 11 was raised
(
    0   Detox                               0x00000001127a56a5 +[NSThread(DetoxUtils) dtx_demangledCallStackSymbols] + 37
    1   Detox                               0x00000001127a8230 __DTXHandleCrash + 464
    2   Detox                               0x00000001127a8971 __DTXHandleSignal + 59
    3   libsystem_platform.dylib            0x0000000118ef1d7d _sigtramp + 29
    4   ???                                 0x00006000162b0180 0x0 + 105553488183680
    5   DetoxSync                           0x0000000154151b37 -[_DTXTimerTrampoline fire:] + 188
    6   DetoxSync                           0x0000000154139b70 _DTXCFTimerTrampoline + 74
    7   CoreFoundation                      0x00000001156416b6 __CFRUNLOOP_IS_CALLING_OUT_TO_A_TIMER_CALLBACK_FUNCTION__ + 20
    8   CoreFoundation                      0x00000001156411b2 __CFRunLoopDoTimer + 923
    9   CoreFoundation                      0x0000000115640771 __CFRunLoopDoTimers + 265
    10  CoreFoundation                      0x000000011563adb0 __CFRunLoopRun + 2010
    11  CoreFoundation                      0x000000011563a0f3 CFRunLoopRunSpecific + 567
    12  Shipt                               0x000000010ff87c74 +[RCTCxxBridge runRunLoop] + 281
    13  DetoxSync                           0x000000015414173a swz_runRunLoopThread + 291
    14  Foundation                          0x0000000113539550 __NSThread__start__ + 1025
    15  libsystem_pthread.dylib             0x0000000118e8a8fc _pthread_start + 224
    16  libsystem_pthread.dylib             0x0000000118e86443 thread_start + 15
)
    at _callee4$ (/Users/distiller/nebula/e2e/helpers/helpers.ts:22:42)
    at tryCatch (/Users/distiller/nebula/node_modules/regenerator-runtime/runtime.js:63:40)

List of dependencies present in the project:

    "dependencies": {
        "@babel/preset-typescript": "7.16.7",
        "@bugsnag/react-native": "7.15.1",
        "@formatjs/intl-datetimeformat": "^2.8.3",
        "@formatjs/intl-numberformat": "^5.7.5",
        "@oracle/react-native-pushiomanager": "https://github.com/shamilovtim/pushiomanager-react-native#shipt-next",
        "@react-native-community/async-storage": "^1.6.2",
        "@react-native-community/geolocation": "^2.0.2",
        "@react-native-community/hooks": "2.8.1",
        "@react-native-community/push-notification-ios": "^1.10.1",
        "@react-native-firebase/app": "14.2.4",
        "@react-native-firebase/perf": "14.2.4",
        "@react-navigation/bottom-tabs": "^6.0.9",
        "@react-navigation/native": "^6.0.6",
        "@react-navigation/routers": "^6.1.0",
        "@react-navigation/stack": "^6.0.11",
        "@segment/analytics-react-native": "^1.5.0",
        "@segment/analytics-react-native-firebase": "^1.5.0",
        "accounting-js": "^1.1.1",
        "concurrently": "6.5.1",
        "core-js": "3.20.1",
        "date-fns": "2.7.0",
        "date-fns-tz": "1.0.12",
        "expo": "43.0.4",
        "expo-apple-authentication": "^4.1.0",
        "expo-facebook": "12.0.3",
        "expo-google-sign-in": "10.0.3",
        "expo-linear-gradient": "10.0.3",
        "expo-secure-store": "^11.1.0",
        "expo-tracking-transparency": "2.0.3",
        "graphql": "^15.7.0",
        "graphql-request": "3.7.0",
        "imagemin": "^8.0.1",
        "immer": "9.0.12",
        "inquirer": "^8.2.0",
        "intl": "^1.2.5",
        "invariant": "2.2.4",
        "jsdom": "19.0.0",
        "lodash": "^4.17.21",
        "lottie-ios": "^3.2.3",
        "lottie-react-native": "4.1.3",
        "mailosaur": "^8.1.0",
        "medallia-digital": "https://repository.medallia.com/digital-npm/medallia-digital-rn/medallia-digital-rn-3.8.1.tgz",
        "multi-progress": "4.0.0",
        "nano-memoize": "1.2.1",
        "node-emoji": "^1.11.0",
        "normalizr": "^3.6.1",
        "prop-types": "15.8.0",
        "query-string": "7.0.1",
        "react": "17.0.2",
        "react-native": "0.65.2",
        "react-native-action-sheet": "2.2.0",
        "react-native-android-open-settings": "1.3.0",
        "react-native-branch": "5.3.0",
        "react-native-camera": "4.2.1",
        "react-native-code-push": "^7.0.4",
        "react-native-device-info": "8.4.8",
        "react-native-easing-gradient": "^1.0.0",
        "react-native-fs": "^2.18.0",
        "react-native-geolocation-service": "5.2.0",
        "react-native-gesture-handler": "1.10.3",
        "react-native-get-random-values": "1.7.2",
        "react-native-haptic-feedback": "1.13.0",
        "react-native-html-parser": "^0.1.0",
        "react-native-in-app-review": "^3.2.3",
        "react-native-keyboard-aware-scroll-view": "0.9.5",
        "react-native-mmkv": "1.6.3",
        "react-native-modal": "13.0.0",
        "react-native-progress": "5.0.0",
        "react-native-push-notification": "8.1.1",
        "react-native-reanimated": "1.13.1",
        "react-native-redash": "^15.6.0",
        "react-native-render-html": "4.2.4",
        "react-native-safe-area-context": "3.3.2",
        "react-native-screens": "3.10.1",
        "react-native-share": "7.3.2",
        "react-native-svg": "^12.1.1",
        "react-native-v8": "0.65.1-patch.1",
        "react-native-webview": "11.15.0",
        "react-query": "3.34.6",
        "react-redux": "^7.2.6",
        "reactotron-redux": "^3.1.3",
        "redux": "4.1.2",
        "redux-persist": "^6.0.0",
        "redux-saga": "^1.1.3",
        "reselect": "4.1.5",
        "shelljs": "^0.8.5",
        "sift-react-native": "^0.1.3",
        "tipsi-stripe": "adkenyon/tipsi-stripe#5cab5f303bedc5ffd3fb543b14dc229f5259b0a0",
        "uri-js": "4.4.1",
        "uuid": "^8.3.2",
        "v8-android-jit-nointl": "9.93.0"
    }

Expected behavior

Expect no signal 11

Screenshots / Video

Not necessary right now, will add later if needed.

Environment

Detox: 19.4.2
React Native: 0.65.2
Node: 14.17.3
Device: iOS Emulator
Xcode: 13.2.1
iOS: latest
macOS: 11.6
Test-runner (select one): jest-circus

Logs

Device and verbose Detox logs

[x] I have run my tests using the --loglevel trace argument and am providing the verbose log below:

Detox logs


The stacktrace is:

```
DetoxRuntimeError: The pending request #99 ("invoke") has been rejected due to the following error:

The app has crashed, see the details below:

Signal 11 was raised
(
    0   Detox                               0x00000001127a56a5 +[NSThread(DetoxUtils) dtx_demangledCallStackSymbols] + 37
    1   Detox                               0x00000001127a8230 __DTXHandleCrash + 464
    2   Detox                               0x00000001127a8971 __DTXHandleSignal + 59
    3   libsystem_platform.dylib            0x0000000118ef1d7d _sigtramp + 29
    4   ???                                 0x00006000162b0180 0x0 + 105553488183680
    5   DetoxSync                           0x0000000154151b37 -[_DTXTimerTrampoline fire:] + 188
    6   DetoxSync                           0x0000000154139b70 _DTXCFTimerTrampoline + 74
    7   CoreFoundation                      0x00000001156416b6 __CFRUNLOOP_IS_CALLING_OUT_TO_A_TIMER_CALLBACK_FUNCTION__ + 20
    8   CoreFoundation                      0x00000001156411b2 __CFRunLoopDoTimer + 923
    9   CoreFoundation                      0x0000000115640771 __CFRunLoopDoTimers + 265
    10  CoreFoundation                      0x000000011563adb0 __CFRunLoopRun + 2010
    11  CoreFoundation                      0x000000011563a0f3 CFRunLoopRunSpecific + 567
    12  Shipt                               0x000000010ff87c74 +[RCTCxxBridge runRunLoop] + 281
    13  DetoxSync                           0x000000015414173a swz_runRunLoopThread + 291
    14  Foundation                          0x0000000113539550 __NSThread__start__ + 1025
    15  libsystem_pthread.dylib             0x0000000118e8a8fc _pthread_start + 224
    16  libsystem_pthread.dylib             0x0000000118e86443 thread_start + 15
)
    at _callee4$ (/Users/distiller/nebula/e2e/helpers/helpers.ts:22:42)
    at tryCatch (/Users/distiller/nebula/node_modules/regenerator-runtime/runtime.js:63:40)
```

asafkorem commented 2 years ago

Hi @shamilovtim, thanks for the report!

Try replacing NSZombieEnabled=YES detox test -c ios.sim.release with SIMCTL_CHILD_NSZombieEnabled=YES detox test -c ios.sim.release

Also, note that the flag that we pass as a fix for the Firebase-related Signal 11 deallocation errors is GULGeneratedClassDisposeDisabled=YES, which replaced the NSZombieEnabled flag that was passed from Detox before it, (see https://github.com/wix/Detox/pull/3167, GoogleUtilities/#66).

shamilovtim commented 2 years ago

Hey @asafkorem! Just confirmed that we do pass SIMCTL_CHILD_NSZombieEnabled=YES correctly, I had just typed it wrong on the Github ticket. I updated the original ticket to reflect that. Screen Shot 2022-02-05 at 5 12 05 PM

I know that you're passing the more specific flag inside of Detox itself but I was hoping that the Zombie flag might deal with any other method swizzling related crashes. Unfortunately the crashes were still present so the more general Zombie flag did not work. Passing no flag doesn't work either.

asafkorem commented 2 years ago

@shamilovtim ah okay, I see.. Was it working with the zombie-enable workaround before upgrading Detox to the latest version?

shamilovtim commented 2 years ago

On December 17 I added SIMCTL_CHILD_NSZombieEnabled=YES to yarn detox test -c ci.ios.release in our project.
On December 29 I introduced Detox 19.4.1 to the project, which is supposed to include this fix inside detox instead of me declaring it as an env var and removed SIMCTL_CHILD_NSZombieEnabled=YES from detox test -c. In between this time we still saw signal 11 in the project.
After December 29, we continued to see Signal 11.
On January 25, we upgraded to Detox 19.4.2 and after that continued to receive signal 11 in the project.

No matter what version of Detox or whether or not we apply SIMCTL_CHILD_NSZombieEnabled=YES Signal 11 continues to live in our project. We used to see it way more often in the past, so it has gone down in frequency. I assume that's because the Firebase fix did remove some instances of it but the other ones are still occurring.

asafkorem commented 2 years ago

@shamilovtim this sounds like the current Signal 11 errors are more Expo related (and aren't coming from Firebase), based on some previous reports and my gut feeling...

Also, can you tell me if you have some other dependency that uses swizzling in iOS?

That would be hard to investigate based on the logs only, so we must have some project I can work with that reproduces this issue. I can try and invest some time soon in testing an Expo + Detox project with the Expo version you have, but if you have something ready that reproduces the crash on the latest Detox version and you can send it me this would be very helpful.

shamilovtim commented 2 years ago

@asafkorem In my opinion the suspects are:

Expo (we don't use expo updates), Stripe (v14 stripe-ios via tipsi stripe), Bugsnag, Codepush, Segment Analytics

Stripe iOS (v14) for sure has method swizzling. I was able to find it by scanning the source code of the Pod.

Flipper uses method swizzling but will not be active during Release builds. (However, I do wonder if Flipper would crash detox debug)?

Excuse my ignorance but is there any way of making Detox play nice with any and all method swizzling rather than the Firebase workaround created in the Google repo?

asafkorem commented 2 years ago

@shamilovtim the problem in Firebase/Performance wasn't in Detox end but in their code (it pretty much assumed that no other framework does dynamic ISA-Swizzling for iOS classes), so it's pretty hard to assume that there is a general solution to the problem from Detox' code.

However, once we'll finish with Detox-iOS transition to XCUITest framework, we may be able to avoid doing ISA swizzling for iOS classes so hopefully it will resolve such issues.

asafkorem commented 2 years ago

@asafkorem In my opinion the suspects are:

Expo (we don't use expo updates), Stripe (v14 stripe-ios via tipsi stripe), Bugsnag, Codepush, Segment Analytics

Stripe iOS (v14) for sure has method swizzling. I was able to find it by scanning the source code of the Pod.

Flipper uses method swizzling but will not be active during Release builds. (However, I do wonder if Flipper would crash detox debug)?

Excuse my ignorance but is there any way of making Detox play nice with any and all method swizzling rather than the Firebase workaround created in the Google repo?

These definitely sound suspicious 😅 It will be very helpful for us if you'll run DetoxTemplate (or any example RN project) with Detox tests and any of your dependencies that might cause this error, and try to reproduce the bug. I know it might take a lot of time, but it is also hard for me to reproduce this issue with other libraries or frameworks I'm not familiar with.

ball-hayden commented 2 years ago

We're seeing this, I can maybe narrow down a bit as we intersect with some of those dependencies.

We're using bits of Expo and CodePush. (We don't use Stripe, Bugsnag, or Segment Analytics)

We also use reanimated (which I see in your dependency list) and our tests do seem to fail with Signal 11s around about places where there are animations - would that match up with what you're seeing @shamilovtim?

shamilovtim commented 2 years ago

We can't really nail down the cause. What version reanimated are you on?

ball-hayden commented 2 years ago

We're running react-native-reanimated: ^2.3.1

shamilovtim commented 2 years ago

We're on 1.13.1. Makes me skeptical it's reanimated. Could it just be Expo in some way?

ball-hayden commented 2 years ago

Possibly, although I can't help but feel it might be animation-timing related given the places we see this are always slightly after an animation has completed.

The places we are seeing this aren't places where we're interacting with Expo directly.

I've tried creating a project based on the Detox Repro Template that has Expo and reanimated in it but I haven't managed to get it to crash yet.

Something I have noticed is that the amount of time it takes Detox to "react" (i.e. continue to the next command) to an animation finishing seems to vary somewhat - see the video below at 19 seconds vs the other repetitions from my attempted reproduction:

https://user-images.githubusercontent.com/1596233/160562283-93f25e5b-da13-4706-bc16-4d654b5b66d6.mov

I also wondered if it might be a race between an animation finishing and a request completing, although again haven't managed to coax the reproduction into crashing.

shamilovtim commented 2 years ago

Possibly, although I can't help but feel it might be animation-timing related given the places we see this are always slightly after an animation has completed.

The places we are seeing this aren't places where we're interacting with Expo directly.

I've tried creating a project based on the Detox Repro Template that has Expo and reanimated in it but I haven't managed to get it to crash yet.

Something I have noticed is that the amount of time it takes Detox to "react" (i.e. continue to the next command) to an animation finishing seems to vary somewhat - see the video below at 19 seconds vs the other repetitions from my attempted reproduction:

https://user-images.githubusercontent.com/1596233/160562283-93f25e5b-da13-4706-bc16-4d654b5b66d6.mov

I also wondered if it might be a race between an animation finishing and a request completing, although again haven't managed to coax the reproduction into crashing.

Are you using the template in CI? Do you have your code up somewhere so i can pull it or add to it?

We only get these crashes in CI fyi

ball-hayden commented 2 years ago

We only get these crashes in CI

^ that is a point. I don't think I've ever seen it locally.

Fighting some other fires, but I'll get the template pushed somewhere when I can.

shamilovtim commented 2 years ago

@asafkorem I wonder if we can add reanimated, gesture handler, expo and react navigation to the official Detox e2e tests? By adding all of the common dependencies it might help to catch some of this crashing. What CI provider do you guys use for the project?

noomorph commented 2 years ago

@shamilovtim

What CI provider do you guys use for the project?

Buildkite + Macstadium

ball-hayden commented 2 years ago

@shamilovtim sorry for the delay.

https://github.com/PlayerData/DetoxReanimatedRepro

Just trying to set up GitHub actions to see if that might result in a reproduction.

ball-hayden commented 2 years ago

@shamilovtim can I ask, do you run multiple simulators at once in CI (as in, do you use the --max-workers flag)?

ball-hayden commented 2 years ago

Running https://github.com/PlayerData/DetoxReanimatedRepro/pull/4 locally, Detox is consistently not waiting for animations to finish for some workers (between 1 and 3).

It doesn't, however, Signal 11

ball-hayden commented 2 years ago

To brain dump a bit more - the stack trace takes us to here (specifically, the block):

https://github.com/wix/DetoxSync/blob/f5cd8cbde60311bb8e41df7c773850cd73c8f84b/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m#L120

Could fire be running for a timer that's been disposed, leading to mode or timer being invalid by the time we execute the block?

Sorry to ping @LeoNatan, but I'm waaay out of my depth here - would you have any ideas around this space?

shamilovtim commented 2 years ago

@shamilovtim can I ask, do you run multiple simulators at once in CI (as in, do you use the --max-workers flag)?

No we don't. CircleCI doesn't allow hardware acceleration so we are limited performance wise

shamilovtim commented 2 years ago

To brain dump a bit more - the stack trace takes us to here (specifically, the block):

https://github.com/wix/DetoxSync/blob/f5cd8cbde60311bb8e41df7c773850cd73c8f84b/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m#L120

Could fire be running for a timer that's been disposed, leading to mode or timer being invalid by the time we execute the block?

Sorry to ping @LeoNatan, but I'm waaay out of my depth here - would you have any ideas around this space?

I wonder if this has something to do with software rendered, slow CI VMs? Maybe the VM is so slow that timer times out and gets GCCed or whatever like you said? Which is why we don't see this on Detox running on bare metal instances

ball-hayden commented 2 years ago

Our CI doesn't use VMs but instead a pair of slightly ageing Mac Minis (2018) so the same theory holds.

ball-hayden commented 2 years ago

Couple of patches I tried yesterday.

Firstly, this change which guards against the returned proxy being null.

diff --git a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/CADisplayLink+DTXSpy.m b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/CADisplayLink+DTXSpy.m
index 023eabc..c296644 100644
--- a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/CADisplayLink+DTXSpy.m
+++ b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/CADisplayLink+DTXSpy.m
@@ -102,13 +102,16 @@ extern atomic_cfrunloop __RNRunLoop;
    if(self.isPaused != paused)
    {
        id<DTXTimerProxy> proxy = [DTXTimerSyncResource existingTimerProxyWithDisplayLink:self create:NO];
-       if(paused == YES)
-       {
-           [proxy untrack];
-       }
-       else
-       {
-           [self _detox_sync_trackIfNeeded];
+
+       if(proxy) {
+           if(paused == YES)
+           {
+               [proxy untrack];
+           }
+           else
+           {
+               [self _detox_sync_trackIfNeeded];
+           }
        }
    }

This didn't help, which kinda makes sense - there should always be a proxy at this point if I've understood properly.

The more interesting change is this one:

diff --git a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/NSTimer+DTXSpy.m b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/NSTimer+DTXSpy.m
index 8f5549c..9a82b9e 100644
--- a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/NSTimer+DTXSpy.m
+++ b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/NSTimer+DTXSpy.m
@@ -78,7 +78,7 @@ CFRunLoopTimerRef __detox_sync_CFRunLoopTimerCreateWithHandler(CFAllocatorRef al
 static void (*__orig_CFRunLoopAddTimer)(CFRunLoopRef rl, CFRunLoopTimerRef timer, CFRunLoopMode mode);
 void __detox_sync_CFRunLoopAddTimer(CFRunLoopRef rl, CFRunLoopTimerRef timer, CFRunLoopMode mode)
 {
-// NSLog(@"🤦‍♂️ addTimer: %@", NS(timer));
+   NSLog(@"🤦‍♂️ addTimer: %@", NS(timer));

    id<DTXTimerProxy> trampoline = [DTXTimerSyncResource existingTimerProxyWithTimer:NS(timer)];
    trampoline.runLoop = rl;
@@ -91,7 +91,7 @@ void __detox_sync_CFRunLoopAddTimer(CFRunLoopRef rl, CFRunLoopTimerRef timer, CF
 static void (*__orig_CFRunLoopRemoveTimer)(CFRunLoopRef rl, CFRunLoopTimerRef timer, CFRunLoopMode mode);
 void __detox_sync_CFRunLoopRemoveTimer(CFRunLoopRef rl, CFRunLoopTimerRef timer, CFRunLoopMode mode)
 {
-// NSLog(@"🤦‍♂️ removeTimer: %@", NS(timer));
+   NSLog(@"🤦‍♂️ removeTimer: %@", NS(timer));

    id<DTXTimerProxy> trampoline = [DTXTimerSyncResource existingTimerProxyWithTimer:NS(timer)];
    [trampoline untrack];
@@ -102,7 +102,7 @@ void __detox_sync_CFRunLoopRemoveTimer(CFRunLoopRef rl, CFRunLoopTimerRef timer,
 static void (*__orig_CFRunLoopTimerInvalidate)(CFRunLoopTimerRef timer);
 void __detox_sync_CFRunLoopTimerInvalidate(CFRunLoopTimerRef timer)
 {
-// NSLog(@"🤦‍♂️ invalidate: %@", NS(timer));
+   NSLog(@"🤦‍♂️ invalidate: %@", NS(timer));

    id<DTXTimerProxy> trampoline = [DTXTimerSyncResource existingTimerProxyWithTimer:NS(timer)];
    [trampoline untrack];
@@ -113,7 +113,7 @@ void __detox_sync_CFRunLoopTimerInvalidate(CFRunLoopTimerRef timer)
 static void (*__orig___NSCFTimer_invalidate)(NSTimer* timer);
 void __detox_sync___NSCFTimer_invalidate(NSTimer* timer)
 {
-   //  NSLog(@"🤦‍♂️ invalidate: %@", timer);
+    NSLog(@"🤦‍♂️ invalidate: %@", timer);

    id<DTXTimerProxy> trampoline = [DTXTimerSyncResource existingTimerProxyWithTimer:timer];
    [trampoline untrack];
diff --git a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m
index 5235fdc..b66d945 100644
--- a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m
+++ b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m
@@ -162,6 +162,8 @@ const void* __DTXTimerTrampolineKey = &__DTXTimerTrampolineKey;
        return;
    }

+    NSLog(@"🤦‍♂️ track: %@", _timer);
+   
    _tracking = YES;
    [DTXTimerSyncResource.sharedInstance trackTimerTrampoline:self];
 }
@@ -173,7 +175,7 @@ const void* __DTXTimerTrampolineKey = &__DTXTimerTrampolineKey;
        return;
    }

-   //  NSLog(@"🤦‍♂️ untrack: %@", _timer);
+    NSLog(@"🤦‍♂️ untrack: %@", _timer);

    [DTXTimerSyncResource.sharedInstance untrackTimerTrampoline:self];
    _tracking = NO;

All this change is doing is re-enabling some of the commented out logging (and adding a little bit extra).

With this change, I wasn't able to reproduce a segfault over around 10 runs (usually, I would have seen at least one if not two in that space).

Presumably, adding the logging is causing a reference to the timer to be retained for longer than it would be otherwise. Either that, or the extra slowness of logging makes the race no longer happen.

shamilovtim commented 2 years ago

Couple of patches I tried yesterday.

Firstly, this change which guards against the returned proxy being null.

diff --git a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/CADisplayLink+DTXSpy.m b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/CADisplayLink+DTXSpy.m
index 023eabc..c296644 100644
--- a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/CADisplayLink+DTXSpy.m
+++ b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/CADisplayLink+DTXSpy.m
@@ -102,13 +102,16 @@ extern atomic_cfrunloop __RNRunLoop;
  if(self.isPaused != paused)
  {
      id<DTXTimerProxy> proxy = [DTXTimerSyncResource existingTimerProxyWithDisplayLink:self create:NO];
-     if(paused == YES)
-     {
-         [proxy untrack];
-     }
-     else
-     {
-         [self _detox_sync_trackIfNeeded];
+
+     if(proxy) {
+         if(paused == YES)
+         {
+             [proxy untrack];
+         }
+         else
+         {
+             [self _detox_sync_trackIfNeeded];
+         }
      }
  }

This didn't help, which kinda makes sense - there should always be a proxy at this point if I've understood properly.

The more interesting change is this one:

diff --git a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/NSTimer+DTXSpy.m b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/NSTimer+DTXSpy.m
index 8f5549c..9a82b9e 100644
--- a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/NSTimer+DTXSpy.m
+++ b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Spies/NSTimer+DTXSpy.m
@@ -78,7 +78,7 @@ CFRunLoopTimerRef __detox_sync_CFRunLoopTimerCreateWithHandler(CFAllocatorRef al
 static void (*__orig_CFRunLoopAddTimer)(CFRunLoopRef rl, CFRunLoopTimerRef timer, CFRunLoopMode mode);
 void __detox_sync_CFRunLoopAddTimer(CFRunLoopRef rl, CFRunLoopTimerRef timer, CFRunLoopMode mode)
 {
-//   NSLog(@"🤦‍♂️ addTimer: %@", NS(timer));
+ NSLog(@"🤦‍♂️ addTimer: %@", NS(timer));

  id<DTXTimerProxy> trampoline = [DTXTimerSyncResource existingTimerProxyWithTimer:NS(timer)];
  trampoline.runLoop = rl;
@@ -91,7 +91,7 @@ void __detox_sync_CFRunLoopAddTimer(CFRunLoopRef rl, CFRunLoopTimerRef timer, CF
 static void (*__orig_CFRunLoopRemoveTimer)(CFRunLoopRef rl, CFRunLoopTimerRef timer, CFRunLoopMode mode);
 void __detox_sync_CFRunLoopRemoveTimer(CFRunLoopRef rl, CFRunLoopTimerRef timer, CFRunLoopMode mode)
 {
-//   NSLog(@"🤦‍♂️ removeTimer: %@", NS(timer));
+ NSLog(@"🤦‍♂️ removeTimer: %@", NS(timer));

  id<DTXTimerProxy> trampoline = [DTXTimerSyncResource existingTimerProxyWithTimer:NS(timer)];
  [trampoline untrack];
@@ -102,7 +102,7 @@ void __detox_sync_CFRunLoopRemoveTimer(CFRunLoopRef rl, CFRunLoopTimerRef timer,
 static void (*__orig_CFRunLoopTimerInvalidate)(CFRunLoopTimerRef timer);
 void __detox_sync_CFRunLoopTimerInvalidate(CFRunLoopTimerRef timer)
 {
-//   NSLog(@"🤦‍♂️ invalidate: %@", NS(timer));
+ NSLog(@"🤦‍♂️ invalidate: %@", NS(timer));

  id<DTXTimerProxy> trampoline = [DTXTimerSyncResource existingTimerProxyWithTimer:NS(timer)];
  [trampoline untrack];
@@ -113,7 +113,7 @@ void __detox_sync_CFRunLoopTimerInvalidate(CFRunLoopTimerRef timer)
 static void (*__orig___NSCFTimer_invalidate)(NSTimer* timer);
 void __detox_sync___NSCFTimer_invalidate(NSTimer* timer)
 {
- //  NSLog(@"🤦‍♂️ invalidate: %@", timer);
+    NSLog(@"🤦‍♂️ invalidate: %@", timer);

  id<DTXTimerProxy> trampoline = [DTXTimerSyncResource existingTimerProxyWithTimer:timer];
  [trampoline untrack];
diff --git a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m
index 5235fdc..b66d945 100644
--- a/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m
+++ b/node_modules/detox/DetoxSync/DetoxSync/DetoxSync/Utils/_DTXTimerTrampoline.m
@@ -162,6 +162,8 @@ const void* __DTXTimerTrampolineKey = &__DTXTimerTrampolineKey;
      return;
  }

+    NSLog(@"🤦‍♂️ track: %@", _timer);
+ 
  _tracking = YES;
  [DTXTimerSyncResource.sharedInstance trackTimerTrampoline:self];
 }
@@ -173,7 +175,7 @@ const void* __DTXTimerTrampolineKey = &__DTXTimerTrampolineKey;
      return;
  }

- //  NSLog(@"🤦‍♂️ untrack: %@", _timer);
+    NSLog(@"🤦‍♂️ untrack: %@", _timer);

  [DTXTimerSyncResource.sharedInstance untrackTimerTrampoline:self];
  _tracking = NO;

All this change is doing is re-enabling some of the commented out logging (and adding a little bit extra).

With this change, I wasn't able to reproduce a segfault over around 10 runs (usually, I would have seen at least one if not two in that space).

Presumably, adding the logging is causing a reference to the timer to be retained for longer than it would be otherwise. Either that, or the extra slowness of logging makes the race no longer happen.

Awesome findings man.

Presumably, adding the logging is causing a reference to the timer to be retained for longer than it would be otherwise

Couldn't you easily prove this hypothesis by setting the timer to a variable rather than using it in the log?

shamilovtim commented 2 years ago

I wonder if in all of these methods if we create a new timer from the old one that would be good enough to prevent the reference nulling out from its parent?

ball-hayden commented 2 years ago

Couldn't you easily prove this hypothesis by setting the timer to a variable rather than using it in the log?

I am very much not an Objective C programmer, but my assumption is that a local variable wouldn't be enough to prevent GC (from past similar experiences with Swift).

I wonder if we can keep a reference within DTXTimerProxy though?

I didn't have time to look further into this today - tomorrow is a new day, and hopefully I can find time again.

shamilovtim commented 2 years ago

This article might help: https://stackoverflow.com/questions/4945028/repeating-nstimer-weak-reference-owning-reference-or-ivar

ball-hayden commented 2 years ago

So - turns out I was just "lucky".

Rebuilt the branch with the logging changes today and it did segfault still. I guess the good news is that I now have those logs.

Slightly redacted / filtered version here: device-redacted.log

ball-hayden commented 2 years ago

Do we maybe need some sort of synchronisation logic between fire and untrack?

ball-hayden commented 2 years ago

I've left some thoughts on https://github.com/wix/DetoxSync/pull/33, although I still haven't managed to get to the bottom of the fault.

ball-hayden commented 2 years ago

My conclusion is that anyone not seeing this is just lucky - the memory management in DetoxSync around timer tracking is dodgy.

Certain races cause DetoxSync's NSTimer spy to attempt to trigger callbacks on objects that don't exist anymore.

(This is also why it's so hard to reproduce - it is very timing dependent)

ball-hayden commented 2 years ago

Okay. I'm giving up for now - this has had a week of my time we can't afford to spend any more on this.

https://github.com/wix/Detox/issues/3319 is preventing me from debugging further within our app, and everything I've thrown at https://github.com/wix/DetoxSync/pull/33 has left me without any further ideas.

For now, I'm adding the --retry flag to our runs - we'll just have to take the hit on increased runtimes.

shamilovtim commented 2 years ago

Wish I could help but I've had no luck reproducing signal 11 locally so I'm flying blind

ball-hayden commented 2 years ago

Me neither @shamilovtim - I've only been able to go off CI runs.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you believe the issue is still relevant, please test on the latest Detox and report back.

Thank you for your contributions!

For more information on bots in this reporsitory, read this discussion.

shamilovtim commented 2 years ago

Not stale

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you believe the issue is still relevant, please test on the latest Detox and report back.

Thank you for your contributions!

For more information on bots in this reporsitory, read this discussion.

shamilovtim commented 2 years ago

Valid

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you believe the issue is still relevant, please test on the latest Detox and report back.

Thank you for your contributions!

For more information on bots in this reporsitory, read this discussion.

sharan-SKY commented 2 years ago

has there been any more investigation on this? things were running pretty consistently in macOS10.5 but issue started once we upgraded our mac to 11 or 12

shamilovtim commented 2 years ago

I think we left off at @ball-hayden's efforts which have been the most thorough to date. Our team turned retry on and so we're swallowing the cost of the retry time rather than thoroughly deflaking now.

sharan-SKY commented 2 years ago

that's no good. do you have some metrics on the retry? Like how many retries does it take for your test to pass etc?

looks like best solution is to stick with macos10.5 till this issue is fixed

shamilovtim commented 2 years ago

One retry is all we need.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you believe the issue is still relevant, please test on the latest Detox and report back.

Thank you for your contributions!

For more information on bots in this reporsitory, read this discussion.

shamilovtim commented 1 year ago

I think this is still valid

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you believe the issue is still relevant, please test on the latest Detox and report back.

Thank you for your contributions!

For more information on bots in this repository, read this discussion.

shamilovtim commented 1 year ago

valid

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you believe the issue is still relevant, please test on the latest Detox and report back.

Thank you for your contributions!

For more information on bots in this repository, read this discussion.

wix / Detox