xuxiandi / angleproject

Automatically exported from code.google.com/p/angleproject
Other
0 stars 0 forks source link

ANGLE gets locked in tight loop in Context::sync after DoS #262

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. DoS an ANGLE context while debugging it.
2. Break All.
3. At any point, it will be locked in the below code fragment:

do
{
    result = eventQuery->GetData(NULL, 0, D3DGETDATA_FLUSH);

    if(block && result == S_FALSE)
    {
        // Keep polling, but allow other threads to do something useful first.
        Sleep(0);
    }
}

What is the expected output? What do you see instead?
It should not get locked in this code. If I Break All in MSVC and set "block" 
to "false" and then allow it to continue, everything seems to work properly and 
as expected afterwards.

What version of the product are you using? On what operating system?
r885, Windows 7.

Please provide any additional information below.

Original issue reported on code.google.com by d...@sherk.me on 6 Dec 2011 at 12:48

GoogleCodeExporter commented 9 years ago
Whoops, forgot to mention that it's created with robustness enabled.

Original comment by d...@sherk.me on 6 Dec 2011 at 12:53

GoogleCodeExporter commented 9 years ago
It does look like there is a problem here.

To reproduce in top of tree Chromium, open the JavaScript console and run e.g. 
sdk/tests/extra/slow-shader-example.html. Note that the loss of context is 
detected and reported to the console.

Then navigate to e.g. the San Angeles demo. First, a "sad tab" page is 
displayed. This is likely a bug in Chromium. Reload the page. The system will 
claim that WebGL is not enabled.

Killing the GPU process in Chrome's Task Manager and reloading the page allows 
it to continue. This indicates the GPU process is probably wedged.

It looks like this might have regressed in 
http://code.google.com/p/angleproject/source/detail?r=807 .

Original comment by kbr@chromium.org on 6 Dec 2011 at 1:18

GoogleCodeExporter commented 9 years ago
Thanks for the more in-depth repro. I'm using the Khronos 
lots-of-polys-example.html test. Fx doesn't have a separate GPU process, so 
this bug locks up the whole browser.

Original comment by d...@sherk.me on 6 Dec 2011 at 2:26

GoogleCodeExporter commented 9 years ago
I don't think this is a regression from r807 -- We had the same polling loop 
previously it just moved around.  In theory the GetData call should return 
D3DERR_DEVICELOST instead of S_FALSE if the device has been lost, and then the 
loop should break out...

Original comment by dan...@transgaming.com on 6 Dec 2011 at 4:01

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I was able to reproduce the GPU process hang, but it's not stuck in 
Context::sync. Instead there are continuous glGetError calls, which return 
GL_OUT_OF_MEMORY because the context is lost. The proper response to 
GL_OUT_OF_MEMORY is to recreate the context.

ANGLE r902
Chrome 17.0.963.1 canary
Windows 7 x64
ATI FirePro V5700 (FireGL)

Original comment by nicolas....@gmail.com on 7 Dec 2011 at 4:45

GoogleCodeExporter commented 9 years ago
I was using an NVIDIA NVS 4200M. I haven't tried it on ANGLE r902 yet, though.

Original comment by d...@sherk.me on 8 Dec 2011 at 1:32

GoogleCodeExporter commented 9 years ago
Note that Chromium bug 95740 tracks a similar issue.

Original comment by kbr@chromium.org on 8 Dec 2011 at 7:29

GoogleCodeExporter commented 9 years ago
jbauman, since you're working on the Chromium side bug and I expect part of the 
fix will be in ANGLE, could you take this? Let me know if I can help with any 
WebGL specific issues.

Original comment by kbr@chromium.org on 8 Dec 2011 at 7:34

GoogleCodeExporter commented 9 years ago
Please note that r903 and r905 contain important fixes which likely affect 
device recovery.

Original comment by nicolas....@gmail.com on 8 Dec 2011 at 7:56

GoogleCodeExporter commented 9 years ago
Looks like chrome is spinning in a loop because ANGLE continually returns 
GL_OUT_OF_MEMORY, which seems to be against the GL spec: "When GetError is 
called,
the code is returned and the flag is cleared, so that a further error will 
again record
its code. If a call to GetError returns NO ERROR, then there has been no 
detectable
error since the last call to GetError (or since the GL was initialized)".

Original comment by jbauman@chromium.org on 9 Dec 2011 at 11:25

GoogleCodeExporter commented 9 years ago
Yes -- we're repeatedly returning GL_OUT_OF_MEMORY if the context is lost.  As 
you pointed out this isn't correct and will break any application that polls 
until glGetError returns GL_NONE.  

The problem is that we're manufacturing the error in glGetError.  If we want to 
keep the OOM error for GL calls with a lost context, the error will instead 
have to be set when calling other GL functions, not from glGetError itself.  
I'll fix this issue.

Original comment by dan...@transgaming.com on 12 Dec 2011 at 3:57

GoogleCodeExporter commented 9 years ago
r905 fixes the OOM error reporting when the context is lost.  The OOM error 
will only be returned on a lost device when another GL call has been made 
between sets of calls to glGetError, and thus it should be safe to call 
glGetError in a loop until GL_NO_ERROR is returned.

Recovery is still not fully working in Chrome though because the device reset 
is failing.  Looks like there are still some dangling references.

Original comment by dan...@transgaming.com on 12 Dec 2011 at 9:08

GoogleCodeExporter commented 9 years ago
We found what prevented the Reset from succeeding, and we're currently auditing 
the code for similar issues. We'll have a fix ASAP.

Original comment by nicolas....@gmail.com on 13 Dec 2011 at 4:45

GoogleCodeExporter commented 9 years ago
r908 and r909 should get context recovery working again.

With these changes I can do the following in Chrome Canary 17.0.936.6.
 - run San Angeles Demo in one tab
 - run the lots-of-polys example in another tab.  This produces a TDR and the driver is reset.
 - after the driver recovers, I can go to the San Angeles tab, reload and it works fine.
This has been verified on Win7 with d3d9 and d3d9ex.

(Note that I've never been able to reproduce the original issue reported by 
Doug.  Doug - can you retest with r909 to see if it works better for you now?)

Original comment by dan...@transgaming.com on 13 Dec 2011 at 5:37

GoogleCodeExporter commented 9 years ago
This is not fixed on r912.

Here's a tiny stack trace (although I don't suspect this will help):
>   libGLESv2.dll!gl::Context::sync(bool block)  Line 2905  C++
    libGLESv2.dll!glFinish()  Line 2028 C++
    xul.dll!mozilla::gl::GLContext::raw_fFinish()  Line 1845    C++

My guess is that you can't repro it because of our difference in hardware. My 
hardware is the following:

Thinkpad T520
NVIDIA NVS 4200M
Driver version 8.17.12.8562

Original comment by d...@sherk.me on 15 Dec 2011 at 9:21

GoogleCodeExporter commented 9 years ago
Ok, chrome (or most other programs) rarely uses glFinish, so that's probably 
why we haven't been hitting this.

Original comment by jbauman@chromium.org on 15 Dec 2011 at 10:10

GoogleCodeExporter commented 9 years ago
It probably doesn't hurt to test for context loss in the polling loop.

Original comment by nicolas....@gmail.com on 15 Dec 2011 at 10:17

GoogleCodeExporter commented 9 years ago
Yeah, busy-waiting slightly less efficiently probably won't be a problem.

Original comment by jbauman@chromium.org on 15 Dec 2011 at 10:20

GoogleCodeExporter commented 9 years ago
r924 adds in a check for context loss while busy waiting.  Should fix this 
problem.

Original comment by dan...@transgaming.com on 16 Dec 2011 at 11:39

GoogleCodeExporter commented 9 years ago
I can confirm that r924 fixes this problem for me.

Original comment by d...@sherk.me on 17 Dec 2011 at 12:31