Daniel Vetter 7abb690a0e drm/i915: Fix spurious -EIO/SIGBUS on wedged gpus
Chris Wilson noticed that since

commit 1f83fee08d625f8d0130f9fe5ef7b17c2e022f3c [v3.9]
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Nov 15 17:17:22 2012 +0100

    drm/i915: clear up wedged transitions

X can again get -EIO when it does not expect it. And even worse score
a SIGBUS when accessing gtt mmaps. The established ABI is that we
_only_ return an -EIO from execbuf - all other ioctls should just
work. And since the reset code moves all bos out of gpu domains and
clears out all the last_seqno/ring tracking there really shouldn't be
any reason for non-execbuf code to ever touch the hw and see an -EIO.

After some extensive discussions we've noticed that these spurios -EIO
are caused by i915_gem_wait_for_error:

http://www.mail-archive.com/intel-gfx@lists.freedesktop.org/msg20540.html

That is easy to fix by returning 0 instead of -EIO, since grabbing the
dev->struct_mutex does not yet mean that we actually want to touch the
hw. And so there is no reason at all to fail with -EIO.

But that's not the entire since, since often (at least it's easily
googleable) dmesg indicates that the reset fails and we declare the
gpu wedged. Then, quite a bit later X wakes up with the "Timed out
waiting for the gpu reset to complete" DRM_ERROR message in
wait_for_errror and brings down the desktop with an -EIO/SIGBUS.

So clearly we're missing a wakeup somewhere, since the gpu reset just
doesn't take 10 seconds to complete. And indeed we're do handle the
terminally wedged state wrong.

Fix this all up.

References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Damien Lespiau <damien.lespiau@intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2013-06-03 14:35:18 +02:00
..
2013-02-19 17:57:44 -05:00
2013-05-31 12:45:09 +10:00
2013-02-27 19:10:16 -08:00
2013-02-27 19:10:16 -08:00
2013-02-27 19:10:16 -08:00
2013-02-27 19:10:15 -08:00
2013-04-30 15:15:58 +02:00
2013-04-30 10:02:25 +10:00
2013-04-26 10:20:00 +10:00

************************************************************
* For the very latest on DRI development, please see:      *
*     http://dri.freedesktop.org/                          *
************************************************************

The Direct Rendering Manager (drm) is a device-independent kernel-level
device driver that provides support for the XFree86 Direct Rendering
Infrastructure (DRI).

The DRM supports the Direct Rendering Infrastructure (DRI) in four major
ways:

    1. The DRM provides synchronized access to the graphics hardware via
       the use of an optimized two-tiered lock.

    2. The DRM enforces the DRI security policy for access to the graphics
       hardware by only allowing authenticated X11 clients access to
       restricted regions of memory.

    3. The DRM provides a generic DMA engine, complete with multiple
       queues and the ability to detect the need for an OpenGL context
       switch.

    4. The DRM is extensible via the use of small device-specific modules
       that rely extensively on the API exported by the DRM module.


Documentation on the DRI is available from:
    http://dri.freedesktop.org/wiki/Documentation
    http://sourceforge.net/project/showfiles.php?group_id=387
    http://dri.sourceforge.net/doc/

For specific information about kernel-level support, see:

    The Direct Rendering Manager, Kernel Support for the Direct Rendering
    Infrastructure
    http://dri.sourceforge.net/doc/drm_low_level.html

    Hardware Locking for the Direct Rendering Infrastructure
    http://dri.sourceforge.net/doc/hardware_locking_low_level.html

    A Security Analysis of the Direct Rendering Infrastructure
    http://dri.sourceforge.net/doc/security_low_level.html