The system:

MSI Raider GE67 HX 12UHS

Intel Core i9-12900HX

nVidia GeForce RTX 3080Ti (laptop)

32GiB RAM

Win11 Pro 64-bit

The problem:

Once in a while (usually 2-3 times per day), the system crashes, usually resulting in a blue screen with one of various error codes. Codes I’ve seen include:

HYPERVISOR_ERROR

CLOCK_WATCHDOG_TIMEOUT

VIDEO_TDR_FAILURE

IRQL_NOT_LESS_OR_EQUAL

Sometimes the system hangs but the blue screen never comes, and I have to power it off manually. When this happens, the fans go to full speed and yet the laptop quickly becomes incredibly hot if I don’t power it off as soon as possible, suggesting that the CPU or GPU is maxing out for some reason.

Checking with Event Viewer shows nothing out of the ordinary in the lead up to the crash.

Things I’ve ruled out:

I initially thought it only happened while plugged in, and bought a new power supply. That didn’t seem to affect the frequency of the issue, and I also have now seen it happen while on battery. I also initially thought it was more frequent while playing games that use the dedicated graphics card, but I’m not sure that’s actually true; I have seen it happen even while just watching Youtube. At one point I felt that it happened more when I moved the laptop or plugged in USB devices, but I think that may be magical thinking; I have never been able to make it happen on purpose by doing those things. It does seem to be true that after it happens, if I let the laptop restart automatically, it often happens again in a short time, but shutting down and then turning it back on gives more time before the next incident.

Solutions I’ve tried:

I tried updating the BIOS and the Intel firmware to the latest available on MSI’s website, but that doesn’t seem to have helped. I also updated my nVidia drivers.

A possibly related issue:

A week or so before this happened for the first time, I updated the BIOS to fix a different issue. What happened then was: I was playing a game on battery unintentionally, and didn’t notice until that “low battery - switching to Super Battery” warning appeared and began throttling system performance. I plugged the laptop in, but performance didn’t improve. I restarted and performance was terrible across all applications, even Firefox. I checked Resource Manager and noticed that the CPU was being throttled down to around 0.16GHz. Event Viewer was showing warnings about this that said the processor was being limited by system firmware.

I tried using various Windows and MSI power management settings to resolve the issue, which persisted across restarts, fully charging the battery, etc. In the end, I solved it by updating the BIOS (to a version that is now one version back from the most current one).

It was a while, maybe a week, after running the update that the crash happened for the first time.

Current theory:

Is it possible I screwed up the BIOS update somehow? I noticed that it instructs you to return clock speeds to stock before doing the update. I don’t think I’ve manually adjusted them, but MSI’s “MSI Center” software seems to offer automatic adjustment. It was set to “Balanced” when I did the most recent update, but it may have been set to “Auto” when I did the first one, which I guess could be a problem if the CPU was automatically overclocked.

  • ryven@lemmy.dbzer0.comOP
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    11 months ago

    Update: memtest86 passed! That’s good, I guess, but I really did think this was the best suggestion, so I’m kind of surprised. I’m going to find a test for the graphics card, and if it passes I’m following the other recommendation to clean reinstall the OS.

    • GrundlButter@lemmy.dbzer0.com
      link
      fedilink
      arrow-up
      0
      ·
      11 months ago

      Good and bad news indeed. I think you’ve got the right course of action, if it’s not a discernable piece of hardware, then a nuclear approach to software is warranted. BIOS/microcode updates are another effort I would add as well. I wish you luck!

      • ryven@lemmy.dbzer0.comOP
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        9 months ago

        Hey just so you know, I finally got around to fixing this after puttering around with it a bit at a time for months, and long story short the SSD was failing, despite several test programs claiming it was good (???). New SSD is running fine.

        Edit: Well, that didn’t last long. The bluescreens are back on the new hardware with a clean install. New hypothesis: whatever is causing them is also what caused the previous SSD to fail. Rather than sacrifice additional components trying to figure it out, I’m just going to call it here and see if it’s still under warranty.