Forum Discussion
Hello, thank you for taking time to help out on the forum.
Right before my game crashes I see a big spike in Commit charge (see image) I thought the crash had something to do with memory because of this or would this be an effect of the crash?
Also, do any of you feel that the crashes has gotten more and more frequent with the last two patches? would this be because lifted fps cap?
@FatSpacePanda, I don't know for sure, but I speculate that the spike is probably when the OS comes in and starts to handle the crash.
- OrioStorm7 years ago
EA Staff (Retired)
I want to thank everybody in this thread for their help on this issue, especially @Falkentyne, @JorPorCorTTV, @MrDakk, and @TEZZ0FIN0.
Based on all the logs and investigation we've done, I'm now convinced that this is a flaw with Intel chips. There is some sequence of instructions that causes results to be used before they're ready on Intel chips. There is one function in Apex that executes this sequence. It doesn't have to be a complex, CPU-intensive program to expose flaws like this; they could even show up in Notepad! I know that we have other functions that are heavily optimized that don't crash; this function that crashes so much is hardly optimized because it's so simple.
It seems that overclocking Intel chips can cause this to happen pretty reliably. It also seems that this can happen when Intel SpeedStep technology raises the clock speed without actually overclocking at all. In all cases, it seems that lowering your clock speed causes the crashes in this code to stop.
So, I base my conclusion that it is an Intel hardware flaw on the experimental evidence that ⚽ the information about the CPU state that the OS reports for these crashes is always impossible for a properly functioning CPU, 🏈 these crashes are only reported on Intel chips, and 🏀 lowering the clock speed always fixes these crashes (even if it wasn't overclocked).
These crashes are exceedingly rare from a CPU perspective. If you crash every other game, that feels like a lot, because it is. My personal goal is nobody crashing ever! However, say that crashing every other game translates into crashing about once every 20 minutes. The CPU runs the instructions that crash at least 100,000 times a second when you're playing the game. With these conservative estimates, the CPU crashes about once every 120 million times it runs this code. So, even though it truly does crash a lot, the conservative estimate is that even a malfunctioning CPU actually functions properly in this code about 99.999999% of the time.
So, what next?
Well, I tried to isolate the crashing function to make a standalone program to exhibit the CPU bug. Unfortunately, I couldn't get the compiler to generate the exact same disassembly. Since the crash appears to depend on the actual sequence of instructions the CPU actually runs, if the disassembly is not equivalent, I don't think it's a good test. Even if I had identical instructions, I don't know that I could reproduce a data set that would cause this function to crash. The real data set depends on where you are in King's Canyon and which way you are looking. I can't replicate all the code and data to generate this data set in a standalone test program, so I'd have to generate random data and hope it crashes. But we've never seen this crash on any of our work machines, so I don't really have any way to verify that I've come up with a crashing data set.
All this to say, I've decided that it takes too much time to make a standalone program to try to repro this bug. The program has a low chance of succeeding, and I can't locally test whether it succeeds or not. Even if it we got lucky and it happened to work, then it would only help highly technical people hone-in their clock speeds, and it might help Intel fix their hardware flaw.
But, it wouldn't help everybody else who has an Intel chip who keeps crashing and doesn't post here and who doesn't want to mess with their clock speeds. I want to help them too! And as a consumer, I know that when you're crashing you don't really care whether it's the CPU's fault or the video card driver's fault or the game's fault; you just want the game to work.
So, I'm going to try changing the function that has all these impossible crashes to do the same job in a slightly different way, and hope that nudges it out of the "sweet spot" that causes some Intel chips to crash every few matches. Until we verify the fix, I'll leave in the old way as a hidden option, so we can be sure that we don't accidentally make things worse with no quick way to go back to the way things were. This is still a shot in the dark, because we can't repro the crashes on any of our machines, but it's the best shot I can take based on all the evidence in this thread.
Unfortunately, I don't know when these changes will go live through our regular release schedule.
- 7 years ago
@OrioStorm Thank you very much!
I knew it had to be a bug. I guessed (if you look at one of my posts earlier) i mentioned some sort of internal bug.
I guessed this because you remember, I mentioned "internal parity error" right?
That error is the "bug" happening and then being corrected by Error correction (ECC).
If you decrease the voltage even lower, you get a "Translation Lookaside buffer" error (instead of just internal parity errors), so these bugs are happening somewhere possibly in the TLB area (all CPU's have them), in an instruction register, but NOT in the "L0" cache (L0 cache is basically some sort of register also, almost like a bridge between the cores and L1 cache).
The only way we can fix this is to contact Intel.
They will need to release YET ANOTHER microcode patch which can fix this bug.
Can you please document your findings, if possible, and see if you can either contact Intel, call their 800 number, or post on one of the Linux / Debian developer threads so that this bug can be sent to Intel?
This is NOT in any way similar to the Pentium "DVID" bug F00F bug, as that was an instant 100% guaranteed crash, or maybe it's more similar to that Skylake FFT bug, where certain prime number FFT sizes would crash the processor. (this was fixed in a microcode update).I'm just a gamer so I have no access to Intel. But since you are a developer, you may be able to reach them. Reference the crash threads on these forums and hopefully this will be escalated to a "high priority" bug, since SOME users with stock CPU's are encountering this also.
These are the only links I can give you to help get this addressed by Intel
Their tollfree tech number of course (you will have to somehow reach their programming department, good luck with that)
https://www.intel.com/content/www/us/en/support/contact-support.html
(Intel):408 765-8080
(and some 800 number but there seems to be a business to business relations link only)
https://github.com/platomav/CPUMicrocodes
https://downloadcenter.intel.com/download/28727/Linux-Processor-Microcode-Data-File
*Edit* a number that works:
(916) 377-7000
- 7 years ago
@OrioStorm I know it's a post bump, but I contacted Intel on the phone, talked to some technical guy and explained that there may be an internal flaw or microcode bug in the 7600k-9900K processors (not sure about older Haswell or HEDT), and referenced your posts and the other thread in a followup email reply.
Let's hope they identify a code sequencing bug or something similar to that Skylake Prime number FFT type bug, can replicate it and it can be fixed in a microcode update.
Hopefully either you or Intel will find some way to establish communication so the processor team has a go at it.
- 7 years ago
I fixed the problem with whea internal parity error. This is a possible fix for all of you, but this works for me.
Specs: 9900k clocked at 5ghz 0 avx 1.3 volt LLC 8 Fully stable and i don't have to downclock my cpu. Nor does it crash in any shape or form anymore. All the tutorials with lowering settings, downclocking is not a smart idea unless they have tried to increase the LLC and first of all made sure the vcore is stable. if you have a stable 4,9 ghz stable oc at 1.285 volt, make sure the llc is increased or lowered to match your desired voltage, if you get like 1.285 and above for example 1.290 or little more it doesn't hurt. But when you drop in voltage during a game, that is where you would strangle the instructions to compile i believe from 1.284 and below. I don't pretend to know how the cpu utilize and work in a game, this is my theory.
Before i got many whea internal parity error in the event viewer while using Load line calibration in the bios set to 7-6 (ASUS LVL 1-8) and below. The voltage dropped with llc 6 and 7 from 1.3v in bios, and during game in windows i got 1.288 which caused the errors. But with LLC 8 i set it to 1.3v, in windows i have 1.305 volt idle and during game, which is fine.
I suspect that Apex legends shows the potential errors that no other stress test can most likely reveal in all cases.
I want to highlight this in my case and some people crashes during stock settings, it is because the motherboard and cpus act differently, some are badly optimized at default and
What do you think, and hopefully you can understand my informal text.
- 7 years ago
This is absolutely **NOT** a good idea!
This works in Apex Legends because Apex Legends does not put a high amp load on the VRM's, which limits guardband penalty and transient response is not affected as much. LLC8 (aka Ultra Extreme loadline calibration on Gigabyte boards and Mode 1 LLC on MSI and Asrock) use a 0 mOhm loadline, which in theory prevents load voltage from dropping below idle. (the way vdroop works is this formula: I * R = vdroop, where I=current in amps and R=resistance (mOhms). So for example 150 amps * 0= 0mv vdroop. (It's actually 0.01 mOhms) but there is a very serious problem.
VRM's do not respond to 0 mOhms loadline gracefully, because MOSFETS cannot charge and discharge their capacitors fast enough to change power (wattage) loads the CPU is requiring as the CPU responds MUCH MUCH faster than VRM's (we already know how fast CPU's can respond, while VRM's usually operate at 500 khz switching frequency). And its not just the voltage the VRMs must adjust down from +12v, there's the *current* (which leads to watts, as watts=volts * amps) also that has to be released by the mosfets too! These are designed to operate with a loadline (vdroop), so that when the CPU goes from an idle state to a load state, or a load state to an idle state, the VRMs and caps are given time to charge and discharge, to adjust the current to what the CPU is needing.
When you have a 0 mOhm loadline, there is *no* time at all for the VRM's to respond to a change in CPU load (to adjust the current it's supplying to the CPU, which the CPU burns up as watts). This causes problems. For example: If a CPU is operating at a heavy load, let's say 150 amps, shall we? E.g. running Prime95 with AVX small FFT (29.8 build 3), and running at 5 ghz, 1.30v, with Level 8 LLC, Ultra Extreme, whatever. Then that load stops (change in iteration, super fast calculation load change, etc or even stopping the test). The CPU responds instantly. The load is stopped. But the VRM's? The VRM's can respond nowhere NEAR that close. So that means, for around something like 45 *Microseconds* or so, the VRM's are still supplying 150 amps of current to the CPU! But the CPU is not using those amps anymore. Power cannot be created nor destroyed It cant be released as heat since heat is a byproduct of watts. So what happens? The voltage spikes up *hard*. It's so fast that a digital multimeter can't pick it up. You need an oscilloscope to see it. But these spikes, at heavier loads, can exceed *200mv* on cheaper motherboards that have bad transient response! So that 1.3v becomes 1.5v. This is worst case scenario (full load to no load), so load balancing from normal heavy sustained loads (though no load is 100% balanced) will have smaller, but repeated oscillations, probably reaching 1.4v often. These can and will very slowly degrade your processor.
The opposite also happens, when going from no load to heavy load or a lighter load to a heavier load. CPU requires amps, VRM's can't supply the amps fast enough as there is no vdroop cushion to give them time to do so, so the CPU voltage drops as much as 200mv (worst case!). So that 1.3v can become 1.1v for a few microseconds. What happens when this occurs? BSOD or crash.
Again that is worst case scenario. Apex Legends won't do this as it doesn't put a 150 amp load on you. However OTHER programs, like Realbench or Prime95 or Cinebench R20 can.
So you would not only find yourself possibly crashing/BSOD'ing when trying to run Realbench 2.56 or Cinebench R20 (and especially prime95 !!) at a bios voltage set to your normal minimum load voltage you need to be stable, but using Loadline Calibration level 8, but you would slowly degrade your chips from the spikes also.
Here is what worst case 0 mOhms loadline looks like on a scope:

Elmor did some tests with comparing the VMIN (minimum voltage required for stability) with LLC6 and LLC8, using a multimeter, to find the voltage that LLC6 was stable at under worst case scenario (FMA3 small FFT prime95) at LLC6 with vdroop, then setting the bios to that exact voltage and using LLC8 to match it at idle and load. The transients ruined his stability with LLC8.
----------
I fired up the my Maximus XI Gene + 9900K to see if I could replicate your behavior.
Core = 4.7G
Cache = 4.4G
P95 29.1 FMA3 Small FFTs 15K
LLC=6, Vcore set = 1.130V, Vcore read = 1.066V: 1 thread failed after 6 minutes
LLC=6, Vcore set = 1.140V, Vcore read = 1.074V: pass 20m+
LLC=8, Vcore set = 1.075V, Vcore read = 1.074V: 1 thread failed after 2 minutes
LLC=8, Vcore set = 1.085V, Vcore read = 1.083V: 1 thread failed after 4 minutes
LLC=8, Vcore set = 1.095V, Vcore read = 1.092V: 1 thread failed after 2 minutes
LLC=8, Vcore set = 1.105V, Vcore read = 1.101V: 1 thread failed after 9 minutes
LLC=8, Vcore set = 1.115V, Vcore read = 1.110V: 1 thread failed after 6 minutes
LLC=8, Vcore set = 1.125V, Vcore read = 1.119V: 1 thread failed after 2 minutes
LLC=8, Vcore set = 1.135V, Vcore read = 1.137V: pass 1h+
I repeated it again with LLC=6, Vcore set = 1.140V, Vcore read = 1.074V and 1 thread failed after 14 minutes. Probably 10-20mV extra would pass for 1h+.--------------------------
Basically, with LLC8, the transient dips ruined any benefit of using LLC8 with a lower bios voltage, compared to LLC6 with a higher bios voltage! So your load voltage (1.137v) was thus MUCH MUCH higher and thus much hotter (1.135v set, 1.137v read) than setting LLC6 wit 1.145v set and 1.080v read. But again this is 150 amps worst case load here.
Ok the point?
While your LLC8 is working in Apex Legends, you are putting your hardware at risk, long term. I would MUCH rather and would suggest that instead of using 1.30v LLC8, that you use 1.325v-1.35v LLC6. It's much safer for your CPU, and you will have lower temps than LLC8+1.30v and lower VRM temps also.
- FatSpacePanda7 years agoNew Novice
Okay thank you. I first thought that the two were related since it was so much data (maybe the cpu got other instructions at the same time apex had something that otherwise would be a micro stutter ??) I don´t know, im way out of my depth.
- 7 years ago
@OrioStorm Thanks for detailed summary!
It also may be worth noting that you have increase the CPU voltage (vCore) to avoid the following non-crash error:
Event ID 19
WHEA-Logger
A corrected hardware error has occurred.
Reported by component: Processor Core
Error Source: Corrected Machine Check
Error Type: Internal parity error
Processor APIC ID: 0For me, a stable vCore was 1.28v for the I9-9900K, with vDroop on my motherboard this gave me a 1.25v-1.26v consistent vCore while gaming. If it drops to 1.21v-1.23v, I get the CPU parity errors above while I play.
- 7 years ago
Hi @OrioStorm Thank you for the well-detailed and scientific explanation on your theories for the said crashes. Mine's a 9900k OCd to 4.8 @1.21V. I was wondering if mine's the same case as the most people here.