Attention: For Live support and discussion, please visit our Discord! https://discord.me/curecoin
Random system lockup
Ok, so I have been folding for cure for almost 2 years now. I have a 4 GPU rig built just for this. It seems that my rig randomly freezes requiring a hard restart. It started doing it more frequently as time went on. So I uninstalled and reinstalled the client. It seemed to work well at first. My GPUs were actually working like they are supposed to. And my PPD went up. But then the random freezing started up again. I just don't see any pattern or any error logs that would cause this. I am at a loss and now have to worry about my system locking up and doing nothing while I am at work.
This sounds like a piece of hardware going bad. Troubleshooting that kind of issue can be a PITA. You pretty much have to stress only 1 component at a time, and note which components are in the system when a failure occurs, in order to rule each thing out one by one.
I would start by using only 1 video card at a time. If you only get a failure with a specific video card installed it's probably that. If not, then I would start looking at individual sticks of RAM. If you're still getting failures after that then I would look to the motherboard, CPU, or even PSU; although without replacements to swap out it can be difficult to narrow down beyond that.
Thanks for the suggestions. I am adding a fan to help cool the rig to see if that has any effect. If not, I will start troubleshooting each piece of hardware one by one.
Wilding makes a good point about a corrupted OS also being a possibility, and while we're on that track it's possible it could be corrupted drivers as well. I would probably try those two solutions first before you start tearing the guts out of your system.
Use display driver uninstaller to purge all traces of the old drivers from your system before reinstalling. You can find it here:
I know reinstalling the OS can be a pain though, especially if you have a lot of things to back up first and how easy it is to restore your system afterwards. It is usually really simple for a dedicated folding/mining rig, but probably not so much if you use it for other things. Depending on how long it normally takes before you are getting these system freezes, it might be quicker to do an OS wipe first before testing each GPU, and it might not be. Though it would be good to rule that possibility out. Then again if you're pretty sure it's a dieing GPU you might want to do that first. It's a judgement call that ultimately you have to make.
It's very frustrating. It was actually on a tear for about a day and a half straight making a solid 500,000 to 600,000 PPD. What it used to do back in the day. The added fans seems to help, but ultimately this morning, it froze at some point during the night. The logs show no errors whatsoever. Since it is a mining rig, there is really nothing on it. I am afraid to do anything to the drivers because the driver update for my 2 radeon cards on MY main PC screwed folding up big time so I can't fold on my primary PC anymore.
let's try and narrow this down a bit. What does the event viewer say (or whatever the linux equivalent is if you're using linux)
I'm using windows 7 actually. The event viewer is very very busy. But I am seeing a pattern under windows logs-applications tab. There is a slew of events, but 3 actual error events the appear grouped together in the same pattern every time it has froze. The are as such:
The description of the first two nearly identical errors reads as such:
"Unloading the performance counter strings for service WmiApRpl (WmiApRpl) failed. The first DWORD in the Data section contains the error code"
As for the 3rd error listed as WMI, the details read like this:
"Event filter with query "SELECT*FROM_instanceModificationEvent WITHIN ^) WHERE TargetInstance ISA Win32_Processor" AND TargetInstance.LoadPercentage > 99" could not be reactivated in namespace "//./rootCIMV2" because of error 0x80041003. Events cannot be delivered through this filter until the problem is corrected"
Whew, that's a mouth and brainfull. But those are consistent.
Are you folding on the CPU or on the GPUs only?
GPU's only. The CPU is always on pause in the client.
Hmm....I am leaning towards it being a GPU going bad, but I would highly recommend a driver reinstall first. It is quick and easy and would be good to rule out. You'd feel pretty silly if you spent a few days testing different components and getting nowhere only to find out later that it was a corrupted driver. I am on Windows 7 with both Pitcairn and Hawaii cards and I can tell you that the 16.9 drivers worked fine for over a month and a half. 16.10 drivers are out now too but obviously for not as long so I can't really say how stable those are but they seemed to work fine as well.
That said, I was using the beta client v7.4.15, which had other issues with detecting the right OpenCL device ID, but that may have been a factor. In any case, you could just download the same driver version you were using that was working for you and reinstall that.
Take a look at CPU and GPU temperatures with HWMonitor or GPUz or similar. Do you see any trends? Any significant differences between GPUs?
If you see an outlier, remove it from the system, and see how it does.