Wednesday, May 14, 2014

GA-P35-DS3R fan control

The Gigabyte GA-P35-DS3R motherboard has a IT8718F chip. It can control the speed of 3 fans via PWM and monitor the speed of 5 fans. The chip also has 3 thermal sensor inputs, which can be read by software. The SmartGuardian feature allows any thermal sensor input to be used to automatically control any fan without software intervention. Fan speeds can also be controlled from software. A PDF datasheet is available.

The BIOS has options to enable CPU fan speed control, and to use voltage or PWM to control the fan. I'm using voltage. The stock Q6600 fan has a 4 pin connector and supports PWM, but PWM causes noise. Voltage can control the speed just as well without the noise.

The BIOS programs the SmartGuardian feature to control CPU fan speed, but it doesn't provide any options for changing that configuration. Both Linux and SpeedFan support the IT8718F chip, but neither can program the SmartGuardian feature. SpeedFan only has an option in the Advanced tab to switch a fan from SmartGuardian to software control, which allows SpeedFan to control its speed. At least for the second PWM output, SpeedFan may not properly re-enable SmartGuardian.

An 8718fans program allows changing of SmartGuardian and other fan-related settings in the chip. It also allows viewing of current settings.

This GA-P35-DS3L information seems similar or identical to the GA-P35-DS3R. CPU fan speed is measured via the first fan sensor, and controlled via the first PWM output. That page claims that the first PWM output controls CPU fan voltage and the third output controls CPU fan PWM, which I didn't test. The second fan output controls voltage on SYS_FAN2, the 4 pin fan connector near the DIMMs and 24 pin power connector.

The BIOS sets up CPU fan control by using the second temperature sensor to control the first PWM output. This is a sensor at the CPU, but not one of the internal core temperature sensors that can be seen in programs like Core Temp. The IT8718F chip cannot use such sensors, because they can only be read by software running on the CPU. The second sensor measures temperatures which are about 10°C colder than the cores. According to 8718fans, full fan speed would be reached at 66°C, which probably corresponds to core temperatures near 76°C.

The BIOS also sets up sensor one to control PWM output two with the same settings. This is probably not reasonable for a case fan, because sensor one isn't at a particularly hot location. Its normal temperature is near 40°C, and if it reached 66°C, hotter areas would overheat.

The IT8718F SmartGuardian algorithm uses a slope, essentially just setting fan speed based on a linear relationship with temperature with some smoothing features. This means temperature depends on load, rather than being controlled to a particular level. If a certain fan speed corresponds to a certain temperature at a certain CPU load and CPU load increases, temperature increases until a new equilibrium is found, with a higher temperature and higher fan speed.

I'm now using SpeedFan to control a case fan, but still letting SmartGuardian control the CPU fan. Maybe I will inject some code into the MBR or elsewhere to set up SmartGuardian for the case fan, because I perfer not depending on an application for fan speed control.

Tuesday, May 13, 2014

A DIMM which won't work with any other DIMMs in the same channel

Since I first got it, I used my Gigabyte GA-P35-DS3R motherboard with 2 GB of DDR2 RAM. This was more than enough at first, but now it results in too much disk access even when only running KDE and Firefox.

The old RAM is an OCZ2G8002GK DDR2-800 OCZ Gold 2*1GB 5-5-5-15 kit consisting of two OCZ28001G modules. I was running it at stock speeds and voltages, and it seemed perfectly stable, never producing any errors in tests. I concluded that 4GB would probably be enough, but I chose to upgrade using 2*2GB for two reasons: I would have 4GB even if I can't get the kits working together, and it's always better to have more memory than you think you need. I found a really good deal on eBay for OCZ2P8004GK DDR2-800 OCZ Platinum 2*2GB, consisting of two OCZ2P8002G modules.

When I put in the new RAM together with the old RAM, I got a hang at the initial graphical BIOS screen, but I could boot if I only put in the new RAM. At first, this seemed like a compatibility problem, maybe because the old RAM was required 1.8V and the new RAM required 2.1V.  I had forgotten about OCZ Platinum requiring 2.1V, and that requirement wasn't stated anywhere on the eBay item page or the labels on RAM photos. I became skeptical when I saw that the old and new kits both worked alone at 5-5-5-15 timings, at either normal voltage or 2.1V. They only failed to work together.

Then I tried relaxing various primary and secondary timings and reducing the frequency. It seems the OCZ28001G modules couldn't handle CL6, but I could relax all the other timings. Nothing helped. In most cases, my computer would power off and back on twice and then hang. That seemed to be the motherboard's attempt to switch to more conservative settings. I assume it is meant to recover from a failed overclock, but it never managed to recover from this. I would have to remove a DIMM to get into BIOS setup and change settings for another attempt. Early on in this process I removed the hard drive so it doesn't get subjected to all these power cycles. Eventually I was forced to give up because I couldn't imagine what else I could change.

I tried another experiment, putting the old RAM in the slots closest to the CPU, and the new RAM in the slots further away. The intention was to put the old RAM in one channel and the new RAM in another channel, in case they were incompatible in the same channel. This configuration allowed me to boot, but caused lots Memtest86+ errors past 4 GB. According to DMI data, this address was in the middle of one of the new DIMMs. I didn't conclude anything based on this, because I didn't know if the configuration was supposed to work, and because it was weird to see errors start in the middle of a DIMM.

Later, I was inspired to try yet another experiment: two DIMMs in one channel, with nothing in the other channel. This would allow more possibilities with only 4 DIMMs. I found that one of the DIMMs wouldn't work with any other DIMM in the same channel, but the other DIMMs would work together. This finally made it seem like one DIMM is defective.

After getting a new G.Skill F2-6400CL5D-4GBPQ set, the suspect DIMM wouldn't work with either of those in the same channel, but the other OCZ2P8002G DIMM worked fine as part of a 2*2+1*1 GB DDR2-800 5-5-5-15 configuration with one of the new G.Skill DIMMs. This seems to confirm that one OCZ2P8002G DIMM is defective.

It's surprising that a DIMM can be bad in a way that it passes tests if alone in a channel but fails when there is another DIMM in the channel. However, it makes sense. Diagnostic programs can only tell you if the memory subsystem of that computer is reliably storing and retrieving data. They can't tell you if a DIMM is meeting its electronic specifications.

Thursday, May 08, 2014

Integrated heat spreader thermal contact failure

I recently upgraded an old PC to a Manchester socket 939 Athlon 4200+. After booting into Stresslinux, I ran mprime (the Linux version of Prime95) to check stability and temperatures. I didn't encounter any errors, but after a few minutes, core temperatures rose past 70, and approached 80. That's bad in general and especially bad for that CPU, so I had to cut power.

After removing the heat sink, the paste application seemed fine. I tried applying paste several times, and even got some new Zalman ZM-STG2 paste, but nothing helped. I couldn't even get anything as good as the first result.

Eventually I found some message board posts about decapping, the thermal paste below the CPU integrated heat spreader (lid), and how that paste can fail. The fact that the heat sink wasn't even warm when the CPU cores approached 80, and the paste between the heat spreader and CPU seemed fine afterwards made this seem like a probable explanation.

I first attempted to cut off the integrated heat spreader (IHS) with a utility knife. This didn't work because the blade was too thick and it couldn't fit into the narrow gap between the CPU circuit board and IHS. Then I pried apart a disposable razor and got one of the blades out. It's very thin and sharp, and it fit into the gap and cut nicely. The only difficulty was that it's also highly flexible, so it can cut the circuit board. Here are pictures of the decapped CPU:

The black material that was holding the IHS is like rubber. Note that it wasn't providing a hermetic seal; there is a gap on the left of the CPU. The brown material is remnants of brasso which I used to try to lap the IHS before. The IHS was definitely slightly concave, but that wasn't the problem. The grey thermal paste was entirely dry and kind of like silicone rubber, but much easier to remove. My theory is that it works fine even when dry, and that problems happen due to the force used to separate the heat sink from the IHS. The black rubber holding the IHS allows some movement. If the dry thermal paste inside breaks apart due to that, it can't re-establish good contact.

I didn't want to run the CPU decapped because that would require modifying the motherboard, and maybe the heat sink retention. The plastic frame surrounding the socket prevents the heat sink from getting low enough to make good contact with the chip, and even if that was cut away, I'm not sure if the heat sink retention would provide enough force when the heat sink sits lower. Also, the IHS seems to be plated copper, and it might actually help with heat transfer to the stock aluminum heat sink. I just shaved down the black rubber a bit, cleaned off old paste, added new paste, and reassembled without attaching the IHS to the CPU.

After all this, running mprime on both cores resulted in temperatures of 45°C and below. This was with the stock cooler for an Athlon 64+ 3500 Newcastle (also 89W TDP) and Zalman ZM-STG2 paste.