Friday, October 03, 2014

Booting GRUB from a logical drive via the extended partition

The master boot record (MBR) contains code which is loaded at boot time, and a table which can list up to 4 partitions. Any data partitions created here are called primary partitions, and booting from them should be possible. One of those partitions can instead be an extended partition, which contains the same type of table, with up to 4 partitions. Partitions inside the extended partition are called logical drives, and it may not be possible to boot from those.

I installed Linux on a logical drive. Due to known problems installing Windows service packs when GRUB is in the MBR, I refused to put GRUB there and instead put it in the logical drive containing Linux. GRUB was normally loaded by the Windows 7 bootloader.

Now I wanted to boot directly into GRUB, so I can use it to hide and unhide partitions and select between two versions of Windows. Making just the logical drive with Linux active did not work. The result was as if there is no active partition. It is possible to make the extended partition active, but grub-install refuses to install there, probably because grub-probe can't figure out the mapping for it.

I solved this by copying the code from the Linux logical drive boot sector to the first sector of the extended partition. It was simple, via "sudo dd if=linux_logical_drive_partition bs=446 count=1 of=extended_partition". The code is 446 bytes long, starting at the beginning of the sector. The rest of the destination sector contains information about partitions, which must not be overwritten. It is extremely important to use the right device names here. (They will typically be something like /dev/sda5, with the logical drive having a higher number than the extended partition.) Mistakes in dd commands writing to raw disk devices can cause unrecoverable data loss.

GRUB's first stage code doesn't care from where it's loaded. However, it does contain hard-coded sector locations, and if that needs to change it would have to be copied again in the same way.

Wednesday, May 14, 2014

GA-P35-DS3R fan control

The Gigabyte GA-P35-DS3R motherboard has a IT8718F chip. It can control the speed of 3 fans via PWM and monitor the speed of 5 fans. The chip also has 3 thermal sensor inputs, which can be read by software. The SmartGuardian feature allows any thermal sensor input to be used to automatically control any fan without software intervention. Fan speeds can also be controlled from software. A PDF datasheet is available.

The BIOS has options to enable CPU fan speed control, and to use voltage or PWM to control the fan. I'm using voltage. The stock Q6600 fan has a 4 pin connector and supports PWM, but PWM causes noise. Voltage can control the speed just as well without the noise.

The BIOS programs the SmartGuardian feature to control CPU fan speed, but it doesn't provide any options for changing that configuration. Both Linux and SpeedFan support the IT8718F chip, but neither can program the SmartGuardian feature. SpeedFan only has an option in the Advanced tab to switch a fan from SmartGuardian to software control, which allows SpeedFan to control its speed. At least for the second PWM output, SpeedFan may not properly re-enable SmartGuardian.

An 8718fans program allows changing of SmartGuardian and other fan-related settings in the chip. It also allows viewing of current settings.

This GA-P35-DS3L information seems similar or identical to the GA-P35-DS3R. CPU fan speed is measured via the first fan sensor, and controlled via the first PWM output. That page claims that the first PWM output controls CPU fan voltage and the third output controls CPU fan PWM, which I didn't test. The second fan output controls voltage on SYS_FAN2, the 4 pin fan connector near the DIMMs and 24 pin power connector.

The BIOS sets up CPU fan control by using the second temperature sensor to control the first PWM output. This is a sensor at the CPU, but not one of the internal core temperature sensors that can be seen in programs like Core Temp. The IT8718F chip cannot use such sensors, because they can only be read by software running on the CPU. The second sensor measures temperatures which are about 10°C colder than the cores. According to 8718fans, full fan speed would be reached at 66°C, which probably corresponds to core temperatures near 76°C.

The BIOS also sets up sensor one to control PWM output two with the same settings. This is probably not reasonable for a case fan, because sensor one isn't at a particularly hot location. Its normal temperature is near 40°C, and if it reached 66°C, hotter areas would overheat.

The IT8718F SmartGuardian algorithm uses a slope, essentially just setting fan speed based on a linear relationship with temperature with some smoothing features. This means temperature depends on load, rather than being controlled to a particular level. If a certain fan speed corresponds to a certain temperature at a certain CPU load and CPU load increases, temperature increases until a new equilibrium is found, with a higher temperature and higher fan speed.

I'm now using SpeedFan to control a case fan, but still letting SmartGuardian control the CPU fan. Maybe I will inject some code into the MBR or elsewhere to set up SmartGuardian for the case fan, because I perfer not depending on an application for fan speed control.

Tuesday, May 13, 2014

A DIMM which won't work with any other DIMMs in the same channel

Since I first got it, I used my Gigabyte GA-P35-DS3R motherboard with 2 GB of DDR2 RAM. This was more than enough at first, but now it results in too much disk access even when only running KDE and Firefox.

The old RAM is an OCZ2G8002GK DDR2-800 OCZ Gold 2*1GB 5-5-5-15 kit consisting of two OCZ28001G modules. I was running it at stock speeds and voltages, and it seemed perfectly stable, never producing any errors in tests. I concluded that 4GB would probably be enough, but I chose to upgrade using 2*2GB for two reasons: I would have 4GB even if I can't get the kits working together, and it's always better to have more memory than you think you need. I found a really good deal on eBay for OCZ2P8004GK DDR2-800 OCZ Platinum 2*2GB, consisting of two OCZ2P8002G modules.

When I put in the new RAM together with the old RAM, I got a hang at the initial graphical BIOS screen, but I could boot if I only put in the new RAM. At first, this seemed like a compatibility problem, maybe because the old RAM was required 1.8V and the new RAM required 2.1V.  I had forgotten about OCZ Platinum requiring 2.1V, and that requirement wasn't stated anywhere on the eBay item page or the labels on RAM photos. I became skeptical when I saw that the old and new kits both worked alone at 5-5-5-15 timings, at either normal voltage or 2.1V. They only failed to work together.

Then I tried relaxing various primary and secondary timings and reducing the frequency. It seems the OCZ28001G modules couldn't handle CL6, but I could relax all the other timings. Nothing helped. In most cases, my computer would power off and back on twice and then hang. That seemed to be the motherboard's attempt to switch to more conservative settings. I assume it is meant to recover from a failed overclock, but it never managed to recover from this. I would have to remove a DIMM to get into BIOS setup and change settings for another attempt. Early on in this process I removed the hard drive so it doesn't get subjected to all these power cycles. Eventually I was forced to give up because I couldn't imagine what else I could change.

I tried another experiment, putting the old RAM in the slots closest to the CPU, and the new RAM in the slots further away. The intention was to put the old RAM in one channel and the new RAM in another channel, in case they were incompatible in the same channel. This configuration allowed me to boot, but caused lots Memtest86+ errors past 4 GB. According to DMI data, this address was in the middle of one of the new DIMMs. I didn't conclude anything based on this, because I didn't know if the configuration was supposed to work, and because it was weird to see errors start in the middle of a DIMM.

Later, I was inspired to try yet another experiment: two DIMMs in one channel, with nothing in the other channel. This would allow more possibilities with only 4 DIMMs. I found that one of the DIMMs wouldn't work with any other DIMM in the same channel, but the other DIMMs would work together. This finally made it seem like one DIMM is defective.

After getting a new G.Skill F2-6400CL5D-4GBPQ set, the suspect DIMM wouldn't work with either of those in the same channel, but the other OCZ2P8002G DIMM worked fine as part of a 2*2+1*1 GB DDR2-800 5-5-5-15 configuration with one of the new G.Skill DIMMs. This seems to confirm that one OCZ2P8002G DIMM is defective.

It's surprising that a DIMM can be bad in a way that it passes tests if alone in a channel but fails when there is another DIMM in the channel. However, it makes sense. Diagnostic programs can only tell you if the memory subsystem of that computer is reliably storing and retrieving data. They can't tell you if a DIMM is meeting its electronic specifications.

Thursday, May 08, 2014

Integrated heat spreader thermal contact failure

I recently upgraded an old PC to a Manchester socket 939 Athlon 4200+. After booting into Stresslinux, I ran mprime (the Linux version of Prime95) to check stability and temperatures. I didn't encounter any errors, but after a few minutes, core temperatures rose past 70, and approached 80. That's bad in general and especially bad for that CPU, so I had to cut power.

After removing the heat sink, the paste application seemed fine. I tried applying paste several times, and even got some new Zalman ZM-STG2 paste, but nothing helped. I couldn't even get anything as good as the first result.

Eventually I found some message board posts about decapping, the thermal paste below the CPU integrated heat spreader (lid), and how that paste can fail. The fact that the heat sink wasn't even warm when the CPU cores approached 80, and the paste between the heat spreader and CPU seemed fine afterwards made this seem like a probable explanation.

I first attempted to cut off the integrated heat spreader (IHS) with a utility knife. This didn't work because the blade was too thick and it couldn't fit into the narrow gap between the CPU circuit board and IHS. Then I pried apart a disposable razor and got one of the blades out. It's very thin and sharp, and it fit into the gap and cut nicely. The only difficulty was that it's also highly flexible, so it can cut the circuit board. Here are pictures of the decapped CPU:

The black material that was holding the IHS is like rubber. Note that it wasn't providing a hermetic seal; there is a gap on the left of the CPU. The brown material is remnants of brasso which I used to try to lap the IHS before. The IHS was definitely slightly concave, but that wasn't the problem. The grey thermal paste was entirely dry and kind of like silicone rubber, but much easier to remove. My theory is that it works fine even when dry, and that problems happen due to the force used to separate the heat sink from the IHS. The black rubber holding the IHS allows some movement. If the dry thermal paste inside breaks apart due to that, it can't re-establish good contact.

I didn't want to run the CPU decapped because that would require modifying the motherboard, and maybe the heat sink retention. The plastic frame surrounding the socket prevents the heat sink from getting low enough to make good contact with the chip, and even if that was cut away, I'm not sure if the heat sink retention would provide enough force when the heat sink sits lower. Also, the IHS seems to be plated copper, and it might actually help with heat transfer to the stock aluminum heat sink. I just shaved down the black rubber a bit, cleaned off old paste, added new paste, and reassembled without attaching the IHS to the CPU.

After all this, running mprime on both cores resulted in temperatures of 45°C and below. This was with the stock cooler for an Athlon 64+ 3500 Newcastle (also 89W TDP) and Zalman ZM-STG2 paste.

Tuesday, November 05, 2013

Enable volume keys in Flash Player using a hotkey program

Flash Player running in Firefox in Windows 7 steals keyboard events when focused. This is a well known bug that has been around for a while and isn't getting fixed. There are many Firefox and Flash bug reports on the issue. It should probably be considered a Flash bug, not a Firefox bug.

This affects all Firefox keyboard shortcuts, such as Control+Tab for switching to the next tab. It also affects some keys that Windows normally handles, such as volume up, volume down, and mute. Fortunately, it doesn't affect keys handled by HoeKey, a hotkey program. This means the problem with volume keys can be fixed by handling volume keys via HoeKey, instead of directly through Windows. Here is what needs to be added to the HoeKey configuration file:

173=Msg|Progman|793|0|524288 ; mute
174=Msg|Progman|793|0|589824 ; vol down
175=Msg|Progman|793|0|655360 ; vol up

I expect other hotkey programs could do the same thing. I like HoeKey because it uses minimal system resources, it works perfectly, and is free.

Monday, April 01, 2013

Running code on the Mercury ME-DPF24MG digital photo frame

In my previous post, I reported reverse engineering findings about the Mercury ME-DPF24MG digital photo frame. Once I figured out the most important things, it was time to start running custom code on the frame.

If firmware page 0 was writable, I would have just flashed a modified version of the code used for other Sitronix frames. Instead, I extended the "get version" command to enable displaying the USB DMA buffer for data sent by the host, and running code there. The command is parsed at flash offset $A40B, which is at $640B in firmware page 2. At that point code has access to the first 16 bytes of the sector at $381, and the last 64 bytes of the sector in the DMA buffer at $200. I used the first 4 parameter bytes at $382-$385 to trigger my code. Here is an example of the display function:

I displayed the text by calling the same function used by the message command. This is not practically useful for displaying messages, but it can confirm that my code works, and that data in the USB DMA buffer can be used.

(After the first 64 bytes, you see the DMA buffer for data being sent to the host, which starts with USBS because it is a USB mass storage status reply. The green and pink corruption outside is because the firmware upgrade code overwrites the second picture and pictures are stored in a compressed format. To prevent overwriting the code doing the firmware upgrade, the new firmware is first written there, and then copied to its final position using code copied to RAM.)

Then it was time to run code. It worked, but 64 bytes isn't much space. It wasn't even enough for the LCD set window code. At $580-$67F there is 512 bytes of RAM where flash writing procedures are copied. Since code is always copied there immediately before being executed, that area is free when not writing to flash. I wrote a small program which copied part of the USB DMA buffer to that free area. Then I used this code appended with small chunks of a longer program to assemble a longer program at $580.

Finally, this allowed me to display images. I wrote code which interacts with the USB interrupt handler to read packets sent to the data port. This is very simple. I just had to deal with a few flags and call one function after I received a packet. For faster performance, it should be possible to use polling for everything except the start and end of a packet.

Probing the device using code

The ability to run code and dump RAM also allowed me to probe the device and discover information about its hardware. By looking at flash ID bytes and the Common Flash Interface (CFI), I found that the chip is an Eon EN29LV320B in bottom boot configuration. It reports that the first two sectors are protected, but I don't know if that's due to a protection setting stored in flash or a result the WP#/ACC pin, which protects the first two sectors when low. I suspect it's due to the pin. Maybe it's connected to a GPIO pin? I also probed the LCD controller and found that it is the Ilitek IL9320.

Port A is connected to the buttons, with bits being normally high and going low when the button is pressed:
PA & 1 = left
PA & 2 = pause
PA & 4 = right
PA & 8 = menu
The slide switch seems to simply cut power, so there is no way to read it. The pause button is read at startup, changing program flow. It might be for entering a recovery mode, not calling photo frame code and immediately going into USB mode.

This is still a work in progress and I'm not releasing any code now. If you want some code, feel free to ask.

Friday, March 29, 2013

Mercury ME-DPF24MG digital photo frame reverse engineering

I got two Mercury ME-DPF24MG digital photo frames at the Leamington Zellers liquidation sale at 70% and 80% off the yellow tag price. They're nice 2.4" photo frames with a 320x240 LCD, rechargeable battery, a magnetic back, and a leg for standing the frame up on a desk. Some reverse engineering has already been done by others, but it's incomplete and the frame cannot be used as USB controlled display yet. The frame is also similar to the Technaxx Magno, which has been investigated more thoroughly but still cannot be used as a display.

Basic information

The frame has 4 megabytes of flash memory. The manual claims "32 MB", but I guess they mean 32 megabits. Phack from st2205tool-1.4.3 does a check for the amount of memory, and quits with "Expected response 8 on cmd 1, got 0x1f!" because the frame reports more memory than expected. Here is the external memory map, in terms of 32 kilobyte (0x8000 size) pages:

0x000-0x07F flash, starting with firmware in pages 0 and 1.
0x080-0x2FF unused
0x300-0x37F LCD
0x380-0x3FF 4 pages repeating, similar to firmware
0x400- address space repeats, probably because bank register ignores bits

The area at 0x380 seems to contain a firmware for a different photo frame. The menus are smaller than in the true firmware, implying it is for a lower resolution. Maybe it is ROM inside the chip? I did not investigate that area further.

The read command adds two to the low byte of the page number. As a result, a normal read starts right after the firmware. I dumped the firmware by reading in page 0xFE, because when firmware adds 2 to that number, it gets 0. I used 0xFE, not 0xFFFFFFFE because the firmware only adds 2 to the least significant byte of the page number.

The chip is probably a Sitronix ST2203U. It is definitely not a ST2205U because the DMA controller is different. I wasn't able to find a full User's Manual for that chip, but the ST2205U and ST2202U manuals are helpful. RAM is at 0x80-0x880 internal addresses, meaning there is only 2 kilobytes.

The LCD controller is unknown, and probably similar to Ilitek ILI9325C. The command number is 16 bit, with the most significant byte first, and coordinates are input the same way.  To talk to the LCD, set DRR to $300, send commands to $8000 and send data to $C000. Here is a sequence for setting a rectangle: C=$20, D=Y1, C=$51, D=Y1, C=$51, D=Y2, C=$21, D=X1, C=$52 D=X1, C=$53, D=X2, C=$22, followed by data with 3 bytes per pixel. This is untested.

Sunday, March 24, 2013

ROM dumping via the sound card

Once I got code running on the RCA RC3000A, the first task was dumping the TCC760 boot ROM and the SST39VF1601 firmware flash. The resistor I soldered for entering USB boot mode provided a convenient connection point for GPIO_B[22]. The chip runs at 2.5V, and that voltage should not be too high for line in. I carefully used an alligator clip to connect it to the tip of a 1/8" plug:

I transferred data serially using a simple ARM assembler program. Zeroes are a short pulse and ones are longer pulses. There is a short pause between bits, and a longer pause between bytes. Here is an example of 11000101 binary. I sent the least significant bit first because right shifts conveniently move it into the carry flag.
Dumping the two megabyte flash chip took an hour and a half. That's not a problem, so it's not worth investing effort in a faster communication method. For larger amounts of data, the TCC76X USB device controller would be a better choice. It seems very easy to use.

The 4KB TCC76X boot ROM has MD5 value 2579641d5be434eea15f4ec3c27a5f53. It implements USB boot mode and secure modes. However, it is not part of the normal operation of the RCA RC3000A. The boot mode is set 000, and execution starts from the SST39VF1601 firmware flash.

The 2MB firmware flash has MD5 value 89ae93fedd7fc5dff0f118aebdd4c7b6. Text strings identify it as "Thomson D100", "V2.20" and "2006.07.13". Apparently unused space with 0xFF bytes starts at 0x92E00, though there is a small chunk used at 0xFE000-0xFE0B2. This means there should be plenty of space for adding a second firmware for dual boot.