The Tale of a GCC Upgrade

Once upon a time, I decided it is a good time to upgrade the version of gcc compiler we use in our ARM926-based firmware project. Historically, this project uses gcc version 4.6, which is quite outdated nowadays (January 2018). In a different project, built around ARM-Cortex, we already successfully use the gcc version 5.4.1 (Linaro release in January 2017), so I wanted to upgrade our ARM926 project to this newer gcc version.

But some things never go so smoothly as one may imagine at the beginning… 🙂

Issue #1

The first problem we encounter is a simple assembly syntax error:

Error: .size expression for firstLoader.reset does not evaluate to a constant

For some reason, our original assembler code used ‘.’ and ‘_’ interchengeably:

.size firstLoader.reset, . - firstLoader.reset

The correct code must be:

.size firstLoader_reset, . - firstLoader_reset

Issue #2

Now the compilation finishes without warnings, and the final ELF file is produced just fine. Yet, when it is loaded into the embedded system via Lauterbach debugger, there is a strange error during memory verification:

 Hmm… Removing the option /VERIFY from LOAD gets us past the error, however, assembly code displayed by debugger is quite wrong. As shown below, the debugger thinks that the main() starts with ‘nop’ and ‘bx r14’ (subroutine call), which is completely bogus:

It seems as if all symbols are shifted around, and nothing quite matches up. Running the code comes to a crash somewhere in the startup.

Normal disassembly via ‘objdump -s’ shows that itcm_code has an offset: it should start at physical address 0x0000, but it starts from 0x0040:

Disassembly of section .itcm:

00000028 <__start_itcm_code__-0x18>:

00000040 <__start_itcm_code__>:

00000140 <secondLoader>:

It is not clear at all why this happens…

Finally, a full disassembly of all sections in the ELF file, including the data sections, reveals where the problem lies:

Disassembly of section

00000000 <>:
0: 00000004 andeq r0, r0, r4
4: 00000014 andeq r0, r0, r4, lsl r0
8: 00000003 andeq r0, r0, r3
c: 00554e47 subseq r4, r5, r7, asr #28
10: 8ede0a2f vfnmshi.f32 s1, s28, s31
14: 72934257 addsvc r4, r3, #1879048197 ; 0x70000005
18: 2fb63d30 svccs 0x00b63d30
1c: a3a01a65 movge r1, #413696 ; 0x65000
20: 12fa8758 rscsne r8, sl, #88, 14 ; 0x1600000

Disassembly of section .itcm:

00000028 <__start_itcm_code__-0x18>:

00000040 <__start_itcm_code__>:

00000140 <secondLoader>:
140: e59f030c ldr r0, [pc, #780] ; 454 <secondLoader_never+0x10>
144: e3e024ff mvn r2, #-16777216 ; 0xff000000
148: e5802000 str r2, [r0]

The new compiler toolchain inserts a special unique build-ID to the output binary, and if it is explicitly not told where to place it, it ends up in the worst place possible – before the start code and right in the interrupt vector table.

There are at least two solutions to this: 1) use gcc option ‘-Wl,–build-id=none’ to disable the feature completely, or 2) put the build ID in the ELF file at a place where it makes no harm, e.g. after the end of code. This is done by adding the following instructions in a suitable place into the linker script:

.note : {
   KEEP (*(.note*))

After the linker script is fixed, the software starts and runs just fine, but strangely only up to the first interrupt…

 Issue #3

We have already fixed two problems: the assembly syntax problem, and the linker build-ID problem. The third problem is the most puzzling one: the program starts and runs correctly, but after the first interrupt is served, no further interrupts are ever served. The processor happily continues running the main application code, but no ISRs are ever entered again. Yet the identical code compiled with the old compiler 4.6 works perfectly. What could be a problem now?

This is execution state before the first interrupt is raised:

This is inside interrupt service route, serving the first IRQ:

The ISR finishes its work, and returns back to the application code:

After that, no further interrupts are ever served…

Perhaps you see the problem right away in the screenshots, but it took me half a day to spot it.

Look in the last picture in the right window labelled ‘REGISTERS’. The left-most column in the REGISTERS window shows from top to bottom the processor flags (N, Z , C, …) and after that the processor state. It shows the processor is running in the… IRQ state! Yet the program counter is already back in the main application code.

Let’s see now how the ISR function is defined in our C code:

volatile t_int main_irq_handler(void) __attribute__ ((interrupt("IRQ"))) __attribute__ ((section(".isr")));

The prototype already has the __attribute__((interrupt(“IRQ”))), and anyway it works in the old gcc-4.6 quite well, so where’s the issue? Let’s compare codes generated by both compilers. This is the ISR function epilogue of main_irq_handler as generated by the good old gcc-4.6:

618: e1a00003 mov r0, r3
61c: e24bd010 sub sp, fp, #16
620: e8fd800f ldm sp!, {r0, r1, r2, r3, pc}^

And now the same epilogue code, but from gcc-5.4.1:

8ac: e1a00003 mov r0, r3
8b0: e24bd018 sub sp, fp, #24
8b4: e8bd980f pop {r0, r1, r2, r3, fp, ip, pc}

Ok, different compilers generate slightly different codes. But what is the peculiar symbol ‘^’ in the good gcc’s version? Google search of ‘arm ldm’ quickly finds the answer:

So the character ‘^’ is a special suffix that must be used exactly at the return from ISR. But the new code generated by gcc-5.4.1 uses the normal return with ‘pop’!

Now it is becoming clear why only the first interrupt request is ever served. Due to a software bug in our ISR epilogue the processor does not recognize that ISR has ended. The CPU remains in the IRQ state even after a formal return to the main application code, and thus it cannot receive any further IRQs.

At this point, it is a good idea to simply test if another compiler works better. And yes, in gcc-7.2.1, which is the current newest Linaro release in January 2018, the code works yet again well. We evidently have hit a bug in gcc-5.4.1 in its ARM926 ISR code generation path.


In conclusion, what I have learned:

  1. assembly syntax is not rock-stable, and it changes over the time,
  2. default linker behaviour changes over the time,
  3. compiler bugs exist, but they are subtle, and can bite unexpectedly.