## PN7120 does not respond on the I2C bus

NXP’s PN7120 is NFC controller for contactless communication at 13.56MHz. It interfaces with the host CPU via the I2C bus. The I2C slave address of  PN7120 is 0b010100Lx, where L is a configurable LSB of the address (by pin B2, I2CADDR0), and x is the standard I2C R/W bit. Hence the normal I2C address is either 0x28 or 0x29:

// write to device at address 0x28, R/W=Write
i2c_master_write_byte(cmd, (0x28 << 1) | WRITE_BIT, ACK_CHECK_EN);

According to UM10819 User Manual, Fig. 23 on page 35, the first NCI command after reset must be the CORE_RESET_CMD, which is followed by an appropriate response from PN7120.

The CORE_RESET_CMD is simply this byte sequence:

const uint8_t nci_core_reset_cmd[] = { 0x20, 0x00, 0x01, 0x01 };

What they don’t tell us in the datasheet (or else I could not find it): there is a boot timeout cca 990ms after the reset is removed (pin VEN=1) within which the CORE_RESET_CMD must be recevied. If it is not received, the PN7120 goes in a strange power-down mode and will no longer respond on the I2C bus! Only a new hw reset via VEN can bring it out of the power-down.

This picture provides updated boot state diagram of PN7120:

CORE_RESET_CMD on the I2C bus:

Simply no repospone on I2C address 0x28 if there was no CORE_RESET_CMD within 990ms after PN7120 reset:

Our hardware: ESP32 and Mikroelektronika’s NFC Click.

## The Tale of a GCC Upgrade

Once upon a time, I decided it is a good time to upgrade the version of gcc compiler we use in our ARM926-based firmware project. Historically, this project uses gcc version 4.6, which is quite outdated nowadays (January 2018). In a different project, built around ARM-Cortex, we already successfully use the gcc version 5.4.1 (Linaro release in January 2017), so I wanted to upgrade our ARM926 project to this newer gcc version.

But some things never go so smoothly as one may imagine at the beginning… 🙂

# Issue #1

The first problem we encounter is a simple assembly syntax error:

Error: .size expression for firstLoader.reset does not evaluate to a constant

For some reason, our original assembler code used ‘.’ and ‘_’ interchengeably:

.size firstLoader.reset, . - firstLoader.reset

The correct code must be:

.size firstLoader_reset, . - firstLoader_reset

# Issue #2

Now the compilation finishes without warnings, and the final ELF file is produced just fine. Yet, when it is loaded into the embedded system via Lauterbach debugger, there is a strange error during memory verification:

Hmm… Removing the option /VERIFY from LOAD gets us past the error, however, assembly code displayed by debugger is quite wrong. As shown below, the debugger thinks that the main() starts with ‘nop’ and ‘bx r14’ (subroutine call), which is completely bogus:

It seems as if all symbols are shifted around, and nothing quite matches up. Running the code comes to a crash somewhere in the startup.

Normal disassembly via ‘objdump -s’ shows that itcm_code has an offset: it should start at physical address 0x0000, but it starts from 0x0040:

Disassembly of section .itcm:

00000028 <__start_itcm_code__-0x18>:
...

00000040 <__start_itcm_code__>:
...

secondLoader:

It is not clear at all why this happens…

Finally, a full disassembly of all sections in the ELF file, including the data sections, reveals where the problem lies:

Disassembly of section .note.gnu.build-id:

00000000 <.note.gnu.build-id>:
0: 00000004 andeq r0, r0, r4
4: 00000014 andeq r0, r0, r4, lsl r0
8: 00000003 andeq r0, r0, r3
c: 00554e47 subseq r4, r5, r7, asr #28
10: 8ede0a2f vfnmshi.f32 s1, s28, s31
14: 72934257 addsvc r4, r3, #1879048197 ; 0x70000005
18: 2fb63d30 svccs 0x00b63d30
1c: a3a01a65 movge r1, #413696 ; 0x65000
20: 12fa8758 rscsne r8, sl, #88, 14 ; 0x1600000

Disassembly of section .itcm:

00000028 <__start_itcm_code__-0x18>:
...

00000040 <__start_itcm_code__>:
...

140: e59f030c ldr r0, [pc, #780] ; 454 <secondLoader_never+0x10>
144: e3e024ff mvn r2, #-16777216 ; 0xff000000
148: e5802000 str r2, [r0]

The new compiler toolchain inserts a special unique build-ID to the output binary, and if it is explicitly not told where to place it, it ends up in the worst place possible – before the start code and right in the interrupt vector table.

There are at least two solutions to this: 1) use gcc option ‘-Wl,–build-id=none’ to disable the feature completely, or 2) put the build ID in the ELF file at a place where it makes no harm, e.g. after the end of code. This is done by adding the following instructions in a suitable place into the linker script:

.note : {
KEEP (*(.note*))
} > SDRAM

After the linker script is fixed, the software starts and runs just fine, but strangely only up to the first interrupt…

# Issue #3

We have already fixed two problems: the assembly syntax problem, and the linker build-ID problem. The third problem is the most puzzling one: the program starts and runs correctly, but after the first interrupt is served, no further interrupts are ever served. The processor happily continues running the main application code, but no ISRs are ever entered again. Yet the identical code compiled with the old compiler 4.6 works perfectly. What could be a problem now?

This is execution state before the first interrupt is raised:

This is inside interrupt service route, serving the first IRQ:

The ISR finishes its work, and returns back to the application code:

After that, no further interrupts are ever served…

Perhaps you see the problem right away in the screenshots, but it took me half a day to spot it.

Look in the last picture in the right window labelled ‘REGISTERS’. The left-most column in the REGISTERS window shows from top to bottom the processor flags (N, Z , C, …) and after that the processor state. It shows the processor is running in the… IRQ state! Yet the program counter is already back in the main application code.

Let’s see now how the ISR function is defined in our C code:

volatile t_int main_irq_handler(void) __attribute__ ((interrupt("IRQ"))) __attribute__ ((section(".isr")));

The prototype already has the __attribute__((interrupt(“IRQ”))), and anyway it works in the old gcc-4.6 quite well, so where’s the issue? Let’s compare codes generated by both compilers. This is the ISR function epilogue of main_irq_handler as generated by the good old gcc-4.6:

618: e1a00003 mov r0, r3
61c: e24bd010 sub sp, fp, #16
620: e8fd800f ldm sp!, {r0, r1, r2, r3, pc}^

And now the same epilogue code, but from gcc-5.4.1:

8ac: e1a00003 mov r0, r3
8b0: e24bd018 sub sp, fp, #24
8b4: e8bd980f pop {r0, r1, r2, r3, fp, ip, pc}

Ok, different compilers generate slightly different codes. But what is the peculiar symbol ‘^’ in the good gcc’s version? Google search of ‘arm ldm’ quickly finds the answer:

So the character ‘^’ is a special suffix that must be used exactly at the return from ISR. But the new code generated by gcc-5.4.1 uses the normal return with ‘pop’!

Now it is becoming clear why only the first interrupt request is ever served. Due to a software bug in our ISR epilogue the processor does not recognize that ISR has ended. The CPU remains in the IRQ state even after a formal return to the main application code, and thus it cannot receive any further IRQs.

At this point, it is a good idea to simply test if another compiler works better. And yes, in gcc-7.2.1, which is the current newest Linaro release in January 2018, the code works yet again well. We evidently have hit a bug in gcc-5.4.1 in its ARM926 ISR code generation path.

# Summary

In conclusion, what I have learned:

1. assembly syntax is not rock-stable, and it changes over the time,
2. default linker behaviour changes over the time,
3. compiler bugs exist, but they are subtle, and can bite unexpectedly.

During pre-Christmas sale on Seeed Studio Bazaar  they offered these digital radio modules with the nRF24L01+ chip for only US\$0.81 each. So I bought 10 of them outright 🙂

What would YOU  suggest to do with them? Build a wireless flower life-support monitoring network? A mobile voice communications radio system? Retrofit them into talking toasters and robotic vacuum cleaners? Let me know!

## Fun with a switching regulator MC34063A

Being mostly ‘digital’ guy, I’ve always shied off from switching mode power supplies because they are too much analog to my liking. I decided to break my habit by playing around with MC34063A, a 1.5-A Boost/Buck/Inverting Switching Regulator.

I tried an inverting topology that generates -12V from +6V power supply. A copy of schema from datasheet is below, and an implementation on a breadboard can be seen above at the page tile:

The converter works by first charging coil L by a current drawn through transistor Q1. When Q1 is switched off the energy stored in the coil is discharged through diode 1N5819 into output capacitor Co. Because current through the coil L goes always from top to bottom (as drawn in the schematic above), the discharging phase is effectively pulling the output voltage below ground level.

## Logic Voltage Levels

Again and again I need to look up the logic voltage levels of various logic gate standards like HC, HCT, LVT, and so on. I found a nice article on EETimes: A brief recap of popular logic standards. The page has a picture that recaps all the popular standards along with their Voh, Vih, Vil, and Vol voltage levels. However, the picture resolution is very low quality, and it is also not evident at the first glance what is the relation between standards. So I redraw the picture in Inkscape, an open-source SVG editor. The result is here (click to see large image):

## Real-time motion detection in video on Zynq FPGA

UPDATE 23.2.2017: Since I no longer work with Xilinx products (more than 3 years!) I cannot provide any updates or support regarding the information given below. You may continue reading at your own risk!

Since I am leaving my current employer UTIA by the end of the year and thus I will–likely–not be working with FPGAs any more (at least not for living), I wanted to implement a larger design in Zynq to enjoy the technology for the last time 😉

The selected application is real-time motion detection in video stream. We implemented it in the project SMECY in Spartan 6 FPGA using the master/worker abstractions. The original design in Spartan 6 achieved 5 FPS at best, and 3 FPS when other features are included (morphology, labeling).

Here I used Zynq XC7Z020 FPGA in the ZC702 board with the IMAGEON daughter card. No code is reused from the SMECY solution. Video pipeline is realized using AXI Streams, HDMI is used for input and output, the accelerator was implemented using Vivado HLS (high-level synthesis from C). The synthesis tool used is Vivado 2013.3 with IP Integrator (replaces XPS).

One possible practical application of the motion detection is in smart cameras for surveillance (security, safety) use — see the second youtube video below. The HDMI input would be replaced with a camera interface and the FPGA system could be integrated in the camera module.

Below is a demonstration video. The application runs at 8.2 FPS with one accelerator, and 14 FPS with two accelerators (not shown in the videos).

Video input and output is via HDMI on the Imageon extension card. The input 1080p video is fed via HDMI from a PC running Ubuntu. Output is 1280x720p to a monitor. The output image contains a top-right 640×480 corner which is (also) the input to the motion detection. Output 640×480 black&white mask is positioned visually next to it.

## Motion detection algorithm

The system implements real-time video motion detection, sometimes also called foreground/background pixel segmentation. The algorithm is derived from a paper by Kaewtrakulpong; the implementation does not use shadow detection and it has several modifications intended to lower compute complexity.

The goal of image segmentation is to mark each pixel in an image frame as a part of static background or moving foreground. The decision depends on statistical models and their mixtures. All pixels in image are considered independently. Each pixel is modelled by a mixture of K strongest Gaussian models of background, K=4 in the implementation. Each Gaussian model k is defined by a set of 3 mean values $\mu_{R,k}$, $\mu_{G,k}$, $\mu_{B,k}$, corresponding to the three primary colours red, green, blue; by variance $\sigma_k$, and by weight $w_k$. Models represent RGB colours that are considered to be stationary background’ colours of the pixel. As there are K=4 independent Gaussian models kept for each pixel the algorithm allows for situations when the pixel periodically changes between two colours, such as moving escalators or trees in wind–these scenes are classified as stationary. Each model also contains the weight parameter $w_k$ indicating how often that particular model successfully described background in the pixel.

The picture above shows how the algorithm updates the models; for simplicity the presentation ignores RGB colours and shows only three models. The first picture at the top shows initial situation with three models M1=$(\mu_1,\sigma_1)$, M2=$(\mu_2,\sigma_2)$, and M3=$(\mu_3, \sigma_3)$. Mean values $\mu_i$ position bell’ shaped models on the horizontal colour (greyscale) axis; variances $\sigma_i$ define widths of the bells’; and model weights $w_i$ are represented by the heights of the bells’. When new pixel colour hits in one of the models the model is strengthen’ by slightly increasing its weight, and the pixel colour is classified as a background that is stationary. This situation is shown in the picture in the middle: the colour hits in model M3, the weight $w_3$ is increased. If the hit is not precise the model is also slightly shifted towards the new colour.

However, when new pixel colour does not hit any existing Gaussian model, the colour is classified as a foreground that is moving. The weakest model is erased and replaced by a new model representing the new colour, albeit with small initial weight. This is illustrated in the last subpicture above: the weakest model M3 has been replaced by new model.

This algorithm was selected and implemented in a “high-level” C code (intended for CPU execution) by Roman Bartosinsky, a colleague in the SMECY project.

## Implementation details

The picture below shows annotated block diagram from IP Integrator. Click to see larger version.

The system consists of three main parts: video input pipeline, video output pipeline, and accelerator subsystem.

The video input path is highlighted using the yellow colour in the system image above. The pipeline consists of the following processing cores:

1. HDMI input and decoder
2. video to AXI-Stream covertor
3. YUV 4:2:2 to 4:4:4 expander (16 to 24 bits) (custom core in VHDL)
4. YUV 4:4:4 to RGB colour space converter
6. Video DMA storing the input video stream into the main memory via
7. AXI memory interconnect and
8. Zynq HP0 port (150MHz, 64bits).

The video output path is highlighted using the blue colour. It basically mirrors the input path in the reverse order:

1. Zynq HP1 port (150MHz, 64bits),
2. AXI memory interconnect,
3. Video DMA reading via the above ports and producing pixel stream on its AXI-Stream output,
4. 32-to-24 bits trim
5. RGB to YUV 4:4:4 colour space convertor
6. YUV 4:4:4 to YUV 4:2:2 conversion (24 to 16 bits)
7. AXI-Stream to video stream
8. HDMI output coder.

The accelerator subsystem uses the red path.

1. Pixel and context data is accessed using Zynq HP2 port (100MHz, 64bits),
2. by the Central DMA engine (in scatter/gather mode that automatically fetches new block descriptors via the ACP port)
3. and transferred via AXI interconnects to
4. AXI BRAM controllers
5. that connect to the actual BRAMs.
6. The BRAMs keep the working data for the accelerator – pixels in, context, pixels out.
7. The accelerator HW implemented by Vivado HLS.

The accelerator HW block requires: 17 DSP48E blocks, 5967 FFs, and 7229 LUTs. It runs at 100 MHz clock. In can process 1024 pixels in one activation. This requires 4kB for pixel input and output data, and 5*16kB=80kB for context data. The accelerator is internally pipelined. Processing of a single pixel takes 96 clock-cycles, a new pixel can be accepted into the pipeline every 9 clock-cycles. Using a single accelerator instance delivers about 8.2 FPS.

UPDATE: A configuration with two accelerator instances achieves 14 FPS.

## Hardware/Software Co-Simulation (PORTAL)

I devised PORTAL as a means for validating our hardware accelerator cores (ASVP in SMECY) in a co-simulated environment together with control software. The basic structure of PORTAL is shown below:

PORTAL is a a communication library that connects a hardware model simulated in ModelSim to its control software running on a PC. Communication is done over TCP/IP.

In PORTAL the primitive communication abstraction is a shared memory. All PORTAL clients have a common access to a (virtually) shared 32-bit address space. Any client can dynamically claim and register any unoccupied memory range in the address space and start to serve read/write requests generated by other clients. Management of the virtual address space is dedicated to the central sever, PHUB.

In our use case, each ASVP core typically has 4 data memory banks and 2 to 3 control banks (firmware, control/status, vector partitions). In co-simulation environment the VHDL top level test-bench registers each bank as a memory extent with the PHUB server and serves accesses to the banks. Meanwhile, the ASVP master control program that runs on a PC also connects to the PHUB, discovers the ASVP core’s memory extents and presents them using WAL (Worker Abstraction Library) to the software.

## MASSTEST: Automated validation/verification testing

In 2009 I joined UTIA and started working on the Apple-CORE project. The project was already in its second year, hence there had been already plenty of implementation work done. My first assignment was to prepare a test suite of simple assembly-level programs for validation of the UTLEON3 processor. The initial approach was quite naive, though: I produced a test program, observed that it does not run as intended, and reported a bug via e-mail to my colleagues. However, it quickly turned out that bugs were so proliferated that fixing one place often broke other things… Running validation tests had to be automated, and the MASSTEST was born.

MASSTEST is a collection of simple scripts written in Perl for running validation test suites. The main script is called mtest.pl. This script runs test-sets: first it compiles test programs using assembler/makefile toolchain and then it runs ModelSim VHDL simulator to execute the test program in a simulated UTLEON3. Simulation results are placed in dedicated directories. The second important script is called mcollect.pl: it gathers results from the directories, processes ModelSim output files (mainly the transcript) and generates summary tables.

The mtest.pl script is driven by a test-set configuration file. Configuration files contain lines that assign lists of values to parameters. For example:

# directories: software, hardware
prog_dir        ../../utbench
hdl_dir         ../../integration-V2/designs/utleon3-ml509-hwt03/_scripts
timeout         400
simscr          do-sim-c.sh
# hardware parameters
hw-PCNT_KERNEL  '1'
hw-RF_1W_PORT    1
# programs to run
program         t07-ut t07-ut-unroll2 t07-ut-unroll4 t07-ut-swch t07-ut-unroll2-swch t07-ut-unroll4-swch
# software parameters
sw-BLOCKSIZE_1  2 5 7
sw-BLOCKSIZE_2  6

The parameters prog_dir, hdl_dir, timeout, and simscr specify system values for running the test. The parameter “program” specifies a list of programs that will be run; each program is a directory under “prog_dir”. The parameters that start with “hw-” are passed to VHDL simulator as generics of the top-level test-bench. The parameters that start with “sw-” are passed to compiler/assembler as macro expansions.

Each parameter is really a list of space-separated values. The important thing about MASSTEST is that it creates Cartesian product of the configuration parameters: in the script above all combinations of the parameters ‘program’ and ‘sw-BLOCKSIZE_1’ will create an independent test run.

Often times, however, I don’t need the full cartesian product of several parameters. Indeed, as individual simulation runs can take tens of minutes it becomes too expensive to run so many tests. Therefore, bundled parameters were devised:

hw-TECHNO;hw-SPEED;hw-dfuclkspeed   1;A3M2;1.0  1;A4M3;1.5  1;A5M3;1.66  1;A6M3;2.0  3;A2M3;1.0  3;A4M4;2.5
cmd-WLEN                4 8 16 32 64 128 196

Bundled parameters are separated by semicolon. The example above bundles three parameters ‘hw-TECHNO’, ‘hw-SPEED’, and ‘hw-dfuclkspeed’ and specifies a set of value triples that are assigned (or iterated) together. The other parameter ‘cmd-WLEN’ on the other hand is not bound to those three; hence, MASSTEST will create Cartesian product of the triples and the individual values in ‘cmd-WLEN’.

Even though all test-runs generated and executed in MASSTEST are independent to each other, parallelization of the runs is not so easy. The problem is that program (software) and design (hardware) directories are re-used for each run. Indeed, VHDL compilation for ModelSim takes time and disk space and redoing it for each test-run is wasteful.

So, parallelization is done differently. You first decide on the number of concurrent runs that should be executed together. Usually this is the number of processors in system, typically four. The program and design directories must be replicated four times and specified in the configuration file using a bracket notation:

# directories: software, hardware; replicated four times.
prog_dir[0]        ../../utbench_0
prog_dir[1]        ../../utbench_1
prog_dir[2]        ../../utbench_2
prog_dir[3]        ../../utbench_3
hdl_dir[0]         ../../integration-V2/designs/utleon3-ml509-hwt03-0/_scripts
hdl_dir[1]         ../../integration-V2/designs/utleon3-ml509-hwt03-1/_scripts
hdl_dir[2]         ../../integration-V2/designs/utleon3-ml509-hwt03-2/_scripts
hdl_dir[3]         ../../integration-V2/designs/utleon3-ml509-hwt03-3/_scripts

This tells MASSTEST that there are four independent copies of program and HDL design directories. Then you specify that the parallelization should be done over those four replicas:

parmodulo    4

Configuration scripts using parallelization need to be run using mptest.pl, not mtest.pl. The mptest.pl is a wrapper that forks off four instances (or any number specified in parmodulo) of mtest.pl and–this is important–tells each instance ‘i’ that it should actually execute only i-th test modulo four. All instances iterate over the whole test space; but the first instance executes tests 0, 4, 8,…, the second instance tests 1, 5, 9, …, and so on. The test runs are interleaved over the instances.

Finally, it is also possible to parallelize test runs over a cluster of computers. This requires that computers in cluster have a unified filesystem view, i.e. a file name on one computer refers to the same object on the other. In practice this means having a shared network disk mounted in the same place in all nodes. Secondly, remote command execution is facilitated by SSH with public key password-less authentication. MASSTEST configuration scripts are extended with a new “remote” directive:

remote[0]    blue03
remote[1]    blue03
remote[2]    blue04
remote[3]    blue04`

The “remote” directive specifies cluster host names where the mtest.pl instances should be delegated to. The parmodulo mechanism is used to break test-runs into independent groups, and the mtest.pl for each group is run in the remote host via ssh.