Using STM32 DMA to speed up hardware SPI and U8G2 - Attempt 2

Using STM32 DMA to speed up hardware SPI and U8G2 - Attempt 2 - buffered DMA

So we have tried the simple DMA method, simple DMA methodwhich doesn't work, let's try another way.

In this test, we will cache everything the u8g2 trys to send to the LCD screen over SPI, then we will find a good chance to send the whole huge buffer to LCD using SPI DMA.

Step 1 store data to our buffer

define a function to be put inside the SPI back function for message U8X8_MSG_BYTE_SEND.

cache_spi_data_to_buffer_for_dma_u8g2_callback

this function doesn't send anything at all, it only cache the data into our DMA buffer.

Since STM32 bluepill has 20kb and blackpill has more, so it's easy for us to spare 2.65KB for u8g2 and LCD DMA buffer.

According to my tests so far, every time we call u8g2 sendbuffer() function, all the accumulated data before the last EndTransfer call, the total transmitted size is 25XX, so 2.65KB should be enough for our buffer size.

Step 1.1 In this store and cache function, we first need to wait for DMA is idle.

This is a bit confusing, since we are not sending anything through DMA, why do we still care about DMA status ? that is because, if DMA is still working on the previous data, which is stored in our 2.65KB buffer, then should NOT touch the buffer at all, because that could tamp the old data which has not be fully sent yet.

Step 1.2 store the data it trys to send, and adjust our data counter for our buffer, so we know how much data we currently have.

we may also need to consider the situations causes dead lock or buffer full error, like our buffer is already full so we cannot store more, yet DMA is not working so the buffer is full forever etc. but let's skip this for now.

Step 2 send large buffer using DMA

So the easy parts are now done. next question would be : when do we send the buffer? as soon as the buffer is full ? what if our buffer is abit larger than the required buffer size? so it will never be full?

Define a smaller size buffer ? So every time we after we send, u8g2 trys to send again, it has to wait for last DMA finish sending, a blocking wait, so it will be slow as non DMA regular hw SPI send.

I think there is a good solution to this problem, which is using u8g2's message - U8X8_MSG_BYTE_START_TRANSFER and U8X8_MSG_BYTE_END_TRANSFER.

so obviously u8g2 calls start before sending SPI and calls end after sending SPI, and according to a lot of example code and document it seems u8g2 original intent is to use these two functions to pull up and pull down CS - SPI Chip Selection PIN.

So that we could use the U8X8_MSG_BYTE_END_TRANSFER to fire up the send_via_DMA function.

case U8X8_MSG_BYTE_END_TRANSFER:

send_buffer_to_stm32_spi_dma();

we can define above new function so we can do other useful stuffs inside(not much really, maybe store a timestamp so we know how much time it takes), or just simply put here the stm32 API function: HAL_SPI_Transmit_DMA(&LCD_SPI, buffer, buffer_size)

Step 3 Modify u8g2

Now it's the difficult part, as much as I try to avoid modifying U8G2's code, however it seems hard not to change it. Because the u8g2 keep calling the startTransfer and endTransfer for each line of dots, that seems too much, this completely render our DMA method, or any DMA method useless. Because DMA has advantage to send large piece of data, not multiple smaller data many times.

Yet the modification to u8g2 is very simple, don't let it call start and endTransfer for every single line of dots.

only do this for each sendbuffer(), for the whole screen.

Step 3.1 remove endTransfer from u8g2 source code file u8x8_d_st7920.c

Find function u8x8_d_st7920_common, ( well, we are only working with st7920 LCD12864). Find switch case for case U8X8_MSG_DISPLAY_DRAW_TILE, comment out the u8x8_cad_StartTransfer(u8x8) and u8x8_cad_EndTransfer(u8x8);

These two function call are for each dot line I per my understanding, and we don't need them to be called so many times.

Step 3.2 Add endTransfer in u8g2_buffer.c

find function u8g2_SendBuffer(u8g2_t *u8g2), add start and endtranfer here instead, for the whole screen.

So, that's all.

Now, every time u8g2 trys to send one dot of data to LCD via SPI, we cache it, and cache, and cache it again, until at last u8g2 finishes sending the whole screen's data, it then calls the endTransfer inside sendBuffer(), then we start sending the whole buffer using DMA, fire and forget, the send SPI DMA returns quickly, no wait at all.

Testing Speed

Now let's test 1. Is it still working correctly ? 2. How much faster than non-DMA ? worth all the code changes at all ?

Test1 no delay added to test the full speed

note that since storing data sent from u8g2 is much much faster than actually sending it over SPI (even with DMA), so our send_DMA method always need to wait for the last DMA finishes first. so this is as fast as we can get.

According to the timestamp, now our sendbuffer only uses 22-23ms ( comparing to non-DMA send which is about 43ms).

And the LCD screen can reach 24FPS ( non DMA is about 17FPS)

So it's much faster already.

However, this is not all the advantage of using DMA. now we can add a delay in the main while loop to simulate some other real work, let's add delay(20ms) in the main loop see how much slower it will get:

interestingly we can see now the sendbuffer() function only uses 6ms , much less than the previous one - no delay. that is because, the previous sendbuffer has to wait for the last DMA to finish first. Now we added a delay(20) in the main loop, so most likely the sendbuffer() doesn't need to wait for DMA (which is a bad wait, wasting time). and this 6ms is all for the storing buffer memcopy.

while (1)
    {
        draw_test_screen(fps);
      ...
        HAL_Delay(20);
    }

Now let's check LCD fps: we can see now we have only a little slower FPS 22, ( no delay is 24)

so that means, even we waste 20ms, that deosn't slow us down a lot, that is the big advantage of using DMA.

There we have it, using DMA, it helps us successfully get a free 20ms each loop, which could have been taken by slow SPI and waiting.

To compare, without using DMA, we have 17FPS, after added the delay(20ms), fps dropped to 12FPS.

Buffer DMA Method Limitations

1. There must be some u8g2 APIs doesn't work with our code change of startTransfer endTransfer, I only used sendbuffer() which sends the whole LCD screen, this works very well. If other api function don't work, I think we could apply the similar code changes to those, then they should work, I hope.

2. About the SPI speed, I tested it only works with 562KBit/s (System speed 72MHZ, SPI prescaler 64), if faster, then it doesn't work. I think the reason behind this is, u8g2 sends some command in inside the buffer, if the SPI speed is too high, then the command can not be handled by LCD ST7920 correctly.

Source code for buffered DMA method

Search This Blog

Super DIY Projects