Introduction
Pseudo-Static Random Access Memory (PSRAM) is used for high-speed transmission of the data stream. This chip communicates with PSRAM via PSRAM controller (PSRAMC). PSRAM can be accessed by KM4 and KM0, and supports execution on PSRAM.
The features of PSRAM are:
Clock rate: up to 200MHz
Double Data Rate (DDR)
Read-Write Data Strobe (DQS)
Supports Half sleep and deep power-down mode
Programmable drive strength
Temperature Compensated Refresh
16/32/64/128 bytes wrap burst access
Distributed refresh interval varies with temperature
Address mapping: 0x6000_0000~0x6040_0000
Pseudo-Static Random Access Memory (PSRAM) is used for high-speed transmission of the data stream. This chip communicates with PSRAM via PSRAM controller (PSRAMC). PSRAM can be accessed by KM4 and KR4, and supports execution on PSRAM.
The features of PSRAM are:
Clock rate: 150MHz
Double Data Rate (DDR)
Read-Write Data Strobe (DQS)
Supports Half sleep and deep power-down mode
Programmable drive strength
Configurable refresh rate
Temperature Compensated Refresh
16/32/64/1024 bytes wrap burst access
Pseudo-Static Random Access Memory (PSRAM) is used for high-speed transmission of the data stream. This chip communicates with PSRAM via PSRAM controller (PSRAMC). PSRAM can be accessed by KM4, KR4 and DSP, and supports execution on PSRAM.
The features of PSRAM are:
Clock rate: 250MHz
Double Data Rate (DDR)
Read-Write Data Strobe (DQS)
Supports Half sleep and deep power-down mode
Programmable drive strength
Configurable refresh rate
Temperature Compensated Refresh
16/32/64/1024 bytes wrap burst access
Pseudo-Static Random Access Memory (PSRAM) is used for high-speed transmission of the data stream. This chip communicates with PSRAM via PSRAM controller (PSRAMC). PSRAM can be accessed by NP, LP and AP, and supports execution on PSRAM.
The features of PSRAM are:
Clock rate: 250MHz
Density: 64M bits
Double Data Rate (DDR)
Read-Write Data Strobe (DQS)
Supports Half sleep and deep power-down mode
Programmable drive strength
Configurable refresh rate
Temperature Compensated Refresh
16/32/64/1024 bytes wrap burst access
Address mapping: 0x6000_0000~0x6080_0000
Throughput
PSRAM supports direct access and DMA access. The throughput of PSRAM is listed in the following table.
Access mode |
Writing 32 bytes |
Reading 32 bytes |
||
Theory |
Test on the KM4 |
Theory |
Test on the KM4 |
|
Direct access (write back) |
1523.81Mbps |
(32*8)/(199.68ns)=1282.05Mbps |
1454.55Mbps |
(32*8)/(212.16ns)=1204.14Mbps |
DMA access |
2206.9Mbps |
1641.03Mbps |
2133.33Mbps |
1172.16Mbps |
Note
Throughput theoretical calculation:
The test data above takes variable initial latency, so there will be 1 or 2 times initial latency depending on RWDS.
The header overlaps with delay by 1T.
Since it is DDR PSRAM, 16T is used to transmit 32 bytes.
Direct access:
By default, we will assign the cache attribute to PSRAM. Therefore, when testing the access performance of PSRAM, we need to consider the cache attribute comprehensively.
In the operation of reading 4 bytes, if read hit (that is, the address data is stored in the cache), the CPU directly reads 4bytes from the cache. If read miss (the address data is not in the cache), it needs to read a cache line size data from PSRAM to the cache.
In the operation of writing 4 bytes, if write hit (the address data exists in the cache), the content of the address in the cache will be updated, and then a cacheline size will be updated to PSRAM when the cache flush. If write miss (the address to be written is not in the cache), based on the write allocate policy, the CPU will first read a cacheline size data from PSRAM to the cache, and then update the content in the cache.
The read / write throughput data in the table is measured based on read miss / cache flush, which requires access to PSRAM. TP of write allocate is equal to TP of read miss.
Instruction execution time also needs to be taken into consideration.
Item |
Writing 32 bits |
Reading 32 bits |
---|---|---|
Header + delay |
[3 +22] * 4ns = 100ns |
[3 + 23] * 4ns = 104ns |
Data transmit period |
2 * 4ns =8ns |
16 * 4ns = 64ns |
Hardware hold |
1 * 4ns =4ns |
2 * 4ns = 8ns |
Total without considering instruction execution time |
100ns + 8ns + 4ns = 112ns |
104ns + 64ns + 8ns = 176ns |
Throughput theoretical value |
32/112ns = 285.71Mbps |
(32x8)/176ns = 1454.55Mbps |
Note
Throughput theoretical calculation:
The test data above takes variable initial latency, so there will be 1 or 2 times initial latency depending on RWDS.
The header overlaps with delay by 1T.
Since it is DDR PSRAM, 16T is used to transmit 32 bytes.
Direct access:
By default, we will assign the cache attribute to PSRAM. Therefore, when testing the access performance of PSRAM, we need to consider the cache attribute comprehensively.
In the operation of reading 4 bytes, if read hit (that is, the address data is stored in the cache), the CPU directly reads 4bytes from the cache. If read miss (the address data is not in the cache), it needs to read a cache line size data from PSRAM to the cache.
In the operation of writing 4 bytes, if write hit (the address data exists in the cache), the content of the address in the cache will be updated, and then a cacheline size will be updated to PSRAM when the cache flush. If write miss (the address to be written is not in the cache), based on the write allocate policy, the CPU will first read a cacheline size data from PSRAM to the cache, and then update the content in the cache.
The read / write throughput data in the table is measured based on read miss / cache flush, which requires access to PSRAM. TP of write allocate is equal to TP of read miss.
Instruction execution time also needs to be taken into consideration.
Item |
Writing 32 bits |
Reading 32 bits |
---|---|---|
Header + delay |
[3 + (6 or 12) -1] * (1000/150)ns |
[3 + (6 or 12) -1] * (1000/150)ns |
Data transmit period |
16 * (1000/150)ns |
16 * (1000/150)ns |
Hardware hold |
1 * (1000/150)ns |
2 * (1000/150)ns |
Total without considering instruction execution time |
[(8 or 14) + 16 +1] * (1000/150)ns = (166.667 or 206.667)ns |
[(8 or 14) + 16 +2] * (1000/150)ns = (173.333 or 213.333)ns |
Throughput theoretical value |
32*8/(166.667 or 206.667)ns = (1536 or 1238.71)Mbps |
(32*8)/(173.333 or 213.333)ns = (1476.923 or 1200)Mbps |
Note
Throughput theoretical calculation:
The test data above takes variable initial latency, so there will be 1 or 2 times initial latency depending on RWDS.
The header overlaps with delay by 1T.
Since it is DDR PSRAM, 16T is used to transmit 32 bytes.
Direct access:
By default, we will assign the cache attribute to PSRAM. Therefore, when testing the access performance of PSRAM, we need to consider the cache attribute comprehensively.
In the operation of reading 4 bytes, if read hit (that is, the address data is stored in the cache), the CPU directly reads 4bytes from the cache. If read miss (the address data is not in the cache), it needs to read a cache line size data from PSRAM to the cache.
In the operation of writing 4 bytes, if write hit (the address data exists in the cache), the content of the address in the cache will be updated, and then a cacheline size will be updated to PSRAM when the cache flush. If write miss (the address to be written is not in the cache), based on the write allocate policy, the CPU will first read a cacheline size data from PSRAM to the cache, and then update the content in the cache.
The read / write throughput data in the table is measured based on read miss / cache flush, which requires access to PSRAM. TP of write allocate is equal to TP of read miss.
Instruction execution time also needs to be taken into consideration.
Item |
Writing 32 bits |
Reading 32 bits |
---|---|---|
Header + delay |
[3 +(7 or 14)-1] * 4ns = 36 or 64ns |
[3 + (7 or 14)-1] * 4ns = 36 or 64ns |
Data transmit period |
16 * 4ns =64ns |
16 * 4ns = 64ns |
Hardware hold |
1 * 4ns =4ns |
2 * 4ns = 8ns |
Total without considering instruction execution time |
36 or 64ns + 64ns + 4ns = 104 or 132ns |
36 or 64ns + 64ns + 8ns = 108 or 136ns |
Throughput theoretical value |
32*8/104 or 132ns =2461.538 or 1939.394Mbps |
(32*8)/108 or 136ns = 2370.37 or 1882.353Mbps |
PSRAM supports direct access and DMA access. The throughput of PSRAM is listed in the following table.
Access mode |
Writing 32 bytes |
Reading 32 bytes |
||
---|---|---|---|---|
Theory (Mbps) |
Test on NP (Mbps) |
Theory (Mbps) |
Test on NP (Mbps) |
|
Direct access (write back) |
1939.394 |
(32*8)/(123ns)=2081.30 |
1882.353 |
(32*8)/(152ns)=1684.21 |
DMA access |
(32*8)/(180ns)=1422.22 |
(32*8)/(175.60ns)=1457.85 |
Note
Throughput theoretical calculation:
The test data above takes fixed initial latency, so there will be 2 times initial latency depending on RWDS.
The header overlaps with delay by 1T.
Since it is DDR PSRAM, 16T is used to transmit 32 bytes.
Direct access:
By default, we will assign the cache attribute to PSRAM. Therefore, when testing the access performance of PSRAM, we need to consider the cache attribute comprehensively.
In the operation of reading 4 bytes, if read hit (that is, the address data is stored in the cache), the CPU directly reads 4bytes from the cache. If read miss (the address data is not in the cache), it needs to read a cache line size data from PSRAM to the cache.
In the operation of writing 4 bytes, if write hit (the address data exists in the cache), the content of the address in the cache will be updated, and then a cacheline size will be updated to PSRAM when the cache flush. If write miss ( the address to be written is not in the cache), based on the write allocate policy, the CPU will first read a cacheline size data from PSRAM to the cache, and then update the content in the cache.
The read / write throughput data in the table is measured based on read miss / cache flush, which requires access to PSRAM. TP of write allocate is equal to TP of read miss.
Instruction execution time also needs to be taken into consideration.
Item |
Writing 32 bits (In fact, writing 32 bytes) |
Reading 32 bits (In fact, reading 32 bytes) |
---|---|---|
Header + delay |
[3 +(14-1)] * 4ns = 64ns |
[3 + (14-1)] * 4ns = 64ns |
Data transmit period |
16 * 4ns =64ns |
16 * 4ns = 64ns |
Hardware hold |
1 * 4ns =4ns |
2 * 4ns = 8ns |
Total without considering instruction execution time |
64ns + 64ns + 4ns = 132ns |
64ns + 64ns + 8ns = 136ns |
Throughput theoretical value |
(32*8) / 132ns = 1939.394Mbps |
(32*8) / 136ns = 1882.353Mbps |
Boot from PSRAM
If the PSRAM is embedded in the chip, follow these steps to boot from PSRAM in the SDK.
Enable the power supply of PSRAM in the bootloader
Initialize the PSRAM controller, PSRAM PHY and PSRAM device to synchronize the relevant parameters
Calibrate the PSRAM
RCC_PeriphClockCmd(APBPeriph_PSRAM, APBPeriph_PSRAM_CLOCK, ENABLE);
DBG_PRINT(MODULE_BOOT, LEVEL_INFO, "Init PSRAM\r\n");
BOOT_PSRAM_Init();
PSRAM Cache “Write Back” Policy
When a cache hit occurs on a store access, the data is only written to the cache. Data in the cache can therefore be more up-to-date than data in memory. Any such data is written back to memory when the cache line is cleaned or reallocated. Another common term for a write-back cache is a copy-back cache. By default, we will assign the cache attribute to PSRAM. For CPU, only when read miss/cache flush/write allocate will access PSRAM, one cache line at a time.
Row Hammer
With the increasing density of DRAM, its memory cells become smaller and smaller, and the stored charge decreases. As a result, the noise tolerance between memory cells is reduced, resulting in the interaction of charges between two independent memory cells. Row hammer is caused by this defect in the design of memory hardware chip. Its principle is to repeatedly read and write the peer address in DRAM memory unit, so that the charge leakage occurs in adjacent rows, and the bit reversal phenomenon occurs in adjacent rows, that is, 0 is reversed to 1, and 1 is reversed to 0.
Therefore, when a large number of accesses are made to PSRAM in a short time, if the refresh frequency is not enough, the MEM space of every 2K (i.e. two rows) will affect each other. When we perform a large number of continuous write operations on a line, the charges of adjacent lines will be affected and the value will change.
With the increasing density of DRAM, its memory cells become smaller and smaller, and the stored charge decreases. As a result, the noise tolerance between memory cells is reduced, resulting in the interaction of charges between two independent memory cells. Row hammer is caused by this defect in the design of memory hardware chip. Its principle is to repeatedly read and write the peer address in DRAM memory unit, so that the charge leakage occurs in adjacent rows, and the bit reversal phenomenon occurs in adjacent rows, that is, 0 is reversed to 1, and 1 is reversed to 0.
Therefore, when a large number of accesses are made to PSRAM in a short time, if the refresh frequency is not enough, the MEM space of every 2K (i.e. two rows) will affect each other. When we perform a large number of continuous write operations on a line, the charges of adjacent lines will be affected and the value will change.
With the increasing density of DRAM, its memory cells become smaller and smaller, and the stored charge decreases. As a result, the noise tolerance between memory cells is reduced, resulting in the interaction of charges between two independent memory cells. Row hammer is caused by this defect in the design of memory hardware chip. Its principle is to repeatedly read and write the peer address in DRAM memory unit, so that the charge leakage occurs in adjacent rows, and the bit reversal phenomenon occurs in adjacent rows, that is, 0 is reversed to 1, and 1 is reversed to 0.
Therefore, when a large number of accesses are made to PSRAM in a short time, if the refresh frequency is not enough, the MEM space of every 2K (i.e. two rows) will affect each other. When we perform a large number of continuous write operations on a line, the charges of adjacent lines will be affected and the value will change.
With the increasing density of DRAM, its memory cells become smaller and smaller, and the stored charge decreases. As a result, the noise tolerance between memory cells is reduced, resulting in the interaction of charges between two independent memory cells. Row hammer is caused by this defect in the design of memory hardware chip. Its principle is to repeatedly read and write the peer address in DRAM memory unit, so that the charge leakage occurs in adjacent rows, and the bit reversal phenomenon occurs in adjacent rows, that is, 0 is reversed to 1, and 1 is reversed to 0.
Therefore, when a large number of accesses are made to PSRAM in a short time, if the refresh frequency is not enough, the MEM space of every 2K (i.e. two rows) will affect each other. When we perform a large number of continuous write operations on a line, the charges of adjacent lines will be affected and the value will change.
Row hammer is an inherent weakness of PSRAM. If the cache is not turned on, it may be hit by excessive load. When the cache is turned on, we have tested that APM PSRAM is safe on AP. The test conditions are as follows:
AP clock is 1.2GHz.
In write back mode.
Every 4K bytes of memory is a group, write the first 800 bytes 8W times (cache flush for each 800 bytes), and then check whether the values of other unwritten memories have changed by reading.
Note
If you want to set memory to non-cache attribute through MMU or MPU, take the boundary of row hammer into consideration.
Notice
Cache Operation
On the “Write Back” policy, the synchronization operations need to be taken between cache and PSRAM to keep content consistency, especially for multiple access by different sources, e.g. CPU, serial ports and peripherals.
As the cache line of KM4/KM0 cache is 32 bytes, and cache operations are all based on the cache line. So the buffer size and buffer starting address are recommended to be 32/64 bytes aligned to avoid synchronization issues.
As the cache line of KM4/KR4 cache line is 32 bytes, and cache operations are all based on the cache line. So the buffer size and buffer starting address are recommended to be 32 bytes aligned to avoid synchronization issues.
As the cache line of KM4/KR4 cache line is 32 bytes, DSP cache line is 128 bytes, and cache operations are all based on the cache line. So the buffer size and buffer starting address are recommended to be 32/128 bytes aligned to avoid synchronization issues.
As the cache lines of NP/LP cache are 32 bytes, AP cache is 64 bytes, and cache operations are all based on the cache line. So the buffer size and buffer starting address are recommended to be 64 bytes aligned to avoid synchronization issues.
DMA Operation
The following steps should be added when executing DMA Rx/Tx.
Operation |
Step |
---|---|
DMA Rx |
|
DMA Tx |
|
In SDK, only the example of one-time xxx_GDMA_Init one-time transmission is illustrated. Step 2 is included in xxx_GDMA_Init
by default.
If you need multi-time DMA Tx/Rx with only one-time xxx_GDMA_Init, DCache_CleanInvalidate()
should be called every time before DMA transmission starts.
BOOL UART_TXGDMA_Init(
u8 UartIndex,
GDMA_InitTypeDef * GDMA_InitStruct,
void *CallbackData,
IRQ_FUN CallBackFunc,
u8 *pTxBuf,
int TxCount
)
{
u8 GdmaChnl;
assert_param(GDMA_InitStruct != NULL);
DCache_CleanInvalidate((u32)pTxBuf, TxCount);
}
|
BOOL UART_RXGDMA_Init(
u8 UartIndex,
GDMA_InitTypeDef * GDMA_InitStruct,
void *CallbackData,
IRQ_FUN CallBackFunc,
u8 *pRxBuf,
int RxCount
)
{
u8 GdmaChnl;
UART_TypeDef * UARTx;
assert_param(GDMA_InitStruct != NULL);
DCache_CleanInvalidate((u32)pRxBuf, RxCount);
}
|
BOOL UART_TXGDMA_Init(
u8 UartIndex,
GDMA_InitTypeDef * GDMA_InitStruct,
void *CallbackData,
IRQ_FUN CallBackFunc,
u8 *pTxBuf,
int TxCount
)
{
u8 GdmaChnl;
assert_param(GDMA_InitStruct != NULL);
DCache_CleanInvalidate((u32)pTxBuf, TxCount);
}
|
BOOL UART_RXGDMA_Init(
u8 UartIndex,
GDMA_InitTypeDef * GDMA_InitStruct,
void *CallbackData,
IRQ_FUN CallBackFunc,
u8 *pRxBuf,
int RxCount
)
{
u8 GdmaChnl;
UART_TypeDef * UARTx;
assert_param(GDMA_InitStruct != NULL);
DCache_CleanInvalidate((u32)pRxBuf, RxCount);
}
|
BOOL UART_TXGDMA_Init(
u8 UartIndex,
GDMA_InitTypeDef * GDMA_InitStruct,
void *CallbackData,
IRQ_FUN CallBackFunc,
u8 *pTxBuf,
int TxCount
)
{
u8 GdmaChnl;
assert_param(GDMA_InitStruct != NULL);
DCache_CleanInvalidate((u32)pTxBuf, TxCount);
}
|
BOOL UART_RXGDMA_Init(
u8 UartIndex,
GDMA_InitTypeDef * GDMA_InitStruct,
void *CallbackData,
IRQ_FUN CallBackFunc,
u8 *pRxBuf,
int RxCount
)
{
u8 GdmaChnl;
UART_TypeDef * UARTx;
assert_param(GDMA_InitStruct != NULL);
DCache_CleanInvalidate((u32)pRxBuf, RxCount);
}
|
BOOL UART_TXGDMA_Init(
u8 UartIndex,
GDMA_InitTypeDef * GDMA_InitStruct,
void *CallbackData,
IRQ_FUN CallbackFunc,
u8 *pTxBuf,
int TxCount
)
{
u8 GdmaChnl;
assert_param(GDMA_InitStruct != NULL);
DCache_CleanInvalidate((u32)pTxBuf, TxCount);
}
|
BOOL UART_RXGDMA_Init(
u8 UartIndex,
GDMA_InitTypeDef * GDMA_InitStruct,
void *CallbackData,
IRQ_FUN CallbackFunc,
u8 *pRxBuf,
int RxCount
)
{
u8 GdmaChnl;
UART_TypeDef * UARTx;
assert_param(GDMA_InitStruct != NULL);
DCache_CleanInvalidate((u32)pRxBuf, RxCount);
}
|
TCEM Setting
The TPR0[24:31]
(CS_TCEM) provides the function that when the CSN low pulse width is equal to (CS_TCEM * 32)*busclk, the SPI Flash Controller will automatically chop the current transmission and pull CS up.
Winbond:
When the temperature is less than 85°C, PSRAM refresh the intern cell array using normal rate (4us).
When the temperature is greater than 85°C and less than 125°C, PSRAM refresh the internal cell array using faster rate (1us). This sets an upper limit on the length of read and write transactions so that the automatic distributed refresh operation can be done between transactions. This limit is called the CS# low maximum time (tCSM) and the tCSM will be equal to the maximum distributed refresh interval.
So when the temperature is less than 85°C, for higher performance, we recommend that
CS_TCEM
should be equal to 4us/busclk/32. When the temperature is greater than 85°C, the value should be equal to 1us/busclk/32.
APM:
APM is in extended mode by default, so it always keeps fast refresh (1us). Here,
CS_TCEM
is recommended equal to 1us/busclk/32.
The TPR0[24:31]
(CS_TCEM) provides the function that when the CSN low pulse width is equal to (CS_TCEM * 32)*busclk, the SPI Flash Controller will automatically chop the current transmission and pull CS up.
Winbond:
When the temperature is less than 85°C, PSRAM refresh the intern cell array using normal rate (4us).
When the temperature is greater than 85°C and less than 125°C, PSRAM refresh the internal cell array using faster rate (1us). This sets an upper limit on the length of read and write transactions so that the automatic distributed refresh operation can be done between transactions. This limit is called the CS# low maximum time (tCSM) and the tCSM will be equal to the maximum distributed refresh interval.
So when the temperature is less than 85°C, for higher performance, we recommend that
CS_TCEM
should be equal to 4us/busclk/32. When the temperature is greater than 85°C, the value should be equal to 1us/busclk/32.
APM:
APM is in extended mode by default, so it always keeps fast refresh (1us). Here,
CS_TCEM
is recommended equal to 1us/busclk/32.
The TPR0[24:31]
(CS_TCEM) provides the function that when the CSN low pulse width is equal to (CS_TCEM * 32)*busclk, the SPI Flash Controller will automatically chop the current transmission and pull CS up.
Winbond:
When the temperature is less than 85°C, PSRAM refresh the intern cell array using normal rate (4us).
When the temperature is greater than 85°C and less than 125°C, PSRAM refresh the internal cell array using faster rate (1us). This sets an upper limit on the length of read and write transactions so that the automatic distributed refresh operation can be done between transactions. This limit is called the CS# low maximum time (tCSM) and the tCSM will be equal to the maximum distributed refresh interval.
So when the temperature is less than 85°C, for higher performance, we recommend that
CS_TCEM
should be equal to 4us/busclk/32. When the temperature is greater than 85°C, the value should be equal to 1us/busclk/32.
APM:
APM is in extended mode by default, so it always keeps fast refresh (1us). Here,
CS_TCEM
is recommended equal to 1us/busclk/32.
RTL8730E dont need this.