DMA

From NESdev Wiki
Jump to navigationJump to search

The 2A03 contains a pair of DMA units, one for copying sprite data to PPU OAM and the other for copying DPCM sample data to the APU's DMC sample buffer. DMA is required for DPCM playback, and it is difficult to fill OAM without DMA. Unfortunately, DMC DMA can also result in data loss when reading registers with side effects, such as the joypads.

Summary

  • The CPU alternates between cycles on which DMA can get (read) and cycles on which DMA can put (write). These are the first and second halves of APU cycles, respectively. At power-on, whether the first CPU cycle is get or put is random.
  • If DMA tries to get on a put cycle, it waits and tries again next cycle. This wait is called an alignment cycle.
  • DMA can only halt on CPU read cycles. On write cycles, the halt fails and the DMA unit tries again next CPU cycle, repeating until successful.
  • OAM DMA halts the CPU, performs an optional alignment cycle, and then gets and puts 256 times, taking 513 or 514 cycles. It attempts to halt on the first CPU cycle after the $4014 write.
  • DMC DMA halts the CPU, performs a dummy cycle and an optional alignment cycle, and then gets once, taking 3 or 4 cycles.
  • The first, "load" DMC DMA after the $4015 write attempts to halt on the get cycle during the 2nd following APU cycle. The other, "reload" DMC DMAs attempt to halt on a put cycle. Failed halts try again on the next CPU cycle, repeating until successful.
  • The DMC DMA get takes precedence over OAM DMA get, delaying it, but the DMA cycles otherwise overlap, reducing the total cycle cost. This delay can force OAM DMA to add an alignment cycle.
  • DMC DMA has bugs triggered by stopping the sample at specific times.
  • DMC DMA can corrupt joypad and PPUDATA ($2007) reads and cause extraneous reads of $4015-4017.

Cadence

The DMA units cannot just read or write on any cycle as they wish. Instead, they alternate between get cycles where they can read and put cycles where they can write. If they need to perform an action that is not permitted on the current cycle, they wait.

Get and put cycles are aligned to the first and second halves of the APU clock, respectively (called apu_clk1 and apu_clk2 in Visual2A03). While these cycles are sometimes described as even and odd CPU cycles, this is not accurate because the CPU and APU randomly power into either of 2 alignments relative to each other. Therefore, get and put may occur on different CPU cycle parities across different power cycles.

Behavior

During the DMA process, the CPU is halted and the 2A03's address and data lines are used for data transfer. The process involves a combination of no-operation cycles and access cycles. No-operation cycles come in 3 equivalent types: the halt cycle, a DMC-only dummy cycle, and an optional alignment cycle.

When DMA is scheduled, the associated DMA unit attempts to halt the CPU. The CPU only allows this on read cycles. If the CPU is writing, it ignores the halt and the DMA unit waits until the next cycle to try again, repeating until successful. Delays of up to 3 cycles are possible, with read-modify-write instructions having 2 consecutive writes and interrupts having 3. The halting process itself takes 1 CPU cycle, during which no useful work is done.

Once the CPU is halted, the DMA unit may need to perform some amount of non-access setup, taking up to 2 cycles. The exact timing of when DMA is scheduled and what kind of setup it needs depends on the type of DMA.

The CPU is halted using its internal RDY input. When RDY is deasserted, the 6502 core repeats the last read cycle indefinitely, making no forward progress nor handling interrupts. On 2A03 CPUs, these repeated reads are externally visible on any no-operation DMA cycle, causing data loss if reading a register with side effects. On 2A07 CPUs, it is suspected that a different address (perhaps the DMA address) is on the bus, instead, during all no-operation cycles. See Register conflicts for more information.

When the DMA process completes, the CPU performs the read it attempted when halted.

Examples - General behavior
  • DMA halts normally:
(halted) CPU reads from address A  <- DMA halt cycle
(halted) [DMA occurs]
         CPU reads from address A  <- CPU resumes execution
  • DMA halt is delayed by writes:
         CPU writes                <- DMA attempts to halt
         CPU writes                <- DMA attempts to halt
(halted) CPU reads from address A  <- DMA halt cycle
(halted) [DMA occurs]
         CPU reads from address A  <- CPU resumes execution
  • DMA has a non-access cycle:
(halted) CPU reads from address A  <- DMA halt cycle
(halted) CPU reads from address A  <- DMA does not read or write
(halted) DMA accesses address B
         CPU reads from address A  <- CPU resumes execution

OAM DMA


OAM DMA copies 256 bytes from a CPU page to PPU OAM via the OAMDATA ($2004) register. It is triggered by writing the page number (the high byte of the address) to OAMDMA ($4014). OAM DMA is scheduled to halt the CPU on the first cycle after the register write. In the common case, it performs a halt cycle, an optional alignment cycle, and 256 get/put pairs.

The 256 get/put pairs copy forward from the start of the page. Because DMA can only read on get cycles, an alignment cycle performing no useful work may be required before being able to read. All together, OAM DMA on its own takes 513 or 514 cycles, depending on whether alignment is needed.

OAM DMA will copy from the page most recently written to $4014. This means that read-modify-write instructions such as INC $4014, which are able to perform a second write before the CPU can be halted, will copy from the second page written, not the first.

OAM DMA has a lower priority than DMC DMA. If a DMC DMA get occurs during OAM DMA, OAM DMA is briefly paused. (See DMC DMA during OAM DMA)

Examples - OAM DMA
  • Alignment is not needed:
         (get) CPU writes to $4014
(halted) (put) CPU reads from address A  <- DMA halt cycle
(halted) (get) DMA reads from $xx00
(halted) (put) DMA writes to $2004
(halted) (get) DMA reads from $xx01
(halted) (put) DMA writes to $2004
             ...
(halted) (get) DMA reads from $xxFF
(halted) (put) DMA writes to $2004       <- DMA completes
         (get) CPU reads from address A  <- CPU resumes execution
  • Alignment is needed:
         (put) CPU writes to $4014
(halted) (get) CPU reads from address A  <- DMA halt cycle
(halted) (put) CPU reads from address A  <- DMA alignment cycle
(halted) (get) DMA reads from $xx00
(halted) (put) DMA writes to $2004
(halted) (get) DMA reads from $xx01
(halted) (put) DMA writes to $2004
             ...
(halted) (get) DMA reads from $xxFF
(halted) (put) DMA writes to $2004       <- DMA completes
         (get) CPU reads from address A  <- CPU resumes execution

DMC DMA


DMC DMA copies a single byte to the DMC unit's sample buffer. This occurs automatically after the DMC enable bit, bit 4, of the sound channel enable register ($4015) is set to 1, which starts DPCM sample playback using the current DMC settings in registers $4010-4013. DMC DMA is scheduled when all of DPCM playback is enabled, there are bytes left in the sample, and the sample buffer is empty (see Memory reader and Output unit). In the common cases, DMC DMA performs a halt cycle, a dummy cycle, an optional alignment cycle, and a get.

The exact timing depends on the type of DMC DMA. There are two types: load and reload. Load DMAs occur after $4015 D4 is set, but only if the sample buffer is empty. They are scheduled to halt the CPU on a get cycle during the 2nd APU cycle after the write (that is, the 3rd or 4th CPU cycle). Reload DMAs occur in response to the sample buffer being emptied. Unlike load DMAs, they are scheduled to halt the CPU on a put cycle.

After the halt, DMC DMA always performs a dummy cycle where no work is done. If the next cycle is not a get cycle, then a cycle will be spent on alignment. Then the DMA read is performed.

DMC DMA normally takes 3 or 4 cycles, depending on whether alignment is needed. Because load and reload DMAs schedule on different cycle types, load DMAs take 3 cycles and reload DMAs take 4 unless the halt is delayed by an odd number of cycles. However, bugs can cause additional cycles; see Bugs below.

Examples - DMC DMA
  • Load DMA:
         (get) \ CPU writes to $4015     <- DMC enabled 
         (put) / during this APU cycle   <- DMC enabled
         (get) CPU reads
         (put) CPU reads
(halted) (get) CPU reads from address A  <- DMA halt cycle
(halted) (put) CPU reads from address A  <- DMA dummy cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from address A  <- CPU resumes execution
  • Load DMA (delayed 1 cycle):
         (get) \ CPU writes to $4015     <- DMC enabled 
         (put) / during this APU cycle   <- DMC enabled
         (get) CPU reads
         (put) CPU reads
         (get) CPU writes                <- DMA attempts to halt
(halted) (put) CPU reads from address A  <- DMA halt cycle
(halted) (get) CPU reads from address A  <- DMA dummy cycle
(halted) (put) CPU reads from address A  <- DMA alignment cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from address A  <- CPU resumes execution
  • Reload DMA:
(halted) (put) CPU reads from address A  <- DMA halt cycle
(halted) (get) CPU reads from address A  <- DMA dummy cycle
(halted) (put) CPU reads from address A  <- DMA alignment cycle
(halted) (get) DMA reads from address B 
         (put) CPU reads from address A  <- CPU resumes execution
  • Reload DMA (delayed 1 cycle):
         (put) CPU writes                <- DMA attempts to halt
(halted) (get) CPU reads from address A  <- DMA halt cycle
(halted) (put) CPU reads from address A  <- DMA dummy cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from address A  <- CPU resumes execution
  • Reload DMA (delayed 2 cycles):
         (put) CPU writes                <- DMA attempts to halt
         (get) CPU writes                <- DMA attempts to halt
(halted) (put) CPU reads from address A  <- DMA halt cycle
(halted) (get) CPU reads from address A  <- DMA dummy cycle
(halted) (put) CPU reads from address A  <- DMA alignment cycle
(halted) (get) DMA reads from address B 
         (put) CPU reads from address A  <- CPU resumes execution
  • Reload DMA (delayed 3 cycles):
         (put) CPU writes                <- DMA attempts to halt
         (get) CPU writes                <- DMA attempts to halt
         (put) CPU writes                <- DMA attempts to halt
(halted) (get) CPU reads from address A  <- DMA halt cycle
(halted) (put) CPU reads from address A  <- DMA dummy cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from address A  <- CPU resumes execution

Bugs

DMC DMA suffers from two bugs[1][2] related to sample playback stopping around the time a DMC output cycle ends, which is what empties the sample buffer and triggers a reload DMA. This can happen explicitly, where the sample is stopped by clearing $4015 D4, or implicitly, where a non-looping 1-byte sample is started while the sample buffer is empty shortly before a reload DMA would schedule. Implicit stops require this type of sample because the load DMA for the first byte when the sample buffer is empty is the only way to implicitly end the sample just before a reload DMA; longer samples will instead be ended by the reload DMA itself, as normal.

When sample playback is stopped during the APU cycle before a reload DMA would schedule (that is, on the 2nd or 3rd CPU cycle before the halt attempt), the DMA starts, but is aborted after a single cycle. If the halt is delayed due to a write cycle, the aborted DMA doesn't occur at all. This aborted DMA schedules regardless of how playback was stopped, whether explicitly or implicitly. In the implicit case, the write to begin the sample will normally be the 4th APU cycle before the reload DMA would schedule (8th or 9th CPU cycle before).

On RP2A03H and late RP2A03G CPUs, when playback is stopped implicitly on the same APU cycle that a reload DMA would schedule (that is, the 1st CPU cycle before the halt attempt), an unexpected reload DMA occurs from the same address. This extra byte goes into the sample buffer and is played after the first byte, as with any normal fetch. On RP2A03G CPUs, this bug was introduced sometime in 1990; earlier chips are unaffected.

It is not known whether 2A07 CPUs are affected by these bugs. Some clone hardware is known to be affected, and behavior on affected clones may differ from official CPUs. For example, UA6527P-based clones feature both the aborted DMA and unexpected DMA bugs, but samples take 1 APU cycle longer to end than on official CPUs, so to trigger these bugs with implicit stops, the sample-ending byte must be fetched 1 APU cycle earlier.

Examples - DMC DMA bugs
  • Explicit-stop aborted DMA:
         (get) \ CPU writes to $4015     <- DMC disabled
         (put) / during this APU cycle   <- DMC disabled
         (get) CPU reads
(halted) (put) CPU reads from address A  <- DMA halt cycle
         (get) CPU reads from address A  <- CPU resumes execution
  • Implicit-stop aborted DMA:
         (get) \ CPU writes to $4015     <- DMC enabled w/ buffer empty
         (put) / during this APU cycle   <- DMC enabled w/ buffer empty
         (get) CPU reads
         (put) CPU reads
(halted) (get) CPU reads from address A  <- DMA halt cycle
(halted) (put) CPU reads from address A  <- DMA dummy cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from address A  <- CPU resumes execution
         (get) CPU reads
(halted) (put) CPU reads from address C  <- DMA halt cycle
         (get) CPU reads from address C  <- CPU resumes execution
  • Implicit-stop unexpected DMA:
         (get) \ CPU writes to $4015     <- DMC enabled w/ buffer empty
         (put) / during this APU cycle   <- DMC enabled w/ buffer empty
         (get) CPU reads
         (put) CPU reads
(halted) (get) CPU reads from address A  <- DMA halt cycle
(halted) (put) CPU reads from address A  <- DMA dummy cycle
(halted) (get) DMA reads from address B
(halted) (put) CPU reads from address A  <- DMA halt cycle
(halted) (get) CPU reads from address A  <- DMA dummy cycle
(halted) (put) CPU reads from address A  <- DMA alignment cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from address A  <- CPU resumes execution

The implicit-stop aborted DMA can be prevented with carefully placed write cycles. This can be necessary for synchronized code, where the aborted DMA's odd number of cycles can invert cycle parity. The following code synchronizes to a put cycle and uses precise write cycles to prevent any aborted DMA that may occur:

STx $4015  ; Initiate DMC DMA.
STx zp     ; Force load DMA to the 4th cycle.
           ; (If on a UA6527P-based clone, place a NOP here.)
STx zp     ; Override the aborted DMA.
; The first cycle of the next instruction is a put cycle.
Examples - aborted DMA workaround
  • Implicit-stop aborted DMA bypass (write on get):
         (get) CPU writes to $4015       <- DMC enabled w/ buffer empty
         (put) CPU reads
         (get) CPU reads
         (put) CPU writes
(halted) (get) CPU reads from address A  <- DMA halt cycle
(halted) (put) CPU reads from address A  <- DMA dummy cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from address A  <- CPU resumes execution
         (get) CPU reads
         (put) CPU writes                <- DMA attempts to halt
         (get) CPU reads
  • Implicit-stop aborted DMA bypass (write on put):
         (put) CPU writes to $4015       <- DMC enabled w/ buffer empty
         (get) CPU reads
         (put) CPU reads
         (get) CPU writes                <- DMA attempts to halt
(halted) (put) CPU reads from address A  <- DMA halt cycle
(halted) (get) CPU reads from address A  <- DMA halt cycle
(halted) (put) CPU reads from address A  <- DMA dummy cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from address A  <- CPU resumes execution
         (get) CPU reads
         (put) CPU writes                <- DMA attempts to halt
         (get) CPU reads

DMC DMA during OAM DMA

DMC and OAM use independent DMA units that only interact when both attempt to access memory on the same cycle. When accesses collide, DMC DMA is allowed to run and OAM DMA is paused, trying again on the next cycle. This can cause OAM DMA to have to perform an additional alignment cycle before continuing. No-operation cycles are allowed to overlap with each other and with access cycles, allowing cycles to be saved.

In the common case, DMC DMA occurring during OAM DMA will cost only 2 cycles: 1 cycle for the DMC DMA get and then 1 cycle for OAM DMA to align back to a get. However, if DMC DMA occurs at the end of OAM DMA, it can take 1 or 3 cycles. If it schedules for the second-to-last put, its get will occur on the first cycle after OAM DMA, taking just 1 cycle total. If it schedules for the last put, it will instead extend 3 cycles beyond the end of OAM DMA.

OAM DMA is sometimes used to synchronize code to avoid conflicts with DMC DMA when reading hardware registers, but because DMC DMA takes an odd number of cycles if it lands at the end, synchronization is not guaranteed.

Examples - DMC DMA during OAM DMA
  • DMC DMA at the start of OAM DMA (write on get), taking 2 cycles
         (get) CPU writes to $4014
(halted) (put) CPU reads from address A      <- DMC and OAM DMA halt cycle
(halted) (get) OAM DMA reads from $xx00      <- DMC DMA dummy cycle
(halted) (put) OAM DMA writes to $2004       <- DMC DMA alignment cycle
(halted) (get) DMC DMA reads from address B
(halted) (put) CPU reads from address A      <- OAM DMA alignment cycle
(halted) (get) DMA reads from $xx01
(halted) (put) DMA writes to $2004
             ...
  • DMC DMA at the start of OAM DMA (write on put), taking 2 cycles
         (put) CPU writes to $4014           <- DMC DMA attempts to halt
(halted) (get) CPU reads from address A      <- DMC and OAM DMA halt cycle
(halted) (put) CPU reads from address A      <- DMC DMA dummy cycle, OAM DMA alignment cycle
(halted) (get) DMC DMA reads from address B
(halted) (put) CPU reads from address A      <- OAM DMA alignment cycle
(halted) (get) OAM DMA reads from $xx00
(halted) (put) OAM DMA writes to $2004
             ...
  • DMC DMA in the middle of OAM DMA, taking 2 cycles
             ...
(halted) (get) OAM DMA reads from address C
(halted) (put) OAM DMA writes to $2004         <- DMC DMA halt cycle
(halted) (get) OAM DMA reads from address C+1  <- DMC DMA dummy cycle
(halted) (put) OAM DMA writes to $2004         <- DMC DMA alignment cycle
(halted) (get) DMC DMA reads from address B
(halted) (put) CPU reads from address A        <- OAM DMA alignment cycle
(halted) (get) OAM DMA reads from address C+2
             ...
  • DMC DMA on second-to-last OAM DMA put, taking 1 cycle
             ...
(halted) (get) OAM DMA reads from $xxFE
(halted) (put) OAM DMA writes to $2004       <- DMC DMA halt cycle
(halted) (get) OAM DMA reads from $xxFF      <- DMC DMA dummy cycle
(halted) (put) OAM DMA writes to $2004       <- DMC DMA alignment cycle
(halted) (get) DMC DMA reads from address B
         (put) CPU reads from address A      <- CPU resumes execution
  • DMC DMA on second-to-last OAM DMA put, taking 3 cycles
             ...
(halted) (get) OAM DMA reads from $xxFF
(halted) (put) OAM DMA writes to $2004       <- DMC DMA halt cycle
(halted) (get) CPU reads from address A      <- DMC DMA dummy cycle
(halted) (put) CPU reads from address A      <- DMC DMA alignment cycle
(halted) (get) DMC DMA reads from address B
         (put) CPU reads from address A      <- CPU resumes execution

Register conflicts

On the 2A03, while the CPU is halted, it repeats the read cycle on which it was halted during every no-operation DMA cycle (that is, when the DMA units are not reading or writing). If the CPU was reading a register with side-effects, this can cause data to be lost. While this isn't realistically a problem with OAM DMA because of its particular timing constraints, it is a very real problem with DMC DMA that must usually be worked around. Example registers include PPUSTATUS ($2002), PPUDATA ($2007), and sound status ($4015). When conflicting with DMC DMA, these will see 2 or more extra reads. (Note that $2007 reads on adjacent cycles may have unexpected behavior.)

Most frequently, these DMC DMA conflicts occur with joypad reads, though the mechanism is slightly different. Joypads are clocked via direct lines from the CPU, called joypad 1 /OE and joypad 2 /OE, rather than going over the address bus. These output enables remain asserted the entire CPU cycle and even across adjacent cycles if they're both reading the same joypad register. Therefore, controllers only see a single read for each contiguous set of reads of a joypad register.

The console type affects joypad extra-read behavior. On the RF Famicom, additional hardware outside the 2A03 only passes joypad 1 /OE and joypad 2 /OE through during one half of the clock cycle, meaning the joypad sees a clock on every single CPU read cycle rather than just every contiguous set[3]. The RF Famicom, Twin Famicom, and Famicom Titler are known to behave this way. The AV Famicom and NES-001 are confirmed to use the per-contiguous-set behavior.[4]

This is further complicated by esoteric behavior regarding how 2A03 registers are activated: instead of checking the full 2A03 address bus, it checks bits 4-0 from the 2A03 address bus and bits 15-5 from the 6502 core. The 6502 core keeps the same address while halted, so if it was reading from $4000-401F, the DMA address can unintentionally activate 2A03 registers. This can lead to bus conflicts and an extra read from the other joypad, and can even prevent an extra read of the current joypad.[5]

The 2A07 fixes these extra read problems, but the mechanism is not yet understood. Experimentally, the CPU still performs a read on the halting cycle; if OAM DMA is done from the $4000 page, the open bus value used by the OAM DMA is the opcode of the instruction following the $4014 write. This means that 6502 core reads still occur when halted, at least on the first cycle. Like on NTSC, this DMA also does not trigger 2A07 registers if the 6502 core is not reading from $4000-401F. Unlike NTSC, reading a joypad register when DMA reads an address that matches the other joypad register in its low 5 bits does not clock the other joypad.

Workarounds exist for this issue. Most commonly, joypads are read multiple times until the same result is seen twice in row, reducing or eliminating the chance of accepting corrupted data. However, any collisions during this time may corrupt the DPCM data, and this strategy is not suitable for all affected registers. Alternatively, reads synchronized using OAM DMA can ensure a collision never happens, but this enforces strict timing constraints on the code and has numerous caveats, particularly for functions longer than one DMC output cycle (sample byte period).

Examples - Register conflicts
  • DMC DMA collides with $2007 read (3 extra reads)
(halted) (put) CPU reads from $2007      <- DMA halt cycle
(halted) (get) CPU reads from $2007      <- DMA dummy cycle
(halted) (put) CPU reads from $2007      <- DMA alignment cycle
(halted) (get) DMA reads from address B
         (put) CPU reads from $2007      <- CPU resumes execution
  • DMC DMA collides with JOYPAD1 ($4016) read (1 or 3 extra reads)
(halted) (put) CPU reads from $4016      <- DMA halt cycle
(halted) (get) CPU reads from $4016      <- DMA dummy cycle
(halted) (put) CPU reads from $4016      <- DMA alignment cycle
(halted) (get) DMA reads from $C000
         (put) CPU reads from $4016      <- CPU resumes execution
  • DMC DMA collides with JOYPAD1 read (0 or 4 extra reads)
    • The combined address from 6502 core bits 15-5 and 2A03 bits 4-0 is $4016.
    • This triggers a bus conflict, corrupting the DMA read.
(halted) (put) CPU reads from $4016      <- DMA halt cycle
(halted) (get) CPU reads from $4016      <- DMA dummy cycle
(halted) (put) CPU reads from $4016      <- DMA alignment cycle
(halted) (get) DMA reads from $C016 and $4016
         (put) CPU reads from $4016      <- CPU resumes execution
  • DMC DMA collides with JOYPAD1 read (1 extra read of JOYPAD2, 1 or 3 extra of JOYPAD1)
    • The combined address from 6502 core bits 15-5 and 2A03 bits 4-0 is $4017.
    • This triggers a bus conflict, corrupting the DMA read.
(halted) (put) CPU reads from $4016      <- DMA halt cycle
(halted) (get) CPU reads from $4016      <- DMA dummy cycle
(halted) (put) CPU reads from $4016      <- DMA alignment cycle
(halted) (get) DMA reads from $C017 and $4017
         (put) CPU reads from $4016      <- CPU resumes execution
  • DMC DMA collides with JOYPAD1 read (1 or 3 extra reads)
    • The combined address from 6502 core bits 15-5 and 2A03 bits 4-0 is $4015.
    • This causes the 2A03 to read $4015 and ignore the DMA value on the external data bus.
(halted) (put) CPU reads from $4016      <- DMA halt cycle
(halted) (get) CPU reads from $4016      <- DMA dummy cycle
(halted) (put) CPU reads from $4016      <- DMA alignment cycle
(halted) (get) DMA reads from $C015, 2A03 reads from $4015 internally
         (put) CPU reads from $4016      <- CPU resumes execution

References

  1. Forum post: Fiskbit's manual DMA test suite
  2. Forum post: Fiskbit's explicit and implicit stop tests
  3. Forum post: lidnariq's RF Famicom joypad clocking explanation
  4. Forum post: Fiskbit's joypad read cycle breakdown
  5. Forum post: Fiskbit's APU register activation test