Monday, September 14, 2015

Haskell and Hardware

Trying to reach the smallest possible audience, I thought I'd try and combine two of my current favorite niche things, FPGA's and Haskell. This post isn't really intended to be an introduction to anything, but rather something to wet the appetite of the roughly 12 people or so interested in both functional languages and FPGA's.

I've been really enjoying messing around with FPGA's for a few months now, I have a specific project in mind, but after I taught myself VHDL and Verilog I got somewhat sidetracked in finding a better way to work with them. VHDL is a horrible verbose language based loosely on ADA's syntax, Verilog is a language based on C syntax about 80% of which can't actually be synthesized into hardware. Neither of which have anything even as basic as structures to work with.

My first thought on this was that functional languages seemed like a much better model in which to design hardware. I didn't think much more of it until a friend pointed me at this talk by Conal Elliot. I tracked down the git hub repository and failed to get it to build, exchange some email with Conal and looked at some alternatives.

Apparently I'm not the only person to ever consider functional languages for hardware design, on the Haskell side there are several variants of a DSL called Lava, Conal's work and Clash, there is a Scala based DSL called Chisel and apparently the commercial HDL Bluespec which I'm told by a friend is heavily influenced by Haskell.

I was unable to get Conal's work compiling and he told me that development was on hiatus, Bluespec is commercial only and expensive, so I ended up looking at Kansas Lava and Clash. Both are Haskell based, both contain similar abstractions, but both take very different approaches, where Kansas Lava is heavily reliant on an embedded DSL, Clash tries to compile Haskell code into hardware. The use of standard Haskell syntax was the primary reason I ended up preferring the latter.

There are a fair number of none trivial Lava examples floating around including this 6502 implementation but far fewer Clash examples. I was looking for a manageable project to improve my understanding of Clash so I thought I'd try to implement a 6502 in Clash.

As my first reasonably sized Clash project, my solution is likely far from optimal or even good, but I think there are several things that stand out.
  • The readability of the core state machine
  • How little of the code is aware it's targeted at hardware
  • How easy it is to simulate the design by running the Haskell code
In typical I'm hacking this together fashion I didn't really think through the design before I started coding, and I changed my mind several times on exactly how certain aspects of the processor would be implemented during development. Being able to quickly "hack" in these changes was really nice, and I don't think it's something I could have done as easily in a Verilog or VHDL implementation.

The implementation is naive, the goal here isn't to be cycle accurate or even efficient,  it's about understanding what's involved in building a "simple" but "real" processor in an FPGA using Clash.  It would be an interesting exercise to try and build an efficient implementation of a 6502 in Clash, this isn't it.

I made some initial guesses at what state would need to be carried around, in the end it didn't change much. I started by coding the startup state, on boot the 6502 reads the PC from 0xFFFC and continues execution from there. This requires 3 states, an initial state where the initial address is set on the address bus and two more where we'll read the low and high PC address bytes. The latter two states will be reused for the RET and RETI instructions later, with some slight changes to the original implementation. 

Clash has a mealy function that "lifts" a function into a state machine the basic signature is

(s -> i -> (s, o)) -> s -> Signal i -> Signal o 

Where s is some state, i is some input and o is some output

You write the function that takes a state and an input and returns a new state and an output, provide an initial state and you get hardware that on every clock maps an input Signal to an output Signal.
The big design decision being what is the State, the Input and the Output. The following is what I started with.

data CpuIn = CpuIn 
  { dataIn :: Byte
  } deriving (Show)

data CpuOut = CpuOut 
  { dataOut :: Byte
  , addr :: Addr
  , writeEn :: Bool
  } deriving (Show)

data State =  Halt
            | Init
            | FetchPCL
            | FetchPCH 
            | Fetch1
            | Fetch2
            | Fetch3
            | Write1
            | FetchAddr1
            deriving (Show)

data CpuProbes = CpuProbes
  { prStateIn :: State
  , prState :: State
  , prPC :: Addr
  , prA :: Byte
  , prX :: Byte
  , prY :: Byte
  , prFlags :: Byte
  , prAddr :: Addr 
  , prDIn :: Byte
  }

data CpuRegisters = CpuRegisters
  { aReg :: Byte
  , xReg :: Byte
  , yReg :: Byte
  , pReg :: Byte
  , spReg :: Byte
  , pcReg :: Addr
  , addrReg :: Addr   -- Used for Indirect addressing 
  , decoded :: DecodedInst -- current decoded instruction
  } deriving (Show)

type CpuState = (State, CpuRegisters)


My input is CpuIn, right now that's just the byte returned by the memory that clock, the State is CpuState and the output is a tuple of CpuOut which is a combination of the dataOut byte, the address lines and the write enable flag along with the CpuProbes structure which is used during simulation to report the state of the system.

The initial 3 states were simple enough, initially I'd have any unimplemented state transition to a halt state which just stopped the processor, so I could easily validate things behaved as anticipated.

cpu :: CpuState -> CpuIn -> (CpuState, (CpuOut, CpuProbes))
-- Initial CPU State, just sets up the read address to the resetVector and initiates the read
cpu (Init, reg@CpuRegisters{..}) CpuIn{..} = ((st', reg'), (cpuOut, cpuProbes)) where
  st' = FetchPCL
  reg' = reg {addrReg = resetVector}
  cpuOut = CpuOut {dataOut = 0 , addr = resetVector, writeEn = False}
  cpuProbes = probes Init st' reg' dataIn

-- Two states to Fetch the 16 bit PC from memory
cpu (FetchPCL, reg@CpuRegisters{..}) CpuIn{..} = ((st', reg'), (cpuOut, cpuProbes)) where 
  st' = FetchPCH
  pc' = resize dataIn
  addr' = addrReg + 1
  reg' = reg {pcReg = pc'}
  cpuOut = CpuOut {dataOut = 0 , addr = addr', writeEn = False}
  cpuProbes = probes FetchPCL st' reg' dataIn

cpu (FetchPCH, reg@CpuRegisters{..}) CpuIn{..} = ((st', reg'), (cpuOut, cpuProbes)) where 
  st' = Fetch1
  pc' = (pcReg .&. 0xff) .|. ((resize dataIn) `shiftL` 8) 
  -- TODO RTS instruction requires incrementing the PC on return
  reg' = reg {pcReg = pc'}
  cpuOut = CpuOut {dataOut = 0 , addr = pc', writeEn = False}
  cpuProbes = probes FetchPCH st' reg' dataIn


The probes function is just a helper to copy register state etc. into the CpuProbes output.

probes :: State -> State -> CpuRegisters -> Byte -> CpuProbesprobes stIn st CpuRegisters{..} d = CpuProbes stIn st pcReg aReg xReg yReg pReg addrReg d

I needed to lift this state machine into a Mealy style state machine

initialProcessorRegisters :: CpuRegisters
initialProcessorRegisters = CpuRegisters 0xaa 0 0 0x02 0xfd 0x00ff 0 decodedNop

type CpuState = (State, CpuRegisters)
initialCpuState :: CpuState
initialCpuState = (Init, initialProcessorRegisters)

cpuA = cpu `mealy` initialCpuState

And provide a driver, my general rule in development is you want to keep things as simple as possible. In my initial development I had no RAM in the system and I didn't care about the output. I created the system with the dataIn byte always set to A9, this is the LDA Immediate OP, so it was enough to see the "processor" start executing at address A9A9 and load the a register with A9. Doing this let me get something running and once that part at least was functional I could worry about some sort of basic system which would include RAM/ROM.

Once I had that running I used BRAM to add 64K of memory to the system, and started building test images as I added addressing modes and instructions.

The "working" code is here, it builds for the Papilio Pro FPGA board using the Xilinx tools. However since I use a Mac and the Xilinx tools run only on Windows or Linux, the makefile launches the tools in a VM via SSH and is unlikely to work for anyone else without modification.

Not everything went smoothly about halfway through the implementation I ran into an issue with exponential runtime in the Clash compiler on large case statements. Clashes maintainer (Christiaan Baaij) gave me a work around and fixed the bug shortly afterwards.

The one thing that never failed to work, was that when Clash generated hardware, it always behaved the same as the Haskell implementation. Initially I would generate and verify on the FPGA all the time, but since it always just worked I'd do it less and less frequently, relying on testing on the Haskell implementation.

Once I had what I thought to be a complete implementation I did a google search for 6502 functional test, and came across this, which I modified slightly and ran using the Haskell implementation. It took a couple of days to track down and fix all of the issues it exposed. Outside the usual brain dead bugs and missed edge cases, a couple of issues turned out to be badly documented behavior in the reference I had been using. Luckily the internet contains many references and with some googling establishing the correct behavior was simple enough. In every case where the documentation and  the functional test diverged, the test was correct. This was easily verified using this (very cool in and of itself).

The ability to run this test using the "native" Haskell implementation was invaluable, since I'm logging all of the processor state every cycle, the test completes in 75 million cycles (my 6502 uses less cycles than a real processor) takes over an hour to complete and generates over 4Gb's of log. But there is no way I could have run anything like as comprehensive a test suite in the Xilinx simulation tools, especially not the crippled free version. The FPGA implementation clocked at 50MHz completes the test in under 2 seconds, but had I run the test only on the FPGA, I'd have been able to isolate failing instruction, but not the specific input that caused it to fail, having the latter made tracking down the issues with BCD arithmetic so much easier.





2 comments:

  1. Please note, there are now 13 people interested in this! I'm interested in discovering what F# might have to offere.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete