Hello AFU – Part 2

This is the second part of my Hello AFU tutorial. In the last part we setup the base project and wrote a few scripts that will help view signals in ModelSim.

In this part we’ll look at the signals received by the AFU from the PSL and implement the reset handling required by all AFUs.

Resetting the AFU

The first thing the PSL requests of the AFU is to reset to a known good state. This is very easy to implement at this point as we have no internal state! Let’s look at the signals coming in for this.


There are two implemented job control commands documented in the CAPI User’s Manual: START(0x90) and RESET(0x80). Initially and between jobs, the RESET command is sent to the AFU. The expectation on the AFU when given a RESET command is that it will reset its internal state, then raise the ah_jdone signal for a cycle.

Tip: All signals that begin with ah_ represent signals from the Accelerator to Host, signals starting with ha_ represent from the Host to the Accelerator

The purpose for each signal is documented within the CAPI User’s Manual, for the RESET command only the ha_jval and ha_jcom are important on the receiving side. ha_jcompar should also be properly set to set to the odd parity bit.

Before we handle this signal, let’s fix all the floating signals we’re sending back. They are floating because we aren’t explicitly setting these signals high or low, so lets set them all low. I’ll set the timebase_request and parity_enabled signals low while I’m at it. The signal names are different in my code as I’ve given them more verbose names in the capi.sv package file used to abstract these out. I’ve added this to my parity_afu module definition just above the always_ff statement.

assign job_out.running = 0,
       job_out.done = 0,
       job_out.cack = 0,
       job_out.error = 0,
       job_out.yield = 0,
       timebase_request = 0,
       parity_enabled = 0;

Additionally, I will need to uncomment the portion of afu.sv that routes these signals into the AFU.

Also, before testing this in the simulator, I write one more do file test.do to make it easier to see what I’m working on in simulation. It will prepare simulation, watch the job interface, then run for 10 cycles.

vsim work.top
do watch_job_interface.do
run 40

With those changes made I’ll verify the signals are now being driven low.


Driving signals

There two main ways to drive signals in SystemVerilog, blocking = and non-blocking <=. These can be confusing terms, this page can help clear it up a little bit. Until you’re comfortable with the difference I suggest you use = only in assign statements where you are driving a signal to a constant value as we are so far. When you use <=, the value will stick in a register, preserving it’s value until changed later.

To send our ah_jdone signal, we need to detect when the ha_jval is high combined with a ha_jcom set to RESET, then we’ll know it’s appropriate to raise the ah_jdone signal.

For signals that we want to change during a clock cycle, we can put non-blocking assignments in an always_ff block. We first need to remove the blocking assignment made with the assign command as only one driver can be used to set the signal. Next we’ll add an if statement that will drive the done signal high only if a valid reset command is given, we also need to set it low in all other conditions so we’ll use an else statement to ensure we get that behavior.

always_ff @(posedge clock) begin
  if(job_in.valid & job_in.command == RESET) begin
    job_out.done <= 1;
  end else begin
    job_out.done <= 0;


I’ve been advised that there is one thing wrong with this design, the done signal should be sent on the next clock cycle. We need something to help us delay the signal.

Making a shift register

There are a few ways to do this, I’ve elected to use a shift register to fulfill this need.

This shift register will pass its input to its output, delaying changes by a single clock cycle. It’s not the most useful shift register but it will do for this purpose.

module shift_register (
  input logic clock,
  input logic in,
  output logic out);

  always_ff @ (posedge clock) begin
    out <= in;

To use this module in our parity_afu module, we’ll need to create an instance of our shift_register module and a variable to reference its input. In the instance we create we’ll reference the inputs and outputs, then change our job logic to use the new jdone variable instead of the direct output.

logic jdone;

shift_register jdone_shift(

Once this is all said and done, we should get the output shifted back a cycle as we desired. Since the shift register is setup with our internal jdone as input and the job_out.done as output, this will affect all assignments to jdone.


See these changes committed here.

This should be sufficient to handle the reset command for now. The next set of signals our AFU will receive will be requests for the AFU descriptor over the MMIO interface, I’ll walk through implementing this in my next post.

Hello AFU – Part 1

These posts describe the process I’ve followed to build an AFU (Accelerator Function Unit) for use in CAPI development. I’ve only been tinkering with RTL design for a few weeks, so this may not be the ideal design flow but it has worked for me so far!

Tools and Setup

My development setup basically follows what I described in my Tinkering wtih CAPI post, though I have switched to using the Gnome window manager as some software I was using didn’t have as good support for XFCE. Additionally, my 30 day evaluation for Quartus Prime was close to expiring so I’m now using the free Lite Edition.

Since I’ve started this digital adventure I’ve come to appreciate some of the nice features SystemVerilog adds to the design flow. As a developer more familiar with userspace programming, the addition of packages, enums, structures and type-definitions are very attractive and I cannot resist using them! I have found that not all design software supports all SystemVerilog semantics but I will use as much as I can get away with because they are very handy.

The design I will go through is something I’ve already put together and is available on Github. I will pull in some files from this repo when they are handy, but step through my process as if this were an external resource and find some areas for improvement along the way.

Parity Generator AFU

The overall goal of this AFU is to practice making a fully-functioning AFU that is paired up with a client application that utilizes it. The goal is to take two memory buffers in an application and generate parity for those buffers using XOR. We’ll start with a nearly clean slate and build our way up to this goal.

Getting the Project Started

As with the Tinkering wtih CAPI post, I start by creating a new empty project using top as the top-level design entry; this time I will name my project and it’s folder hello-afu.

Next, I copy top.v from the PSLSE project into the root directory of my project.

Additionally, I will pull in some helper files from my capi-parity example project. I wrote capi.sv to organize and simplify some of the interfaces the PSL exposes to the AFU and afu.sv to map signals to that abstraction for use in SystemVerilog.

These files don’t get automatically imported into the project, so I’ll add them via Project->Add/Remove Files in Project. The Add All button in this window picks up all the Verilog and SystemVerilog in the project directory.


At this point I tried to synthesize the project, and got the error Node instance "svAFU" instantiates undefined entity "parity_afu". since my afu.sv references a module that does not exist yet.

You can use File->New to create a new SystemVerilog design file. In this file, I import my CAPI package and define the input and output ports for the module. The only functionality is a debug message to notify me of the rising clock edge.

import CAPI::*;

module parity_afu (
  input clock,
  output timebase_request,
  output parity_enabled,
  input JobInterfaceInput job_in,
  output JobInterfaceOutput job_out,
  input CommandInterfaceInput command_in,
  output CommandInterfaceOutput command_out,
  input BufferInterfaceInput buffer_in,
  output BufferInterfaceOutput buffer_out,
  input ResponseInterface response,
  input MMIOInterfaceInput mmio_in,
  output MMIOInterfaceOutput mmio_out);

  always_ff @(posedge clock) begin
    $display("Clock edge!");


Again I try to synthesize the project, this time getting a smattering of messages along the lines of Port "buffer_in" does not exist in the macrofunction "svAFU". I suppose Quartus Prime does not like that I’m not yet using these inputs, so I opened the afu.sv file and commented out the ports that are defined but unused structures, it doesn’t seem to be as picky about the inputs and outputs that are not structures.

At this stage is have a useless AFU that will at least synthesize, I prepare my project for version control and push my initial commit to github.

Meeting ModelSim

Tip: I find it inconvenient that I need to edit the simulation/modelsim/modelsim.ini to reference the PSLSE veriuser.sl each time I open up ModelSim. So I’ve elected to modify my /home/kwilke/altera_lite/15.1/modelsim_ase/modelsim.ini file so that I don’t have to set this manually and remember to compile options in the ModelSim interface. This will effect all of my projects, but since I’m only using this for working with PSLSE that is something I find acceptable.

My next step is to begin simulating to see how my empty AFU is faring so far. Find the button for RTL Simulation or you can go to Tools->Run Simulation->RTL Simulation to open ModelSim.

Since my last article, I’ve learned a few tricks to make my life with ModelSim easier. Most of the interactions within the interface correspond to commands ran in the VSIM prompt at the bottom of the main interface. Instead of going to Simulate->Start Simulation and picking the same options every time, I can just run vsim work.top in the command line. Even better, you can combine these commands into DO files that help automate some of the things you want the simulator to… do.

After entering a vsim work.top command the simulator is ready to go. The AFU as written so far does only one thing, which is to display an output in the simulator to notify us of the rising edge of the clock signal. Since the PSL provides a 250Mhz clock, the phase time for this is 4 nanoseconds. We can use run 4 to simulate 4 nanoseconds of work. As before with the memcopy example, ModelSim will hang a bit as it’s waiting for a connection to PSLSE. Once started, the output will be something like below:

# AFU Server is waiting for connection on kbawx:32768
# PSL client connection from localhost
# Using PSL protocol level : 0.9908.0
# Clock edge!

Text is great for some events, but something more useful is a visual timeline of the signals as they are changing. Let’s watch the clock signal itself.

In the sim window near the upper left part of the interface, you can see a hierarchy of instances, drill down into this window and select our svAFU from within top->a0->svAFU.


After selecting svAFU the Objects window on the right side will contain a list of the objects within that instance. We can find the clock signal as it exists coming into our parity_afu module. With clock selected, you can right click and select Add Wave or use CTRL+w to open it in ModelSim’s waveform viewer. The command add wave -position insertpoint sim:/top/a0/svAFU/clock can been seen in the console, providing you the syntax for the command to watch a waveform.


Running the simulation for a few more phases we can see the clock signal is now being recorded.

Tip: There are buttons in the interface to run the simulation for different amount of times, by default the run length is set to 100ns. In the picture below I’ve changed this to 4ns and used the button just to the right to run it for that amount of time.


To help scripting these commands, you can start to type a command in the vsim prompt, and it will show you the available flags and options.


I have a few scripts in my capi-parity project that you can put in the simulation/modelsim/ directory within your project that will add the various PSL signals. This script will set the radix for many signals to something more useful than binary and change the color of AFU outputs to yellow to help differentiate what’s coming in and out of your AFU. Once a do file is in your simulation/modelsim/ directory you can run it like do watch_all_interfaces.do. Tab completion is supported to ease that process as well.

With these scripts added, I commit my repo changes as I want to share and take them with me if I work on this project from another workstation.

I feel this covers a good about for a single post, in the next post we’ll start looking at what needs to be implemented to get this AFU to function in the most basic sense.

Tinkering with CAPI

CAPI (Coherent Accelerator Processor Interface) is an exciting technology that should allow developers to more easily design applications that utilize a FPGA accelerator. This article documents my initial spelunking into this technology.

A little context


This is my first foray into tinkering with FPGAs and digital logic design in general. For those unfamiliar with this technology, an FPGA (Field-Programmable Gate Array) is a type of integrated circuit that essentially allows software defined hardware. For the designer it’s almost like a pile of digital logic gates and some mechanisms that allow you to define how they are connected together. With this technology, and the appropriate skill sets, the FPGA can be programmed to act like nearly any piece of digital hardware.


An accelerator is used kind of like a co-proccessor that can be used to hardware implement computationally expensive algorithms. The idea is that instead of processing something on the general purpose CPU in your computer you delegate processing to a piece of hardware designed specifically for the task at hand; somewhat similar to using a GPU for 3D rendering. In my first couple projects my goal is to make something more functional than practical.


CAPI is a technology that should allow me to focus on the interesting parts of designing an accelerated application. Instead of worrying about how I’m going to communicate between code running in a Linux userspace application and custom piece of hardware, I get to focus my efforts on the application and the hardware itself!

To run it on real gear you’ll need a POWER8-based server, for me the plan is to tinker with this on the Barreleye server that I play with work on for Rackspace. To make this more accessible to other developers I will focus mostly on my design process and simulation on my x86_64 workstation.

My simulation environment

For my initial testbed I’m using Xubuntu 15.10 on my laptop and the Quartus Prime design software.

If you want to set this up for yourself I recommend you grab the flavor of Ubuntu that you like the most and install the Quartus Prime software. I’m using the 30-day Evaluation of Quartus Prime Standard Edition, but I believe the free Lite Edition would suffice for this tinkering as well. Elect to install ModelSim Starter Edition as part of the Quartus installation process.

Moar CAPI talk and terminology

Nallatech offers a CAPI developer kit and provides a copy of the CAPI User’s Manual. This manual describes a lot of the core components that make up the CAPI systems.

CAPI Diagram
Diagram from CAPI User’s Manual

An important part of the CAPI system on the FPGA side of things is the POWER Service Layer (PSL), which helps create the bridge between your custom hardware and userspace application. The accelerator itself is referred to as a Accelerator Function Unit (AFU) in the context of CAPI, this is the part I am most interested in designing.

PSL Diagram
Diagram of the PSL from CAPI User’s Manual

On the userspace side of things, libcxl is the library you include in your application to communicate with the PSL and the AFU(s) behind it.

The Power Service Layer Simulation Engine (pslse) can be used to help design and test this technology without the need for the physical gear. In the next few bits I’ll outline the process I have taken to set this up on my machine and run a sample project.

Building and setting up PSLSE

First, clone down the pslse repo from github

git clone https://github.com/ibm-capi/pslse

Build the AFU driver

The AFU driver is used by ModelSim to transmit signals between a simulated design and a running instance of PSLSE. To build it you’ll need to find the vpi_user.h header included in your ModelSim installation. For me this is located in /home/$USER/altera/15.1/modelsim_ase/include/. You’ll also need to compile for 32bit as ModelSim is a 32bit application.

cd pslse/afu_driver/src/
export VPI_USER_H_DIR="/home/$USER/altera/15.1/modelsim_ase/include/"
BIT32=y make

If you get an error about not finding a cdefs.h header, you’ll just need to install the libc6-dev-i386 package.

You can run file veriuser.sl to verify it generated a ELF 32-bit LSB shared object.

Build pslse itself

PSLSE has a straight forward build process, just make sure to build this for 32bit use as well.

cd ../../pslse/
BIT32=y make

Build libcxl from pslse repo

There is a variant of libcxl inside of the PSLSE repo that is modified for use in a simulated environment. This can be compiled for 64bit architecture as it communicates with the pslse over a socket.

cd ../libcxl/

Memcopy example project

IBM has a downloadable Memcopy Demo you can find here to test your setup. The next couple steps will outline the process I’ve taken to run this sample project.

Create new Quartus project

In Quartus, go to File->New Project Wizard.... If you get the introduction screen, click next to get to the Directory, Name, Top-Level Entity page. Set the working project directory to a new directory to store the project files, I named my directory memcopy-example. I also named my project memcopy-example. After naming the project the top level design entity field will mimic the project name, but for this project we’ll use top as our top level entity to match the top.v file provided in the pslse repo. After filling in those fields you can hit Finish to exit the wizard.

New Project dialog in Quartus
New Project dialog in Quartus

Copy files into project directory

From the pslse repo, copy afu_driver/verilog/top.v into your new project directory.

From the MemcopyDemoKit.tar.gz archive, copy all the files in capi-memcpy/memcpy/ into your project directory.

Synthesize and start simulation

Press Ctrl+k or go to Processing->Start->Start Analysis & Synthesis to build the project. When complete the bottom message area should say something like Quartus Prime Analysis & Synthesis was successful. 0 errors, 57 warnings and you’ll see a green check next to Analysis & Elaboration in the tasks window.

Next, go to Tools->Run Simulation Tool->RTL Simulation to open ModelSim. On my box this initially gave me an error about some license file stuff, even though I was using the free version of ModelSim. I followed this helpful post to fix the missing dependencies.

Point ModelSim to the pslse veriuser.sl

When ModelSim starts it will create simulation/modelsim/modelsim.ini in your Quartus project directory. Open this file and search for Veriuser. Add a line to this file that sets Veriuser to a path to the veriuser.sl you compiled within the afu_driver/src directory of the pslse repo. Example below:

Veriuser = /tmp/pslse/afu_driver/src/veriuser.sl

Back in ModelSim, go to Compile->Compile Options and hit OK so that ModelSim will reload the configuration. Unfortunately, this file is overwritten when you open ModelSim later, so you’ll need to do this each time you open ModelSim unless you modify the template at ~/altera/15.1/modelsim_ase/modelsim.ini, which would affect all of your Quartus projects.

Power up the AFU

In ModelSim, go to Simulate->Start Simulation. In the window that comes up, expand the work node, select the top module and hit OK.

Pick top module to simulate
Pick top module to simulate

Once started, the Transcript box should end with something like Errors: 0, Warnings: 0. If you do get an error, it might be because the veriuser.sl was compiled for 64bit architecture but is being loaded by a 32bit application.

Now that the simulation is prepared, we can run it via Simulate->Run->Continue. After a short bit ModelSim should output a message in the transcript that reads something like # AFU Server is waiting for connection on localhost:32768 and ModelSim might appear to freeze up, though it actually seems to be blocking on a connection attempt.

Open a terminal and go into the pslse/ directory within the pslse repo. Check shim_host.dat to make sure the port matches what the AFU server is waiting on. Then kick off the pslse server with ./pslse. Once ModelSim connects, the window should become responsive again. Within ModelSim, use Simulate->Run->All to keep the simulation going.

In the terminal you have pslse running you should get some output like this:

INFO:PSLSE version 1.002 compiled @ Jan 11 2016 11:00:04
INFO:PSLSE parm values:
    Seed     = 13
    Timeout  = 10 seconds
    Response = 16%
    Paged    = 3%
    Reorder  = 86%
    Buffer   = 82%
INFO:Attempting to connect AFU: afu0.0 @ localhost:32768
PSL_SOCKET: Using PSL protocol level : 0.9908.0
INFO:Clocking afu0.0
INFO:Started PSLSE server, listening on localhost:16384
INFO:Stopping clocks to afu0.0

At this point, ModelSim and PSLSE are both ready for a userspace application to put them to good use.

Run the userspace application that utilizes the AFU

Go into the capi-memcpy folder extracted from the MemcopyDemoKit archive. Edit the Makefile included in this folder to set the PSLSE_DIR variable to the libcxl directory within the pslse repo. Run make to build the application.

Since the libcxl library isn’t installed to the systems library path, you will need to set LD_LIBRARY_PATH to the same path you pointed PSLSE_DIR to before the program will be able to run.

LD_LIBRARY_PATH=/tmp/pslse/libcxl/ ./capi_memcpy

That should do it! My output is something like this (duplicate lines removed for brevity):

INFO:Connecting to host 'localhost' port 16384
Using seed 1452713916
Starting copy of 8192 bytes from 0x00000000015bc580 to 0x00000000015be600
Timeout after 1 seconds waiting for AFU to start
Command events:
0x0000000022: Tag:0x09,1 Command:0x0a00,1 Addr:0x00000000015bc480,1 abt:0 cch:0x0 size:128
0xffffffffff: Tag:0xff,1 Command:0x1fff,1 Addr:0xffffffffffffffff,1 abt:7 cch:0xffff size:4095
0xffffffffff: Tag:0xff,1 Command:0x1fff,1 Addr:0xffffffffffffffff,1 abt:7 cch:0xffff size:4095
Response events:
0x000000002c: Tag:0x09,1 Code:0x08 credits:1
0xffffffffff: Tag:0xff,1 Code:0xff credits:7
0xffffffffff: Tag:0xff,1 Code:0xff credits:7
Control events:
0x0000000002:0x0000000002: Done, Error:0x0000000000000000

Final thoughts

I’ve not fully wrapped my head around how this works nor verified if it’s actually copying blocks of memory as it alludes to, though it does appear to be a working setup. As I gain some more knowledge with VHDL/Verilog and the CAPI technology I hope to produce an easily understood breakdown of what is going on here and how a developer could dabble with their own designs. Stay tuned.