Hello AFU – Part 6

This is the 6th and final part of my Hello AFU tutorial. In the last post, I started building out a state machine for the AFU and read from the data structure that the WED points to. In this post, I’ll finish off the state machine, pulling down the data in our stripes XOR them together and write that data back to userland.

Reading the Stripes

Since the largest memory size I can request via the PSL is for 128 bytes, I’ll make requests for that amount. I need a scratch pad for this data so I’ll add two 1024 bit internal registers for these chunks of data. I’ll also need a variable to know when I’ve received both chunks, so I’ll setup a small register for that as well.

logic [0:1023] stripe1_data;
logic [0:1023] stripe2_data;
logic stripe_received;

In my REQUEST_STRIPES state I’ll request data from stripe1 in one cycle, then stripe2 in the next, I’ll use the command’s tag to know where I am in that process. I’ll set my stripe_received to 0, to indicate I’ve not yet retrieved either.

  command_out.valid <= 1;
  command_out.size = 128;
  command_out.command <= READ_CL_NA;
  if (command_out.tag == REQUEST_READ) begin
    command_out.tag <= STRIPE1_READ;
    command_out.address <= request.stripe1;
  end else begin
    command_out.tag <= STRIPE2_READ;
    command_out.address <= request.stripe2;
    current_state <= WAITING_FOR_STRIPES;
    stripe_received <= 0;

With the requests for stripe data sent, I need to wait for the data to come back. This could happen in any order, so I need to be ready for either.

  command_out.valid <= 0;
  if (buffer_in.write_valid) begin
      STRIPE1_READ: begin
        if (buffer_in.write_address  == 0) begin
          stripe1_data[0:511] <= buffer_in.write_data;
        end else begine
          stripe1_data[512:1023] <= buffer_in.write_data;
      STRIPE2_READ: begin
        if (buffer_in.write_address == 0) begin
          stripe2_data[0:511] <= buffer_in.write_data;
        end else begine
          stripe2_data[512:1023] <= buffer_in.write_data;

In the same state, I’ll look for the tags to come in over the response interface. On the first request I set the stripe_received register, the second request the state progresses to WRITE_PARITY

if (response.valid) begin
  if (response.tag == STRIPE1_READ ||
      response.tag == STRIPE2_READ) begin
    if (stripe_received) begin
      current_state <= WRITE_PARITY;
    end else begin
      stripe_received <= 1;

Where is this Parity?

I decided to parity the stripes via assign, by creating one new internal variable parity_data can be referenced for the XOR’d value of stripe1 and stripe2.

logic [0:1023] parity_data;

assign parity_data = stripe1_data ^ stripe2_data;


Since I set the buffer latency to 1, the data being put on the buffer for writing to memory needs to be shifted back a cycle.

logic [0:511] write_buffer;

shift_register #(512) write_shift (

Now I need to write the parity data to the memory at request.parity. This is pretty similar to reading memory. I’ll send a WRITE_CL “write cacheline” command and align my data with buffer_out.read_data, returning the first half for address 0 and the high half in 1.

  if (command_out.tag != PARITY_WRITE) begin
    command_out.command <= WRITE_NA;
    command_out.address <= request.parity;
    command_out.tag <= PARITY_WRITE;
    command_out.valid <= 1;
  end else begin
    command_out.valid <= 0;
    // Read half depending on address
    if (buffer_in.read_address == 0)  begin
      write_buffer <= parity_data[0:511];
    end else begin
      write_buffer <= parity_data[512:1023];
    // Handle response
    if (response.valid &&
        response.tag == PARITY_WRITE) begin
        current_state <= DONE;

After the parity is written, the job is complete. The state progresses to DONE when the write comes back on the response interface.

Aligned Writing

Writing the done flag is a little trickier, since it is not on a 128 or 64-byte alignment. The PSL can handle writing to any address, but the data must be aligned within the 128-byte read bus. If the data size you’re writing to is 64 bytes or less you can let the same data sit on the buffer interface for both addresses.

In this case, the done field is 32 bytes past WED. and I’m doing a 1 byte write. I’ll align my data starting at the 256th bit, writing 8 bits. I’ll write a 1 in the first byte to set the little-endian unsigned 64bit number to a non-zero.

DONE: begin
  if (command_out.tag != DONE_WRITE) begin
    command_out.tag <= DONE_WRITE;
    command_out.size <= 1;
    command_out.address <= wed + 32;
    command_out.valid <= 1;
    write_buffer[256:319] <= 1;
  end else begin
    command_out.valid <= 0;

With that, the parity is written and the userspace application can see when it completes. Here’s the output from the test_afu application.

INFO:Connecting to host 'localhost' port 16384
[example structure
  example: 0x7fa500
  example->size: 128
  example->stripe1: 0x7fa600
  example->stripe2: 0x7fa780
  example->parity: 0x7fa880
  &(example->done): 0x7fa520
Attached to AFU
Waiting for completion by AFU
done: 0
done: 0
done: 1
That is some proper parity! This is exactly what I'm expecting to see. I'd also like to see this running on some real gear soon
Releasing AFU

That completes the basic function of this AFU, I’ll commit my changes here.

Larger buffers

Now I’ll extend the design to support more than 128-byte buffers, this just requires an offset buffer that keep track of the current offset relative to the total size of the buffer to generate parity for.

I’ll start by adding a new variable for the offset that matches the data type as size.

longint unsigned offset;

Then I’ll set it to 0 in the START state.

offset <= 0;

In the REQUEST_STRIPES state I’ll add the offset to the stripe pointers.

command_out.address <= request.stripe1 + offset;

In the WRITE_PARITY state I’ll add the offset to the parity pointer, and check to see if the operation is complete.

command_out.address <= request.parity + offset;
if (offset + 128 < request.size) begin
  offset <= offset + 128;
  current_state <= REQUEST_STRIPES;
end else begin
  current_state <= DONE;

With that I’d say this AFU is good enough for this tutorial. I’ll commit my changes and welcome pull requests if you find improvements to this tutorial. Hope this helps you hack on CAPI!

Hello AFU – Part 5

This is part 5 of my Hello AFU tutorial. In the last post, I built the C application that would attach and utilize the AFU that’s the focus of these posts. In this post I’ll start pulling data from the application’s memory space into the AFU and read the WED structure.

Keeping it Running

Before I start requesting for data, some modifications are necessary to notify the underlying systems that the AFU is running. So far, I’m not managing the ah_jrunning signal that should be set high when the AFU is performing a task. After a short time the PSL will stop driving the AFU’s clock if the AFU hasn’t raised the ah_jrunning signal, so lets quickly fix this and improve the parity_afu module a little bit.

I’ll refactor the always_ff block of the parity_afu module to use a case statement to handle commands and add handling for the START command in addition to our existing RESET command.

always_ff @(posedge clock) begin
  if(job_in.valid) begin
      RESET: begin
        jdone <= 1;
        job_out.running <= 0;
      START: begin
        jdone <= 0;
        job_out.running <= 1;
  end else begin
    jdone <= 0;

Now that I’m setting job_out.running, I’ll also remove my static assignment of that signal. These changes are committed here.

Planning for the Work Element Submodule

The ground work to actually deal with the issue at hand is almost completely laid out. The module that will do the real work will have considerably more complexity than the components so far, so I’ll start planning and creating a new module to segregate that functionality to, my parity_workelement.

First I’ll define the inputs and outputs of this module

Direction Name Purpose
Input clock Clock signal to follow
Input enabled High while AFU is in running state
Input reset Signal triggering reset of internal state
Input wed The WED pointer from userspace
Input buffer_in For reading userspace buffer data
Input response To check responses of commands
Output command_out To request buffer reads and writes
Output buffer_out For writing userspace buffer data

We’ll also define a mostly linear finite state machine to describe the work to be done.

State Purpose Next State
REQUEST_STRIPES Send commands to read stripe1 and stripe2 WAITING_FOR_STRIPES
WAITING_FOR_STRIPES Wait for stripe data to be available WRITE_PARITY
WRITE_PARITY Write XOR’d parity from stripes back to memory REQUEST_STRIPES if more data to read;
DONE otherwise
DONE Write done flag and halt. n/a

Now I’ll write the first couple portions of this module. I’ll create an enumeration that contains the various states used by the module. In the module definition itself I’ll define the input/output ports and create an internal register for the current_state. While I’m in here I setup some signals with assign, mostly some settings I don’t want to change and a few parity generators as well. Lastly I’ll start off the always_ff block that’ll contains the reset logic and the case statement that implements my state machine.

import CAPI::*;

typedef enum {
} state;

module parity_workelement (
  input logic clock,
  input logic enabled,
  input logic reset,
  input pointer_t wed,
  input BufferInterfaceInput buffer_in,
  input ResponseInterface response,
  output CommandInterfaceOutput command_out,
  output BufferInterfaceOutput buffer_out

  state current_state;

  assign command_out.abt = 0,
         command_out.context_handle = 0,
         buffer_out.read_latency = 1,
         command_out.command_parity = ~^command_out.command,
         command_out.address_parity = ~^command_out.address,
         command_out.tag_parity = ~^command_out.tag,
         buffer_out.read_parity = ~^buffer_out.read_data;

  always_ff @ (posedge clock) begin
    if (reset) begin
      current_state <= START;
    end else if (enabled) begin
        START: begin


With that defined, I’ll modify my parity_afu module to include and instance of my parity_workelement:

parity_workelement workelement(

To reduce how much I’m looking at during simulation, I’ll also modify my test.do to just show what’s going on in my workelement.

vsim work.top
add wave -position insertpoint sim:/top/a0/svAFU/workelement/*
run 136

Since this is a significant amount of code I’ll commit here before implementing the state machine.

Requesting Data

Requesting the WED data will be easy enough, but I first want a handy container to put it in, so I’ll define a new type in SystemVerilog that matches my WED structure in C but I skip the done field as I don’t need to look at what’s currently in there; I can set that later by it’s offset relative to the WED.

typedef struct {
  longint unsigned size;
  pointer_t stripe1;
  pointer_t stripe2;
  pointer_t parity;
} parity_request;

Next I’ll add an internal register to the parity_workelement module that can hold this structure.

parity_request request;

To use the PSL’s Command Interface to request this data, the PSL requires that each active commands has a unique tag ID. I’ll define another enum that will be used to automatically ensure I have a unique tag for each purpose.

typedef enum logic [0:7] {
} request_tag;

The simplest way to request data from userspace is using the READ_CL_NA, or “read cacheline, no allocate”, command. I’ll request a read size of 32 bytes, as I’m reading in 4 64-bit pointers. I’ll set the tag to REQUEST_READ and use the wed as my address. As with the other interfaces, I need to set a valid signal high for 1 clock, I’ll do this by setting it high in the START state, transitioning to the WAITING_FOR_REQUEST state, and have it set back low there.

  START: begin
    command_out.command <= READ_CL_NA;
    command_out.tag <= REQUEST_READ;
    command_out.size <= 32;
    command_out.address <= wed;
    command_out.valid <= 1;
    current_state = WAITING_FOR_REQUEST;
    command_out.valid <= 0;

When the data I’ve requested comes back, it’ll come via two writes on the buffer_in.write_data bus. This bus is 512-bites wide, but supports 128 byte (1024 bit) requests. As such, there are two writes that occur to deliver the lower (address 0) and higher (address 1) halves. Since I’ve only requested 32 bytes, the data will be in the first 256 bits of the writes to address 0 for the REQUEST_READ tag.

One important thing to look out for is that you can get multiple cycles of data on this bus, so you need to capture that data until the response interface lets you know the last cycle was valid.

With this in mind I’ll read the buffer interface each time it’s a valid signal and it’s for my tag and it’s for the address I’m looking for. It’s also important to remember that the terms read and write for the buffer interface are named from the PSL’s perspective, so even though I’m making a read request to read data, it comes to the AFU on the buses named write_data and such.

if (buffer_in.write_valid &&
    buffer_in.write_tag == REQUEST_READ &&
    buffer_in.write_address == 0) begin
  request.size <= buffer_in.write_data[0:63];
  request.stripe1 <= buffer_in.write_data[64:127];
  request.stripe2 <= buffer_in.write_data[128:191];
  request.parity <= buffer_in.write_data[192:255];

When the data comes back, it’s not quite as I’d like it to be.


My application code spits out what these values should be:

[example structure
  example: 0x1d91500
  example->size: 128
  example->stripe1: 0x1d91600
  example->stripe2: 0x1d91780
  example->parity: 0x1d91880
  &(example->done): 0x1d91520

The issue here is that I’m reading in data that is in a little-endian byte format, but is being interpreted as big-endian. To deal with this issue I wrote a SystemVerilog function that can swap the endianness of the bytes in a generic way.

function logic [0:63] swap_endianness(logic [0:63] in);
  return {in[56:63], in[48:55], in[40:47], in[32:39], in[24:31], in[16:23],
          in[8:15], in[0:7]};

I’ll modify my assignments to make use of this function.

request.size <= swap_endianness(buffer_in.write_data[0:63]);
request.stripe1 <= swap_endianness(buffer_in.write_data[64:127]);
request.stripe2 <= swap_endianness(buffer_in.write_data[128:191]);
request.parity <= swap_endianness(buffer_in.write_data[192:255]);

Now that this is in the right byte order, my internal request register is being filled with the appropriate values.

I’ll add a touch of logic to catch when these values are set to something valid then move to the next state.

if (response.valid && response.tag == REQUEST_READ) begin
  current_state <= REQUEST_STRIPES;

With our WED data all the way into our AFU I’ll commit my changes and call it a wrap for this post. In the next post I’ll write the remaining states and write some data back to userspace memory, completing this AFU!

Hello AFU – Part 4

This is part 4 of my Hello AFU tutorial. In the previous section we implemented the functionality to handle requests for the AFU descriptor. In this part we’ll shift focus a little bit into writing the C code that runs on the application side, and send our first bit of data to our AFU

Getting the Code Started

I like to start a new C project by writing a basic Makefile, this one will just set up some variables to include the libCXL library from PSLSE.

CC=gcc -Wall -o $@ $< $(LIBRARIES)

all: test_afu

test_afu: test_afu.c

    rm -f test_afu

Next I’ll write a basic C file that will just open a handle to the AFU and clean up.

#include <stdio.h>
#include "libcxl.h"

int main(int argc, char *argv[])
    struct cxl_afu_h *afu;

    afu = cxl_afu_open_dev("/dev/cxl/afu0.0d");
        printf("Failed to open AFU: %m\n");
        return 1;

    cxl_afu_attach(afu, 0x0123456789abcdef);
    printf("Attached to AFU\n");


    return 0;

Next, just to make things a little faster, I’ve noticed my AFU typically becomes ready around 136ns, so I’ll modify my test.do to run for 136ns right at the start. At this point I can make my test_afu binary and run it as long as I set my linker path via export LD_LIBRARY_PATH=~/workprojects/pslse/libcxl/ prior to running it.

The last thing to setup before running is to create a pslse_server.dat file that contains what host:port the simulated libCXL should connect to. I’ll point mine to localhost:16384 which is the default if you’re testing locally.

After kicking off my test_afu application and running the AFU for a few cycles, I’ll see my second argument to cxl_afu_attach show up in my ha_jea bus, this chunk of data is usually referred to as the Work Element Descriptor (WED).


I’ll commit my changes and we’ll start making a little better use of that WED.

Aligning data

Many of the requests we’ll make soon to read data from the applications memory space will require that the data is aligned to 128-byte addresses. There are a few ways to accomplish this, my go-to is the aligned_alloc() function that is part of the C11 standard.

This function provides an interface that is very similar to the classic malloc() function, its first parameter lets you specify what memory alignment you want.

Now that we can align data, I’ll create my WED structure for this parity-generating AFU.

typedef struct
    __u64 size;
    void *stripe1;
    void *stripe2;
    void *parity;
    __u64 done;
} parity_request;

Next I’ll create my example parity request, using aligned allocations for each block.

parity_request *example;
size_t size = 128, alignment = 128;

example = aligned_alloc(alignment, sizeof(*example));
example->size = size;
example->stripe1 = aligned_alloc(alignment, size);
example->stripe2 = aligned_alloc(alignment, size);
example->parity = aligned_alloc(alignment, size);

The intention here is that the data in the structure members stripe1 and stripe3 will be XOR’d together, and the results put in the parity member. Once the operation is complete, the AFU will set the done field to a non-zero.

Before sending this request to the AFU, I’ll copy some data into both buffers and zero out the done field.

       "x3t8wefiankxkfmgm ncmbqx8ehn2jkaeubgfbuapwnjxkg09f0w9es80872981",
example->done = 0;

I’ll also add some print statements to show me these structure members.

printf("[example structure\n");
printf("  example: %p\n", example);
printf("  example->size: %llu\n", example->size);
printf("  example->stripe1: %p\n", example->stripe1);
printf("  example->stripe2: %p\n", example->stripe2);
printf("  example->parity: %p\n", example->parity);
printf("  &(example->done): %p\n", &(example->done));

I’ll modify my cxl_afu_attach() call to send the pointer to this parity_request structure.

cxl_afu_attach(afu, (__u64)example);

Lastly, I’ll add a while loop to wait until the AFU has completed it’s operation then spit out the data in the parity member.

printf("Waiting for completion by AFU\n");

printf("PARITY:\n%s\n", (char *)example->parity);

At this point we can get the address of our WED structure in our AFU, but we’ll need to use the PSL’s Command and Buffer interfaces to request the data inside of that structure, which I’ll cover in the next post. Ending on this point I’ll commit my application code changes and see you in the next post!