Hello AFU – Part 5

This is part 5 of my Hello AFU tutorial. In the last post, I built the C application that would attach and utilize the AFU that’s the focus of these posts. In this post I’ll start pulling data from the application’s memory space into the AFU and read the WED structure.

Keeping it Running

Before I start requesting for data, some modifications are necessary to notify the underlying systems that the AFU is running. So far, I’m not managing the ah_jrunning signal that should be set high when the AFU is performing a task. After a short time the PSL will stop driving the AFU’s clock if the AFU hasn’t raised the ah_jrunning signal, so lets quickly fix this and improve the parity_afu module a little bit.

I’ll refactor the always_ff block of the parity_afu module to use a case statement to handle commands and add handling for the START command in addition to our existing RESET command.

always_ff @(posedge clock) begin
  if(job_in.valid) begin
    case(job_in.command)
      RESET: begin
        jdone <= 1;
        job_out.running <= 0;
      end
      START: begin
        jdone <= 0;
        job_out.running <= 1;
      end
    endcase
  end else begin
    jdone <= 0;
  end
end

Now that I’m setting job_out.running, I’ll also remove my static assignment of that signal. These changes are committed here.



Planning for the Work Element Submodule

The ground work to actually deal with the issue at hand is almost completely laid out. The module that will do the real work will have considerably more complexity than the components so far, so I’ll start planning and creating a new module to segregate that functionality to, my parity_workelement.

First I’ll define the inputs and outputs of this module

Direction Name Purpose
Input clock Clock signal to follow
Input enabled High while AFU is in running state
Input reset Signal triggering reset of internal state
Input wed The WED pointer from userspace
Input buffer_in For reading userspace buffer data
Input response To check responses of commands
Output command_out To request buffer reads and writes
Output buffer_out For writing userspace buffer data

We’ll also define a mostly linear finite state machine to describe the work to be done.

State Purpose Next State
START Request data at WED WAITING_FOR_REQUEST
WAITING_FOR_REQUEST Wait for WED data to be available REQUEST_STRIPES
REQUEST_STRIPES Send commands to read stripe1 and stripe2 WAITING_FOR_STRIPES
WAITING_FOR_STRIPES Wait for stripe data to be available WRITE_PARITY
WRITE_PARITY Write XOR’d parity from stripes back to memory REQUEST_STRIPES if more data to read;
DONE otherwise
DONE Write done flag and halt. n/a

Now I’ll write the first couple portions of this module. I’ll create an enumeration that contains the various states used by the module. In the module definition itself I’ll define the input/output ports and create an internal register for the current_state. While I’m in here I setup some signals with assign, mostly some settings I don’t want to change and a few parity generators as well. Lastly I’ll start off the always_ff block that’ll contains the reset logic and the case statement that implements my state machine.

import CAPI::*;

typedef enum {
  START,
  WAITING_FOR_REQUEST,
  REQUEST_STRIPES,
  WAITING_FOR_STRIPES,
  WRITE_PARITY,
  DONE
} state;

module parity_workelement (
  input logic clock,
  input logic enabled,
  input logic reset,
  input pointer_t wed,
  input BufferInterfaceInput buffer_in,
  input ResponseInterface response,
  output CommandInterfaceOutput command_out,
  output BufferInterfaceOutput buffer_out
);

  state current_state;

  assign command_out.abt = 0,
         command_out.context_handle = 0,
         buffer_out.read_latency = 1,
         command_out.command_parity = ~^command_out.command,
         command_out.address_parity = ~^command_out.address,
         command_out.tag_parity = ~^command_out.tag,
         buffer_out.read_parity = ~^buffer_out.read_data;

  always_ff @ (posedge clock) begin
    if (reset) begin
      current_state <= START;
    end else if (enabled) begin
      case(current_state)
        START: begin
          $display("Started!");
        end
      endcase
    end
  end

endmodule

With that defined, I’ll modify my parity_afu module to include and instance of my parity_workelement:

parity_workelement workelement(
  .clock(clock),
  .enabled(job_out.running),
  .reset(jdone),
  .wed(job_in.address),
  .buffer_in(buffer_in),
  .response(response),
  .command_out(command_out),
  .buffer_out(buffer_out));

To reduce how much I’m looking at during simulation, I’ll also modify my test.do to just show what’s going on in my workelement.

vsim work.top
add wave -position insertpoint sim:/top/a0/svAFU/workelement/*
run 136

Since this is a significant amount of code I’ll commit here before implementing the state machine.

Requesting Data

Requesting the WED data will be easy enough, but I first want a handy container to put it in, so I’ll define a new type in SystemVerilog that matches my WED structure in C but I skip the done field as I don’t need to look at what’s currently in there; I can set that later by it’s offset relative to the WED.

typedef struct {
  longint unsigned size;
  pointer_t stripe1;
  pointer_t stripe2;
  pointer_t parity;
} parity_request;

Next I’ll add an internal register to the parity_workelement module that can hold this structure.

parity_request request;

To use the PSL’s Command Interface to request this data, the PSL requires that each active commands has a unique tag ID. I’ll define another enum that will be used to automatically ensure I have a unique tag for each purpose.

typedef enum logic [0:7] {
  REQUEST_READ,
  STRIPE1_READ,
  STRIPE2_READ,
  PARITY_WRITE,
  DONE_WRITE
} request_tag;

The simplest way to request data from userspace is using the READ_CL_NA, or “read cacheline, no allocate”, command. I’ll request a read size of 32 bytes, as I’m reading in 4 64-bit pointers. I’ll set the tag to REQUEST_READ and use the wed as my address. As with the other interfaces, I need to set a valid signal high for 1 clock, I’ll do this by setting it high in the START state, transitioning to the WAITING_FOR_REQUEST state, and have it set back low there.

case(current_state)
  START: begin
    command_out.command <= READ_CL_NA;
    command_out.tag <= REQUEST_READ;
    command_out.size <= 32;
    command_out.address <= wed;
    command_out.valid <= 1;
    current_state = WAITING_FOR_REQUEST;
  end
  WAITING_FOR_REQUEST: begin
    command_out.valid <= 0;
  end
endcase

When the data I’ve requested comes back, it’ll come via two writes on the buffer_in.write_data bus. This bus is 512-bites wide, but supports 128 byte (1024 bit) requests. As such, there are two writes that occur to deliver the lower (address 0) and higher (address 1) halves. Since I’ve only requested 32 bytes, the data will be in the first 256 bits of the writes to address 0 for the REQUEST_READ tag.

One important thing to look out for is that you can get multiple cycles of data on this bus, so you need to capture that data until the response interface lets you know the last cycle was valid.

With this in mind I’ll read the buffer interface each time it’s a valid signal and it’s for my tag and it’s for the address I’m looking for. It’s also important to remember that the terms read and write for the buffer interface are named from the PSL’s perspective, so even though I’m making a read request to read data, it comes to the AFU on the buses named write_data and such.

if (buffer_in.write_valid &&
    buffer_in.write_tag == REQUEST_READ &&
    buffer_in.write_address == 0) begin
  request.size <= buffer_in.write_data[0:63];
  request.stripe1 <= buffer_in.write_data[64:127];
  request.stripe2 <= buffer_in.write_data[128:191];
  request.parity <= buffer_in.write_data[192:255];
end

When the data comes back, it’s not quite as I’d like it to be.

wed_data

My application code spits out what these values should be:

[example structure
  example: 0x1d91500
  example->size: 128
  example->stripe1: 0x1d91600
  example->stripe2: 0x1d91780
  example->parity: 0x1d91880
  &(example->done): 0x1d91520

The issue here is that I’m reading in data that is in a little-endian byte format, but is being interpreted as big-endian. To deal with this issue I wrote a SystemVerilog function that can swap the endianness of the bytes in a generic way.

function logic [0:63] swap_endianness(logic [0:63] in);
  return {in[56:63], in[48:55], in[40:47], in[32:39], in[24:31], in[16:23],
          in[8:15], in[0:7]};
endfunction

I’ll modify my assignments to make use of this function.

request.size <= swap_endianness(buffer_in.write_data[0:63]);
request.stripe1 <= swap_endianness(buffer_in.write_data[64:127]);
request.stripe2 <= swap_endianness(buffer_in.write_data[128:191]);
request.parity <= swap_endianness(buffer_in.write_data[192:255]);

Now that this is in the right byte order, my internal request register is being filled with the appropriate values.

I’ll add a touch of logic to catch when these values are set to something valid then move to the next state.

if (response.valid && response.tag == REQUEST_READ) begin
  current_state <= REQUEST_STRIPES;
end

With our WED data all the way into our AFU I’ll commit my changes and call it a wrap for this post. In the next post I’ll write the remaining states and write some data back to userspace memory, completing this AFU!

Hello AFU – Part 4

This is part 4 of my Hello AFU tutorial. In the previous section we implemented the functionality to handle requests for the AFU descriptor. In this part we’ll shift focus a little bit into writing the C code that runs on the application side, and send our first bit of data to our AFU

Getting the Code Started

I like to start a new C project by writing a basic Makefile, this one will just set up some variables to include the libCXL library from PSLSE.

LIBCXL_PATH=~/workprojects/pslse/libcxl
LIBCXL_INCLUDE=-I $(LIBCXL_PATH) -L $(LIBCXL_PATH) -lcxl -lpthread
LIBRARIES=$(LIBCXL_INCLUDE)
CC=gcc -Wall -o $@ $< $(LIBRARIES)

all: test_afu

test_afu: test_afu.c
    $(CC)

clean:
    rm -f test_afu

Next I’ll write a basic C file that will just open a handle to the AFU and clean up.

#include <stdio.h>
#include "libcxl.h"


int main(int argc, char *argv[])
{
    struct cxl_afu_h *afu;

    afu = cxl_afu_open_dev("/dev/cxl/afu0.0d");
    if(!afu)
    {
        printf("Failed to open AFU: %m\n");
        return 1;
    }

    cxl_afu_attach(afu, 0x0123456789abcdef);
    printf("Attached to AFU\n");

    cxl_afu_free(afu);

    return 0;
}

Next, just to make things a little faster, I’ve noticed my AFU typically becomes ready around 136ns, so I’ll modify my test.do to run for 136ns right at the start. At this point I can make my test_afu binary and run it as long as I set my linker path via export LD_LIBRARY_PATH=~/workprojects/pslse/libcxl/ prior to running it.

The last thing to setup before running is to create a pslse_server.dat file that contains what host:port the simulated libCXL should connect to. I’ll point mine to localhost:16384 which is the default if you’re testing locally.

After kicking off my test_afu application and running the AFU for a few cycles, I’ll see my second argument to cxl_afu_attach show up in my ha_jea bus, this chunk of data is usually referred to as the Work Element Descriptor (WED).

wed_signal

I’ll commit my changes and we’ll start making a little better use of that WED.



Aligning data

Many of the requests we’ll make soon to read data from the applications memory space will require that the data is aligned to 128-byte addresses. There are a few ways to accomplish this, my go-to is the aligned_alloc() function that is part of the C11 standard.

This function provides an interface that is very similar to the classic malloc() function, its first parameter lets you specify what memory alignment you want.

Now that we can align data, I’ll create my WED structure for this parity-generating AFU.

typedef struct
{
    __u64 size;
    void *stripe1;
    void *stripe2;
    void *parity;
    __u64 done;
} parity_request;

Next I’ll create my example parity request, using aligned allocations for each block.

parity_request *example;
size_t size = 128, alignment = 128;

example = aligned_alloc(alignment, sizeof(*example));
example->size = size;
example->stripe1 = aligned_alloc(alignment, size);
example->stripe2 = aligned_alloc(alignment, size);
example->parity = aligned_alloc(alignment, size);

The intention here is that the data in the structure members stripe1 and stripe3 will be XOR’d together, and the results put in the parity member. Once the operation is complete, the AFU will set the done field to a non-zero.

Before sending this request to the AFU, I’ll copy some data into both buffers and zero out the done field.

memcpy(example->stripe1,
       "asfb190jwqsefx0amxAqa1nlkaf78sa0g&0ha8dngj3t21078fnajl38n32j3np2"
       "x3t8wefiankxkfmgm ncmbqx8ehn2jkaeubgfbuapwnjxkg09f0w9es80872981",
       size);
memcpy(example->stripe2,
       "\x35\x1b\x07\x16\x11\x50\x43\x4a\x04\x1e\x1e\x00\x46\x08\x42\x0e"
       "\x1d\x1d\x33\x51\x11\x50\x1c\x05\x1f\x18\x47\x17\x6c\x1b\x08\x43"
       "\x47\x4f\x43\x48\x04\x40\x05\x0d\x13\x06\x4a\x54\x45\x59\x51\x43"
       "\x18\x2f\x49\x0c\x4a\x09\x4b\x48\x0b\x50\x46\x03\x5d\x09\x50\x46"
       "\x17\x13\x07\x5d\x12\x4b\x46\x20\x46\x0a\x4b\x19\x07\x15\x02\x47"
       "\x01\x49\x05\x06\x4d\x16\x1e\x58\x4b\x00\x0d\x4e\x46\x02\x02\x12"
       "\x45\x07\x17\x09\x08\x0b\x1b\x06\x50\x18\x00\x4a\x0b\x04\x0a\x55"
       "\x19\x14\x55\x16\x55\x45\x14\x5d\x51\x4a\x17\x41\x56\x57\x5f",
       size);
example->done = 0;

I’ll also add some print statements to show me these structure members.

printf("[example structure\n");
printf("  example: %p\n", example);
printf("  example->size: %llu\n", example->size);
printf("  example->stripe1: %p\n", example->stripe1);
printf("  example->stripe2: %p\n", example->stripe2);
printf("  example->parity: %p\n", example->parity);
printf("  &(example->done): %p\n", &(example->done));

I’ll modify my cxl_afu_attach() call to send the pointer to this parity_request structure.

cxl_afu_attach(afu, (__u64)example);

Lastly, I’ll add a while loop to wait until the AFU has completed it’s operation then spit out the data in the parity member.

printf("Waiting for completion by AFU\n");
while(!example->done){
  sleep(1);
}

printf("PARITY:\n%s\n", (char *)example->parity);

At this point we can get the address of our WED structure in our AFU, but we’ll need to use the PSL’s Command and Buffer interfaces to request the data inside of that structure, which I’ll cover in the next post. Ending on this point I’ll commit my application code changes and see you in the next post!

Hello AFU – Part 3

This is part 3 of my Hello AFU tutorial. In the last part we built components to handle the AFU reset. In this part we’ll look at the requests coming in for the AFU descriptor and build a mechanism to send this data back to the PSL.

svAFU Ports

Before we get started here, I noticed something odd that’s good to be aware of. In the first post I commented out the structured inputs and outputs because Quartus was throwing errors that they were not defined. I assumed it didn’t like that they weren’t being used, but after re-cloning the repo down it started to give me those errors again.

It looks like this error might be just related to some of the order the components are being built in, if you comment out all the structured inputs it’ll synthesize the project successfully. After that, you can uncomment it all and it will synthesize just fine. If anyone has insight into why this is happening or a better fix, I’d appreciate any feedback you have to offer.



AFU Descriptor Read Requests

After the AFU handles its initial reset signal, the next batch of signals are requests over the MMIO interface for the AFU descriptor. The AFU decriptor provides some details about the AFUs function and setup.

To add the MMIO interfaces to the wave viewer, I’ll add a do watch_mmio_interface.do line to my test.do script before the run 40 command.

The MMIO operations are synchronous so the PSL will send a single request and wait until it gets a response. We can see the first signal coming in here:

first_mmio

Like with the job interface, the ha_mmval will be raised when a valid command is active. The ha_mmcfg being high lets us know this a request for data in the AFU descriptor. ha_mmrnw is high for read requests and low for write requests. ha_mmdw is low for 32-bit requests and high for 64-bit. ha_mmad is the address of the data being requested, and ha_mmadpar is the odd parity bit for that address. ha_data and ha_datapar are only used for write requests, so we don’t need to look at those quite yet.

Parameterizing the Shift Register

Before we can send data back, we need to make a modification to our shift register. Similar to our previous jdone signal we shifted back a clock cycle, we need to do the same here for ah_mmack and ah_mmdata. Our shift register as-is will work fine for ah_mmack, as it’s also a signal signal. For ah_mmdata we need it to be 64 bits wide to support the whole bus being shifted back a clock cycle.

SystemVerilog provides as a construct to parameterize a module, allowing us to modify some of how it’s operating on a per-instance basis. In this case I want to add a width field that lets us set the width the bus.

The logic in the always_ff block does not need to change for this, we just need to define the parameter and use it in the input and output port declarations

module shift_register #(parameter width = 1) (
  input logic clock,
  input logic [0:width-1] in,
  output logic [0:width-1] out);

In this change, we now have a default width of 1, so that we don’t need to change the shift registers already in use. For the module we’re about to build we can now create an instance like this for a 64 bit wide shifter:

shift_register #(64) data_shift(
  .clock(clock),
  .in(data),
  .out(mmio_out.data));

Handling AFU Descriptor requests

I will define and add a new file mmio.sv to the project that will be responsible for all MMIO request handling. It will have some internal variables ack and data to hold the data that will be shifted back. Additionally it will have some logic to set the ah_mmdatapar bit. That parity bit doesn’t need to be shifted because we can hook it up to the current output to save a couple logic gates.

import CAPI::*;

module mmio (
  input logic clock,
  input MMIOInterfaceInput mmio_in,
  output MMIOInterfaceOutput mmio_out);

  logic ack;
  logic [0:63] data;

  shift_register ack_shift(
    .clock(clock),
    .in(ack),
    .out(mmio_out.ack));

  shift_register #(64) data_shift(
    .clock(clock),
    .in(data),
    .out(mmio_out.data));

  // Set parity bit for MMIO output
  assign mmio_out.data_parity = ~^mmio_out.data;

  always_ff @(posedge clock) begin
    if(mmio_in.valid) begin
      if(mmio_in.cfg) begin
        if(mmio_in.read) begin
          ack <= 1;
          data <= 1;
        end
      end
    end else begin
      ack <= 0;
      data <= 0;
    end
  end

endmodule

For now, I’m not as worried about sending proper data as I am getting all the pieces laid out and working. I’ll add an instance of this new mmio module in my parity_afu module.

mmio mmio_handler(
  .clock(clock),
  .mmio_in(mmio_in),
  .mmio_out(mmio_out));

Looking at the waves now, we can see 7 MMIO requests coming in, and for each we’re sending back a simple 1 across on the data bus.

first_mmio_writes

Since we didn’t send a proper descriptor, PSLSE complains ERROR:AFU descriptor num_of_processes=0!

Either way it’s starting to come together so I’ll commit my changes and move on.

Defining a New Type

It took me a while to find a way to handle these AFU requests that I felt was functional and cleanly coded. Most AFU descriptor implementations I’ve seen so far are using some verilog implementation of ROM, and this is how I first implemented this.

I found this method to be a bit cumbersome, so I decided to extend my capi.sv to include a new structure definition for an AFU descriptor. This format is modeled after whats described in the CAPI User’s Manual.

  typedef struct packed {
    bit [0:15] num_ints_per_process;
    bit [0:15] num_of_processes;
    bit [0:15] num_of_afu_crs;
    bit [0:15] req_prog_model;
    bit [0:199] reserved_1;
    bit [0:55] afu_cr_len;
    bit [0:63] afu_cr_offset;
    bit [0:5] reserved_2;
    bit psa_per_process_required;
    bit psa_required;
    bit [0:55] psa_length;
    bit [0:63] psa_offset;
    bit [0:7] reserved_3;
    bit [0:55] afu_eb_len;
    bit [0:63] afu_eb_offset;
  } AFUDescriptor;

To support reading the right portions of the AFU descriptor, a SystemVerilog function felt like the best route. This initial implementation is built just to support the regions of the AFU descriptor that I’ve seen requests come in to so far.

function bit [0:63] read_afu_descriptor(AFUDescriptor descriptor,
                                        bit [0:23] address);
  case(address)
    'h0: begin
      return {descriptor.num_ints_per_process,
              descriptor.num_of_processes,
              descriptor.num_of_afu_crs,
              descriptor.req_prog_model};
    end
    default: begin
      return 0;
    end
  endcase
endfunction

With this new type and function to help reading it added to my CAPI package, I can create an instance of this type in my mmio module and set the values appropriately.

  AFUDescriptor afu_desc;

  assign afu_desc.num_ints_per_process = 0,
         afu_desc.num_of_processes = 1,
         afu_desc.num_of_afu_crs = 0,
         afu_desc.req_prog_model = 16'h8010,
         afu_desc.reserved_1 = 0,
         afu_desc.afu_cr_len = 0,
         afu_desc.afu_cr_offset = 0,
         afu_desc.reserved_2 = 0,
         afu_desc.psa_per_process_required = 0,
         afu_desc.psa_required = 0,
         afu_desc.psa_length = 0,
         afu_desc.psa_offset = 0,
         afu_desc.reserved_3 = 0,
         afu_desc.afu_eb_len = 0,
         afu_desc.afu_eb_offset = 0;

The last step is to replace our hard-coded response with the newly defined function.

data <= read_afu_descriptor(afu_desc, mmio_in.address);

With that completed, I’ll verify I’m getting the expected behavior during simulation.

afu_mmio

Now that we’ve returned enough AFU data the PSLSE output shows us we’re ready to connect a client!

INFO:PSLSE version 1.002 compiled @ Feb  5 2016 11:47:34
INFO:PSLSE parm values:
    Seed     = 13
    Timeout  = 10 seconds
    Response = 16%
    Paged    = 3%
    Reorder  = 86%
    Buffer   = 82%
INFO:Attempting to connect AFU: afu0.0 @ localhost:32768
PSL_SOCKET: Using PSL protocol level : 0.9908.0
INFO:Clocking afu0.0
WARNING:ah_brlat must be either 1 or 3!
WARNING:ah_brlat must be either 1 or 3!
INFO:Started PSLSE server, listening on kbawx:16384

There are also a couple of warnings about the buffer read latency, but I’ll wait to address that when we look at using the buffer interface. With this bit implemented, I’ll commit my changes and in the next post we’ll look at communicating with our AFU from userspace.