Hello AFU – Part 4

This is part 4 of my Hello AFU tutorial. In the previous section we implemented the functionality to handle requests for the AFU descriptor. In this part we’ll shift focus a little bit into writing the C code that runs on the application side, and send our first bit of data to our AFU

Getting the Code Started

I like to start a new C project by writing a basic Makefile, this one will just set up some variables to include the libCXL library from PSLSE.

LIBCXL_PATH=~/workprojects/pslse/libcxl
LIBCXL_INCLUDE=-I $(LIBCXL_PATH) -L $(LIBCXL_PATH) -lcxl -lpthread
LIBRARIES=$(LIBCXL_INCLUDE)
CC=gcc -Wall -o $@ $< $(LIBRARIES)

all: test_afu

test_afu: test_afu.c
    $(CC)

clean:
    rm -f test_afu

Next I’ll write a basic C file that will just open a handle to the AFU and clean up.

#include <stdio.h>
#include "libcxl.h"


int main(int argc, char *argv[])
{
    struct cxl_afu_h *afu;

    afu = cxl_afu_open_dev("/dev/cxl/afu0.0d");
    if(!afu)
    {
        printf("Failed to open AFU: %m\n");
        return 1;
    }

    cxl_afu_attach(afu, 0x0123456789abcdef);
    printf("Attached to AFU\n");

    cxl_afu_free(afu);

    return 0;
}

Next, just to make things a little faster, I’ve noticed my AFU typically becomes ready around 136ns, so I’ll modify my test.do to run for 136ns right at the start. At this point I can make my test_afu binary and run it as long as I set my linker path via export LD_LIBRARY_PATH=~/workprojects/pslse/libcxl/ prior to running it.

The last thing to setup before running is to create a pslse_server.dat file that contains what host:port the simulated libCXL should connect to. I’ll point mine to localhost:16384 which is the default if you’re testing locally.

After kicking off my test_afu application and running the AFU for a few cycles, I’ll see my second argument to cxl_afu_attach show up in my ha_jea bus, this chunk of data is usually referred to as the Work Element Descriptor (WED).

wed_signal

I’ll commit my changes and we’ll start making a little better use of that WED.



Aligning data

Many of the requests we’ll make soon to read data from the applications memory space will require that the data is aligned to 128-byte addresses. There are a few ways to accomplish this, my go-to is the aligned_alloc() function that is part of the C11 standard.

This function provides an interface that is very similar to the classic malloc() function, its first parameter lets you specify what memory alignment you want.

Now that we can align data, I’ll create my WED structure for this parity-generating AFU.

typedef struct
{
    __u64 size;
    void *stripe1;
    void *stripe2;
    void *parity;
    __u64 done;
} parity_request;

Next I’ll create my example parity request, using aligned allocations for each block.

parity_request *example;
size_t size = 128, alignment = 128;

example = aligned_alloc(alignment, sizeof(*example));
example->size = size;
example->stripe1 = aligned_alloc(alignment, size);
example->stripe2 = aligned_alloc(alignment, size);
example->parity = aligned_alloc(alignment, size);

The intention here is that the data in the structure members stripe1 and stripe3 will be XOR’d together, and the results put in the parity member. Once the operation is complete, the AFU will set the done field to a non-zero.

Before sending this request to the AFU, I’ll copy some data into both buffers and zero out the done field.

memcpy(example->stripe1,
       "asfb190jwqsefx0amxAqa1nlkaf78sa0g&0ha8dngj3t21078fnajl38n32j3np2"
       "x3t8wefiankxkfmgm ncmbqx8ehn2jkaeubgfbuapwnjxkg09f0w9es80872981",
       size);
memcpy(example->stripe2,
       "\x35\x1b\x07\x16\x11\x50\x43\x4a\x04\x1e\x1e\x00\x46\x08\x42\x0e"
       "\x1d\x1d\x33\x51\x11\x50\x1c\x05\x1f\x18\x47\x17\x6c\x1b\x08\x43"
       "\x47\x4f\x43\x48\x04\x40\x05\x0d\x13\x06\x4a\x54\x45\x59\x51\x43"
       "\x18\x2f\x49\x0c\x4a\x09\x4b\x48\x0b\x50\x46\x03\x5d\x09\x50\x46"
       "\x17\x13\x07\x5d\x12\x4b\x46\x20\x46\x0a\x4b\x19\x07\x15\x02\x47"
       "\x01\x49\x05\x06\x4d\x16\x1e\x58\x4b\x00\x0d\x4e\x46\x02\x02\x12"
       "\x45\x07\x17\x09\x08\x0b\x1b\x06\x50\x18\x00\x4a\x0b\x04\x0a\x55"
       "\x19\x14\x55\x16\x55\x45\x14\x5d\x51\x4a\x17\x41\x56\x57\x5f",
       size);
example->done = 0;

I’ll also add some print statements to show me these structure members.

printf("[example structure\n");
printf("  example: %p\n", example);
printf("  example->size: %llu\n", example->size);
printf("  example->stripe1: %p\n", example->stripe1);
printf("  example->stripe2: %p\n", example->stripe2);
printf("  example->parity: %p\n", example->parity);
printf("  &(example->done): %p\n", &(example->done));

I’ll modify my cxl_afu_attach() call to send the pointer to this parity_request structure.

cxl_afu_attach(afu, (__u64)example);

Lastly, I’ll add a while loop to wait until the AFU has completed it’s operation then spit out the data in the parity member.

printf("Waiting for completion by AFU\n");
while(!example->done){
  sleep(1);
}

printf("PARITY:\n%s\n", (char *)example->parity);

At this point we can get the address of our WED structure in our AFU, but we’ll need to use the PSL’s Command and Buffer interfaces to request the data inside of that structure, which I’ll cover in the next post. Ending on this point I’ll commit my application code changes and see you in the next post!

Hello AFU – Part 3

This is part 3 of my Hello AFU tutorial. In the last part we built components to handle the AFU reset. In this part we’ll look at the requests coming in for the AFU descriptor and build a mechanism to send this data back to the PSL.

svAFU Ports

Before we get started here, I noticed something odd that’s good to be aware of. In the first post I commented out the structured inputs and outputs because Quartus was throwing errors that they were not defined. I assumed it didn’t like that they weren’t being used, but after re-cloning the repo down it started to give me those errors again.

It looks like this error might be just related to some of the order the components are being built in, if you comment out all the structured inputs it’ll synthesize the project successfully. After that, you can uncomment it all and it will synthesize just fine. If anyone has insight into why this is happening or a better fix, I’d appreciate any feedback you have to offer.



AFU Descriptor Read Requests

After the AFU handles its initial reset signal, the next batch of signals are requests over the MMIO interface for the AFU descriptor. The AFU decriptor provides some details about the AFUs function and setup.

To add the MMIO interfaces to the wave viewer, I’ll add a do watch_mmio_interface.do line to my test.do script before the run 40 command.

The MMIO operations are synchronous so the PSL will send a single request and wait until it gets a response. We can see the first signal coming in here:

first_mmio

Like with the job interface, the ha_mmval will be raised when a valid command is active. The ha_mmcfg being high lets us know this a request for data in the AFU descriptor. ha_mmrnw is high for read requests and low for write requests. ha_mmdw is low for 32-bit requests and high for 64-bit. ha_mmad is the address of the data being requested, and ha_mmadpar is the odd parity bit for that address. ha_data and ha_datapar are only used for write requests, so we don’t need to look at those quite yet.

Parameterizing the Shift Register

Before we can send data back, we need to make a modification to our shift register. Similar to our previous jdone signal we shifted back a clock cycle, we need to do the same here for ah_mmack and ah_mmdata. Our shift register as-is will work fine for ah_mmack, as it’s also a signal signal. For ah_mmdata we need it to be 64 bits wide to support the whole bus being shifted back a clock cycle.

SystemVerilog provides as a construct to parameterize a module, allowing us to modify some of how it’s operating on a per-instance basis. In this case I want to add a width field that lets us set the width the bus.

The logic in the always_ff block does not need to change for this, we just need to define the parameter and use it in the input and output port declarations

module shift_register #(parameter width = 1) (
  input logic clock,
  input logic [0:width-1] in,
  output logic [0:width-1] out);

In this change, we now have a default width of 1, so that we don’t need to change the shift registers already in use. For the module we’re about to build we can now create an instance like this for a 64 bit wide shifter:

shift_register #(64) data_shift(
  .clock(clock),
  .in(data),
  .out(mmio_out.data));

Handling AFU Descriptor requests

I will define and add a new file mmio.sv to the project that will be responsible for all MMIO request handling. It will have some internal variables ack and data to hold the data that will be shifted back. Additionally it will have some logic to set the ah_mmdatapar bit. That parity bit doesn’t need to be shifted because we can hook it up to the current output to save a couple logic gates.

import CAPI::*;

module mmio (
  input logic clock,
  input MMIOInterfaceInput mmio_in,
  output MMIOInterfaceOutput mmio_out);

  logic ack;
  logic [0:63] data;

  shift_register ack_shift(
    .clock(clock),
    .in(ack),
    .out(mmio_out.ack));

  shift_register #(64) data_shift(
    .clock(clock),
    .in(data),
    .out(mmio_out.data));

  // Set parity bit for MMIO output
  assign mmio_out.data_parity = ~^mmio_out.data;

  always_ff @(posedge clock) begin
    if(mmio_in.valid) begin
      if(mmio_in.cfg) begin
        if(mmio_in.read) begin
          ack <= 1;
          data <= 1;
        end
      end
    end else begin
      ack <= 0;
      data <= 0;
    end
  end

endmodule

For now, I’m not as worried about sending proper data as I am getting all the pieces laid out and working. I’ll add an instance of this new mmio module in my parity_afu module.

mmio mmio_handler(
  .clock(clock),
  .mmio_in(mmio_in),
  .mmio_out(mmio_out));

Looking at the waves now, we can see 7 MMIO requests coming in, and for each we’re sending back a simple 1 across on the data bus.

first_mmio_writes

Since we didn’t send a proper descriptor, PSLSE complains ERROR:AFU descriptor num_of_processes=0!

Either way it’s starting to come together so I’ll commit my changes and move on.

Defining a New Type

It took me a while to find a way to handle these AFU requests that I felt was functional and cleanly coded. Most AFU descriptor implementations I’ve seen so far are using some verilog implementation of ROM, and this is how I first implemented this.

I found this method to be a bit cumbersome, so I decided to extend my capi.sv to include a new structure definition for an AFU descriptor. This format is modeled after whats described in the CAPI User’s Manual.

  typedef struct packed {
    bit [0:15] num_ints_per_process;
    bit [0:15] num_of_processes;
    bit [0:15] num_of_afu_crs;
    bit [0:15] req_prog_model;
    bit [0:199] reserved_1;
    bit [0:55] afu_cr_len;
    bit [0:63] afu_cr_offset;
    bit [0:5] reserved_2;
    bit psa_per_process_required;
    bit psa_required;
    bit [0:55] psa_length;
    bit [0:63] psa_offset;
    bit [0:7] reserved_3;
    bit [0:55] afu_eb_len;
    bit [0:63] afu_eb_offset;
  } AFUDescriptor;

To support reading the right portions of the AFU descriptor, a SystemVerilog function felt like the best route. This initial implementation is built just to support the regions of the AFU descriptor that I’ve seen requests come in to so far.

function bit [0:63] read_afu_descriptor(AFUDescriptor descriptor,
                                        bit [0:23] address);
  case(address)
    'h0: begin
      return {descriptor.num_ints_per_process,
              descriptor.num_of_processes,
              descriptor.num_of_afu_crs,
              descriptor.req_prog_model};
    end
    default: begin
      return 0;
    end
  endcase
endfunction

With this new type and function to help reading it added to my CAPI package, I can create an instance of this type in my mmio module and set the values appropriately.

  AFUDescriptor afu_desc;

  assign afu_desc.num_ints_per_process = 0,
         afu_desc.num_of_processes = 1,
         afu_desc.num_of_afu_crs = 0,
         afu_desc.req_prog_model = 16'h8010,
         afu_desc.reserved_1 = 0,
         afu_desc.afu_cr_len = 0,
         afu_desc.afu_cr_offset = 0,
         afu_desc.reserved_2 = 0,
         afu_desc.psa_per_process_required = 0,
         afu_desc.psa_required = 0,
         afu_desc.psa_length = 0,
         afu_desc.psa_offset = 0,
         afu_desc.reserved_3 = 0,
         afu_desc.afu_eb_len = 0,
         afu_desc.afu_eb_offset = 0;

The last step is to replace our hard-coded response with the newly defined function.

data <= read_afu_descriptor(afu_desc, mmio_in.address);

With that completed, I’ll verify I’m getting the expected behavior during simulation.

afu_mmio

Now that we’ve returned enough AFU data the PSLSE output shows us we’re ready to connect a client!

INFO:PSLSE version 1.002 compiled @ Feb  5 2016 11:47:34
INFO:PSLSE parm values:
    Seed     = 13
    Timeout  = 10 seconds
    Response = 16%
    Paged    = 3%
    Reorder  = 86%
    Buffer   = 82%
INFO:Attempting to connect AFU: afu0.0 @ localhost:32768
PSL_SOCKET: Using PSL protocol level : 0.9908.0
INFO:Clocking afu0.0
WARNING:ah_brlat must be either 1 or 3!
WARNING:ah_brlat must be either 1 or 3!
INFO:Started PSLSE server, listening on kbawx:16384

There are also a couple of warnings about the buffer read latency, but I’ll wait to address that when we look at using the buffer interface. With this bit implemented, I’ll commit my changes and in the next post we’ll look at communicating with our AFU from userspace.

Hello AFU – Part 2

This is the second part of my Hello AFU tutorial. In the last part we setup the base project and wrote a few scripts that will help view signals in ModelSim.

In this part we’ll look at the signals received by the AFU from the PSL and implement the reset handling required by all AFUs.

Resetting the AFU

The first thing the PSL requests of the AFU is to reset to a known good state. This is very easy to implement at this point as we have no internal state! Let’s look at the signals coming in for this.

reset_command

There are two implemented job control commands documented in the CAPI User’s Manual: START(0x90) and RESET(0x80). Initially and between jobs, the RESET command is sent to the AFU. The expectation on the AFU when given a RESET command is that it will reset its internal state, then raise the ah_jdone signal for a cycle.

Tip: All signals that begin with ah_ represent signals from the Accelerator to Host, signals starting with ha_ represent from the Host to the Accelerator

The purpose for each signal is documented within the CAPI User’s Manual, for the RESET command only the ha_jval and ha_jcom are important on the receiving side. ha_jcompar should also be properly set to set to the odd parity bit.

Before we handle this signal, let’s fix all the floating signals we’re sending back. They are floating because we aren’t explicitly setting these signals high or low, so lets set them all low. I’ll set the timebase_request and parity_enabled signals low while I’m at it. The signal names are different in my code as I’ve given them more verbose names in the capi.sv package file used to abstract these out. I’ve added this to my parity_afu module definition just above the always_ff statement.

assign job_out.running = 0,
       job_out.done = 0,
       job_out.cack = 0,
       job_out.error = 0,
       job_out.yield = 0,
       timebase_request = 0,
       parity_enabled = 0;

Additionally, I will need to uncomment the portion of afu.sv that routes these signals into the AFU.

Also, before testing this in the simulator, I write one more do file test.do to make it easier to see what I’m working on in simulation. It will prepare simulation, watch the job interface, then run for 10 cycles.

vsim work.top
do watch_job_interface.do
run 40

With those changes made I’ll verify the signals are now being driven low.

low_signals



Driving signals

There two main ways to drive signals in SystemVerilog, blocking = and non-blocking <=. These can be confusing terms, this page can help clear it up a little bit. Until you’re comfortable with the difference I suggest you use = only in assign statements where you are driving a signal to a constant value as we are so far. When you use <=, the value will stick in a register, preserving it’s value until changed later.

To send our ah_jdone signal, we need to detect when the ha_jval is high combined with a ha_jcom set to RESET, then we’ll know it’s appropriate to raise the ah_jdone signal.

For signals that we want to change during a clock cycle, we can put non-blocking assignments in an always_ff block. We first need to remove the blocking assignment made with the assign command as only one driver can be used to set the signal. Next we’ll add an if statement that will drive the done signal high only if a valid reset command is given, we also need to set it low in all other conditions so we’ll use an else statement to ensure we get that behavior.

always_ff @(posedge clock) begin
  if(job_in.valid & job_in.command == RESET) begin
    job_out.done <= 1;
  end else begin
    job_out.done <= 0;
  end
end

done_wrong

I’ve been advised that there is one thing wrong with this design, the done signal should be sent on the next clock cycle. We need something to help us delay the signal.

Making a shift register

There are a few ways to do this, I’ve elected to use a shift register to fulfill this need.

This shift register will pass its input to its output, delaying changes by a single clock cycle. It’s not the most useful shift register but it will do for this purpose.

module shift_register (
  input logic clock,
  input logic in,
  output logic out);

  always_ff @ (posedge clock) begin
    out <= in;
  end
endmodule

To use this module in our parity_afu module, we’ll need to create an instance of our shift_register module and a variable to reference its input. In the instance we create we’ll reference the inputs and outputs, then change our job logic to use the new jdone variable instead of the direct output.

logic jdone;

shift_register jdone_shift(
  .clock(clock),
  .in(jdone),
  .out(job_out.done));

Once this is all said and done, we should get the output shifted back a cycle as we desired. Since the shift register is setup with our internal jdone as input and the job_out.done as output, this will affect all assignments to jdone.

done_right

See these changes committed here.

This should be sufficient to handle the reset command for now. The next set of signals our AFU will receive will be requests for the AFU descriptor over the MMIO interface, I’ll walk through implementing this in my next post.