Hello AFU – Part 5

This is part 5 of my Hello AFU tutorial. In the last post, I built the C application that would attach and utilize the AFU that’s the focus of these posts. In this post I’ll start pulling data from the application’s memory space into the AFU and read the WED structure.

Keeping it Running

Before I start requesting for data, some modifications are necessary to notify the underlying systems that the AFU is running. So far, I’m not managing the ah_jrunning signal that should be set high when the AFU is performing a task. After a short time the PSL will stop driving the AFU’s clock if the AFU hasn’t raised the ah_jrunning signal, so lets quickly fix this and improve the parity_afu module a little bit.

I’ll refactor the always_ff block of the parity_afu module to use a case statement to handle commands and add handling for the START command in addition to our existing RESET command.

always_ff @(posedge clock) begin
  if(job_in.valid) begin
    case(job_in.command)
      RESET: begin
        jdone <= 1;
        job_out.running <= 0;
      end
      START: begin
        jdone <= 0;
        job_out.running <= 1;
      end
    endcase
  end else begin
    jdone <= 0;
  end
end

Now that I’m setting job_out.running, I’ll also remove my static assignment of that signal. These changes are committed here.



Planning for the Work Element Submodule

The ground work to actually deal with the issue at hand is almost completely laid out. The module that will do the real work will have considerably more complexity than the components so far, so I’ll start planning and creating a new module to segregate that functionality to, my parity_workelement.

First I’ll define the inputs and outputs of this module

Direction Name Purpose
Input clock Clock signal to follow
Input enabled High while AFU is in running state
Input reset Signal triggering reset of internal state
Input wed The WED pointer from userspace
Input buffer_in For reading userspace buffer data
Input response To check responses of commands
Output command_out To request buffer reads and writes
Output buffer_out For writing userspace buffer data

We’ll also define a mostly linear finite state machine to describe the work to be done.

State Purpose Next State
START Request data at WED WAITING_FOR_REQUEST
WAITING_FOR_REQUEST Wait for WED data to be available REQUEST_STRIPES
REQUEST_STRIPES Send commands to read stripe1 and stripe2 WAITING_FOR_STRIPES
WAITING_FOR_STRIPES Wait for stripe data to be available WRITE_PARITY
WRITE_PARITY Write XOR’d parity from stripes back to memory REQUEST_STRIPES if more data to read;
DONE otherwise
DONE Write done flag and halt. n/a

Now I’ll write the first couple portions of this module. I’ll create an enumeration that contains the various states used by the module. In the module definition itself I’ll define the input/output ports and create an internal register for the current_state. While I’m in here I setup some signals with assign, mostly some settings I don’t want to change and a few parity generators as well. Lastly I’ll start off the always_ff block that’ll contains the reset logic and the case statement that implements my state machine.

import CAPI::*;

typedef enum {
  START,
  WAITING_FOR_REQUEST,
  REQUEST_STRIPES,
  WAITING_FOR_STRIPES,
  WRITE_PARITY,
  DONE
} state;

module parity_workelement (
  input logic clock,
  input logic enabled,
  input logic reset,
  input pointer_t wed,
  input BufferInterfaceInput buffer_in,
  input ResponseInterface response,
  output CommandInterfaceOutput command_out,
  output BufferInterfaceOutput buffer_out
);

  state current_state;

  assign command_out.abt = 0,
         command_out.context_handle = 0,
         buffer_out.read_latency = 1,
         command_out.command_parity = ~^command_out.command,
         command_out.address_parity = ~^command_out.address,
         command_out.tag_parity = ~^command_out.tag,
         buffer_out.read_parity = ~^buffer_out.read_data;

  always_ff @ (posedge clock) begin
    if (reset) begin
      current_state <= START;
    end else if (enabled) begin
      case(current_state)
        START: begin
          $display("Started!");
        end
      endcase
    end
  end

endmodule

With that defined, I’ll modify my parity_afu module to include and instance of my parity_workelement:

parity_workelement workelement(
  .clock(clock),
  .enabled(job_out.running),
  .reset(jdone),
  .wed(job_in.address),
  .buffer_in(buffer_in),
  .response(response),
  .command_out(command_out),
  .buffer_out(buffer_out));

To reduce how much I’m looking at during simulation, I’ll also modify my test.do to just show what’s going on in my workelement.

vsim work.top
add wave -position insertpoint sim:/top/a0/svAFU/workelement/*
run 136

Since this is a significant amount of code I’ll commit here before implementing the state machine.

Requesting Data

Requesting the WED data will be easy enough, but I first want a handy container to put it in, so I’ll define a new type in SystemVerilog that matches my WED structure in C but I skip the done field as I don’t need to look at what’s currently in there; I can set that later by it’s offset relative to the WED.

typedef struct {
  longint unsigned size;
  pointer_t stripe1;
  pointer_t stripe2;
  pointer_t parity;
} parity_request;

Next I’ll add an internal register to the parity_workelement module that can hold this structure.

parity_request request;

To use the PSL’s Command Interface to request this data, the PSL requires that each active commands has a unique tag ID. I’ll define another enum that will be used to automatically ensure I have a unique tag for each purpose.

typedef enum logic [0:7] {
  REQUEST_READ,
  STRIPE1_READ,
  STRIPE2_READ,
  PARITY_WRITE,
  DONE_WRITE
} request_tag;

The simplest way to request data from userspace is using the READ_CL_NA, or “read cacheline, no allocate”, command. I’ll request a read size of 32 bytes, as I’m reading in 4 64-bit pointers. I’ll set the tag to REQUEST_READ and use the wed as my address. As with the other interfaces, I need to set a valid signal high for 1 clock, I’ll do this by setting it high in the START state, transitioning to the WAITING_FOR_REQUEST state, and have it set back low there.

case(current_state)
  START: begin
    command_out.command <= READ_CL_NA;
    command_out.tag <= REQUEST_READ;
    command_out.size <= 32;
    command_out.address <= wed;
    command_out.valid <= 1;
    current_state = WAITING_FOR_REQUEST;
  end
  WAITING_FOR_REQUEST: begin
    command_out.valid <= 0;
  end
endcase

When the data I’ve requested comes back, it’ll come via two writes on the buffer_in.write_data bus. This bus is 512-bites wide, but supports 128 byte (1024 bit) requests. As such, there are two writes that occur to deliver the lower (address 0) and higher (address 1) halves. Since I’ve only requested 32 bytes, the data will be in the first 256 bits of the writes to address 0 for the REQUEST_READ tag.

One important thing to look out for is that you can get multiple cycles of data on this bus, so you need to capture that data until the response interface lets you know the last cycle was valid.

With this in mind I’ll read the buffer interface each time it’s a valid signal and it’s for my tag and it’s for the address I’m looking for. It’s also important to remember that the terms read and write for the buffer interface are named from the PSL’s perspective, so even though I’m making a read request to read data, it comes to the AFU on the buses named write_data and such.

if (buffer_in.write_valid &&
    buffer_in.write_tag == REQUEST_READ &&
    buffer_in.write_address == 0) begin
  request.size <= buffer_in.write_data[0:63];
  request.stripe1 <= buffer_in.write_data[64:127];
  request.stripe2 <= buffer_in.write_data[128:191];
  request.parity <= buffer_in.write_data[192:255];
end

When the data comes back, it’s not quite as I’d like it to be.

wed_data

My application code spits out what these values should be:

[example structure
  example: 0x1d91500
  example->size: 128
  example->stripe1: 0x1d91600
  example->stripe2: 0x1d91780
  example->parity: 0x1d91880
  &(example->done): 0x1d91520

The issue here is that I’m reading in data that is in a little-endian byte format, but is being interpreted as big-endian. To deal with this issue I wrote a SystemVerilog function that can swap the endianness of the bytes in a generic way.

function logic [0:63] swap_endianness(logic [0:63] in);
  return {in[56:63], in[48:55], in[40:47], in[32:39], in[24:31], in[16:23],
          in[8:15], in[0:7]};
endfunction

I’ll modify my assignments to make use of this function.

request.size <= swap_endianness(buffer_in.write_data[0:63]);
request.stripe1 <= swap_endianness(buffer_in.write_data[64:127]);
request.stripe2 <= swap_endianness(buffer_in.write_data[128:191]);
request.parity <= swap_endianness(buffer_in.write_data[192:255]);

Now that this is in the right byte order, my internal request register is being filled with the appropriate values.

I’ll add a touch of logic to catch when these values are set to something valid then move to the next state.

if (response.valid && response.tag == REQUEST_READ) begin
  current_state <= REQUEST_STRIPES;
end

With our WED data all the way into our AFU I’ll commit my changes and call it a wrap for this post. In the next post I’ll write the remaining states and write some data back to userspace memory, completing this AFU!

One thought on “Hello AFU – Part 5”

  1. what’s up how are you ? Your blog is very neat, lots of valuable information and gives me motivation to start my own blog. Do you have any tips you can give me ? Also Which template are you using as your design its very reputable.

Leave a Reply