Hello AFU – Part 6

This is the 6th and final part of my Hello AFU tutorial. In the last post, I started building out a state machine for the AFU and read from the data structure that the WED points to. In this post, I’ll finish off the state machine, pulling down the data in our stripes XOR them together and write that data back to userland.

Reading the Stripes

Since the largest memory size I can request via the PSL is for 128 bytes, I’ll make requests for that amount. I need a scratch pad for this data so I’ll add two 1024 bit internal registers for these chunks of data. I’ll also need a variable to know when I’ve received both chunks, so I’ll setup a small register for that as well.

logic [0:1023] stripe1_data;
logic [0:1023] stripe2_data;
logic stripe_received;

In my REQUEST_STRIPES state I’ll request data from stripe1 in one cycle, then stripe2 in the next, I’ll use the command’s tag to know where I am in that process. I’ll set my stripe_received to 0, to indicate I’ve not yet retrieved either.

REQUEST_STRIPES: begin
  command_out.valid <= 1;
  command_out.size = 128;
  command_out.command <= READ_CL_NA;
  if (command_out.tag == REQUEST_READ) begin
    command_out.tag <= STRIPE1_READ;
    command_out.address <= request.stripe1;
  end else begin
    command_out.tag <= STRIPE2_READ;
    command_out.address <= request.stripe2;
    current_state <= WAITING_FOR_STRIPES;
    stripe_received <= 0;
  end
end

With the requests for stripe data sent, I need to wait for the data to come back. This could happen in any order, so I need to be ready for either.

WAITING_FOR_STRIPES: begin
  command_out.valid <= 0;
  if (buffer_in.write_valid) begin
    case(buffer_in.write_tag)
      STRIPE1_READ: begin
        if (buffer_in.write_address  == 0) begin
          stripe1_data[0:511] <= buffer_in.write_data;
        end else begine
          stripe1_data[512:1023] <= buffer_in.write_data;
        end
      end
      STRIPE2_READ: begin
        if (buffer_in.write_address == 0) begin
          stripe2_data[0:511] <= buffer_in.write_data;
        end else begine
          stripe2_data[512:1023] <= buffer_in.write_data;
        end
      end
    endcase
  end
end

In the same state, I’ll look for the tags to come in over the response interface. On the first request I set the stripe_received register, the second request the state progresses to WRITE_PARITY

if (response.valid) begin
  if (response.tag == STRIPE1_READ ||
      response.tag == STRIPE2_READ) begin
    if (stripe_received) begin
      current_state <= WRITE_PARITY;
    end else begin
      stripe_received <= 1;
    end
  end
end



Where is this Parity?

I decided to parity the stripes via assign, by creating one new internal variable parity_data can be referenced for the XOR’d value of stripe1 and stripe2.

logic [0:1023] parity_data;

assign parity_data = stripe1_data ^ stripe2_data;

Since I set the buffer latency to 1, the data being put on the buffer for writing to memory needs to be shifted back a cycle.

logic [0:511] write_buffer;

shift_register #(512) write_shift (
  .clock(clock),
  .in(write_buffer),
  .out(buffer_out.read_data));

Now I need to write the parity data to the memory at request.parity. This is pretty similar to reading memory. I’ll send a WRITE_CL “write cacheline” command and align my data with buffer_out.read_data, returning the first half for address 0 and the high half in 1.

WRITE_PARITY: begin
  if (command_out.tag != PARITY_WRITE) begin
    command_out.command <= WRITE_NA;
    command_out.address <= request.parity;
    command_out.tag <= PARITY_WRITE;
    command_out.valid <= 1;
  end else begin
    command_out.valid <= 0;
    // Read half depending on address
    if (buffer_in.read_address == 0)  begin
      write_buffer <= parity_data[0:511];
    end else begin
      write_buffer <= parity_data[512:1023];
    end
    // Handle response
    if (response.valid &&
        response.tag == PARITY_WRITE) begin
        current_state <= DONE;
    end
  end
end

After the parity is written, the job is complete. The state progresses to DONE when the write comes back on the response interface.

Aligned Writing

Writing the done flag is a little trickier, since it is not on a 128 or 64-byte alignment. The PSL can handle writing to any address, but the data must be aligned within the 128-byte read bus. If the data size you’re writing to is 64 bytes or less you can let the same data sit on the buffer interface for both addresses.

In this case, the done field is 32 bytes past WED. and I’m doing a 1 byte write. I’ll align my data starting at the 256th bit, writing 8 bits. I’ll write a 1 in the first byte to set the little-endian unsigned 64bit number to a non-zero.

DONE: begin
  if (command_out.tag != DONE_WRITE) begin
    command_out.tag <= DONE_WRITE;
    command_out.size <= 1;
    command_out.address <= wed + 32;
    command_out.valid <= 1;
    write_buffer[256:319] <= 1;
  end else begin
    command_out.valid <= 0;
  end
end

With that, the parity is written and the userspace application can see when it completes. Here’s the output from the test_afu application.

INFO:Connecting to host 'localhost' port 16384
[example structure
  example: 0x7fa500
  example->size: 128
  example->stripe1: 0x7fa600
  example->stripe2: 0x7fa780
  example->parity: 0x7fa880
  &(example->done): 0x7fa520
Attached to AFU
Waiting for completion by AFU
done: 0
done: 0
done: 1
PARITY:
That is some proper parity! This is exactly what I'm expecting to see. I'd also like to see this running on some real gear soon
Releasing AFU

That completes the basic function of this AFU, I’ll commit my changes here.

Larger buffers

Now I’ll extend the design to support more than 128-byte buffers, this just requires an offset buffer that keep track of the current offset relative to the total size of the buffer to generate parity for.

I’ll start by adding a new variable for the offset that matches the data type as size.

longint unsigned offset;

Then I’ll set it to 0 in the START state.

offset <= 0;

In the REQUEST_STRIPES state I’ll add the offset to the stripe pointers.

command_out.address <= request.stripe1 + offset;

In the WRITE_PARITY state I’ll add the offset to the parity pointer, and check to see if the operation is complete.

command_out.address <= request.parity + offset;
if (offset + 128 < request.size) begin
  offset <= offset + 128;
  current_state <= REQUEST_STRIPES;
end else begin
  current_state <= DONE;
end

With that I’d say this AFU is good enough for this tutorial. I’ll commit my changes and welcome pull requests if you find improvements to this tutorial. Hope this helps you hack on CAPI!

One thought on “Hello AFU – Part 6”

Leave a Reply