Such Programming

Tinkerings and Ramblings

Fun With Pointers

Pointers are one of the most misunderstood concepts in C but also one the most powerful tools it provides. In general they are just numbers in memory like everything else, but their value is interpreted as an address to other data. In this document I’ll attempt to demystify the arcane pointer and show some practical examples of their power.

ad

Everything Lives In Memory

Before I get into pointers, I want to clarify some aspects of how a program runs. This will all be based on my local Linux environment so your mileage may vary, but conceptually it’s all the same.

The C compiler and linker you use takes your code, parses it, and converts it to a binary file that works for the targeted processor and operating system. The compiler will validate your C code and generate machine code that your processor can work with, the linker will do some organizational tasks to allow your program to use libraries and be loaded by the operating system.

Most, if not all, UNIX systems use the Executable and Linkable Format (ELF) for compiled programs and libraries. It has a header that describes what type of binary the file is, and has a lot of other data that tells the operating system where in virtual memory the program’s assets should be loaded.

Here’s a simple program that we’ll inspect a little bit. It spits out the addresses of a few things, the locations in the running virtual memory space of a program. The & operator before a variable will give you the address of that variable.

#include <stdio.h>

int my_global = 123;

int main(void)
{
  int my_local = 456;

  printf("main has an address, it is %p\n", main);
  printf("my_global has an address, it is %p\n", &my_global);
  printf("my_local has an address, it is %p\n", &my_local);
  printf("printf has an address, it is %p\n", printf);
  return 0;
}

Tons of crap ends up in the binary for a C program, like all the stuff that provides you the goodness of the standard C library and the stuff the C library itself needs. One of the more interesting things we can look at is the symbol table. Symbols help give us names to find important things in the program. I’ll use readelf to dump the symbol table from the binary, and filter out globals.

From this list there are a few things that were defined in my source file. Entries 4 and 53 both refer to the printf call I’m using from glibc which for this program is located at the memory address 0x400470, entry 59 is the my_global variable from my program at the address 0x601040 and entry 64 is my main function at 0x400596. The only thing missing is the my_local variable that’s inside of my main function.

If I run the program a few times I’ll see where the program says these things are.

As seen here, main, my_global and printf all have consistent locations in memory and it matches what was listed in the symbol table, while my_local changes between calls of the program. This is because the local variable is part of the function which is part of the call stack and Linux uses a security feature that randomizes where the stack begins. As functions are called in your C program the stack is used for function arguments, function local variables and a pointer to code that should be ran after the function is complete. Each function call gets a new section of the stack, referred to as a Stack Frame.

Just like a string and everything else on the computer, the executable code resides somewhere in memory. I can use objdump to get the machine code version of main, and it’ll show us each instruction that main is composed of and the addresses of each processor operation.

0000000000400596 <main>:
  400596:       55                      push   rbp
  400597:       48 89 e5                mov    rbp,rsp
  40059a:       48 83 ec 10             sub    rsp,0x10
  40059e:       64 48 8b 04 25 28 00    mov    rax,QWORD PTR fs:0x28
  4005a5:       00 00 
  4005a7:       48 89 45 f8             mov    QWORD PTR [rbp-0x8],rax
  4005ab:       31 c0                   xor    eax,eax
  4005ad:       c7 45 f4 c8 01 00 00    mov    DWORD PTR [rbp-0xc],0x1c8
  4005b4:       be 96 05 40 00          mov    esi,0x400596
  4005b9:       bf b8 06 40 00          mov    edi,0x4006b8
  4005be:       b8 00 00 00 00          mov    eax,0x0
  4005c3:       e8 a8 fe ff ff          call   400470 <printf@plt>
  4005c8:       be 40 10 60 00          mov    esi,0x601040
  4005cd:       bf d8 06 40 00          mov    edi,0x4006d8
  4005d2:       b8 00 00 00 00          mov    eax,0x0
  4005d7:       e8 94 fe ff ff          call   400470 <printf@plt>
  4005dc:       48 8d 45 f4             lea    rax,[rbp-0xc]
  4005e0:       48 89 c6                mov    rsi,rax
  4005e3:       bf 00 07 40 00          mov    edi,0x400700
  4005e8:       b8 00 00 00 00          mov    eax,0x0
  4005ed:       e8 7e fe ff ff          call   400470 <printf@plt>
  4005f2:       be 70 04 40 00          mov    esi,0x400470
  4005f7:       bf 28 07 40 00          mov    edi,0x400728
  4005fc:       b8 00 00 00 00          mov    eax,0x0
  400601:       e8 6a fe ff ff          call   400470 <printf@plt>
  400606:       b8 00 00 00 00          mov    eax,0x0
  40060b:       48 8b 55 f8             mov    rdx,QWORD PTR [rbp-0x8]
  40060f:       64 48 33 14 25 28 00    xor    rdx,QWORD PTR fs:0x28
  400616:       00 00 
  400618:       74 05                   je     40061f <main+0x89>
  40061a:       e8 41 fe ff ff          call   400460 <__stack_chk_fail@plt>
  40061f:       c9                      leave  
  400620:       c3                      ret    

Don’t be too daunted with the assembly here, we’re not here to talk about that. Just grasp that when this program is running main will be residing here in the memory space of the program.

To the memory, there is no difference between program code and bytes that may be strings, integers or pictures of cats. To prove this, I’ll tell the compiler that I want to use main as a string, and I’ll look at the first character of it.

#include <stdio.h>

int main(void)
{
  char *string = (char*) main;

  printf("character: %c\n", string[0]);
  printf("hex: %x\n", string[0]);

  return 0;

}

When I build and run this program, this is what I get:

$ ./test 
character: U
hex: 55

Just like my previous program, main starts with the assembly operation push rbp, which if you look at as an ascii character would be a U, or 0x55 as a hexadecimal number.

Why You Confuse Me So?!

The reason I’m pushing this so hard is to really drive home the fact that everything lives in memory and that the only difference between data types is how you treat the data in memory. You can interpret any given chunk of memory as any type, in the case of pointers we treat a chunk of memory as an address that points to another chunk of memory.

Let’s look at some less esoteric examples of how this can be useful.

The Great Unknown

For someone writing a C application, it would be fantastic if you knew how much memory a program would use before running it. Sadly this is almost never the case, so we need ways to get more memory when we need it. The call stack can suffice to create space for the variables we use inside of a function, but when the function ends we lose scope of those variables. To dynamically allocate space for keeps we use the heap, most commonly via malloc() (memory allocate).

The malloc() function is provided by the C standard library, it asks the operating system to give your program a chunk of memory to work on. This is useful for making space as needed to store data in your program.

Here’s an example, this program will take a number of bytes to allocate. It’ll request a chunk of memory of that size, print out it’s address, and then releases that chunk of memory back to the system (to avoid memory leaks!.

#include <stdio.h>
#include <stdlib.h>
 
int main(int argc, char *argv[])
{
	void *dynamic_space;
	int size;
 
	if(argc != 2)
	{
		fprintf(stderr, "Usage: %s <number of bytes to allocate>\n", argv[0]);
		return 1;
	}
 
	size = atoi(argv[1]);
	if(size < 1)
	{
		fprintf(stderr, "Number of bytes must be larger than 0\n");
		return 1;
	}
 
	dynamic_space = malloc(size);
 
	printf("We allocated %d bytes, they are located at %p\n", size, dynamic_space);
 
	free(dynamic_space);
 
	return 0;
}

I’ll give this program a few runs asking for different amounts of memory

As you can see, the memory gets allocated in a variety of places even if you ask for the same amount of memory on another run of the program. On my machine, when I asked for a huge allocation it provided me a much higher memory address to use. I’m not sure why this happened but if the system thinks that’s where it should be who am I to argue.

It’s easy enough to ask for space dynamically, the new problem introduced by this is how to be organize it all.

Pointer Usage

There is no one perfect way to organize your pointers, but there are a ton of tried and true methods that are useful for a variety of cases.

Null terminated arrays

One of most common structures is a null terminated array, in the case of the char type we normally just call this a string. One of the handy things about these types of structures is they allow for pointer arithmetic. A null terminated array is a contiguous block of memory that contains no information about it’s size, but you can find the end of it by looking for the NULL or 0 (same thing in C) value at the end of it.

In C the ++ and -- operators work to increment and decrement values, but in the case of pointers they increment by the size of the data type the pointer works with, so it’s identical to incrementing a number with character pointers since characters use a single byte, but for a 4 byte integer it’s more like +=4. Using this functionality you can loop through a null terminated list until the value the pointer points to is zero pretty easily.

Here’s an example of that:

#include <stdio.h>
 
char my_string[] = "well hello there";
int many_numbers[] = {1, 2, 3, 0};
 
int main(int argc, char *argv[]) {
  char *a_character; 
  int *an_integer;
 
  // Start where my_string points, printing each character until a_character
  // points to a zero value
  for(a_character = my_string; *a_character; a_character++) {
    printf("at %p we have '%c'\n", a_character, *a_character);
  }
  printf("\n");
 
  for (an_integer = many_numbers; *an_integer; an_integer++) {
    printf("at %p we have %d\n", an_integer, *an_integer);
  }
 
  return 0;
}

And here’s that program in action

As you can see the address is incremented by 1 for character pointers and 4 for integer pointers. We can keep reading as long as the dereferenced value (the value in the address the pointer points to) is not zero. When it is zero we’re at the end of the array.

To drive this home further, here’s a program that’ll determine the length of a string given as the first argument.

#include <stdio.h>
 
int main(int argc, char *argv[]) {
	int size = 0;
	char *ptr;
 
	if (argc != 2) {
		fprintf(stderr, "Usage: %s <string>\n", argv[0]);
		return 1;
	}
 
	ptr = argv[1];
	while (*ptr) {
		size++;
		ptr++;
	}
 
	printf("first argument is %d bytes long\n", size);
 
	return 0;
}

And let’s see how that fares

Passing By Reference

In many languages, we hear about how things are passed by reference, as opposed to things that are passed by value. In many interpreted languages you have little to no control over how that works, but C lets (makes?) you control that completely.

Let’s say I want to make a function that randomizes two numbers for me. In C you can’t return more than one value from a function, so setting two variables with one function call isn’t possible without pointers. I could return an array of two integers that have random values, then set my variables to those values, but that means my function doesn’t do everything I want it to on it’s own.

The easiest way to do this is to pass the addresses (pointers) to the variables I want randomized, then my function can dereference them as if they were ordinary variables with *.

#include <stdio.h>
#include <stdlib.h>


void randomize_them(int *a, int *b) {
  *a = random();
  *b = random();
}

int main(void) {
  int my_a = 1;
  int my_b = 2;

  printf("before\na = %d\nb = %d\n", my_a, my_b);
  randomize_them(&my_a, &my_b);
  printf("after\na = %d\nb = %d\n", my_a, my_b);

  return 0;
}

Now that wasn’t too bad, was it?

That’ll wrap things up for this post, this covers most of the basics about pointers. I will be covering a bit more pointer-fu in the next post about defining and using structures in C, but I think this provides a good start for now.

ad