Pointers are one of the most misunderstood concepts in C but also one the most powerful tools it provides. In general they are just numbers in memory like everything else, but their value is interpreted as an address to other data. In this document I’ll attempt to demystify the arcane pointer and show some practical examples of their power.
Everything Lives In Memory
Before I get into pointers, I want to clarify some aspects of how a program runs. This will all be based on my local Linux environment so your mileage may vary, but conceptually it’s all the same.
The C compiler and linker you use takes your code, parses it, and converts it to a binary file that works for the targeted processor and operating system. The compiler will validate your C code and generate machine code that your processor can work with, the linker will do some organizational tasks to allow your program to use libraries and be loaded by the operating system.
Most, if not all, UNIX systems use the Executable and Linkable Format (ELF) for compiled programs and libraries. It has a header that describes what type of binary the file is, and has a lot of other data that tells the operating system where in virtual memory the program’s assets should be loaded.
Here’s a simple program that we’ll inspect a little bit. It spits out the addresses of a few things, the locations in the running virtual memory space of a program. The &
operator before a variable will give you the address of that variable.
#include <stdio.h>
int my_global = 123;
int main(void)
{
int my_local = 456;
printf("main has an address, it is %p\n", main);
printf("my_global has an address, it is %p\n", &my_global);
printf("my_local has an address, it is %p\n", &my_local);
printf("printf has an address, it is %p\n", printf);
return 0;
}
Tons of crap ends up in the binary for a C program, like all the stuff that provides you the goodness of the standard C library and the stuff the C library itself needs. One of the more interesting things we can look at is the symbol table. Symbols help give us names to find important things in the program. I’ll use readelf
to dump the symbol table from the binary, and filter out globals.
From this list there are a few things that were defined in my source file. Entries 4 and 53 both refer to the printf
call I’m using from glibc which for this program is located at the memory address 0x400470
, entry 59 is the my_global
variable from my program at the address 0x601040
and entry 64 is my main
function at 0x400596
. The only thing missing is the my_local
variable that’s inside of my main
function.
If I run the program a few times I’ll see where the program says these things are.
As seen here, main
, my_global
and printf
all have consistent locations in memory and it matches what was listed in the symbol table, while my_local
changes between calls of the program. This is because the local variable is part of the function which is part of the call stack and Linux uses a security feature that randomizes where the stack begins. As functions are called in your C program the stack is used for function arguments, function local variables and a pointer to code that should be ran after the function is complete. Each function call gets a new section of the stack, referred to as a Stack Frame.
Just like a string and everything else on the computer, the executable code resides somewhere in memory. I can use objdump
to get the machine code version of main, and it’ll show us each instruction that main is composed of and the addresses of each processor operation.
0000000000400596 <main>:
400596: 55 push rbp
400597: 48 89 e5 mov rbp,rsp
40059a: 48 83 ec 10 sub rsp,0x10
40059e: 64 48 8b 04 25 28 00 mov rax,QWORD PTR fs:0x28
4005a5: 00 00
4005a7: 48 89 45 f8 mov QWORD PTR [rbp-0x8],rax
4005ab: 31 c0 xor eax,eax
4005ad: c7 45 f4 c8 01 00 00 mov DWORD PTR [rbp-0xc],0x1c8
4005b4: be 96 05 40 00 mov esi,0x400596
4005b9: bf b8 06 40 00 mov edi,0x4006b8
4005be: b8 00 00 00 00 mov eax,0x0
4005c3: e8 a8 fe ff ff call 400470 <printf@plt>
4005c8: be 40 10 60 00 mov esi,0x601040
4005cd: bf d8 06 40 00 mov edi,0x4006d8
4005d2: b8 00 00 00 00 mov eax,0x0
4005d7: e8 94 fe ff ff call 400470 <printf@plt>
4005dc: 48 8d 45 f4 lea rax,[rbp-0xc]
4005e0: 48 89 c6 mov rsi,rax
4005e3: bf 00 07 40 00 mov edi,0x400700
4005e8: b8 00 00 00 00 mov eax,0x0
4005ed: e8 7e fe ff ff call 400470 <printf@plt>
4005f2: be 70 04 40 00 mov esi,0x400470
4005f7: bf 28 07 40 00 mov edi,0x400728
4005fc: b8 00 00 00 00 mov eax,0x0
400601: e8 6a fe ff ff call 400470 <printf@plt>
400606: b8 00 00 00 00 mov eax,0x0
40060b: 48 8b 55 f8 mov rdx,QWORD PTR [rbp-0x8]
40060f: 64 48 33 14 25 28 00 xor rdx,QWORD PTR fs:0x28
400616: 00 00
400618: 74 05 je 40061f <main+0x89>
40061a: e8 41 fe ff ff call 400460 <__stack_chk_fail@plt>
40061f: c9 leave
400620: c3 ret
Don’t be too daunted with the assembly here, we’re not here to talk about that. Just grasp that when this program is running main
will be residing here in the memory space of the program.
To the memory, there is no difference between program code and bytes that may be strings, integers or pictures of cats. To prove this, I’ll tell the compiler that I want to use main
as a string, and I’ll look at the first character of it.
#include <stdio.h>
int main(void)
{
char *string = (char*) main;
printf("character: %c\n", string[0]);
printf("hex: %x\n", string[0]);
return 0;
}
When I build and run this program, this is what I get:
$ ./test
character: U
hex: 55
Just like my previous program, main
starts with the assembly operation push rbp
, which if you look at as an ascii character would be a U
, or 0x55
as a hexadecimal number.
Why You Confuse Me So?!
The reason I’m pushing this so hard is to really drive home the fact that everything lives in memory and that the only difference between data types is how you treat the data in memory. You can interpret any given chunk of memory as any type, in the case of pointers we treat a chunk of memory as an address that points to another chunk of memory.
Let’s look at some less esoteric examples of how this can be useful.
The Great Unknown
For someone writing a C application, it would be fantastic if you knew how much memory a program would use before running it. Sadly this is almost never the case, so we need ways to get more memory when we need it. The call stack can suffice to create space for the variables we use inside of a function, but when the function ends we lose scope of those variables. To dynamically allocate space for keeps we use the heap, most commonly via malloc() (memory allocate).
The malloc() function is provided by the C standard library, it asks the operating system to give your program a chunk of memory to work on. This is useful for making space as needed to store data in your program.
Here’s an example, this program will take a number of bytes to allocate. It’ll request a chunk of memory of that size, print out it’s address, and then releases that chunk of memory back to the system (to avoid memory leaks!.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
void *dynamic_space;
int size;
if(argc != 2)
{
fprintf(stderr, "Usage: %s <number of bytes to allocate>\n", argv[0]);
return 1;
}
size = atoi(argv[1]);
if(size < 1)
{
fprintf(stderr, "Number of bytes must be larger than 0\n");
return 1;
}
dynamic_space = malloc(size);
printf("We allocated %d bytes, they are located at %p\n", size, dynamic_space);
free(dynamic_space);
return 0;
}
I’ll give this program a few runs asking for different amounts of memory
As you can see, the memory gets allocated in a variety of places even if you ask for the same amount of memory on another run of the program. On my machine, when I asked for a huge allocation it provided me a much higher memory address to use. I’m not sure why this happened but if the system thinks that’s where it should be who am I to argue.
It’s easy enough to ask for space dynamically, the new problem introduced by this is how to be organize it all.
Pointer Usage
There is no one perfect way to organize your pointers, but there are a ton of tried and true methods that are useful for a variety of cases.
Null terminated arrays
One of most common structures is a null terminated array, in the case of the char
type we normally just call this a string. One of the handy things about these types of structures is they allow for pointer arithmetic. A null terminated array is a contiguous block of memory that contains no information about it’s size, but you can find the end of it by looking for the NULL
or 0
(same thing in C) value at the end of it.
In C the ++
and --
operators work to increment and decrement values, but in the case of pointers they increment by the size of the data type the pointer works with, so it’s identical to incrementing a number with character pointers since characters use a single byte, but for a 4 byte integer it’s more like +=4
. Using this functionality you can loop through a null terminated list until the value the pointer points to is zero pretty easily.
Here’s an example of that:
#include <stdio.h>
char my_string[] = "well hello there";
int many_numbers[] = {1, 2, 3, 0};
int main(int argc, char *argv[]) {
char *a_character;
int *an_integer;
// Start where my_string points, printing each character until a_character
// points to a zero value
for(a_character = my_string; *a_character; a_character++) {
printf("at %p we have '%c'\n", a_character, *a_character);
}
printf("\n");
for (an_integer = many_numbers; *an_integer; an_integer++) {
printf("at %p we have %d\n", an_integer, *an_integer);
}
return 0;
}
And here’s that program in action
As you can see the address is incremented by 1 for character pointers and 4 for integer pointers. We can keep reading as long as the dereferenced value (the value in the address the pointer points to) is not zero. When it is zero we’re at the end of the array.
To drive this home further, here’s a program that’ll determine the length of a string given as the first argument.
#include <stdio.h>
int main(int argc, char *argv[]) {
int size = 0;
char *ptr;
if (argc != 2) {
fprintf(stderr, "Usage: %s <string>\n", argv[0]);
return 1;
}
ptr = argv[1];
while (*ptr) {
size++;
ptr++;
}
printf("first argument is %d bytes long\n", size);
return 0;
}
And let’s see how that fares
Passing By Reference
In many languages, we hear about how things are passed by reference, as opposed to things that are passed by value. In many interpreted languages you have little to no control over how that works, but C lets (makes?) you control that completely.
Let’s say I want to make a function that randomizes two numbers for me. In C you can’t return more than one value from a function, so setting two variables with one function call isn’t possible without pointers. I could return an array of two integers that have random values, then set my variables to those values, but that means my function doesn’t do everything I want it to on it’s own.
The easiest way to do this is to pass the addresses (pointers) to the variables I want randomized, then my function can dereference them as if they were ordinary variables with *
.
#include <stdio.h>
#include <stdlib.h>
void randomize_them(int *a, int *b) {
*a = random();
*b = random();
}
int main(void) {
int my_a = 1;
int my_b = 2;
printf("before\na = %d\nb = %d\n", my_a, my_b);
randomize_them(&my_a, &my_b);
printf("after\na = %d\nb = %d\n", my_a, my_b);
return 0;
}
Now that wasn’t too bad, was it?
That’ll wrap things up for this post, this covers most of the basics about pointers. I will be covering a bit more pointer-fu in the next post about defining and using structures in C, but I think this provides a good start for now.