Node:Internals, Next:, Previous:Dynamic Extensions, Up:Dynamic Extensions



A Minimal Introduction to gawk Internals

The truth is that gawk was not designed for simple extensibility. The facilities for adding functions using shared libraries work, but are something of a "bag on the side." Thus, this tour is brief and simplistic; would-be gawk hackers are encouraged to spend some time reading the source code before trying to write extensions based on the material presented here. Of particular note are the files awk.h, builtin.c, and eval.c. Reading awk.y in order to see how the parse tree is built would also be of use.

With the disclaimers out of the way, the following types, structure members, functions, and macros are declared in awk.h and are of use when writing extensions. The next section shows how they are used:

AWKNUM
An AWKNUM is the internal type of awk floating-point numbers. Typically, it is a C double.
NODE
Just about everything is done using objects of type NODE. These contain both strings and numbers, as well as variables and arrays.
AWKNUM force_number(NODE *n)
This macro forces a value to be numeric. It returns the actual numeric value contained in the node. It may end up calling an internal gawk function.
void force_string(NODE *n)
This macro guarantees that a NODE's string value is current. It may end up calling an internal gawk function. It also guarantees that the string is zero-terminated.
n->param_cnt
The number of parameters actually passed in a function call at runtime.
n->stptr
n->stlen
The data and length of a NODE's string value, respectively. The string is not guaranteed to be zero-terminated. If you need to pass the string value to a C library function, save the value in n->stptr[n->stlen], assign '\0' to it, call the routine, and then restore the value.
n->type
The type of the NODE. This is a C enum. Values should be either Node_var or Node_var_array for function parameters.
n->vname
The "variable name" of a node. This is not of much use inside externally written extensions.
void assoc_clear(NODE *n)
Clears the associative array pointed to by n. Make sure that n->type == Node_var_array first.
NODE **assoc_lookup(NODE *symbol, NODE *subs, int reference)
Finds, and installs if necessary, array elements. symbol is the array, subs is the subscript. This is usually a value created with tmp_string (see below). reference should be TRUE if it is an error to use the value before it is created. Typically, FALSE is the correct value to use from extension functions.
NODE *make_string(char *s, size_t len)
Take a C string and turn it into a pointer to a NODE that can be stored appropriately. This is permanent storage; understanding of gawk memory management is helpful.
NODE *make_number(AWKNUM val)
Take an AWKNUM and turn it into a pointer to a NODE that can be stored appropriately. This is permanent storage; understanding of gawk memory management is helpful.
NODE *tmp_string(char *s, size_t len);
Take a C string and turn it into a pointer to a NODE that can be stored appropriately. This is temporary storage; understanding of gawk memory management is helpful.
NODE *tmp_number(AWKNUM val)
Take an AWKNUM and turn it into a pointer to a NODE that can be stored appropriately. This is temporary storage; understanding of gawk memory management is helpful.
NODE *dupnode(NODE *n)
Duplicate a node. In most cases, this increments an internal reference count instead of actually duplicating the entire NODE; understanding of gawk memory management is helpful.
void free_temp(NODE *n)
This macro releases the memory associated with a NODE allocated with tmp_string or tmp_number. Understanding of gawk memory management is helpful.
void make_builtin(char *name, NODE *(*func)(NODE *), int count)
Register a C function pointed to by func as new built-in function name. name is a regular C string. count is the maximum number of arguments that the function takes. The function should be written in the following manner:
/* do_xxx --- do xxx function for gawk */

NODE *
do_xxx(NODE *tree)
{
    ...
}

NODE *get_argument(NODE *tree, int i)
This function is called from within a C extension function to get the i-th argument from the function call. The first argument is argument zero.
void set_value(NODE *tree)
This function is called from within a C extension function to set the return value from the extension function. This value is what the awk program sees as the return value from the new awk function.
void update_ERRNO(void)
This function is called from within a C extension function to set the value of gawk's ERRNO variable, based on the current value of the C errno variable. It is provided as a convenience.

An argument that is supposed to be an array needs to be handled with some extra code, in case the array being passed in is actually from a function parameter. The following boilerplate code shows how to do this:

NODE *the_arg;

the_arg = get_argument(tree, 2); /* assume need 3rd arg, 0-based */

/* if a parameter, get it off the stack */
if (the_arg->type == Node_param_list)
    the_arg = stack_ptr[the_arg->param_cnt];

/* parameter referenced an array, get it */
if (the_arg->type == Node_array_ref)
    the_arg = the_arg->orig_array;

/* check type */
if (the_arg->type != Node_var && the_arg->type != Node_var_array)
    fatal("newfunc: third argument is not an array");

/* force it to be an array, if necessary, clear it */
the_arg->type = Node_var_array;
assoc_clear(the_arg);

Again, you should spend time studying the gawk internals; don't just blindly copy this code.