]> Using and porting gnu <wordasword>lightning</wordasword> gnu lightning This document describes installing, using and porting the gnu lightning library for dynamic code generation. Unlike other dynamic code generation systems, which are usually either inefficient or non-portable, gnu lightning is both retargetable and very fast. Introduction to gnu lightning Dynamic code generation is the generation of machine code at runtime. It is typically used to strip a layer of interpretation by allowing compilation to occur at runtime. One of the most well-known applications of dynamic code generation is perhaps that of interpreters that compile source code to an intermediate bytecode form, which is then recompiled to machine code at run-time: this approach effectively combines the portability of bytecode representations with the speed of machine code. Another common application of dynamic code generation is in the field of hardware simulators and binary emulators, which can use the same techniques to translate simulated instructions to the instructions of the underlying machine. Yet other applications come to mind: for example, windowing bitblt operations, matrix manipulations, and network packet filters. Albeit very powerful and relatively well known within the compiler community, dynamic code generation techniques are rarely exploited to their full potential and, with the exception of the two applications described above, have remained curiosities because of their portability and functionality barriers: binary instructions are generated, so programs using dynamic code generation must be retargeted for each machine; in addition, coding a run-time code generator is a tedious and error-prone task more than a difficult one. This manual describes the gnu lightning dynamic code generation library. gnu lightning provides a portable, fast and easily retargetable dynamic code generation system. To be fast, gnu lightning emits machine code without first creating intermediate data structures such as RTL representations traditionally used by optimizing compilers (see See section ``RTL representation'' in Using and porting GNU CC). gnu lightning translates code directly from a machine independent interface to that of the underlying architecture. This makes code generation more efficient, since no intermediate data structures have to be constructed and consumed. A collateral benefit it that gnu lightning consumes little space: other than the memory needed to store generated instructions and data structures such as parse trees, the only data structure that client will usually need is an array of pointers to labels and unresolved jumps, which you can often allocate directly on the system stack. To be portable, gnu lightning abstracts over current architectures' quirks and unorthogonalities. The interface that it exposes to is that of a standardized RISC architecture loosely based on the SPARC and MIPS chips. There are a few general-purpose registers (six, not including those used to receive and pass parameters between subroutines), and arithmetic operations involve three operands—either three registers or two registers and an arbitrarily sized immediate value. On one hand, this architecture is general enough that it is possible to generate pretty efficient code even on CISC architectures such as the Intel x86 or the Motorola 68k families. On the other hand, it matches real architectures closely enough that, most of the time, the compiler's constant folding pass ends up generating code which assembles machine instructions without further tests. Drawbacks gnu lightning has been useful in practice; however, it does have at least four drawbacks: it has limited registers, no peephole optimizer, no instruction scheduler and no symbolic debugger. Of these, the last is the most critical even though it does not affect the quality of generated code: the only way to debug code generated at run-time is to step through it at the level of host specific machine code. A decent knowledge of the underlying instruction set is thus needed to make sense of the debugger's output. The low number of available registers (six) is also an important limitation. However, let's take the primary application of dynamic code generation, that is, bytecode translators. The underlying virtual machines tend to have very few general purpose registers (usually 0 to 2) and the translators seldom rely on sophisticated graph-coloring algorithms to allocate registers to temporary variables. Rather, these translators usually obtain performance increases because: a) they remove indirect jumps, which are usually poorly predicted, and thus often form a bottleneck, b) they parameterize the generated code and go through the process of decoding the bytecodes just once. So, their usage of registers is rather sparse—in fact, in practice, six registers were found to be enough for most purposes. The lack of a peephole optimizer is most important on machines where a single instruction can map to multiple native instructions. For instance, Intel chips' division instruction hard-codes the dividend to be in EAX and the quotient and remainder to be output, respectively, in EAX and EDX: on such chips, gnu lightning does lots of pushing and popping of EAX and EDX to save those registers that are not used. Unnecessary stack operations could be removed by looking at whether preserved registers are destroyed soon. Unfortunately, the current implementation of gnu lightning is so fast because it only knows about the single instruction that is being generated; performing these optimizations would require a flow analysis pass that would probably hinder gnu lightning's speed. The lack of an instruction scheduler is not very important—pretty good instruction scheduling can actually be obtained by separating register writes from register reads. The only architectures on which a scheduler would be useful are those on which arithmetic instructions have two operands; an example is, again, the x86, on which the single instruction subr_i R0, R1, R2 !Compute R0 = R1 - R2 is translated to two instruction, of which the second depends on the result of the first: movl %ebx, %eax ! Move R1 into R0 subl %edx, %eax ! Subtract R2 from R0 Using gnu lightning This chapter describes installing and using gnu lightning. Configuring and installing gnu lightning The first thing to do to use gnu lightning is to configure the program, picking the set of macros to be used on the host architecture; this configuration is automatically performed by the configure shell script; to run it, merely type: ./configure gnu lightning supports cross-compiling in that you can choose a different set of macros from the one needed on the computer that you are compiling gnu lightning on. For example, ./configure --host=sparc-sun-linux will select the SPARC set of runtime assemblers. You can use configure's ability to make reasonable assumptions about the vendor and operating system and simply type ./configure --host=i386 ./configure --host=ppc ./configure --host=sparc Another option that configure accepts is --enable-assertions, which enables several consistency checks in the run-time assemblers. These are not usually needed, so you can decide to simply forget about it; also remember that these consistency checks tend to slow down your code generator. After you've configured gnu lightning, you don't have to compile it because it is nothing more than a set of include files. If you want to compile the examples, run make as usual. The next important step is: make install This ends the process of installing gnu lightning. gnu lightning's instruction set gnu lightning's instruction set was designed by deriving instructions that closely match those of most existing RISC architectures, or that can be easily syntesized if absent. Each instruction is composed of: an operation, like sub or mul sometimes, an register/immediate flag (r or i) a type identifier or, occasionally, two The second and third field are separated by an underscore; thus, examples of legal mnemonics are addr_i (integer add, with three register operands) and muli_l (long integer multiply, with two register operands and an immediate operand). Each instruction takes two or three operands; in most cases, one of them can be an immediate value instead of a register. gnu lightning supports a full range of integer types: operands can be 1, 2 or 4 bytes long (64-bit architectures might support 8 bytes long operands), either signed or unsigned. The types are listed in the following table together with the C types they represent: c signed char uc unsigned char s short us unsigned short i int ui unsigned int l long ul unsigned long f float d double p void * Some of these types may not be distinct: for example, (e.g., l is equivalent to i on 32-bit machines, and p is substantially equivalent to ul). There are at least seven integer registers, of which six are general-purpose, while the last is used to contain the stack pointer (SP). The stack pointer can be used to allocate and access local variables on the stack (which is supposed to grow downwards in memory on all architectures). Of the general-purpose registers, at least three are guaranteed to be preserved across function calls (V0, V1 and V2) and at least three are not (R0, R1 and R2). Six registers are not very much, but this restriction was forced by the need to target CISC architectures which, like the x86, are poor of registers; anyway, backends can specify the actual number of available caller- and callee-save registers. In addition, there is a special RET register which contains the return value. You should always remember, however, that writing this register could overwrite either a general-purpose register or an incoming parameter, depending on the architecture. There are at least six floating-point registers, named FPR0 to FPR5. These are separate from the integer registers on all the supported architectures; on Intel architectures, the register stack is mapped to a flat register file. The complete instruction set follows; as you can see, most non-memory operations only take integers, long integers (either signed or unsigned) and pointers as operands; this was done in order to reduce the instruction set, and because most architectures only provide word and long word operations on registers. There are instructions that allow operands to be extended to fit a larger data type, both in a signed and in an unsigned way. Binary ALU operations These accept three operands; the last one can be an immediatevalue for integer operands, or a register for all operand types.addx operations must directly follow addc, andsubx must follow subc; otherwise, results are undefined. addr i ui l ul p f d O1 = O2 + O3 addi i ui l ul p O1 = O2 + O3 addxr i ui l ul O1 = O2 + (O3 + carry) addxi i ui l ul O1 = O2 + (O3 + carry) addcr i ui l ul O1 = O2 + O3, set carry addci i ui l ul O1 = O2 + O3, set carry subr i ui l ul p f d O1 = O2 - O3 subi i ui l ul p O1 = O2 - O3 subxr i ui l ul O1 = O2 - (O3 + carry) subxi i ui l ul O1 = O2 - (O3 + carry) subcr i ui l ul O1 = O2 - O3, set carry subci i ui l ul O1 = O2 - O3, set carry rsbr i ui l ul p f d O1 = O3 - O2 rsbi i ui l ul p O1 = O3 - O2 mulr i ui l ul f d O1 = O2 * O3 muli i ui l ul O1 = O2 * O3 hmulr i ui l ul O1 = high bits of O2 * O3 hmuli i ui l ul O1 = high bits of O2 * O3 divr i ui l ul f d O1 = O2 / O3 divi i ui l ul O1 = O2 / O3 modr i ui l ul O1 = O2 % O3 modi i ui l ul O1 = O2 % O3 andr i ui l ul O1 = O2 & O3 andi i ui l ul O1 = O2 & O3 orr i ui l ul O1 = O2 | O3 ori i ui l ul O1 = O2 | O3 xorr i ui l ul O1 = O2 ^ O3 xori i ui l ul O1 = O2 ^ O3 lshr i ui l ul O1 = O2 << O3 lshi i ui l ul O1 = O2 << O3 rshr i ui l ul O1 = O2 >> O3 The sign bit is propagated for signed types. rshi i ui l ul O1 = O2 >> O3 The sign bit is propagated for signed types. Unary ALU operations These accept two operands, both of which must be registers. negr i l f d O1 = -O2 notr i ui l ul O1 = ~O2 Compare instructions These accept three operands; again, the last can be an immediatevalue for integer data types. The last two operands are compared,and the first operand is set to either 0 or 1, according towhether the given condition was met or not.The conditions given below are for the standard behavior of C,where the “unordered” comparison result is mapped to false. ltr i ui l ul p f d O1 = (O2 < O3) lti i ui l ul p O1 = (O2 < O3) ler i ui l ul p f d O1 = (O2 <= O3) lei i ui l ul p O1 = (O2 <= O3) gtr i ui l ul p f d O1 = (O2 > O3) gti i ui l ul p O1 = (O2 > O3) ger i ui l ul p f d O1 = (O2 >= O3) gei i ui l ul p O1 = (O2 >= O3) eqr i ui l ul p f d O1 = (O2 == O3) eqi i ui l ul p O1 = (O2 == O3) ner i ui l ul p f d O1 = (O2 != O3) nei i ui l ul p O1 = (O2 != O3) unltr f d O1 = !(O2 >= O3) unler f d O1 = !(O2 > O3) ungtr f d O1 = !(O2 <= O3) unger f d O1 = !(O2 < O3) uneqr f d O1 = !(O2 < O3) && !(O2 > O3) ltgtr f d O1 = !(O2 >= O3) || !(O2 <= O3) ordr f d O1 = (O2 == O2) && (O3 == O3) unordr f d O1 = (O2 != O2) || (O3 != O3) Transfer operations These accept two operands; for ext both of them must beregisters, while mov accepts an immediate value as the secondoperand.Unlike movr and movi, the other instructions are appliedbetween operands of different data types, and they need twodata type specifications. You can use extr to convert betweeninteger data types, in which case the first must be smaller in sizethan the second; for example extr_c_ui is correct whileextr_ul_us is not. You can also use extr to convertan integer to a floating point value: the only available possibilitiesare extr_i_f and extr_i_d. The other instructionsconvert a floating point value to an integer, so the possiblesuffixes are _f_i and _d_i. movr i ui l ul p f d O1 = O2 movi i ui l ul p f d O1 = O2 extr c uc s us i ui l ul f d O1 = O2 roundr i f d O1 = round(O2) truncr i f d O1 = trunc(O2) floorr i f d O1 = floor(O2) ceilr i f d O1 = ceil(O2) Note that the order of the arguments is destination first,source second as for all other gnu lightning instructions, butthe order of the types is always reversed with respect to thatof the arguments: shorter—source—first,longer—destination—second. This happens for historicalreasons. Network extensions These accept two operands, both of which must be registers; thesetwo instructions actually perform the same task, yet they areassigned to two mnemonics for the sake of convenience andcompleteness. As usual, the first operand is the destination andthe second is the source. hton us ui Host-to-network (big endian) order ntoh us ui Network-to-host order Load operations ld accepts two operands while ldx accepts three;in both cases, the last can be either a register or an immediatevalue. Values are extended (with or without sign, according tothe data type specification) to fit a whole register. ldr c uc s us i ui l ul p f d O1 = *O2 ldi c uc s us i ui l ul p f d O1 = *O2 ldxr c uc s us i ui l ul p f d O1 = *(O2+O3) ldxi c uc s us i ui l ul p f d O1 = *(O2+O3) Store operations st accepts two operands while stx accepts three; inboth cases, the first can be either a register or an immediatevalue. Values are sign-extended to fit a whole register. str c uc s us i ui l ul p f d *O1 = O2 sti c uc s us i ui l ul p f d *O1 = O2 stxr c uc s us i ui l ul p f d *(O1+O2) = O3 stxi c uc s us i ui l ul p f d *(O1+O2) = O3 Stack management These accept a single register parameter. These operations are notguaranteed to be efficient on all architectures. pushr i ui l ul p push O1 on the stack popr i ui l ul p pop O1 off the stack Argument management These are: prepare i f d pusharg c uc s us i ui l ul p f d getarg c uc s us i ui l ul p f d arg c uc s us i ui l ul p f d Of these, the first two are used by the caller, while the last twoare used by the callee. A code snippet that wants to call anotherprocedure and has to pass registers must, in order: use theprepare instruction, giving the number of arguments tobe passed to the procedure (once for each data type); usepusharg to push the arguments in reverse order;and use calli or finish (explained below) toperform the actual call.arg and getarg are used by the callee.arg is different from other instruction in that it does notactually generate any code: instead, it is a function which returnsa value to be passed to getarg. “Return avalue” means that gnu lightning macros that compile theseinstructions return a value when expanded. You should callarg as soon as possible, before any function call or, moreeasily, right after the prolog or leaf instructions(which are treated later).getarg accepts a register argument and a value returned byarg, and will move that argument to the register, extendingit (with or without sign, according to the data type specification)to fit a whole register. These instructions are more intimatelyrelated to the usage of the gnu lightning instruction set in codethat generates other code, so they will be treated morespecifically in Generating code at run-time.You should observe a few rules when using these macros. First ofall, it is not allowed to call functions with more than six arguments;this was done to simplify and speed up the implementation onarchitectures that use registers for parameter passing.You should not nest calls to prepare, nor call zero-argumentfunctions (which do not need a call to prepare) inside aprepare/calli or prepare/finish block. Doing thismight corrupt already pushed arguments.You cannot pass parameters between subroutines usingthe six general-purpose registers. This might work only whentargeting particular architectures.On the other hand, it is possible to assume that callee-saved registers(R0 through R2) are not clobbered by another dynamicallygenerated function which does not use them as operands in its code andwhich does not return a value. Branch instructions Like arg, these also return a value which, in this case,is to be used to compile forward branches as explained inFibonacci numbers. They accept a pointer to thedestination of the branch and two operands to be compared; of these,the last can be either a register or an immediate. They are: bltr i ui l ul p f d if (O2 < O3) goto O1 blti i ui l ul p if (O2 < O3) goto O1 bler i ui l ul p f d if (O2 <= O3) goto O1 blei i ui l ul p if (O2 <= O3) goto O1 bgtr i ui l ul p f d if (O2 > O3) goto O1 bgti i ui l ul p if (O2 > O3) goto O1 bger i ui l ul p f d if (O2 >= O3) goto O1 bgei i ui l ul p if (O2 >= O3) goto O1 beqr i ui l ul p f d if (O2 == O3) goto O1 beqi i ui l ul p if (O2 == O3) goto O1 bner i ui l ul p f d if (O2 != O3) goto O1 bnei i ui l ul p if (O2 != O3) goto O1 bunltr f d if !(O2 >= O3) goto O1 bunler f d if !(O2 > O3) goto O1 bungtr f d if !(O2 <= O3) goto O1 bunger f d if !(O2 < O3) goto O1 buneqr f d if !(O2 < O3) && !(O2 > O3) goto O1 bltgtr f d if !(O2 >= O3) || !(O2 <= O3) goto O1 bordr f d if (O2 == O2) && (O3 == O3) goto O1 bunordr f d if !(O2 != O2) || (O3 != O3) goto O1 bmsr i ui l ul if O2 & O3 goto O1 bmsi i ui l ul if O2 & O3 goto O1 bmcr i ui l ul if !(O2 & O3) goto O1 bmci i ui l ul if !(O2 & O3) goto O1 These mnemonics mean, respectively, branch if mask set and branch if mask cleared. boaddr i ui l ul O2 += O3 , goto O1 on overflow boaddi i ui l ul O2 += O3 , goto O1 on overflow bosubr i ui l ul O2 -= O3 , goto O1 on overflow bosubi i ui l ul O2 -= O3 , goto O1 on overflow Jump and return operations These accept one argument except ret which has none; thedifference between finish and calli is that thelatter does not clean the stack from pushed parameters (if any)and the former must always follow a prepareinstruction. Results are undefined when using function callsin a leaf function. calli (not specified) function call to O1 callr (not specified) function call to a register finish (not specified) function call to O1 finishr (not specified) function call to a register jmpi/jmpr (not specified) unconditional jump to O1 prolog (not specified) function prolog for O1 args leaf (not specified) the same for leaf functions ret (not specified) return from subroutine retval c uc s us i ui l ul p f d move return value to register Like branch instruction, jmpi also returns a value which is tobe used to compile forward branches. See Fibonacci numbers. As a small appetizer, here is a small function that adds 1 to the input parameter (an int). I'm using an assembly-like syntax here which is a bit different from the one used when writing real subroutines with gnu lightning; the real syntax will be introduced in See Generating code at run-time. incr: leaf 1 in = arg_i ! We have an integer argument getarg_i R0, in ! Move it to R0 addi_i RET, R0, 1 ! Add 1, put result in return value ret ! And return the result And here is another function which uses the printf function from the standard C library to write a number in hexadecimal notation: printhex: prolog 1 in = arg_i ! Same as above getarg_i R0, in prepare 2 ! Begin call sequence for printf pusharg_i R0 ! Push second argument pusharg_p "%x" ! Push format string finish printf ! Call printf ret ! Return to caller Generating code at run-time To use gnu lightning, you should include the lightning.h file that is put in your include directory by the ‘make install’ command. That include files defines about four hundred public macros (plus others that are private to gnu lightning), one for each opcode listed above. Each of the instructions above translates to a macro. All you have to do is prepend jit_ (lowercase) to opcode names and JIT_ (uppercase) to register names. Of course, parameters are to be put between parentheses, just like with every other cpp macro. This small tutorial presents three examples: A function which increments a number by one Let's see how to create and use the sample incr function created in gnu lightning's instruction set: #include <stdio.h> #include "lightning.h" static jit_insn codeBuffer[1024]; typedef int (*pifi)(int); /* Pointer to Int Function of Int */ int main() { pifi incr = (pifi) (jit_set_ip(codeBuffer).iptr); int in; jit_leaf(1); /* leaf 1 */ in = jit_arg_i(); /* in = arg_i */ jit_getarg_i(JIT_R0, in); /* getarg_i R0 */ jit_addi_i(JIT_RET, JIT_R0, 1); /* addi_i RET, R0, 1 */ jit_ret(); /* ret */ jit_flush_code(codeBuffer, jit_get_ip().ptr); /* call the generated code, passing 5 as an argument */ printf("%d + 1 = %d\n", 5, incr(5)); return 0; } Let's examine the code line by line (well, almost…): #include "lightning.h" You already know about this. It defines all of gnu lightning's macros. static jit_insn codeBuffer[1024]; You might wonder about what is jit_insn. It is just a type thatis defined by gnu lightning. Its exact definition depends on thearchitecture; in general, defining an array of 1024 jit_insnsallows one to write 100 to 400 gnu lightning instructions (depending onthe architecture and exact instructions). typedef int (*pifi)(int); Just a handy typedef for a pointer to a function that takes anint and returns another. pifi incr = (pifi) (jit_set_ip(codeBuffer).iptr); This is the first gnu lightning macro we encounter that does not map toan instruction. It is jit_set_ip, which takes a pointer to anarea of memory where compiled code will be put and returns the samevalue, cast to a union type whose members are pointers tofunctions returning different C types. This union is calledjit_code and is defined as follows: typedef union jit_code { char *ptr; void (*vptr)(); char (*cptr)(); unsigned char (*ucptr)(); short (*sptr)(); unsigned short (*usptr)(); int (*iptr)(); unsigned int (*uiptr)(); long (*lptr)(); unsigned long (*ulptr)(); void * (*pptr)(); float (*fptr)(); double (*dptr)(); } jit_code; Any of the members could have been used, since the result is soon castedto type pifi but, for the sake of clarity, the program usesiptr, a pointer to a function with no prototype and returning anint.Analogous to jit_set_ip is jit_get_ip, which does notmodify the instruction pointer—it is nothing more than a cast of thecurrent ip to jit_code. int in; A footnote in gnu lightning's instructionset, under the description of arg, says that macros implementingarg return a value—we'll be using this variable to store theresult of arg. jit_leaf(1); Ok, so we start generating code for our beloved function… it willaccept one argument and won't call any other function. in = jit_arg_i();jit_getarg_i(JIT_R0, in); We retrieve the first (and only) argument, an integer, and store itinto the general-purpose register R0. jit_addi_i(JIT_RET, JIT_R0, 1); We add one to the content of the register and store the result in thereturn value. jit_ret(); This instruction generates a standard function epilog that returnsthe contents of the RET register. jit_flush_code(codeBuffer, jit_get_ip().ptr); This instruction is very important. It flushes the generated codearea out of the processor's instruction cache, avoiding the processorexecutes bogus data that it happens to find there. Thejit_flush_code function accepts the first and the last addressto flush; we use jit_get_ip to find out the latter. printf("%d + 1 = %d", 5, incr(5)); Calling our function is this simple—it is not distinguishable froma normal C function call, the only difference being that incris a variable. gnu lightning abstracts two phases of dynamic code generation: selecting instructions that map the standard representation, and emitting binary code for these instructions. The client program has the responsibility of describing the code to be generated using the standard gnu lightning instruction set. Let's examine the code generated for incr on the SPARC and x86 architectures (on the right is the code that an assembly-language programmer would write): SPARC save %sp, -96, %sp mov %i0, %l0 retl add %l0, 1, %i0 add %o0, 1, %o0 ret restore In this case, gnu lightning introduces overhead to create a registerwindow (not knowing that the procedure is a leaf procedure) and tomove the argument to the general purpose register R0 (whichmaps to %l0 on the SPARC). The former overhead could beavoided by teaching gnu lightning about leaf procedures (see );the latter could instead be avoided by rewriting the getarg instructionas jit_getarg_i(JIT_RET, in), which was not done in thisexample. x86 pushl %ebp movl %esp, %ebp pushl %ebx pushl %esi pushl %edi movl 8(%ebp), %eax movl 4(%esp), %eax addl $1, %eax incl %eax popl %edi popl %esi popl %ebx popl %ebp ret ret In this case, the main overhead is due to the function's prolog andepilog, which is nine instructions long on the x86; a hand-writtenroutine would not save unused callee-preserved registers on the stack.It is to be said, however, that this is not a problem in morecomplicated uses, because more complex procedure would probably usethe V0 through V2 registers (%ebx, %esi,%edi); in this case, a hand-written routine would have includedthe prolog too. Also, a ten byte prolog would probably be a smalloverhead in a more complex function. In such a simple case, the macros that make up the back-end compile reasonably efficient code, with the notable exception of prolog/epilog code. A simple function call to printf Again, here is the code for the example: #include <stdio.h> #include "lightning.h" static jit_insn codeBuffer[1024]; typedef void (*pvfi)(int); /* Pointer to Void Function of Int */ int main() { pvfi myFunction; /* ptr to generated code */ char *start, *end; /* a couple of labels */ int in; /* to get the argument */ myFunction = (pvfi) (jit_set_ip(codeBuffer).vptr); start = jit_get_ip().ptr; jit_prolog(1); in = jit_arg_i(); jit_movi_p(JIT_R0, "generated %d bytes\n"); jit_getarg_i(JIT_R1, in); jit_prepare(2); jit_pusharg_i(JIT_R1); /* push in reverse order */ jit_pusharg_p(JIT_R0); jit_finish(printf); jit_ret(); end = jit_get_ip().ptr; /* call the generated code, passing its size as argument */ jit_flush_code(start, end); myFunction(end - start); } The function shows how many bytes were generated. Most of the code is not very interesting, as it resembles very closely the program presented in A function which increments a number by one. For this reason, we're going to concentrate on just a few statements. start = jit_get_ip().ptr;end = jit_get_ip().ptr; These two instruction call the jit_get_ip macro which wasmentioned in A function which increments a number by onetoo. In this case we use the only field of jit_code that isnot a function pointer: ptr, which is a simple char *. jit_movi_p(JIT_R0, "generated %d bytes\n"); Note the use of the ‘p’ type specifier, which automaticallycasts the second parameter to an unsigned long to make thecode more clear and less cluttered by typecasts. jit_prepare(2);jit_pusharg_i(JIT_R1);jit_pusharg_p(JIT_R0);jit_finish(printf); Once the arguments to printf have been put in general-purposeregisters, we can start a prepare/pusharg/finish sequence thatmoves the argument to either the stack or registers, then callsprintf, then cleans up the stack. Note how gnu lightningabstracts the differences between different architectures andABI's – the client program does not know how parameter passingworks on the host architecture. A more complex example, an RPN calculator We create a small stack-based RPN calculator which applies a series of operators to a given parameter and to other numeric operands. Unlike previous examples, the code generator is fully parameterized and is able to compile different formulas to different functions. Here is the code for the expression compiler; a sample usage will follow. #include <stdio.h> #include "lightning.h" typedef int (*pifi)(int); /* Pointer to Int Function of Int */ pifi compile_rpn(char *expr) { pifi fn; int in; fn = (pifi) (jit_get_ip().iptr); jit_leaf(1); in = jit_arg_i(); jit_getarg_i(JIT_R0, in); while (*expr) { char buf[32]; int n; if (sscanf(expr, "%[0-9]%n", buf, &n)) { expr += n - 1; jit_push_i(JIT_R0); jit_movi_i(JIT_R0, atoi(buf)); } else if (*expr == '+') { jit_pop_i(JIT_R1); jit_addr_i(JIT_R0, JIT_R1, JIT_R0); } else if (*expr == '-') { jit_pop_i(JIT_R1); jit_subr_i(JIT_R0, JIT_R1, JIT_R0); } else if (*expr == '*') { jit_pop_i(JIT_R1); jit_mulr_i(JIT_R0, JIT_R1, JIT_R0); } else if (*expr == '/') { jit_pop_i(JIT_R1); jit_divr_i(JIT_R0, JIT_R1, JIT_R0); } else { fprintf(stderr, "cannot compile: %s\n", expr); abort(); } ++expr; } jit_movr_i(JIT_RET, JIT_R0); jit_ret(); return fn; } The principle on which the calculator is based is easy: the stack top is held in R0, while the remaining items of the stack are held on the hardware stack. Compiling an operand pushes the old stack top onto the stack and moves the operand into R0; compiling an operator pops the second operand off the stack into R1, and compiles the operation so that the result goes into R0, thus becoming the new stack top. Try to locate a call to jit_set_ip in the source code. You will not find one; this means that the client has to manually set the instruction pointer. This technique has one advantage and one drawback. The advantage is that the client can simply set the instruction pointer once and then generate code for multiple functions, one after another, without caring about passing a different instruction pointer each time; see Re-entrant usage of gnu lightning for the disadvantage. Source code for the client (which lies in the same source file) follows: static jit_insn codeBuffer[1024]; int main() { pifi c2f, f2c; int i; jit_set_ip(codeBuffer); c2f = compile_rpn("9*5/32+"); f2c = compile_rpn("32-5*9/"); jit_flush_code(codeBuffer, jit_get_ip().ptr); printf("\nC:"); for (i = 0; i <= 100; i += 10) printf("%3d ", i); printf("\nF:"); for (i = 0; i <= 100; i += 10) printf("%3d ", c2f(i)); printf("\n"); printf("\nF:"); for (i = 32; i <= 212; i += 10) printf("%3d ", i); printf("\nC:"); for (i = 32; i <= 212; i += 10) printf("%3d ", f2c(i)); printf("\n"); return 0; } The client displays a conversion table between Celsius and Fahrenheit degrees (both Celsius-to-Fahrenheit and Fahrenheit-to-Celsius). The formulas are, F(c) = c*9/5+32 and C(f) = (f-32)*5/9, respectively. Providing the formula as an argument to compile_rpn effectively parameterizes code generation, making it possible to use the same code to compile different functions; this is what makes dynamic code generation so powerful. The rpn.c file in the gnu lightning distribution includes a more complete (and more complex) implementation of compile_rpn, which does constant folding, allows the argument to the functions to be used more than once, and is able to assemble instructions with an immediate parameter. Fibonacci numbers The code in this section calculates a variant of the Fibonacci sequence. While the traditional Fibonacci sequence is modeled by the recurrence relation: f(0) = f(1) = 1 f(n) = f(n-1) + f(n-2) the functions in this section calculates the following sequence, which is more interesting as a benchmarkThat's because, as is easily seen, the sequence represents the number of activations of the nfibs procedure that are needed to compute its value through recursion.: nfibs(0) = nfibs(1) = 1 nfibs(n) = nfibs(n-1) + nfibs(n-2) + 1 The purpose of this example is to introduce branches. There are two kind of branches: backward branches and forward branches. We'll present the calculation in a recursive and iterative form; the former only uses forward branches, while the latter uses both. #include <stdio.h> #include "lightning.h" static jit_insn codeBuffer[1024]; typedef int (*pifi)(int); /* Pointer to Int Function of Int */ int main() { pifi nfibs = (pifi) (jit_set_ip(codeBuffer).iptr); int in; /* offset of the argument */ jit_insn *ref; /* to patch the forward reference */ jit_prolog (1); in = jit_arg_ui (); jit_getarg_ui(JIT_V0, in); /* V0 = n */ ref = jit_blti_ui (jit_forward(), JIT_V0, 2); jit_subi_ui (JIT_V1, JIT_V0, 1); /* V1 = n-1 */ jit_subi_ui (JIT_V2, JIT_V0, 2); /* V2 = n-2 */ jit_prepare(1); jit_pusharg_ui(JIT_V1); jit_finish(nfibs); jit_retval(JIT_V1); /* V1 = nfibs(n-1) */ jit_prepare(1); jit_pusharg_ui(JIT_V2); jit_finish(nfibs); jit_retval(JIT_V2); /* V2 = nfibs(n-2) */ jit_addi_ui(JIT_V1, JIT_V1, 1); jit_addr_ui(JIT_RET, JIT_V1, JIT_V2); /* RET = V1 + V2 + 1 */ jit_ret(); jit_patch(ref); /* patch jump */ jit_movi_i(JIT_RET, 1); /* RET = 1 */ jit_ret(); /* call the generated code, passing 32 as an argument */ jit_flush_code(codeBuffer, jit_get_ip().ptr); printf("nfibs(%d) = %d", 32, nfibs(32)); return 0; } As said above, this is the first example of dynamically compiling branches. Branch instructions have three operands: two contains the values to be compared, while the first is a label; gnu lightning label's are represented as jit_insn * values. Unlike other instructions (apart from arg, which is actually a directive rather than an instruction), branch instructions also return a value which, as we see in the example above, can be used to compile forward references. Compiling a forward reference is a two-step operation. First, a branch is compiled with a dummy label, since the actual destination of the jump is not yet known; the dummy label is returned by the jit_forward() macro. The value returned by the branch instruction is saved to be used later. Then, when the destination of the jump is reached, another macro is used, jit_patch(). This macro must be called once for every point in which the code had a forward branch to the instruction following jit_patch (in this case a movi_i instruction). Now, here is the iterative version: #include <stdio.h> #include "lightning.h" static jit_insn codeBuffer[1024]; typedef int (*pifi)(int); /* Pointer to Int Function of Int */ int main() { pifi nfibs = (pifi) (jit_set_ip(codeBuffer).iptr); int in; /* offset of the argument */ jit_insn *ref; /* to patch the forward reference */ jit_insn *loop; /* start of the loop */ jit_leaf (1); in = jit_arg_ui (); jit_getarg_ui(JIT_R2, in); /* R2 = n */ jit_movi_ui (JIT_R1, 1); ref = jit_blti_ui (jit_forward(), JIT_R2, 2); jit_subi_ui (JIT_R2, JIT_R2, 1); jit_movi_ui (JIT_R0, 1); loop= jit_get_label(); jit_subi_ui (JIT_R2, JIT_R2, 1); /* decr. counter */ jit_addr_ui (JIT_V0, JIT_R0, JIT_R1); /* V0 = R0 + R1 */ jit_movr_ui (JIT_R0, JIT_R1); /* R0 = R1 */ jit_addi_ui (JIT_R1, JIT_V0, 1); /* R1 = V0 + 1 */ jit_bnei_ui (loop, JIT_R2, 0); /* if (R2) goto loop; */ jit_patch(ref); /* patch forward jump */ jit_movr_ui (JIT_RET, JIT_R1); /* RET = R1 */ jit_ret (); /* call the generated code, passing 36 as an argument */ jit_flush_code(codeBuffer, jit_get_ip().ptr); printf("nfibs(%d) = %d", 36, nfibs(36)); return 0; } This code calculates the recurrence relation using iteration (a for loop in high-level languages). There is still a forward reference (indicated by the jit_forward/jit_patch pair); there are no function calls anymore: instead, there is a backward jump (the bnei at the end of the loop). In this case, the destination address should be known, because the jumps lands on an instruction that has already been compiled. However the program must make a provision and remember the address where the jump will land. This is achieved with jit_get_label, yet another macro that is much similar to jit_get_ip but, instead of a jit_code union, it answers an jit_insn * that the branch macros accept. Now, let's make one more change: let's rewrite the loop like this: jit_delay( jit_movi_ui (JIT_R1, 1), ref = jit_blti_ui (jit_forward(), JIT_R2, 2)); jit_subi_ui (JIT_R2, JIT_R2, 1); loop= jit_get_label(); jit_subi_ui (JIT_R2, JIT_R2, 1); /* decr. counter */ jit_addr_ui (JIT_V0, JIT_R0, JIT_R1); /* V0 = R0 + R1 */ jit_movr_ui (JIT_R0, JIT_R1); /* R0 = R1 */ jit_delay( jit_addi_ui (JIT_R1, JIT_V0, 1), /* R1 = V0 + 1 */ jit_bnei_ui (loop, JIT_R2, 0)); /* if (R2) goto loop; */ The jit_delay macro is used to schedule delay slots in jumps and branches. This is optional, but might lead to performance improvements in tight inner loops (of course not in a loop that is executed 35 times, but this is just an example). jit_delay takes two gnu lightning instructions, a delay instruction and a branch instruction. Note that the two instructions must be written in execution order (first the delay instruction, then the branch instruction), not with the branch first. If the current machine has a delay slot, the delay instruction (or part of it) is placed in the delay slot after the branch instruction; otherwise, it emits the delay instruction before the branch instruction. The delay instruction must not depend on being executed before or after the branch. Instead of jit_patch, you can use jit_patch_at, which takes two arguments: the first is the same as for jit_patch, and the second is the valued to be patched in. In other words, these two invocations have the same effect: jit_patch (jump_pc); jit_patch_at (jump_pc, jit_get_ip ()); Dual to branches and jit_patch_at are jit_movi_p and jit_patch_movi, which can also be used to implement forward references. jit_movi_p is carefully implemented to use an encoding that is as long as possible, so that it can always be patched; in addition, like branches, it will return an address which is then passed to jit_patch_movi. The usage of jit_patch_movi is similar to jit_patch_at. Re-entrant usage of gnu lightning By default, gnu lightning is able to compile different functions at the same time as long as it happens in different object files, and on the other hand constrains code generation tasks to reside in a single object file. The reason for this is not apparent, but is easily explained: the lightning.h header file defines its state as a static variable, so calls to jit_set_ip and jit_get_ip residing in different files access different instruction pointers. This was not done without reason: it makes the usage of gnu lightning much simpler, as it limits the initialization tasks to the bare minimum and removes the need to link the program with a separate library. On the other hand, multi-threaded or otherwise concurrent programs require reentrancy in the code generator, so this approach cannot be the only one. In fact, it is possible to define your own copy of gnu lightning's instruction state by defining a variable of type jit_state and #define-ing _jit to it: struct jit_state lightning; #define _jit lightning You are free to define the jit_state variable as you like: extern, static to a function, auto, or global. This feature takes advantage of an aspect of macros (cascaded macros), which is documented thus in CPP's reference manual: A cascade of macros is when one macro's body contains a reference to another macro. This is very common practice. For example, #define BUFSIZE 1020 #define TABLESIZE BUFSIZE This is not at all the same as defining TABLESIZE to be ‘1020’. The #define for TABLESIZE uses exactly the body you specify—in this case, BUFSIZE—and does not check to see whether it too is the name of a macro; it's only when you use TABLESIZE that the result of its expansion is checked for more macro names. This makes a difference if you change the definition of BUFSIZE at some point in the source file. TABLESIZE, defined as shown, will always expand using the definition of BUFSIZE that is currently in effect: #define BUFSIZE 1020 #define TABLESIZE BUFSIZE #undef BUFSIZE #define BUFSIZE 37 Now TABLESIZE expands (in two stages) to `37'. (The #undef is to prevent any warning about the nontrivial redefinition of BUFSIZE.) In the same way, jit_get_label will adopt whatever definition of _jit is in effect: #define jit_get_label() (_jit.pc) Special care must be taken when functions residing in separate files must access the same state. This could be the case, for example, if a special library contained function for strength reduction of multiplications to adds & shifts, or maybe of divisions to multiplications and shifts. The function would be compiled using a single definition of _jit and that definition would be used whenever the function would be called. Since gnu lightning uses a feature of the preprocessor to obtain re-entrancy, it makes sense to rely on the preprocessor in this case too. The idea is to pass the current struct jit_state to the function: static void _opt_muli_i(jit, dest, source, n) register struct jit_state *jit; register int dest, source, n; { #define _jit jit … #undef _jit } doing this unbeknownst to the client, using a macro in the header file: extern void _opt_muli_i(struct jit_state *, int, int, int); #define opt_muli_i(rd, rs, n) _opt_muli_i(&_jit, (rd), (rs), (n)) Registers Accessing the whole register file As mentioned earlier in this chapter, all gnu lightning back-ends are guaranteed to have at least six integer registers and six floating-point registers, but many back-ends will have more. To access the entire register files, you can use the JIT_R, JIT_V and JIT_FPR macros. They accept a parameter that identifies the register number, which must be strictly less than JIT_R_NUM, JIT_V_NUM and JIT_FPR_NUM respectively; the number need not be constant. Of course, expressions like JIT_R0 and JIT_R(0) denote the same register, and likewise for integer callee-saved, or floating-point, registers. Using autoconf with gnu lightning It is very easy to include gnu lightning's source code (without the documentation and examples) into your program's distribution so that people don't need to have it installed in order to use it. Here is a step by step explanation of what to do: Run lightningize from your package's maindistribution directory. lightningize If you're using Automake, you might be pleased to know thatMakefile.am files will be already there. If you're not using Automake and aclocal, instead,you should delete the Makefile.am files (they are of no useto you) and copy the contents of the lightning.m4 file, found inaclocal's macro repository (usually /usr/share/aclocal,to your configure.in or acinclude.m4 or aclocal.m4 file. Include a call to the LIGHTNING_CONFIGURE_IF_NOT_FOUNDmacro in your configure.in file. LIGHTNING_CONFIGURE_IF_NOT_FOUND will first look for a pre-installed copy of gnu lightning and, if it can be found, it will use it; otherwise, it will do exactly the same things that gnu lightning's own configure script does. If gnu lightning is already installed, or if the configuration process succeeds, it will define the HAVE_LIGHTNING symbol. In addtion, an Automake conditional named HAVE_INSTALLED_LIGHTNING will be set if gnu lightning is already installed, which can be used to set up include paths appropriately. Finally, LIGHTNING_CONFIGURE_IF_NOT_FOUND accepts two optional parameters: respectively, an action to be taken if gnu lightning is available, and an action to be taken if it is not. Porting gnu lightning This chapter describes the process of porting gnu lightning. It assumes that you are pretty comfortable with the usage of gnu lightning for dynamic code generation, as described in . An overview of the porting process A particular port of gnu lightning is composed of four files. These have a common suffix which identifies the port (for example, i386 or ppc), and a prefix that identifies their function; they are: asm-suffix.h, which contains the description of thetarget machine's instruction format. The creation of this fileis discussed in Creating the run-time assembler. core-suffix.h, which contains the mappings fromgnu lightning's instruction set to the target machine's assemblylanguage format. The creation of this file is discussed inCreating the platform-independent layer. funcs-suffix.h, for now, only contains the definitionof jit_flush_code. The creation of this file is brieflydiscussed in More complex tasks in the platform-independent layer. fp-suffix.h, which contains the description of thetarget machine's instruction format and the internal macros for doingfloating point computation. The creation of this file is discussedin Implementing macros for floating point. Before doing anything, you have to add the ability to recognize the new port during the configuration process. This is explained in Automatically recognizing the new platform. Automatically recognizing the new platform Before starting your port, you have to add the ability to recognize the new port during the configure process. You only have to run config.guess, which you'll find in the main distribution directory, and note down the first part of the output (up to the first dash). Then, in the two files configure.in and lightning.m4, lookup the line case "$host_cpu" in and, right after it, add the line: cpu-name) cpu=file-suffix ;; where cpu-name is the cpu as output by config.guess, and file-suffix is the suffix that you are going to use for your files (see An overview of the porting process). Now create empty files for your new port: touch lightning/asm-xxx.h touch lightning/fp-xxx.h touch lightning/core-xxx.h touch lightning/funcs-xxx.h and run configure, which should create the symlinks that are needed by lightning.h. This is important because it will allow you to use gnu lightning (albeit in a limited way) for testing even before the port is completed. Creating the run-time assembler The run-time assembler is a set of macros whose purpose is to assemble instructions for the target machine's assembly language, translating mnemonics to machine language together with their operands. While a run-time assembler is not, strictly speaking, part of gnu lightning (it is a private layer to be used while implementing the standard macros that are ultimately used by clients), designing a run-time assembler first allows you to think in terms of assembly language rather than binary code (ouch!…), making it considerably easier to write the standard macros. Creating a run-time assembler is a tedious process rather than a difficult one, because most of the time will be spent collecting and copying information from the architecture's manual. Macros defined by a run-time assembler are conventionally named after the mnemonic and the type of its operands. Examples took from the SPARC's run-time assembler are ADDrrr, a macro that assembles an ADD instruction with three register operands, and SUBCCrir, which assembles a SUBCC instruction whose second operand is an immediate and the remaining two are registers. The first step in creating the assembler is to pick a convention for operand specifiers (r and i in the example above) and for register names. On the SPARC, this convention is as follows r A register name. For every r in the macro name, a numericparameter RR is passed to the macro, and the operand is assembledas %rRR. i An immediate, usually a 13-bit signed integer (with exception forinstructions such as SETHI and branches). The macros checkthe size of the passed parameter if gnu lightning is configured with--enable-assertions. x A combination of two r parameters, which are summed to determinethe effective address in a memory load/store operation. m A combination of an r and i parameter, which are summed todetermine the effective address in a memory load/store operation. Additional macros can be defined that provide easier access to register names. For example, on the SPARC, _Ro(3) and _Rg(5) map respectively to %o3 and %g5; on the x86, instead, symbolic representations of the register names are provided (for example, _EAX and _EBX). CISC architectures sometimes have registers of different sizes–this is the case on the x86 where %ax is a 16-bit register while %esp is a 32-bit one. In this case, it can be useful to embed information on the size in the definition of register names. The x86 machine language, for example, represents all three of %bh, %di and %edi as 7; but the x86 run-time assemblers defines them with different numbers, putting the register's size in the upper nybble (for example, ‘17h’ for %bh and ‘27h’ for %di) so that consistency checks can be made on the operands' sizes when --enable-assertions is used. The next important part defines the native architecture's instruction formats. These can be as few as ten on RISC architectures, and as many as fifty on CISC architectures. In the latter case it can be useful to define more macros for sub-formats (such as macros for different addressing modes) or even for sub-fields in an instruction. Let's see an example of these macros. #define _2i( OP, RD, OP2, IMM) _I((_u2 (OP )<<30) | (_u5(RD)<<25) | (_u3(OP2)<<22) | _u22(IMM) ) The name of the macro, _2i, indicates a two-operand instruction comprising an immediate operand. The instruction format is: .------.---------.------.-------------------------------------------. | OP | RD | OP2 | IMM | |------+---------+------+-------------------------------------------| |2 bits| 5 bits |3 bits| 22 bits | |31-30 | 29-25 | 22-24| 0-21 | '------'---------'------'-------------------------------------------' gnu lightning provides macros named _sXX(OP) and _uXX(OP), where XX is a number between 1 and 31, which testOnly when --enable-assertions is used. whether OP can be represented as (respectively) a signed or unsigned integer of the given size. What the macro above does, then, is to shift and or together the different fields, ensuring that each of them fits the field. Here is another definition, this time for the PowerPC architecture. #define _X(OP,RD,RA,RB,XO,RC) _I((_u6 (OP)<<26) | (_u5(RD)<<21) | (_u5(RA)<<16) | ( _u5(RB)<<11) | (_u10(XO)<<1) | _u1(RC) ) Here is the bit layout corresponding to this instruction format: .--------.--------.--------.--------.---------------------.-------. | OP | RD | RA | RB | X0 | RC | |--------+--------+--------+--------+-----------------------------| | 6 bits | 5 bits | 5 bits | 5 bits | 10 bits | 1 bit | | 31-26 | 25-21 | 16-20 | 11-15 | 1-10 | 0 | '--------'---------'-------'--------'-----------------------------' How do these macros actually generate code? The secret lies in the _I macro, which is one of four predefined macros which actually store machine language instructions in memory. They are _B, _W, _I and _L, respectively for 8-bit, 16-bit, 32-bit, and long (either 32-bit or 64-bit, depending on the architecture) values. Next comes another set of macros (usually the biggest) which represents the actual mnemonics—macros such as ADDrrr and SUBCCrir, which were cited earlier in this chapter, belong to this set. Most of the times, all these macros will do is to use the “instruction format” macros, specifying the values of the fields in the different instruction formats. Let's see a few of these definitions, again taken from the SPARC assembler: #define BAi(DISP) _2 (0, 0, 8, 2, DISP) #define BA_Ai(DISP) _2 (0, 1, 8, 2, DISP) #define SETHIir(IMM, RD) _2i (0, RD, 4, IMM) #define ADDrrr(RS1, RS2, RD) _3 (2, RD, 0, RS1, 0, 0, RS2) #define ADDrir(RS1, IMM, RD) _3i (2, RD, 0, RS1, 1, IMM) #define ADDCCrrr(RS1, RS2, RD) _3 (2, RD, 16, RS1, 0, 0, RS2) #define ADDCCrir(RS1, IMM, RD) _3i (2, RD, 16, RS1, 1, IMM) #define ANDrrr(RS1, RS2, RD) _3 (2, RD, 1, RS1, 0, 0, RS2) #define ANDrir(RS1, IMM, RD) _3i (2, RD, 1, RS1, 1, IMM) #define ANDCCrrr(RS1, RS2, RD) _3 (2, RD, 17, RS1, 0, 0, RS2) #define ANDCCrir(RS1, IMM, RD) _3i (2, RD, 17, RS1, 1, IMM) A few things have to be noted. For example: The SPARC assembly language sometimes uses a comma inside a mnemonic(for example, ba,a). This symbol is not allowed inside acpp macro name, so it is replaced with an underscore; the sameis done with the dots found in the PowerPC assembly language (forexample, andi. is defined as ANDI_rri). It can be useful to group together instructions with the sameinstruction format, as doing this tends to make the source codemore readable (numbers are put in the same columns). Using an editor without automatic wrap at end of line can be useful,since run-time assemblers tend to have very long lines. A final touch is to define the synthetic instructions, which are usually found on RISC machines. For example, on the SPARC, the LD instruction has two synonyms (LDUW and LDSW) which are defined thus: #define LDUWxr(RS1, RS2, RD) LDxr(RS1, RS2, RD) #define LDUWmr(RS1, IMM, RD) LDmr(RS1, IMM, RD) #define LDSWxr(RS1, RS2, RD) LDxr(RS1, RS2, RD) #define LDSWmr(RS1, IMM, RD) LDmr(RS1, IMM, RD) Other common case are instructions which take advantage of registers whose value is hard-wired to zero, and short-cut instructions which hard-code some or all of the operands: /* Destination is %g0, which the processor never overwrites. */ #define CMPrr(R1, R2) SUBCCrrr(R1, R2, 0) /* subcc %r1, %r2, %g0 */ /* One of the source registers is hard-coded to be %g0. */ #define NEGrr(R,S) SUBrrr(0, R, S) /* sub %g0, %rR, %rS */ /* All of the operands are hard-coded. */ #define RET() JMPLmr(31,8 ,0) /* jmpl [%r31+8], %g0 */ /* One of the operands acts as both source and destination */ #define BSETrr(R,S) ORrrr(R, S, S) /* or %rR, %rS, %rS */ Specific to RISC computers, finally, is the instruction to load an arbitrarily sized immediate into a register. This instruction is usually implemented as one or two basic instructions: If the number is small enough, an instruction is sufficient(LI or ORI on the PowerPC, MOV on the SPARC). If the lowest bits are all zeroed, an instruction is sufficient(LIS on the PowerPC, SETHI on the SPARC). Otherwise, the high bits are set first (with LIS orSETHI), and the result is then ored with the lowbits Here is the definition of such an instruction for the PowerPC: #define MOVEIri(R,I) (_siP(16,I) ? LIri(R,I) : \ /* case 1 */ (_uiP(16,I) ? ORIrri(R,0,I) : \ /* case 1 */ _MOVEIri(R, _HI(I), _LO(I)) )) /* case 2/3 */ #define _MOVEIri(H,L,R) (LISri(R,H), (L ? ORIrri(R,R,L) : 0)) and for the SPARC: #define SETir(I,R) (_siP(13,I) ? MOVir(I,R) : \ _SETir(_HI(I), _LO(I), R)) #define _SETir(H,L,R) (SETHIir(H,R), (L ? ORrir(R,L,R) : 0)) In both cases, _HI and _LO are macros for internal use that extract different parts of the immediate operand. You should take a look at the run-time assemblers distributed with gnu lightning before trying to craft your own. In particular, make sure you understand the RISC run-time assemblers (the SPARC's is the simplest) before trying to decypher the x86 run-time assembler, which is significantly more complex. Creating the platform-independent layer The platform-independent layer is the one that is ultimately used by gnu lightning clients. Creating this layer is a matter of creating a hundred or so macros that comprise part of the interface used by the clients, as described in gnu lightning's instruction set. Fortunately, a number of these definitions are common to the different platforms and are defined just once in one of the header files that make up gnu lightning, that is, core-common.h. Most of the macros are relatively straight-forward to implement (with a few caveats for architectures whose assembly language only offers two-operand arithmetic instructions). This section will cover the tricky points, before presenting the complete listing of the macros that make up the platform-independent interface provided by gnu lightning. Implementing forward references Implementation of forward references takes place in: The branch macros The jit_patch_at macros Roughly speaking, the branch macros, as seen in Generating code at run-time, return a value that later calls to jit_patch or jit_patch_at use to complete the assembly of the forward reference. This value is usually the contents of the program counter after the branch instruction is compiled (which is accessible in the _jit.pc variable). Let's see an example from the x86 back-end: #define jit_bmsr_i(label, s1, s2) \ (TESTLrr((s1), (s2)), JNZm(label,0,0,0), _jit.pc) The bms (branch if mask set) instruction is assembled as the combination of a TEST instruction (bit-wise and between the two operands) and a JNZ instruction (jump if non-zero). The macro then returns the final value of the program counter. jit_patch_at is one of the few macros that need to possess a knowledge of the machine's instruction formats. Its purpose is to patch a branch instruction (identified by the value returned at the moment the branch was compiled) to jump to the current position (that is, to the address identified by _jit.pc). On the x86, the displacement between the jump and the landing point is expressed as a 32-bit signed integer lying in the last four bytes of the jump instruction. The definition of _jit_patch_at is: #define jit_patch(jump_pc, pv) (*_PSL((jump_pc) - 4) = \ (pv) - (jump_pc)) The _PSL macro is nothing more than a cast to long *, and is used here to shorten the definition and avoid cluttering it with excessive parentheses. These type-cast macros are: _PUC(X) to cast to a unsigned char *. _PUS(X) to cast to a unsigned short *. _PUI(X) to cast to a unsigned int *. _PSL(X) to cast to a long *. _PUL(X) to cast to a unsigned long *. On other platforms, notably RISC ones, the displacement is embedded into the instruction itself. In this case, jit_patch_at must first zero out the field, and then or in the correct displacement. The SPARC, for example, encodes the displacement in the bottom 22 bits; in addition the right-most two bits are suppressed, which are always zero because instruction have to be word-aligned. #define jit_patch_at(delay_pc, pv) jit_patch_ (((delay_pc) - 1), (pv)) /* branch instructions return the address of the delay * instruction---this is just a helper macro that makes the code more * readable. */ #define jit_patch_(jump_pc, pv) (*jump_pc = \ (*jump_pc & ~_MASK(22)) | \ ((_UL(pv) - _UL(jump_pc)) >> 2) & _MASK(22)) This introduces more predefined shortcut macros: _UC(X) to cast to a unsigned char. _US(X) to cast to a unsigned short. _UI(X) to cast to a unsigned int. _SL(X) to cast to a long. _UL(X) to cast to a unsigned long. _MASK(N) gives a binary number made of N ones. Dual to branches and jit_patch_at are jit_movi_p and jit_patch_movi, since they can also be used to implement forward references. jit_movi_p should be carefully implemented to use an encoding that is as long as possible, and it should return an address which is then passed to jit_patch_movi. The implementation of jit_patch_movi is similar to jit_patch_at. Common features supported by core-common.h The core-common.h file contains hundreds of macro definitions which will spare you defining a lot of things in the files the are specific to your port. Here is a list of the features that core-common.h provides. Support for common synthetic instructions These are instructions that can be represented as a simple operation,for example a bit-wise and or a subtraction. core-common.hrecognizes when the port-specific header file defines these macros andavoids compiler warnings about redefined macros, but there should beno need to define them. They are: #define jit_extr_c_ui(d, rs) #define jit_extr_s_ui(d, rs) #define jit_extr_c_ul(d, rs) #define jit_extr_s_ul(d, rs) #define jit_extr_i_ul(d, rs) #define jit_negr_i(d, rs) #define jit_negr_l(d, rs) Support for the abi All of jit_prolog, jit_leaf and jit_finish are notmandatory. If not defined, they will be defined respectively as anempty macro, as a synonym for jit_prolog, and as a synonym forjit_calli. Whether to define them or not in the port-specificheader file, it depends on the underlying architecture's abi—ingeneral, however, you'll need to define at least jit_prolog. Support for uncommon instructions These are instructions that many widespread architectures lack.core-common.h is able to provide default definitions, but theyare usually inefficient if the hardware provides a way to do theseoperations with a single instruction. They are extension with signand “reverse subtraction” (that is, REG2=IMM-REG1): #define jit_extr_c_i(d, rs) #define jit_extr_s_i(d, rs) #define jit_extr_c_l(d, rs) #define jit_extr_s_l(d, rs) #define jit_extr_i_l(d, rs) #define jit_rsbi_i(d, rs, is) #define jit_rsbi_l(d, rs, is) #define jit_rsbi_p(d, rs, is) Conversion between network and host byte ordering These macros are no-ops on big endian systems. Don't define them onsuch systems; on the other hand, they are mandatory on little endiansystems. They are: #define jit_ntoh_ui(d, rs) #define jit_ntoh_us(d, rs) Support for a “zero” register Many RISC architectures provide a read-only register whose value ishard-coded to be zero; this register is then used implicitly whenreferring to a memory location using a single register. For example,on the SPARC, an operand like [%l6] is actually assembled as[%l6+%g0]. If this is the case, you should defineJIT_RZERO to be the number of this register; core-common.hwill use it to implement all variations of the ld and stinstructions. For example: #define jit_ldi_c(d, is) jit_ldxi_c(d, JIT_RZERO, is) #define jit_ldr_i(d, rs) jit_ldxr_c(d, JIT_RZERO, rs) If available, JIT_RZERO is also used to provide more efficientdefinitions of the neg instruction (see “Support for commonsynthetic instructions”, above). Synonyms core-common.h provides a lot of trivial definitions which makethe instruction set as orthogonal as possible. For example, adding twounsigned integers is exactly the same as adding two signed integers(assuming a two's complement representation of negative numbers); yet,gnu lightning provides both jit_addr_i and jit_addr_uimacros. Similarly, pointers and unsigned long integers behave in thesame way, but gnu lightning has separate instruction for the two datatypes—those that operate on pointers usually include a typecastthat makes programs clearer. Shortcuts These define “synthetic” instructions whose definition is not astrivial as in the case of synonyms, but is anyway standard. Thisis the case for bitwise not (which is implemented by XORing astring of ones), “reverse subtraction” between registers (which isconverted to a normal subtraction with the two source operandsinverted), and subtraction of an immediate from a register (which isconverted to an addition). Unlike neg and ext (see“Support for common synthetic instructions”, above), which aresimply non-mandatory, you must not define these functions. Support for longs On most systems, longs and unsigned longs are the sameas, respectively, ints and unsigned ints. In this case,core-common.h defines operations on these types to be synonyms. jit_state Last but not least, core-common.h defines the jit_statetype. Part of this struct is machine-dependent and includesall kinds of state needed by the back-end; this part is alwaysaccessible in a re-entrant way as _jitl. _jitl will beof type struct jit_local_state; this struct must be definedeven if no state is required. Supporting scheduling of delay slots Delay slot scheduling is obtained by clients through the jit_delay macro. However this macro is not to be defined in the platform-independent layer, because gnu lightning provides a common definition in core-common.h. Instead, the platform-independent layer must define another macro, called jit_fill_delay_after, which has to exchange the instruction to be scheduled in the delay slot with the branch instruction. The only parameter accepted by the macro is a call to a branch macro, which must be expanded exactly once by jit_fill_delay_after. The client must be able to pass the return value of jit_fill_delay_after to jit_patch_at. There are two possible approaches that can be used in jit_fill_delay_after. They are summarized in the following pictures: The branch instructions assemble a nop instruction which isthen removed by jit_fill_delay_after. before | after ---------------------------------+----------------------------- ... | <would-be delay instruction> | <branch instruction> <branch instruction> | <delay instruction> NOP | <--- _jit.pc <--- _jit.pc | The branch instruction assembles the branch so that the delayslot is annulled, jit_fill_delay_after toggles the bit: before | after ---------------------------------+----------------------------- ... | <would-be delay instruction> | <branch instruction> <branch with annulled delay> | <delay instruction> <--- _jit.pc | <--- _jit.pc Don't forget that you can take advantage of delay slots in the implementation of boolean instructions such as le or gt. Supporting arbitrarily sized immediate values This is a problem that is endemic to RISC machines. The basic idea is to reserve one or two register to represent large immediate values. Let's see an example from the SPARC: addi_i R0, V2, 45 | addi_i R0, V2, 10000 ---------------------------+--------------------------- add %l5, 45, %l0 | set 10000, %l6 | add %l5, %l6, %l0 In this case, %l6 is reserved to be used for large immediates. An elegant solution is to use an internal macro which automatically decides which version is to be compiled. Beware of register conflicts on machines with delay slots. This is the case for the SPARC, where %l7 is used instead for large immediates in compare-and-branch instructions. So the sequence jit_delay( jit_addi_i(JIT_R0, JIT_V2, 10000), jit_blei_i(label, JIT_R1, 20000) ); is assembled this way: set 10000, %l6 ! prepare immediate for add set 20000, %l7 ! prepare immediate for cmp cmp %l1, %l7 ble label add %l5, %l6, %l0 ! delay slot Note that using %l6 in the branch instruction would have given an incorrect result—R0 would have been filled with the value of V2+20000 rather than V2+10000. Implementing the ABI Implementing the underlying architecture's abi is done in the macros that handle function prologs and epilogs and argument passing. Let's look at the prologs and epilogs first. These are usually pretty simple and, what's more important, with constant content—that is, they always generate exactly the same instruction sequence. Here is an example: SPARC x86 save %sp, -96, %sp push %ebp push %ebx push %esi push %edi movl %esp, %ebp ... ... ret popl %edi restore popl %esi popl %ebx popl %ebp ret The registers that are saved (%ebx, %esi, %edi) are mapped to the V0 through V2 registers in the gnu lightning instruction set. Argument passing is more tricky. There are basically three casesFor speed and ease of implementation, gnu lightning does not currently support passing some of the parameters on the stack and some in registers.: Register windows Output registers are different from input registers—the prolog takescare of moving the caller's output registers to the callee's inputregisters. This is the case with the SPARC. Passing parameters via registers In this case, output registers are the same as input registers. Theprogram must take care of saving input parameters somewhere (on thestack, or in non-argument registers). This is the case with thePowerPC. All the parameters are passed on the stack This case is by far the simplest and is the most common in CISCarchitectures, like the x86 and Motorola 68000. In all cases, the port-specific header file will define two variable for private use—one to be used by the caller during the prepare/pusharg/finish sequence, one to be used by the callee, specifically in the jit_prolog and jit_arg macros. Let's look again, this time with more detail, at each of the cases. Register windows jit_finish is the same as jit_calli, and is definedin core-common.h (see Common featuressupported by core-common.h). #define jit_prepare_i(numargs) (_jitl.pusharg = _Ro(numargs)) #define jit_pusharg_i(rs) (--_jitl.pusharg, \ MOVrr((rs), _jitl.pusharg)) Remember that arguments pushing takes place in reverse order, thusgiving a pre-decrement (rather than post-increment) injit_pusharg_i.Here is what happens on the callee's side: #define jit_arg_c() (_jitl.getarg++) #define jit_getarg_c(rd, ofs) jit_extr_c_i ((rd), (ofs)) #define jit_prolog(numargs) (SAVErir(JIT_SP, -96, JIT_SP), \ _jitl.getarg = _Ri(0)) The jit_arg macros return nothing more than a register index,which is then used by the jit_getarg macros. jit_prologresets the counter used by jit_arg to zero; the numargsparameter is not used. It is sufficient for jit_leaf to be asynonym for jit_prolog. Passing parameter via registers The code is almost the same as that for the register windows case, butwith an additional complexity—jit_arg will transfer theargument from the input register to a non-argument register so thatfunction calls will not clobber it. The prolog and epilog code can thenbecome unbearably long, up to 20 instructions on the PPC; a commonsolution in this case is that of trampolines.The prolog does nothing more than put the function's actual address in acaller-preserved register and then call the trampoline: mflr r0 ! grab return address movei r10, trampo_2args ! jump to trampoline mtlr r10 blrl here: mflr r31 ! r31 = address of epilog ...actual code... mtlr r31 ! return to the trampoline blr In this case, jit_prolog does use its argument containing thenumber of parameters to pick the appropriate trampoline. Here,trampo_2args is the address of a trampoline designed for2-argument functions.The trampoline executes the prolog code, jumps to the contents ofr10, and upon return from the subroutine it executes theepilog code. All the parameters are passed on the stack jit_pusharg uses a hardware push operation, which is commonlyavailable on CISC machines (where this approach is most likelyfollowed). Since the stack has to be cleaned up after the call,jit_prepare_i remembers how many parameters have been put there,and jit_finish adjusts the stack pointer after the call. #define jit_prepare_i(numargs) (_jitl.args += (numargs)) #define jit_pusharg_i(rs) PUSHLr(rs) #define jit_finish(sub) (jit_calli((sub)), \ ADDLir(4 * _jitl.args, JIT_SP), \ _jitl.numargs = 0) Note the usage of += in jit_prepare_i. This is doneso that one can defer the popping of the arguments that were savedon the stack (stack pollution). To do so, it is sufficient touse jit_calli instead of jit_finish in all but thelast call.On the caller's side, arg returns an offset relative to theframe pointer, and getarg loads the argument from the stack: #define jit_getarg_c(rd, ofs) jit_ldxi_c((rd), _EBP, (ofs)); #define jit_arg_c() ((_jitl.frame += sizeof(int) \ - sizeof(int)) The _jitl.frame variable is initialized by jit_prologwith the displacement between the value of the frame pointer(%ebp) and the address of the first parameter. These schemes are the most used, so core-common.h provides a way to employ them automatically. If you do not define the jit_getarg_c macro and its companions, core-common.h will presume that you intend to pass parameters through either the registers or the stack. If you define JIT_FP, stack-based parameter passing will be employed and the jit_getarg macros will be defined like this: #define jit_getarg_c(reg, ofs) jit_ldxi_c((reg), JIT_FP, (ofs)); In other words, the jit_arg macros (which are still to be defined by the platform-specific back-end) shall return an offset into the stack frame. On the other hand, if you don't define JIT_FP, register-based parameter passing will be employed and the jit_arg macros shall return a register number; in this case, jit_getarg will be implemented in terms of jit_extr and jit_movr operations: #define jit_getarg_c(reg, ofs) jit_extr_c_i ((reg), (ofs)) #define jit_getarg_i(reg, ofs) jit_movr_i ((reg), (ofs)) Macros composing the platform-independent layer Register names (all mandatory but the last two) #define JIT_R #define JIT_R_NUM #define JIT_V #define JIT_V_NUM #define JIT_FPR #define JIT_FPR_NUM #define JIT_SP #define JIT_FP #define JIT_RZERO Helper macros (non-mandatory): #define jit_fill_delay_after(branch) Mandatory: #define jit_arg_c() #define jit_arg_i() #define jit_arg_l() #define jit_arg_p() #define jit_arg_s() #define jit_arg_uc() #define jit_arg_ui() #define jit_arg_ul() #define jit_arg_us() #define jit_abs_d(rd,rs) #define jit_addi_i(d, rs, is) #define jit_addr_d(rd,s1,s2) #define jit_addr_i(d, s1, s2) #define jit_addxi_i(d, rs, is) #define jit_addxr_i(d, s1, s2) #define jit_andi_i(d, rs, is) #define jit_andr_i(d, s1, s2) #define jit_beqi_i(label, rs, is) #define jit_beqr_d(label, s1, s2) #define jit_beqr_i(label, s1, s2) #define jit_bgei_i(label, rs, is) #define jit_bgei_ui(label, rs, is) #define jit_bger_d(label, s1, s2) #define jit_bger_i(label, s1, s2) #define jit_bger_ui(label, s1, s2) #define jit_bgti_i(label, rs, is) #define jit_bgti_ui(label, rs, is) #define jit_bgtr_d(label, s1, s2) #define jit_bgtr_i(label, s1, s2) #define jit_bgtr_ui(label, s1, s2) #define jit_blei_i(label, rs, is) #define jit_blei_ui(label, rs, is) #define jit_bler_d(label, s1, s2) #define jit_bler_i(label, s1, s2) #define jit_bler_ui(label, s1, s2) #define jit_bltgtr_d(label, s1, s2) #define jit_blti_i(label, rs, is) #define jit_blti_ui(label, rs, is) #define jit_bltr_d(label, s1, s2) #define jit_bltr_i(label, s1, s2) #define jit_bltr_ui(label, s1, s2) #define jit_bmci_i(label, rs, is) #define jit_bmcr_i(label, s1, s2) #define jit_bmsi_i(label, rs, is) #define jit_bmsr_i(label, s1, s2) #define jit_bnei_i(label, rs, is) #define jit_bner_d(label, s1, s2) #define jit_bner_i(label, s1, s2) #define jit_boaddi_i(label, rs, is) #define jit_boaddi_ui(label, rs, is) #define jit_boaddr_i(label, s1, s2) #define jit_boaddr_ui(label, s1, s2) #define jit_bordr_d(label, s1, s2) #define jit_bosubi_i(label, rs, is) #define jit_bosubi_ui(label, rs, is) #define jit_bosubr_i(label, s1, s2) #define jit_bosubr_ui(label, s1, s2) #define jit_buneqr_d(label, s1, s2) #define jit_bunger_d(label, s1, s2) #define jit_bungtr_d(label, s1, s2) #define jit_bunler_d(label, s1, s2) #define jit_bunltr_d(label, s1, s2) #define jit_bunordr_d(label, s1, s2) #define jit_calli(label) #define jit_callr(label) #define jit_ceilr_d_i(rd, rs) #define jit_divi_i(d, rs, is) #define jit_divi_ui(d, rs, is) #define jit_divr_d(rd,s1,s2) #define jit_divr_i(d, s1, s2) #define jit_divr_ui(d, s1, s2) #define jit_eqi_i(d, rs, is) #define jit_eqr_d(d, s1, s2) #define jit_eqr_i(d, s1, s2) #define jit_extr_i_d(rd, rs) #define jit_floorr_d_i(rd, rs) #define jit_gei_i(d, rs, is) #define jit_gei_ui(d, s1, s2) #define jit_ger_d(d, s1, s2) #define jit_ger_i(d, s1, s2) #define jit_ger_ui(d, s1, s2) #define jit_gti_i(d, rs, is) #define jit_gti_ui(d, s1, s2) #define jit_gtr_d(d, s1, s2) #define jit_gtr_i(d, s1, s2) #define jit_gtr_ui(d, s1, s2) #define jit_hmuli_i(d, rs, is) #define jit_hmuli_ui(d, rs, is) #define jit_hmulr_i(d, s1, s2) #define jit_hmulr_ui(d, s1, s2) #define jit_jmpi(label) #define jit_jmpr(reg) #define jit_ldxi_f(rd, rs, is) #define jit_ldxr_f(rd, s1, s2) #define jit_ldxi_c(d, rs, is) #define jit_ldxi_d(rd, rs, is) #define jit_ldxi_i(d, rs, is) #define jit_ldxi_s(d, rs, is) #define jit_ldxi_uc(d, rs, is) #define jit_ldxi_us(d, rs, is) #define jit_ldxr_c(d, s1, s2) #define jit_ldxr_d(rd, s1, s2) #define jit_ldxr_i(d, s1, s2) #define jit_ldxr_s(d, s1, s2) #define jit_ldxr_uc(d, s1, s2) #define jit_ldxr_us(d, s1, s2) #define jit_lei_i(d, rs, is) #define jit_lei_ui(d, s1, s2) #define jit_ler_d(d, s1, s2) #define jit_ler_i(d, s1, s2) #define jit_ler_ui(d, s1, s2) #define jit_lshi_i(d, rs, is) #define jit_lshr_i(d, r1, r2) #define jit_ltgtr_d(d, s1, s2) #define jit_lti_i(d, rs, is) #define jit_lti_ui(d, s1, s2) #define jit_ltr_d(d, s1, s2) #define jit_ltr_i(d, s1, s2) #define jit_ltr_ui(d, s1, s2) #define jit_modi_i(d, rs, is) #define jit_modi_ui(d, rs, is) #define jit_modr_i(d, s1, s2) #define jit_modr_ui(d, s1, s2) #define jit_movi_d(rd,immd) #define jit_movi_f(rd,immf) #define jit_movi_i(d, is) #define jit_movi_p(d, is) #define jit_movr_d(rd,rs) #define jit_movr_i(d, rs) #define jit_muli_i(d, rs, is) #define jit_muli_ui(d, rs, is) #define jit_mulr_d(rd,s1,s2) #define jit_mulr_i(d, s1, s2) #define jit_mulr_ui(d, s1, s2) #define jit_negr_d(rd,rs) #define jit_nei_i(d, rs, is) #define jit_ner_d(d, s1, s2) #define jit_ner_i(d, s1, s2) #define jit_nop() #define jit_ordr_d(d, s1, s2) #define jit_ori_i(d, rs, is) #define jit_orr_i(d, s1, s2) #define jit_patch_at(jump_pc, value) #define jit_patch_movi(jump_pc, value) #define jit_pop_i(rs) #define jit_prepare_d(numargs) #define jit_prepare_f(numargs) #define jit_prepare_i(numargs) #define jit_push_i(rs) #define jit_pusharg_i(rs) #define jit_ret() #define jit_retval_i(rd) #define jit_roundr_d_i(rd, rs) #define jit_rshi_i(d, rs, is) #define jit_rshi_ui(d, rs, is) #define jit_rshr_i(d, r1, r2) #define jit_rshr_ui(d, r1, r2) #define jit_sqrt_d(rd,rs) #define jit_stxi_c(rd, id, rs) #define jit_stxi_d(id, rd, rs) #define jit_stxi_f(id, rd, rs) #define jit_stxi_i(rd, id, rs) #define jit_stxi_s(rd, id, rs) #define jit_stxr_c(d1, d2, rs) #define jit_stxr_d(d1, d2, rs) #define jit_stxr_f(d1, d2, rs) #define jit_stxr_i(d1, d2, rs) #define jit_stxr_s(d1, d2, rs) #define jit_subr_d(rd,s1,s2) #define jit_subr_i(d, s1, s2) #define jit_subxi_i(d, rs, is) #define jit_subxr_i(d, s1, s2) #define jit_truncr_d_i(rd, rs) #define jit_uneqr_d(d, s1, s2) #define jit_unger_d(d, s1, s2) #define jit_ungtr_d(d, s1, s2) #define jit_unler_d(d, s1, s2) #define jit_unltr_d(d, s1, s2) #define jit_unordr_d(d, s1, s2) #define jit_xori_i(d, rs, is) #define jit_xorr_i(d, s1, s2) Non mandatory—there should be no need to define them: #define jit_extr_c_ui(d, rs) #define jit_extr_s_ui(d, rs) #define jit_extr_c_ul(d, rs) #define jit_extr_s_ul(d, rs) #define jit_extr_i_ul(d, rs) #define jit_negr_i(d, rs) #define jit_negr_l(d, rs) Non mandatory—whether to define them depends on the abi: #define jit_prolog(n) #define jit_finish(sub) #define jit_finishr(reg) #define jit_leaf(n) #define jit_getarg_c(reg, ofs) #define jit_getarg_i(reg, ofs) #define jit_getarg_l(reg, ofs) #define jit_getarg_p(reg, ofs) #define jit_getarg_s(reg, ofs) #define jit_getarg_uc(reg, ofs) #define jit_getarg_ui(reg, ofs) #define jit_getarg_ul(reg, ofs) #define jit_getarg_us(reg, ofs) #define jit_getarg_f(reg, ofs) #define jit_getarg_d(reg, ofs) Non mandatory—define them if instructions that do this exist: #define jit_extr_c_i(d, rs) #define jit_extr_s_i(d, rs) #define jit_extr_c_l(d, rs) #define jit_extr_s_l(d, rs) #define jit_extr_i_l(d, rs) #define jit_rsbi_i(d, rs, is) #define jit_rsbi_l(d, rs, is) Non mandatory if condition code are always set by add/sub, needed on other systems: #define jit_addci_i(d, rs, is) #define jit_addci_l(d, rs, is) #define jit_subci_i(d, rs, is) #define jit_subci_l(d, rs, is) Mandatory on little endian systems—don't define them on other systems: #define jit_ntoh_ui(d, rs) #define jit_ntoh_us(d, rs) Mandatory if JIT_RZERO not defined—don't define them if it is defined: #define jit_ldi_c(d, is) #define jit_ldi_i(d, is) #define jit_ldi_s(d, is) #define jit_ldr_c(d, rs) #define jit_ldr_i(d, rs) #define jit_ldr_s(d, rs) #define jit_ldi_uc(d, is) #define jit_ldi_ui(d, is) #define jit_ldi_ul(d, is) #define jit_ldi_us(d, is) #define jit_ldr_uc(d, rs) #define jit_ldr_ui(d, rs) #define jit_ldr_ul(d, rs) #define jit_ldr_us(d, rs) #define jit_sti_c(id, rs) #define jit_sti_i(id, rs) #define jit_sti_s(id, rs) #define jit_str_c(rd, rs) #define jit_str_i(rd, rs) #define jit_str_s(rd, rs) #define jit_ldi_f(rd, is) #define jit_sti_f(id, rs) #define jit_ldi_d(rd, is) #define jit_sti_d(id, rs) #define jit_ldr_f(rd, rs) #define jit_str_f(rd, rs) #define jit_ldr_d(rd, rs) #define jit_str_d(rd, rs) Synonyms—don't define them: #define jit_addi_p(d, rs, is) #define jit_addi_ui(d, rs, is) #define jit_addi_ul(d, rs, is) #define jit_addr_p(d, s1, s2) #define jit_addr_ui(d, s1, s2) #define jit_addr_ul(d, s1, s2) #define jit_andi_ui(d, rs, is) #define jit_andi_ul(d, rs, is) #define jit_andr_ui(d, s1, s2) #define jit_andr_ul(d, s1, s2) #define jit_beqi_p(label, rs, is) #define jit_beqi_ui(label, rs, is) #define jit_beqi_ul(label, rs, is) #define jit_beqr_p(label, s1, s2) #define jit_beqr_ui(label, s1, s2) #define jit_beqr_ul(label, s1, s2) #define jit_bmci_ui(label, rs, is) #define jit_bmci_ul(label, rs, is) #define jit_bmcr_ui(label, s1, s2) #define jit_bmcr_ul(label, s1, s2) #define jit_bmsi_ui(label, rs, is) #define jit_bmsi_ul(label, rs, is) #define jit_bmsr_ui(label, s1, s2) #define jit_bmsr_ul(label, s1, s2) #define jit_bgei_p(label, rs, is) #define jit_bger_p(label, s1, s2) #define jit_bgti_p(label, rs, is) #define jit_bgtr_p(label, s1, s2) #define jit_blei_p(label, rs, is) #define jit_bler_p(label, s1, s2) #define jit_blti_p(label, rs, is) #define jit_bltr_p(label, s1, s2) #define jit_bnei_p(label, rs, is) #define jit_bnei_ui(label, rs, is) #define jit_bnei_ul(label, rs, is) #define jit_bner_p(label, s1, s2) #define jit_bner_ui(label, s1, s2) #define jit_bner_ul(label, s1, s2) #define jit_eqi_p(d, rs, is) #define jit_eqi_ui(d, rs, is) #define jit_eqi_ul(d, rs, is) #define jit_eqr_p(d, s1, s2) #define jit_eqr_ui(d, s1, s2) #define jit_eqr_ul(d, s1, s2) #define jit_extr_c_s(d, rs) #define jit_extr_c_us(d, rs) #define jit_extr_uc_s(d, rs) #define jit_extr_uc_us(d, rs) #define jit_extr_uc_i(d, rs) #define jit_extr_uc_ui(d, rs) #define jit_extr_us_i(d, rs) #define jit_extr_us_ui(d, rs) #define jit_extr_uc_l(d, rs) #define jit_extr_uc_ul(d, rs) #define jit_extr_us_l(d, rs) #define jit_extr_us_ul(d, rs) #define jit_extr_ui_l(d, rs) #define jit_extr_ui_ul(d, rs) #define jit_gei_p(d, rs, is) #define jit_ger_p(d, s1, s2) #define jit_gti_p(d, rs, is) #define jit_gtr_p(d, s1, s2) #define jit_ldr_p(d, rs) #define jit_ldi_p(d, is) #define jit_ldxi_p(d, rs, is) #define jit_ldxr_p(d, s1, s2) #define jit_lei_p(d, rs, is) #define jit_ler_p(d, s1, s2) #define jit_lshi_ui(d, rs, is) #define jit_lshi_ul(d, rs, is) #define jit_lshr_ui(d, s1, s2) #define jit_lshr_ul(d, s1, s2) #define jit_lti_p(d, rs, is) #define jit_ltr_p(d, s1, s2) #define jit_movi_p(d, is) #define jit_movi_ui(d, rs) #define jit_movi_ul(d, rs) #define jit_movr_p(d, rs) #define jit_movr_ui(d, rs) #define jit_movr_ul(d, rs) #define jit_nei_p(d, rs, is) #define jit_nei_ui(d, rs, is) #define jit_nei_ul(d, rs, is) #define jit_ner_p(d, s1, s2) #define jit_ner_ui(d, s1, s2) #define jit_ner_ul(d, s1, s2) #define jit_hton_ui(d, rs) #define jit_hton_us(d, rs) #define jit_ori_ui(d, rs, is) #define jit_ori_ul(d, rs, is) #define jit_orr_ui(d, s1, s2) #define jit_orr_ul(d, s1, s2) #define jit_pop_ui(rs) #define jit_pop_ul(rs) #define jit_push_ui(rs) #define jit_push_ul(rs) #define jit_pusharg_c(rs) #define jit_pusharg_p(rs) #define jit_pusharg_s(rs) #define jit_pusharg_uc(rs) #define jit_pusharg_ui(rs) #define jit_pusharg_ul(rs) #define jit_pusharg_us(rs) #define jit_retval_c(rd) #define jit_retval_p(rd) #define jit_retval_s(rd) #define jit_retval_uc(rd) #define jit_retval_ui(rd) #define jit_retval_ul(rd) #define jit_retval_us(rd) #define jit_rsbi_p(d, rs, is) #define jit_rsbi_ui(d, rs, is) #define jit_rsbi_ul(d, rs, is) #define jit_rsbr_p(d, rs, is) #define jit_rsbr_ui(d, s1, s2) #define jit_rsbr_ul(d, s1, s2) #define jit_sti_p(d, is) #define jit_sti_uc(d, is) #define jit_sti_ui(d, is) #define jit_sti_ul(d, is) #define jit_sti_us(d, is) #define jit_str_p(d, rs) #define jit_str_uc(d, rs) #define jit_str_ui(d, rs) #define jit_str_ul(d, rs) #define jit_str_us(d, rs) #define jit_stxi_p(d, rs, is) #define jit_stxi_uc(d, rs, is) #define jit_stxi_ui(d, rs, is) #define jit_stxi_ul(d, rs, is) #define jit_stxi_us(d, rs, is) #define jit_stxr_p(d, s1, s2) #define jit_stxr_uc(d, s1, s2) #define jit_stxr_ui(d, s1, s2) #define jit_stxr_ul(d, s1, s2) #define jit_stxr_us(d, s1, s2) #define jit_subi_p(d, rs, is) #define jit_subi_ui(d, rs, is) #define jit_subi_ul(d, rs, is) #define jit_subr_p(d, s1, s2) #define jit_subr_ui(d, s1, s2) #define jit_subr_ul(d, s1, s2) #define jit_subxi_p(d, rs, is) #define jit_subxi_ui(d, rs, is) #define jit_subxi_ul(d, rs, is) #define jit_subxr_p(d, s1, s2) #define jit_subxr_ui(d, s1, s2) #define jit_subxr_ul(d, s1, s2) #define jit_xori_ui(d, rs, is) #define jit_xori_ul(d, rs, is) #define jit_xorr_ui(d, s1, s2) #define jit_xorr_ul(d, s1, s2) Shortcuts—don't define them: #define JIT_R0 #define JIT_R1 #define JIT_R2 #define JIT_V0 #define JIT_V1 #define JIT_V2 #define JIT_FPR0 #define JIT_FPR1 #define JIT_FPR2 #define JIT_FPR3 #define JIT_FPR4 #define JIT_FPR5 #define jit_patch(jump_pc) #define jit_notr_c(d, rs) #define jit_notr_i(d, rs) #define jit_notr_l(d, rs) #define jit_notr_s(d, rs) #define jit_notr_uc(d, rs) #define jit_notr_ui(d, rs) #define jit_notr_ul(d, rs) #define jit_notr_us(d, rs) #define jit_rsbr_d(d, s1, s2) #define jit_rsbr_i(d, s1, s2) #define jit_rsbr_l(d, s1, s2) #define jit_subi_i(d, rs, is) #define jit_subi_l(d, rs, is) Mandatory unless target arithmetic is always done in the same precision: #define jit_abs_f(rd,rs) #define jit_addr_f(rd,s1,s2) #define jit_beqr_f(label, s1, s2) #define jit_bger_f(label, s1, s2) #define jit_bgtr_f(label, s1, s2) #define jit_bler_f(label, s1, s2) #define jit_bltgtr_f(label, s1, s2) #define jit_bltr_f(label, s1, s2) #define jit_bner_f(label, s1, s2) #define jit_bordr_f(label, s1, s2) #define jit_buneqr_f(label, s1, s2) #define jit_bunger_f(label, s1, s2) #define jit_bungtr_f(label, s1, s2) #define jit_bunler_f(label, s1, s2) #define jit_bunltr_f(label, s1, s2) #define jit_bunordr_f(label, s1, s2) #define jit_ceilr_f_i(rd, rs) #define jit_divr_f(rd,s1,s2) #define jit_eqr_f(d, s1, s2) #define jit_extr_d_f(rs, rd) #define jit_extr_f_d(rs, rd) #define jit_extr_i_f(rd, rs) #define jit_floorr_f_i(rd, rs) #define jit_ger_f(d, s1, s2) #define jit_gtr_f(d, s1, s2) #define jit_ler_f(d, s1, s2) #define jit_ltgtr_f(d, s1, s2) #define jit_ltr_f(d, s1, s2) #define jit_movr_f(rd,rs) #define jit_mulr_f(rd,s1,s2) #define jit_negr_f(rd,rs) #define jit_ner_f(d, s1, s2) #define jit_ordr_f(d, s1, s2) #define jit_roundr_f_i(rd, rs) #define jit_rsbr_f(d, s1, s2) #define jit_sqrt_f(rd,rs) #define jit_subr_f(rd,s1,s2) #define jit_truncr_f_i(rd, rs) #define jit_uneqr_f(d, s1, s2) #define jit_unger_f(d, s1, s2) #define jit_ungtr_f(d, s1, s2) #define jit_unler_f(d, s1, s2) #define jit_unltr_f(d, s1, s2) #define jit_unordr_f(d, s1, s2) Mandatory if sizeof(long) != sizeof(int)—don't define them on other systems: #define jit_addi_l(d, rs, is) #define jit_addr_l(d, s1, s2) #define jit_andi_l(d, rs, is) #define jit_andr_l(d, s1, s2) #define jit_beqi_l(label, rs, is) #define jit_beqr_l(label, s1, s2) #define jit_bgei_l(label, rs, is) #define jit_bgei_ul(label, rs, is) #define jit_bger_l(label, s1, s2) #define jit_bger_ul(label, s1, s2) #define jit_bgti_l(label, rs, is) #define jit_bgti_ul(label, rs, is) #define jit_bgtr_l(label, s1, s2) #define jit_bgtr_ul(label, s1, s2) #define jit_blei_l(label, rs, is) #define jit_blei_ul(label, rs, is) #define jit_bler_l(label, s1, s2) #define jit_bler_ul(label, s1, s2) #define jit_blti_l(label, rs, is) #define jit_blti_ul(label, rs, is) #define jit_bltr_l(label, s1, s2) #define jit_bltr_ul(label, s1, s2) #define jit_bosubi_l(label, rs, is) #define jit_bosubi_ul(label, rs, is) #define jit_bosubr_l(label, s1, s2) #define jit_bosubr_ul(label, s1, s2) #define jit_boaddi_l(label, rs, is) #define jit_boaddi_ul(label, rs, is) #define jit_boaddr_l(label, s1, s2) #define jit_boaddr_ul(label, s1, s2) #define jit_bmci_l(label, rs, is) #define jit_bmcr_l(label, s1, s2) #define jit_bmsi_l(label, rs, is) #define jit_bmsr_l(label, s1, s2) #define jit_bnei_l(label, rs, is) #define jit_bner_l(label, s1, s2) #define jit_divi_l(d, rs, is) #define jit_divi_ul(d, rs, is) #define jit_divr_l(d, s1, s2) #define jit_divr_ul(d, s1, s2) #define jit_eqi_l(d, rs, is) #define jit_eqr_l(d, s1, s2) #define jit_extr_c_l(d, rs) #define jit_extr_c_ul(d, rs) #define jit_extr_s_l(d, rs) #define jit_extr_s_ul(d, rs) #define jit_extr_i_l(d, rs) #define jit_extr_i_ul(d, rs) #define jit_gei_l(d, rs, is) #define jit_gei_ul(d, rs, is) #define jit_ger_l(d, s1, s2) #define jit_ger_ul(d, s1, s2) #define jit_gti_l(d, rs, is) #define jit_gti_ul(d, rs, is) #define jit_gtr_l(d, s1, s2) #define jit_gtr_ul(d, s1, s2) #define jit_hmuli_l(d, rs, is) #define jit_hmuli_ul(d, rs, is) #define jit_hmulr_l(d, s1, s2) #define jit_hmulr_ul(d, s1, s2) #define jit_ldi_l(d, is) #define jit_ldi_ui(d, is) #define jit_ldr_l(d, rs) #define jit_ldr_ui(d, rs) #define jit_ldxi_l(d, rs, is) #define jit_ldxi_ui(d, rs, is) #define jit_ldxi_ul(d, rs, is) #define jit_ldxr_l(d, s1, s2) #define jit_ldxr_ui(d, s1, s2) #define jit_ldxr_ul(d, s1, s2) #define jit_lei_l(d, rs, is) #define jit_lei_ul(d, rs, is) #define jit_ler_l(d, s1, s2) #define jit_ler_ul(d, s1, s2) #define jit_lshi_l(d, rs, is) #define jit_lshr_l(d, s1, s2) #define jit_lti_l(d, rs, is) #define jit_lti_ul(d, rs, is) #define jit_ltr_l(d, s1, s2) #define jit_ltr_ul(d, s1, s2) #define jit_modi_l(d, rs, is) #define jit_modi_ul(d, rs, is) #define jit_modr_l(d, s1, s2) #define jit_modr_ul(d, s1, s2) #define jit_movi_l(d, rs) #define jit_movr_l(d, rs) #define jit_muli_l(d, rs, is) #define jit_muli_ul(d, rs, is) #define jit_mulr_l(d, s1, s2) #define jit_mulr_ul(d, s1, s2) #define jit_nei_l(d, rs, is) #define jit_ner_l(d, s1, s2) #define jit_ori_l(d, rs, is) #define jit_orr_l(d, s1, s2) #define jit_pop_l(rs) #define jit_push_l(rs) #define jit_pusharg_l(rs) #define jit_retval_l(rd) #define jit_rshi_l(d, rs, is) #define jit_rshi_ul(d, rs, is) #define jit_rshr_l(d, s1, s2) #define jit_rshr_ul(d, s1, s2) #define jit_sti_l(d, is) #define jit_str_l(d, rs) #define jit_stxi_l(d, rs, is) #define jit_stxr_l(d, s1, s2) #define jit_subr_l(d, s1, s2) #define jit_xori_l(d, rs, is) #define jit_xorr_l(d, s1, s2) More complex tasks in the platform-independent layer There is actually a single function that you must define in the funcs-suffix.h file, that is, jit_flush_code. As explained in Generating code at run-time, its purpose is to flush part of the processor's instruction cache (usually the part of memory that contains the generated code), avoiding the processor executing bogus data that it happens to find in the cache. The jit_flush_code function takes the first and the last address to flush. On many processors (for example, the x86 and the all the processors in the 68k family up to the 68030), it is not even necessary to flush the cache. In this case, the contents of the file will simply be #ifndef __lightning_funcs_h #define __lightning_funcs_h #define jit_flush_code(dest, end) #endif /* __lightning_core_h */ On other processors, flushing the cache is necessary for proper behavior of the program; in this case, the file will contain a proper definition of the function. However, we must make yet another distinction. On some processors, flushing the cache is obtained through a call to the operating system or to the C run-time library. In this case, the definition of jit_flush_code will be very simple: two examples are the Alpha and the 68040. For the Alpha the code will be: #define jit_flush_code(dest, end) \ __asm__ __volatile__("call_pal 0x86"); and, for the Motorola #define jit_flush_code(start, end) \ __clear_cache((start), (end)) As you can see, the Alpha does not even need to pass the start and end address to the function. It is good practice to protect usage of the GNU CC-specific __asm__ directive by relying on the preprocessor. For example: #if !defined(__GNUC__) && !defined(__GNUG__) #error Go get GNU C, I do not know how to flush the cache #error with this compiler. #else #define jit_flush_code(dest, end) \ __asm__ __volatile__("call_pal 0x86"); #endif gnu lightning's configuration process tries to compile a dummy file that includes lightning.h, and gives a warning if there are problem with the compiler that is installed on the system. In more complex cases, you'll need to write a full-fledged function. Don't forget to make it static, otherwise you'll have problems linking programs that include lightning.h multiple times. An example, taken from the funcs-ppc.h file, is: #ifndef __lightning_funcs_h #define __lightning_funcs_h #if !defined(__GNUC__) && !defined(__GNUG__) #error Go get GNU C, I do not know how to flush the cache #error with this compiler. #else static void jit_flush_code(start, end) void *start; void *end; { register char *dest = start; for (; dest <= end; dest += SIZEOF_CHAR_P) __asm__ __volatile__ ("dcbst 0,%0; sync; icbi 0,%0; isync"::"r"(dest)); } #endif #endif /* __lightning_funcs_h */ The funcs-suffix.h file is also the right place to put helper functions that do complex tasks for the core-suffix.h file. For example, the PowerPC assembler defines jit_prolog as a function and puts it in that file (for more information, see ). Take special care when defining such a function, as explained in Reentrant usage of gnu lightning. Implementing macros for floating point The future of gnu lightning Presented below is the set of tasks that I feel need to be performed to make gnu lightning a more fully functional, viable system. They are presented in no particular order. I would very much welcome any volunteers who would like to help with the implementation of one or more of these tasks. Please write to me, Paolo Bonzini, at if you are interested in adding your efforts to the gnu lightning project. Tasks: The most important task to make gnu lightning more widely usableis to retarget it. Although currently supported architectures(x86, SPARC, PowerPC) are certainly some of the most widely used,gnu lightning could be ported to others—namely, the Alpha andMIPS architectures. Another interesting task is to allow the instruction stream to growdynamically. This is a problem because not all architectures allowto write position independent code. The x86's absolutejumps, for example, are actually slow indirect jumps, and need aregister. Optimize leaf procedures on the SPARC. This involves using theoutput registers (%oX) instead of the local registers(%lX) when writing leaf procedures; the problem is,leaf procedures also receive parameters in the output registers,so they would be overwritten by write accesses to general-purposeregisters. Acknowledgements As far as I know, the first general-purpose portable dynamic code generator is dcg, by Dawson R. Engler and T. A. Proebsting. Further work by Dawson R. Engler resulted in the vcode system; unlike dcg, vcode used no intermediate representation and directly inspired gnu lightning. Thanks go to Ian Piumarta, who kindly accepted to release his own program ccg under the GNU General Public License, thereby allowing gnu lightning to use the run-time assemblers he had wrote for ccg. ccg provides a way of dynamically assemble programs written in the underlying architecture's assembly language. So it is not portable, yet very interesting. I also thank Steve Byrne for writing GNU Smalltalk, since gnu lightning was first developed as a tool to be used in GNU Smalltalk's dynamic translator from bytecodes to native code.