                        itracer
                                        (c) 2012 deroko of ARTeam
                                        
     itracer project was started as custom replacement for 2 popular
instrumentation codes - PIN and DynamoRio, because I always like to
have full control over the code I'm using, and also to have code
which I can quicly fix if something goes wrong. This code operates
on basic blocks, and doesn't care about instructions inside of 
basic blocks (although it can be changed to do that also)

Problems whcih I have faced during development are:

1. fast lookup of basic blocks
2. handling of sysenter
3. preserving hooks of entrypoints
4. Exception injection
5. Processing TF
6. EIP redirection instructions
7. memory allocation
8. disassembly
9. DbgPrint

1. Fast Lookup of Basic Blocks
        First I need to describe what's basic block, and how decoding
is done. Decoding is done on page boundary before code is executed.
Instructions which are overlapping page boundary always are stored
as individual basic block. This is required, thus in case of page
protection change we can nook only this part of the code, and not 
whole page basic block cache. Here is simple example of basic block:

        mov     esi, eax
        mov     edi, edx
        mov     ecx, 200h
__loop: mov     eax, [esi]
        mov     [edi], eax
        add     esi, 4
        add     edi, 4
        dec     ecx
        jnz     __loop
        
All code until jnz is considered to be basic block. Every EIP redirection
is considered as end of BasicBlock, and all EIP redirections are emulated
inside of code. So in this case basis block will be natively executed, and
jnz will be emulated. To fast lookup basic block for every page I allocate
0x1000 * sizeof(ULONG_PTR) thus I can easily lookup Basic Block for current
instruction. If basic block is NULL, I decode it, and execute it. Next time
this basic block will be looked up fast, and executed without need to decode
it.

To deal with Self Modifying code I use hooks of NtProtectVirtualMemory, and
everytime page is set to WRITE I mark it as WRITE and before executing basic
block I check if data in basic block changed with data in real image. If there
is change I flush whole page cache and start decoding again. There is also
flag which is called WAS_WRITE, this is for special case when page was set
WRITE and changed back to READ but code wasn't executed while it was WRITE,
thus if this happens, I free everything for this page, clear WAS_WRITE flag
and decode it. Best case is WriteProcessMemory case:
        
        call someptr            <---- build BBL
        WriteProcessMemory(GetCurrentProcess(), <someptr>, <mydata>, 5, 0);
                -> NtProtectVirtualMemory(someptr);  <-- PAGE_EXECUTE_READWRITE
                -> NtWriteVirtualMemory(someptr);    <-- make change
                -> NtProtectVirtualMemory(someptr);  <-- PAGE_EXECUTE_READ
        call someptr            <---- free and rebuild BBL due to WAS_WRITE

In this case hook of NtProtectVirtualMemory will set page of <someptr> to WRITE,
but next NtProtectVirtualMemory will set page to READ/EXEC which on bbl exec
will mean that page is not changed. Wrong, in this case I mark page as WAS_WRITE
thus when we hit <someptr> due to WAS_WRITE code will be flushed, and bbl will
be rebuilt, and WAS_WRITE will be removed from internal memory cache for this
page.

2. sysenter
        sysenter is very tough instruction to handle. As it always returns back
to KiFastSystemCallRet in ntdll.dll. We can cheat and replace it with int 2e, but
that's cheating and not really nice, as we want generic solution as when I port
this code to linux (one day?!?!?!) I want to have proper handling of sysenter.
Everytime sysenter is executed, I build code which will replace ret address on stack
to my code, and keep all information where this sysenter should return in internal
struct, which is freed once ret is executed. You can see more in trace.c how I handle
it.

        mov [edx], ret_after_sysenter
        sysenter
ret_after_sysenter:
        jmp     vm
        
now vm, based on stack index/stack base will locate proper return address and 
execute ret properly. This is done to handle all weird possible code obsfucation.
Eg. somebody could hook KiFastSystemCallRet with jmp to different code, and assuming
that there is ret is simply false!!!! This is one way to detect PIN and DynamoRio, as
PIN emulates syscall via int 2e, and dynamorio, well I don't know what they do...        

Update of sysenter handling:
        As I have shadow ntdll.dll, and KiFasySystemCallRet hooked with jmp, I have
to make difference between what's executed, and what not from my code. To make this
happen, I replace all sysenter calls in real ntdll.dll with int 2e, thus makeing
my code to use int 2e as syscall, and I avoid conflict between hooked KiFastSystemCallRet
coming from instrumented code, and sysenter coming from my code. Now all should be
good, and most of stuff that I don't need I removed (eg. keeping stack offset to
know where KiFastSystemCallRet should return, etc...), and I've removed all these lists
which weren't actually needed if sysenter was properly handled.

3. preserving hooks of entrypoints
        Preserving hooks is very important not to lose control over code. There are
a few entrypoint which might happen, and my way of preserving hook inside of ntdll.dll
is to rebase ntdll.dll to different base, thus process can't access original ntll.dll.
For it, old ntdll.dll doesn't exist, and all entries and code is executed in remapped 
ntdll.dll which I force always to 0x50000000 base. In this way I don't have to care
what really happens in this ntdll.dll as all code blocks are properly generated, and
my entry point hook always survive. Simple, and dirty trick. On Vista/win7/win8 ntdll.dll
is present in \KnownDlls thus NtOpenSection -> NtMapViewOfSection is sufficient, otherwise
on XP where ntdll is not present in \KnownDlls we use NtCreateFile -> NtCreateSection->
NtMapViewOfSection to map ntdll.dll into remote process. I use this trick to make ntdll.dll
to be SEC_IMAGE, otherwise I could use NtAllocateVirtualMemory -> NtWriteVirtualMemory to
allocate ntdll in remote process. It's also important to know that SEH will sometimes check
if it's executed from SEC_IMAGE, thus in this way we make SEH inside of ntdll to work 
properly.

4. Exception injection
        Every time some code is about to be executed I check protection of a page. If
page protection is not known, or ProbeForRead/ProbeForWrite return exception I inject
exception via traceInjectException where I emulate exception injection by setting up
stack, and calling KiUserExceptionDispatcher.


5. Processing TF
        For me very important was TF processing. So how does it happen? There are
several ways to how TF can be injected:
        - popfd
        - iretd
        - NtContinue;
        - NtSetThreadContext

        TF is never executed in code, but after execution of every instruction I
use traceInjectException to inject STATUS_SINGLE_STEP. This all sounds great, but
there is one spcific case where TF handling must be properly handled. That's case
of exception while TF is set. Lets have a look at one simple example:

        xor     eax, eax
        pushfd
        or      dword ptr[esp], 100h
        popfd
        xor     [eax], eax      

        Probably, everybody can spot problem here. TF will be generated after 
"xor    [eax], eax", but as instruction itself will generate exception, next instruction
executed will be inside of KiUserExceptionDispatcher, thus TF is injected after
execution of first instruction in KiUserExceptionDispatcher, but what's also very
important is that CONTEXT on stack will have TF set in EFlags. Lets follow perevious
example with ia32 logic:

        xor     eax, eax
        pushfd
        or      dword ptr[esp], 100h
        popfd
        xor     [eax], eax              <--- TF set, exception happens
        
        - stack on entering int 0eh will have eflags | 100h
        - code will be redirected to KiUserExceptionDispatcher and CONTEXT.Eflags is
          picked from stack which has TF
        - EIP is changed to point to KiUserExceptionDispatcher
        - iretd <--- remember eflags still have TF set
        
        KiUserExceptionDispatcher:
                cld             <---- TF exception generated after cld
                mov     ecx, [esp+4]
                mov     ebx, [esp]
        
        So in this particular case we need to inject TF after cld, and also to set
CONTEXT.EFlags |= 0x100. Also, while execution TF, every code block is split into
one instruction codeblock, thus we can inject TF after execution of every single
exception.
        I don't handle NtSetContextThread, thus is to be added for the future.
        

6. EIP redirection instructions
        EIP redirection instructions are emulated. There is code support for 
emulation of jcc, but for a speedup reason jccs are inlined for now. However,
if certain part of the code is removed, jccs will be emulated. There is support
in the code for this. Emulation of all EIP redirections was intendent to easly
find redirections to eip while unpacking, and also it gave we me nice support
for hunting down some weird checksum. 

7. memory allocation
        For all memory allocation, I use Doug Lea's malloc implementation, with
some small modification to use native ntdll apis, and to always allocate 
PAGE_EXECUTE_READWRITE. Other stuff inside of library are not changed.

8. disassembly
        For disassembly I use XED which comes as part of pin tool, which is
the best disassembly library I have found

9. DbgPrint
        For DbgPrint I would usually use OutputDebugStringA/W, but I don't
want to import any windows dll, except existing ntdll.dll, thus I rewrote
OutputDebugStringA/W to use only native ntdll.dll APIs thus outputing
to DebugView can be done at early stage of instrumentation

At the end, this code was written only for fun, and nothing else. 

                                                (c) 2012 deroko of ARTeam
                                                
                                