Since the temperature of scripting in Ghidra is so high at the current point in time, I want to tell you that scripting it in Java is so much better than scripting it in Python. After that I'll randomly motivate why one wants to get the "original bytes" from a sample and how to do it.
## Why Java?
Python is a quasi-standard in the IT-security industry. Probably because it is easy to read, ships with a lot of features and the eco-system is mature enough so there are packages for everything else. The obvious question is then: Why should I script Ghidra with Java and not Python then? For me, the answer breaks down into two categories: Python versions and Developer Experience (DX).
## Versions
The integrated Python in Ghidra is not a "real" Python but a re-implementation of Python designed to run within Java. In theory, they are compatible but in practice there are tiny nerve-wrecking differences. Also, installation of packages is not that easy. In addition to that, the Jython version integrated in Ghidra is quite dated (2.7). So I played around with lazy mortal beings and need to carefully divide their time between different activities. So, instead of replacing function call with code to allocate a string and stores it in the variable
execnet
and different packages that use Jython within Ghidra to start a server capable to exchange data with other Python processes on the same machine. Needless to say, this didn't go well.
## Developer Experience
DX in Java is much better than in Python. The eco-system is _very_ mature: external dependencies are normally stable and IDEs (especially eclipse) are feature-rich. Both of this may be caused by differences in the language (types in Java are much more powerful for example). Also, Ghidra's integration with eclipse is very good and supports code-completion for all the Ghidra API functions, which safes a lot of time if you don't know the API by heart. You definitly need to jump through some extra-hoops because Java was not meant to fiddle around with bits, but I think the benefits outweight the disadvantages.
## Why do we Patch Binaries?
Many reverse engineering tools like IDA or Ghidra have a feature to replace data in the original file. Sometimes, you do that to help the decompiler figure something out, sometimes you just replace obfuscated data with the decoded counter part to make the disassembly or decompiled code easier to read.
Say, we are facing a binary where, instead of referencing a string direclty, the code contains a call to a deobfuscation function. One argument of this call is the obfuscated string:
char *local_78 = sub_123456("7A66666261283D3D707E73753C7C677E7E66777B7E77607460777B3C76773D", 0x12);
If you guessed that sub_123456
accepts a hex-encoded string, decodes it, XORs the result with 0x12
and returns a pointer to the result, you are right. Hence this would roughly translate to the following pseudo-code:
char *local_78 = "https://blag.nullteilerfrei.de/"
A quick side note: Malware reverse engineers are local_78
, they often decide to just replace the hex-string with the decoded string. This results decompiled code like the following:
char *local_78 = sub_123456("https://blag.nullteilerfrei.de/", 0x12);
This is not ideal, since the code is now _wrong_. But if you imagin, that the function sub_123456
just returns the string passed to it as the first argument, the code becomes right again. And who cares about facts when you can have imagination.
## Get Original Bytes
Anyway, lets imagine we wrote a script that searches for all calls to sub_123456
, reads the first argument from the binary and replaces it with the decoded string. If we would call this script twice, all hell breaks loose: the second call to the script does not read the encoded string 7A66666261283D3D707E73753C7C677E7E66777B7E77607460777B3C76773D
but the already decoded string https://blag.nullteilerfrei.de/
from the last run.
For precisely situations like this, IDA has a convenient function getOriginalBytes
to get the bytes from the sample before any patching operation ((or was it called GetOriginalBytes
or get_original_bytes
or bytesGet_original
?)). It seems as if the Ghidra plain API does not have such a function ((please let us know, if we missed it!)). So we decided to implement it ourselfs, have fun with it:
private byte[] getOriginalBytes(Address addr, int size) {
MemoryBlock stringMemoryBlock = currentProgram.getMemory().getBlock(addr);
if (stringMemoryBlock == null)
return null;
FileBytes fileBytes = currentProgram.getMemory().getAllFileBytes().get(0);
MemoryBlockSourceInfo memoryInformation = stringMemoryBlock.getSourceInfos().get(0);
long fileOffset = addr.getOffset() - memoryInformation.getMinAddress().getOffset()
+ memoryInformation.getFileBytesOffset();
try {
byte[] result = new byte[size];
fileBytes.getOriginalBytes(fileOffset, result);
return result;
} catch (IOException X) {
return null;
}
}