Flashing your C String

On a project I worked on a few months ago, I was tasked with taking data from a binary source file, reading it into Flash as UTF Bytes (eg: a ByteArray) and parsing the resulting ByteArray as a String. Straightforward enough, right?

What I know of the file to parse is that it is a very long list of words compressed and encoded, and that it needs to be decoded at runtime inside the swf. Fortunately, I have the algorithm for decoding the string on-hand, so I know I can implement this with relative ease. All I have to do is load the .dat file containing the list and then run the resulting byte array through the parser and get my list of words.

var ba:ByteArray = myImportedByteArray;
ba.position = 0;
var encodedString:String = ba.readUTFBytes(ba.length);
trace(encodedString);
/* My expected result was something akin to:
"Header Metadata - (Jibberish representing encoded list goes here)"
*/

So funny enough, when I do that what do I get instead?

"Header Metadata"

Huh.

I spent a good long time puzzling over where the encoded word list was in all of this. I could open the .dat file in Eclipse, and plainly see the list right there. I wondered if perhaps there were some sort of problem reading the data. Maybe the ByteArray was incomplete? No, because when I traced ba.length it indicated that the array was well over 90,000 bytes in size. Definitely too many to simply be “Header Metadata”. Next, I did this:

// 200 chosen simply for the sake of an easier to read log...
ba.position = 0;
var byte:Number;
while(ba.position < 200){
	byte = ba.readByte();
	trace(byte, String.fromCharCode(byte));	
}

What I recieved in the output window was in fact the first 200 characters of the file just as I’d expected them to be!

72 H
101 e
97 a
100 d
101 e
114 r
32  
77 M
101 e
116 t
97 a
100 d
97 a
116 t
97 a
0
32  
63 ?
32  
0
102 f
111 o
111 o
98 b
97 a
46 .
46 .
46 .

This gets more and more curious by the moment. At some point though, the lightbulb goes on. Look at this right here….

100 d
97 a
116 t
97 a
0        // < -- see that 0?  
32  
63 ?
32  
0        // < -- and that 0?

Anatomy of a String

So, in Flash-land we very seldom have to know about what might be going on under the hood of the String class. I mean, we just declare Strings, read them, print them, concatenate them, and in general just don’t really care much about how they work. It is not like this in all languages, however. Hop aboard the black stallion, for the Horseman is about to take you to a magical world known as “C Strings.”

The C language generally operates “Close to the metal” of a machine. You can think of it as an abstraction that allows you to write code that is then directly translated into Assembly-level instructions, that are then executed by the machine’s processor. In that sense it is a “High level” language (at least in comparison to writing Assembly by hand) but it is lower level than what an ActionScript or Java programmer would have to deal with as our respective Virtual Machines abstract over the nitty-gritty of memory allocation and deallocation, and direct pointer manipulation inherent in C. So what does a String look like in this lower level world? It is ultimately nothing more than an Array of char values.

char someString[10]; // an array of char values with a length of 10

It needs to be pointed out that there is a very important distinction between C Arrays and ActionScript Arrays. C Arrays are fixed-length. If you declare an Array to have a length of 10, then a contiguous set of memory addresses, of a size capable of storing 10 units of the data (in this case, char values) is allocated for that Array. That’s all. Nothing else happens. An Array in C is not an Object, so it doesn’t have a notion of a “length” property, nor does it have .push() or .pop() functions. Unless you know the length already, you can iterate your way past the end of the Array and keep on going into some other variable’s data… possibly even into undefined space.

This is considered “a very bad thing to do”. (As an existential aside about walking off the end of an array, please read this blog post by Steve Yegge)

But wait, if you can’t know the length of an Array in C then how can you possibly know when you’ve reached the “end” of a string?

Good question! Here’s what the string “Hello World!” looks like in C:

char *foo = "Hello World!";
// looks to the machine like this in ASCII....
// {72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33, 0}

Each of those numbers maps to a Unicode representation of a letter, so for example 72 maps to “H” while 33 maps to “!”. If you want, to see this in the Flash context you can set up a KeyboardEvent listener and trace the event. You’ll find the keyboardEvent.charCode property will match the above.

But do you notice anything funny here? “Hello World!” has 12 chars, but that array above has 13 entries! The very last one of course being 0. Now why is that?

In C, when you want to know if you’ve reached the end of a string, you’ll look for what is called a null terminator. In C, 0 can be considered a null value and so when a string is declared it has a length equal to the length of the characters entered into it plus one for the terminating 0. Any code that wants to parse the string can know that it should stop when it hits a 0 because to keep going after that could lead anywhere! And we certainly wouldn’t want to add some instruction set for connecting to the printer to your string… or worse, overwrite it with some random garbage!

And so this brings us back to our little ByteArray in Flash and the encoded word list. As you can see, the header metadata is immediately followed by a 0. In C terms, this file is in fact 3 strings instead of 1. Knowing this, and observing the fact that Flash’s String object seems to believe that this 90,000+ length ByteArray is in fact a string of a mere 15 or so characters, we can deduce that under the hood Flash’s String is C’s string.

My algorithm then, had to account for this fact and once I knew that Flash would treat a 0 byte as a null terminator when calling readUTFBytes, was able to successfully reach the word list.

Tags: , , , , , ,

  1. #1 written by Ryan December 29th, 2011 at 00:12

    Randomly came upon this while googling about null terminators in Flex. Very interesting and well-written read! Thanks!

  2. #2 written by The Horseman December 29th, 2011 at 23:40

    No problem Ryan, I hope it was helpful in some form for you. It’s not exactly a common topic at the Flash Player level!

  3. #3 written by Mikey December 5th, 2012 at 02:55

    Hiya,

    Do you still have the source you came up with for getting around this, perchance…? I’d be very grateful!

    Thanks,

    Mikey

  4. #4 written by The Horseman December 5th, 2012 at 15:39

    The code from the original project is not available for public consumption, unfortunately.

    That said, if you know when / where the 0 bytes appear and there’s a pattern to it you can account for them in your solution. You’ll have to examine the byte array and iterate through (at least part of) it one byte at a time.