Background
I am writing an esoteric language called Jolf. It is used on the lovely site codegolf SE. If you don’t already know, a lot of challenges are scored in bytes. People have made lots of languages that utilize either their own encoding or a pre-existing encoding.
On the interpreter for my language, I have a byte counter. As you might expect, it counts the number of bytes in the code. Until now, I’ve been using a UTF-8 en/decoder (utf8.js
). I am now using the ISO 8859-7 encoding, which has Greek characters. Nor does the text upload actually work. I need to count the actually bytes contained within an uploaded file. Also, is there a way to read the contents of said encoded file?
Question
Given a file encoded in ISO 8859-7 obtained from an <input>
element on the page, is there any way to obtain the number of bytes contained in that file? And, given “plaintext” (i.e. text put directly into a <textarea>
), how might I count the bytes in that as if it was encoded in ISO 8859-7?
What I’ve tried
The input element is called isogreek
. The file resides in the <input>
element. The content is ΦX族
, a Greek character, a latin character (each of which should be a byte) and a Chinese character, which should be more than one byte (?).
isogreek.files[0].size; // is 3; should be more. var reader = new FileReader(); reader.readAsBinaryString(isogreek.files[0]); // corrupts the string to `ÖX?` reader.readAsText(isogreek.files[0]); // �X? reader.readAsText(isogreek.files[0],"ISO 8859-7"); // �X?
Answer
Extended from this comment.
As @pvg mentioned in the comments, the string resulting from readAsBinaryString
would be correct, but is corrupted for two reasons:
A. The result is encoded in ISO-8859-1. You can use a function to fix this:
function convertFrom1to7(text) { // charset is the set of chars in the ISO-8859-7 encoding from 0xA0 and up, encoded with this format: // - If the character is in the same position as in ISO-8859-1/Unicode, use a "!". // - If the character is a Greek char with 720 subtracted from its char code, use a ".". // - Otherwise, use uXXXX format. var charset = "!u2018u2019!u20ACu20AF!!!!.!!!!u2015!!!!...!...!.!....................!............................................!"; var newtext = "", newchar = ""; for (var i = 0; i < text.length; i++) { var char = text[i]; newchar = char; if (char.charCodeAt(0) >= 160) { newchar = charset[char.charCodeAt(0) - 160]; if (newchar === "!") newchar = char; if (newchar === ".") newchar = String.fromCharCode(char.charCodeAt(0) + 720); } newtext += newchar; } return newtext; }
B. The Chinese character isn’t a part of the ISO-8859-7 charset (because the charset supports up to 256 unique chars, as the table shows). If you want to include arbitrary Unicode characters in a program, you will probably need to do one of these two things:
- Count the bytes of that program in i.e. UTF-8 or UTF-16. This can be done pretty easily with the library you linked. However, if you want this to be done automatically, you’ll need a function that checks if the content of the textarea is a valid ISO-8859-7 file, like this:
function isValidISO_8859_7(text) { var charset = /[u0000-u00A0u2018u2019u00A3u20ACu20AFu00A6-u00A9u037Au00AB-u00ADu2015u00B0-u00B3u0384-u0386u00B7u0388-u038Au00BBu038Cu00BDu038E-u03CE]/; var valid = true; for (var i = 0; i < text.length; i++) { valid = valid && charset.test(text[i]); } return valid; }
- Create your own, custom variant of ISO-8859-7 that uses a specific byte (or more than one) to signify that the next 2 or 3 bytes belong to a single Unicode char. This can be pretty much as simple or complex as you like, from one char signifying a 2-byte char and one signifying a 3-byter to everything between
80
and9F
setting up for the next few. Here’s a basic example that uses80
as the 2-byter and81
as the 3-byter (assumes the text is encoded in ISO-8859-1):
function reUnicode(text) { var newtext = ""; for (var i = 0; i < text.length; i++) { if (text.charCodeAt(i) === 0x80) { newtext += String.fromCharCode((text.charCodeAt(++i) << 8) + text.charCodeAt(++i)); } else if (text.charCodeAt(i) === 0x81) { var charcode = (text.charCodeAt(++i) << 16) + (text.charCodeAt(++i) << 8) + text.charCodeAt(++i) - 65536; newtext += String.fromCharCode(0xD800 + (charcode >> 10), 0xDC00 + (charcode & 1023)); // Convert into a UTF-16 surrogate pair } else { newtext += convertFrom1to7(text[i]); } } return newtext; }
I can go into either method in more detail if you desire.