[z-machine] [Spec 1.1] Sound
Kevin Bracey
kevin@bracey-griffith.freeserve.co.uk
Thu, 13 Nov 2003 11:04:07 GMT
In message <004f01c3a95a$3250c210$db397ad5@Computer>
"David Kinder" <d.kinder@btinternet.com> wrote:
> > Also, I had some issues with the Unicode in strings proposal; my
> > memory is that Kevin and I had basically come to an agreement in
> > principle on that, but I can't remember if we got anything final
> > written down. I'll look that up the next time I'm logged in at home.
>
> That would be good: I'd like to at least consider some bits of Z-Spec
> 1.1 being added to Inform 6.30, and Unicode is one I'm quite keen on.
>
> > What's the link to the latest version again?
>
> http://www.jczorkmid.net/~jpenney/ZSpec11-latest.txt
Hello again folks.
I haven't really had time recently to mess with any Z-machine stuff, but I've
dug up my Inform patches to give to David Kinder to ensure all relevant
things make it in to 6.30. I'm particularly concerned to make sure all the
character fixes go in, and I also had a couple of otherwise unknown fixes
and the addition of shift and exclusive-or operators.
On the subject of the Unicode proposal, I think we were all happy with the
last scheme detailed in the 2 copied e-mails below.
Is there an on-line archive of the list available yet?
--Kevin
---------------------------------------------------------------------------
Date: Wed, 11 Sep 2002 09:00:38 +0100
From: Kevin Bracey <kevin@bracey-griffith.freeserve.co.uk>
To: z-machine@GMD.DE
Subject: Re: [z-machine] New Unicode in strings proposal
In message <ro1wupt8m34.fsf@jackfruit.Stanford.EDU>
David Carlton <carlton@math.stanford.edu> wrote:
> On Tue, 10 Sep 2002 23:22:20 +0100, Kevin Bracey
> <kevin@bracey-griffith.freeserve.co.uk> said:
>
> > Note that this proposal limits you to 16-bit (BMP) Unicode
> > characters, as does print_unicode. If this was seen as a problem,
> > the following more UTF-8-like scheme might be used, giving 22-bit
> > range, sufficient to match UTF-16:
>
> > Binary value
> > ------------
> > $$00nnnnnnnn ZSCII character n; terminates escape sequence
> > $$01nnnnnnnn reserved
> > $$10nnnnnnnn continuation of multi-code Unicode character
> > $$110nnnnnnn start of a 2-code Unicode character
> > $$1110nnnnnn start of a 3-code Unicode character
> > $$1111nnnnnn reserved
>
> > Comments?
>
> I like it. (Unsurprisingly.)
>
> I do think that we should take the 16-bit issue seriously, though: my
> understanding is that characters using more than 16 bits are intended
> for use by invented languages (Elvish, Klingon, whatever), and it
> wouldn't seem at all surprising to me for people to want to write
> games using those scripts.
Agreed. It's not currently possible to output them, but it would be
straightforward using this method.
>
> One thing to consider that is relevant to this issue: I don't think
> it's important to explicitly mark continuation characters.
I came to the same conclusion while pondering it last night.
> Keeping that in mind, here's another possible encoding:
>
> $$00nnnnnnnn ZSCII character n; terminates escape
> $$01nnnnnnnn reserved
> $$10nnnnnnnn $$nnnnnnnnnn Unicode character needing <= 18 bits
> $$110nnnnnnn $$nnnnnnnnnn $$nnnnnnnnnn Unicode <= 27 bits.
> $$111nnnnnnn reserved
>
> This packs 16-bit Unicode characters just as well as well as your
> proposal from the part of your message that I didn't quote and better
> than your modified proposal that I've quoted above; but, like your
> modified proposal, it handles larger Unicode characters.
Not bad, but here's my proposal, which deals with the termination problem
nicely.
$$00nnnnnnnn ZSCII character n
$$01nnnnnnnn Unicode character needing <= 8 bits
$$10nnnnnnnn Unicode character needing <= 8 bits
$$110nnnnnnn $$xnnnnnnnnnn Unicode character needing <= 16 bits
$$1110nnnnnn $$1nnnnnnnnnn $$xnnnnnnnnn Unicode <= 24 bits
$$111100nnnn $$1nnnnnnnnnn $$1nnnnnnnnn $$xnnnnnnnnn Unicode <= 31 bits
$$111101nnnn reserved
$$11111nnnnn reserved
The 10-bit sequence is terminated by a 10-bit code with its top bit clear.
This allows you to switch back to 5-bit without any overhead. The whole
16-bit BMP is also neatly fitted into only 2 codes. This handles all Unicode
characters, the full 31-bit space.
Unicode characters must be encoded with a minimum length code. It is worth
noting that Latin-1 characters will always be encoded as single-code
characters.
This whole scheme would be better viewed as a switch between "5-bit" and
"10-bit" mode, rather than just a special "ZSCII escape". Thus the 5-bit
sequence 5,6 switches you into 10-bit mode, and a 10-bit code with its top
bit clear switches you back into 5-bit mode.
Now, as a related issue, if we're going to do this, which clearly compromises
backwards compatibility, the feature will be done through a compiler switch
of some sort. Having sacrificed the backwards compatibility, it would be nice
for the compiler to also be able to make use of 5-bit locking shifts, a
feature that was present in some, but not all Infocom terps. So how about
requiring that Standard 1.1 interpreters also handle these shifts?
Locking shifts
--------------
In Version 3 and later, a Standard 1.1 interpreter should interpret
consecutive 4 or 5 codes as shift lock characters, as per the following
table:
Next code
4 5
A0 Shift to A1 Shift to A2
Current A1 Lock to A1 Lock to A0
alphabet A2 Lock to A0 Lock to A2
Thus, "DEADLINE: An INTERLOGIC Mystery" might be coded as
V1/V2: (35 codes)
4 D E A D L I N E 2 : 0 A 3 n 0 I N T E R L O G I C 0 M 5 y s t e r y
V3+, Std 1.0: (52 codes)
4 D 4 E 4 A 4 D 4 L 4 I 4 N 4 E 5 : 0 4 A n 0 4 I 4 N 4 T 4 E 4 R 4 L 4 O
4 G 4 I 4 C 0 4 M y s t e r y
V3+, Std 1.1: (39 codes)
4 4 D E A D L I N E 5 5 : 0 4 A n 0 4 4 I N T E R L O G I C 0 M 5 y s t e r y
Locking shifts are not used when encoding dictionary words. Locking state is
left unaltered by abbreviations and 10-bit sequences. Abbreviations are
always decoded starting in A0, regardless of the alphabet lock state of the
invoker.
--
Kevin Bracey
http://www.bracey-griffith.freeserve.co.uk/
-----------------------------------------------------------------------------
Date: Wed, 11 Sep 2002 22:50:50 +0100
From: Kevin Bracey <kevin@bracey-griffith.freeserve.co.uk>
To: z-machine@GMD.DE
Subject: Re: [z-machine] New Unicode in strings proposal
In message <ro17khs6wv3.fsf@jackfruit.Stanford.EDU>
David Carlton <carlton@math.stanford.edu> wrote:
> On Wed, 11 Sep 2002 09:00:38 +0100, Kevin Bracey <kevin@bracey-griffith.freeserve.co.uk> said:
>
> > The 10-bit sequence is terminated by a 10-bit code with its top bit
> > clear. This allows you to switch back to 5-bit without any
> > overhead. The whole 16-bit BMP is also neatly fitted into only 2
> > codes. This handles all Unicode characters, the full 31-bit space.
>
> > Unicode characters must be encoded with a minimum length code. It is
> > worth noting that Latin-1 characters will always be encoded as
> > single-code characters.
>
> Nice.
It's an entertaining puzzle, as you said in an earlier post. Stopped me going
to sleep for a good hour, shuffling bits in my head, before I came up with
that scheme.
> > This whole scheme would be better viewed as a switch between "5-bit" and
> > "10-bit" mode, rather than just a special "ZSCII escape". Thus the 5-bit
> > sequence 5,6 switches you into 10-bit mode, and a 10-bit code with its
> > top bit clear switches you back into 5-bit mode.
>
> Indeed. Seems useful and elegant to me (inasmuch as anything
> involving Z-machine strings can be called "elegant").
Two extra things worth noting are that the string must switch back to 5-bit
mode before ending, and that you have to switch back to 5-bit mode when you
generate a new-line, as we don't allow Unicode control characters.
Also, Unicode characters can't be used when tokenising (by definition,
almost, as you only tokenise ZSCII).
> > Now, as a related issue, if we're going to do this, which clearly
> > compromises backwards compatibility, the feature will be done
> > through a compiler switch of some sort. Having sacrificed the
> > backwards compatibility, it would be nice for the compiler to also
> > be able to make use of 5-bit locking shifts, a feature that was
> > present in some, but not all Infocom terps. So how about requiring
> > that Standard 1.1 interpreters also handle these shifts?
>
> Sounds reasonable. I assume there are no existing files that contain
> repeated 4's or 5's other than at the end of words?
No. We did a full search a while back. Infocom didn't implement this in all
their interpreters, and I suspect they never got around to getting their
compiler to take advantage of it. Because their interpreters weren't
consistent in behaviour for multiple 4's or 5's they steered well clear (or
at least their compiler did), apart from their use as padding.
Actually Zip 2000 already implements shift locks, and has for the last few
versions.
Graham was against documenting this when I first pointed it out to him,
because it wasn't clear how you be able to take advantage of it usefully, but
it seems reasonable to me to put it in at the same time as the Unicode stuff.
--
Kevin Bracey
http://www.bracey-griffith.freeserve.co.uk/
-----------------------------------------------------------------------------
--
Kevin Bracey
http://www.bracey-griffith.freeserve.co.uk/