[z-machine] [Spec 1.1] Sound

Kevin Bracey kevin@bracey-griffith.freeserve.co.uk
Thu, 13 Nov 2003 11:04:07 GMT


In message <004f01c3a95a$3250c210$db397ad5@Computer>
          "David Kinder" <d.kinder@btinternet.com> wrote:

> > Also, I had some issues with the Unicode in strings proposal; my
> > memory is that Kevin and I had basically come to an agreement in
> > principle on that, but I can't remember if we got anything final
> > written down.  I'll look that up the next time I'm logged in at home.
> 
> That would be good: I'd like to at least consider some bits of Z-Spec
> 1.1 being added to Inform 6.30, and Unicode is one I'm quite keen on.
> 
> > What's the link to the latest version again?
> 
> http://www.jczorkmid.net/~jpenney/ZSpec11-latest.txt

Hello again folks.

I haven't really had time recently to mess with any Z-machine stuff, but I've
dug up my Inform patches to give to David Kinder to ensure all relevant
things make it in to 6.30. I'm particularly concerned to make sure all the
character fixes go in, and I also had a couple of otherwise unknown fixes
and the addition of shift and exclusive-or operators.

On the subject of the Unicode proposal, I think we were all happy with the
last scheme detailed in the 2 copied e-mails below.

Is there an on-line archive of the list available yet?

--Kevin

---------------------------------------------------------------------------
Date: Wed, 11 Sep 2002 09:00:38 +0100
From: Kevin Bracey <kevin@bracey-griffith.freeserve.co.uk>
To: z-machine@GMD.DE
Subject: Re: [z-machine] New Unicode in strings proposal

In message <ro1wupt8m34.fsf@jackfruit.Stanford.EDU>
          David Carlton <carlton@math.stanford.edu> wrote:

> On Tue, 10 Sep 2002 23:22:20 +0100, Kevin Bracey
> <kevin@bracey-griffith.freeserve.co.uk> said:
> 
> > Note that this proposal limits you to 16-bit (BMP) Unicode
> > characters, as does print_unicode. If this was seen as a problem,
> > the following more UTF-8-like scheme might be used, giving 22-bit
> > range, sufficient to match UTF-16:
> 
> >      Binary value      
> >      ------------ 
> >      $$00nnnnnnnn       ZSCII character n; terminates escape sequence
> >      $$01nnnnnnnn       reserved
> >      $$10nnnnnnnn       continuation of multi-code Unicode character
> >      $$110nnnnnnn       start of a 2-code Unicode character
> >      $$1110nnnnnn       start of a 3-code Unicode character
> >      $$1111nnnnnn       reserved
> 
> > Comments?
> 
> I like it.  (Unsurprisingly.)
> 
> I do think that we should take the 16-bit issue seriously, though: my
> understanding is that characters using more than 16 bits are intended
> for use by invented languages (Elvish, Klingon, whatever), and it
> wouldn't seem at all surprising to me for people to want to write
> games using those scripts.

Agreed. It's not currently possible to output them, but it would be
straightforward using this method.

> 
> One thing to consider that is relevant to this issue: I don't think
> it's important to explicitly mark continuation characters.  

I came to the same conclusion while pondering it last night.

> Keeping that in mind, here's another possible encoding:
> 
> $$00nnnnnnnn                       ZSCII character n; terminates escape
> $$01nnnnnnnn                       reserved
> $$10nnnnnnnn $$nnnnnnnnnn          Unicode character needing <= 18 bits
> $$110nnnnnnn $$nnnnnnnnnn $$nnnnnnnnnn    Unicode <= 27 bits.
> $$111nnnnnnn                       reserved
> 
> This packs 16-bit Unicode characters just as well as well as your
> proposal from the part of your message that I didn't quote and better
> than your modified proposal that I've quoted above; but, like your
> modified proposal, it handles larger Unicode characters.

Not bad, but here's my proposal, which deals with the termination problem
nicely.

$$00nnnnnnnn                       ZSCII character n
$$01nnnnnnnn                       Unicode character needing <= 8 bits
$$10nnnnnnnn                       Unicode character needing <= 8 bits
$$110nnnnnnn $$xnnnnnnnnnn         Unicode character needing <= 16 bits
$$1110nnnnnn $$1nnnnnnnnnn $$xnnnnnnnnn               Unicode <= 24 bits
$$111100nnnn $$1nnnnnnnnnn $$1nnnnnnnnn $$xnnnnnnnnn  Unicode <= 31 bits
$$111101nnnn                       reserved
$$11111nnnnn                       reserved

The 10-bit sequence is terminated by a 10-bit code with its top bit clear.
This allows you to switch back to 5-bit without any overhead. The whole
16-bit BMP is also neatly fitted into only 2 codes. This handles all Unicode
characters, the full 31-bit space.

Unicode characters must be encoded with a minimum length code. It is worth
noting that Latin-1 characters will always be encoded as single-code
characters.

This whole scheme would be better viewed as a switch between "5-bit" and
"10-bit" mode, rather than just a special "ZSCII escape". Thus the 5-bit
sequence 5,6 switches you into 10-bit mode, and a 10-bit code with its top
bit clear switches you back into 5-bit mode.


Now, as a related issue, if we're going to do this, which clearly compromises
backwards compatibility, the feature will be done through a compiler switch
of some sort. Having sacrificed the backwards compatibility, it would be nice
for the compiler to also be able to make use of 5-bit locking shifts, a
feature that was present in some, but not all Infocom terps. So how about
requiring that Standard 1.1 interpreters also handle these shifts?



Locking shifts
--------------
In Version 3 and later, a Standard 1.1 interpreter should interpret
consecutive 4 or 5 codes as shift lock characters, as per the following
table:

                            Next code
                        4              5
              A0  Shift to A1      Shift to A2
     Current  A1  Lock to A1       Lock to A0
    alphabet  A2  Lock to A0       Lock to A2
    
Thus, "DEADLINE: An INTERLOGIC Mystery" might be coded as

V1/V2: (35 codes)

 4 D E A D L I N E 2 : 0 A 3 n 0 I N T E R L O G I C 0 M 5 y s t e r y

V3+, Std 1.0: (52 codes)

 4 D 4 E 4 A 4 D 4 L 4 I 4 N 4 E 5 : 0 4 A n 0 4 I 4 N 4 T 4 E 4 R 4 L 4 O
 4 G 4 I 4 C 0 4 M y s t e r y
 
V3+, Std 1.1: (39 codes)

 4 4 D E A D L I N E 5 5 : 0 4 A n 0 4 4 I N T E R L O G I C 0 M 5 y s t e r y

Locking shifts are not used when encoding dictionary words. Locking state is
left unaltered by abbreviations and 10-bit sequences. Abbreviations are
always decoded starting in A0, regardless of the alphabet lock state of the
invoker.

-- 
Kevin Bracey
http://www.bracey-griffith.freeserve.co.uk/

-----------------------------------------------------------------------------
Date: Wed, 11 Sep 2002 22:50:50 +0100
From: Kevin Bracey <kevin@bracey-griffith.freeserve.co.uk>
To: z-machine@GMD.DE
Subject: Re: [z-machine] New Unicode in strings proposal

In message <ro17khs6wv3.fsf@jackfruit.Stanford.EDU>
          David Carlton <carlton@math.stanford.edu> wrote:

> On Wed, 11 Sep 2002 09:00:38 +0100, Kevin Bracey <kevin@bracey-griffith.freeserve.co.uk> said:
> 
> > The 10-bit sequence is terminated by a 10-bit code with its top bit
> > clear.  This allows you to switch back to 5-bit without any
> > overhead. The whole 16-bit BMP is also neatly fitted into only 2
> > codes. This handles all Unicode characters, the full 31-bit space.
> 
> > Unicode characters must be encoded with a minimum length code. It is
> > worth noting that Latin-1 characters will always be encoded as
> > single-code characters.
> 
> Nice.

It's an entertaining puzzle, as you said in an earlier post. Stopped me going
to sleep for a good hour, shuffling bits in my head, before I came up with
that scheme.

> > This whole scheme would be better viewed as a switch between "5-bit" and
> > "10-bit" mode, rather than just a special "ZSCII escape". Thus the 5-bit
> > sequence 5,6 switches you into 10-bit mode, and a 10-bit code with its
> > top bit clear switches you back into 5-bit mode.
> 
> Indeed.  Seems useful and elegant to me (inasmuch as anything
> involving Z-machine strings can be called "elegant").

Two extra things worth noting are that the string must switch back to 5-bit
mode before ending, and that you have to switch back to 5-bit mode when you
generate a new-line, as we don't allow Unicode control characters.

Also, Unicode characters can't be used when tokenising (by definition,
almost, as you only tokenise ZSCII).

> > Now, as a related issue, if we're going to do this, which clearly
> > compromises backwards compatibility, the feature will be done
> > through a compiler switch of some sort. Having sacrificed the
> > backwards compatibility, it would be nice for the compiler to also
> > be able to make use of 5-bit locking shifts, a feature that was
> > present in some, but not all Infocom terps. So how about requiring
> > that Standard 1.1 interpreters also handle these shifts?
> 
> Sounds reasonable.  I assume there are no existing files that contain
> repeated 4's or 5's other than at the end of words?

No. We did a full search a while back. Infocom didn't implement this in all
their interpreters, and I suspect they never got around to getting their
compiler to take advantage of it. Because their interpreters weren't
consistent in behaviour for multiple 4's or 5's they steered well clear (or
at least their compiler did), apart from their use as padding.

Actually Zip 2000 already implements shift locks, and has for the last few
versions.

Graham was against documenting this when I first pointed it out to him,
because it wasn't clear how you be able to take advantage of it usefully, but
it seems reasonable to me to put it in at the same time as the Unicode stuff.

-- 
Kevin Bracey
http://www.bracey-griffith.freeserve.co.uk/
-----------------------------------------------------------------------------


-- 
Kevin Bracey
http://www.bracey-griffith.freeserve.co.uk/