Projects

Ticket #339 (closed defect: fixed)

Opened 2 years ago

Last modified 2 years ago

YAML error with UTF-16 string

Reported by: dev@… Owned by: lsansonetti@…
Priority: critical Milestone: MacRuby 0.5
Component: MacRuby Keywords: YAML encoding
Cc:

Description

macruby -e 'require "yaml"; puts "Rübe".to_yaml'

produces

--- ! "R\xFCb"

with the last character missing!

BTW, I googled that "YAML streams must be UTF-8 or UTF-16", so why is a escape sequence visible in the above puts statement?

Change History

Changed 2 years ago by lsansonetti@…

It's simply a bug :) Thanks for reporting it.

Changed 2 years ago by neeracher@…

  • status changed from new to closed
  • resolution set to fixed

Fixed as of rev 2995 (Actual fixes were in 2788 and 2825, respectively, for the exclamation mark and the truncated output).

Changed 2 years ago by jazzbox@…

$ macruby -e 'require "yaml"; puts "Rübe".to_yaml'
--- "R\xFCbe"
$ ruby1.9 -e 'require "yaml"; puts "Rübe".to_yaml'
--- "R\xC3\xBCbe"

seems to work now! Macruby escpapes to UTF-16 and Ruby1.9 escapes to UTF-8. I didn't find anything in YAML docs that describes that behaviour, both methods seem to be correct. But ruby 1.8 fails to load the UTF-16 YAML. That is not astonishing because IMHO there is now way to guess what is the correct escaping mode.

I think escaping is not necessary here because the encoding of input and output is the same. This can easly be tested by

$ macruby -e 'require "yaml"; puts YAML::load "--- Rübe"'
Rübe

Changed 2 years ago by neeracher@…

I responded to this in e-mail, but to preserve the answer for posterity, I'm copying it here.

The ruby1.9 encoding is simply wrong: It translates the accented character into UTF-8, and then escapes the two UTF-8 characters separately. What this ends up encoding is "Rübe", which is not what you want.

I didn't find anything in YAML docs that describes that behaviour, both methods seem to be correct.

They can't possibly be BOTH correct, as interpreting the output of one according to the theory of the other would give a different result. If you look at the section in the YAML spec: < http://www.yaml.org/spec/1.2/spec.html#id2776092>, you will see

[ 57 ] "Escaped 8-bit Unicode character."

This is NOT an UTF-8 character.

But ruby 1.8 fails to load the UTF-16 YAML. That is not astonishing because IMHO there is now way to guess what is the correct escaping mode.

It's not astonishing because (a) 1.8 has very poor Unicode support anyway and (b) this would hardly be the only bug in syck.

I think escaping is not necessary here because the encoding of input and output is the same. This can easly be tested by

$ macruby -e 'require "yaml"; puts YAML::load "--- Rübe"'
Rübe

That's an interesting point. I think you're right that the YAML spec does not require escaping of printable characters >\u007F. However, non-printable characters DO have to be escaped, and for the printable ones, it could be argued that erring on the side of escaping helps readability if the OS does not have font coverage for some printable characters. In any case, the current implementation tries to be conservative in what it generates and liberal in what it accepts. I'm open to persuasion that we should avoid escaping characters, provided there is a low-cost test for printability of general Unicode characters (I have not yet checked whether one of the built-in CFCharacterSets can give that; the descriptions were inconclusive).

Note: See TracTickets for help on using tickets.