Projects

Ticket #225 (closed defect: duplicate)

Opened 3 years ago

Last modified 3 years ago

regexp engine broken when a string contains non ascii characters

Reported by: mattaimonetti@… Owned by: lsansonetti@…
Priority: critical Milestone: MacRuby 0.4
Component: MacRuby Keywords: regexp, bug
Cc:

Description

Here is a sample code to reproduce the problem:

html = %{<p><a href="http://www.flickr.com/people/jeanelietrujillo/">jeanelietrujillo</a> posted a photo:</p>
<p><a href="http://www.flickr.com/photos/jeanelietrujillo/2211862262/" title="Galgani Décoration"><img src="http://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg" width="240" height="240" alt="Galgani Décoration" /></a></p>}

html.scan(/<img\s+src="(.+?)"/)[0][0]

ruby 1.9 returns:

  => "http://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg"

macruby returns:

  => "ttp://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg\""

Now let's try to remove the é and replace it by a e:

html = %{<p><a href="http://www.flickr.com/people/jeanelietrujillo/">jeanelietrujillo</a> posted a photo:</p>
<p><a href="http://www.flickr.com/photos/jeanelietrujillo/2211862262/" title="Galgani Decoration"><img src="http://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg" width="240" height="240" alt="Galgani Décoration" /></a></p>}

html.scan(/<img\s+src="(.+?)"/)[0][0]

MacRuby now returns:

  => "http://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg"

My guess is that the unicode characters mess up the the count to extract the matched string resulting in a substring starting one character too early.

To prove my hypothesis here is another sample, this time with 2 "é" characters:

html = %{<p><a href="http://www.flickr.com/people/jeanelietrujillo/">jeanelietrujillo</a>a posté une photo:</p>
<p><a href="http://www.flickr.com/photos/jeanelietrujillo/2211862262/" title="Galgani Décoration"><img src="http://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg" width="240" height="240" alt="Galgani Décoration" /></a></p>}

html.scan(/<img\s+src="(.+?)"/)[0][0]

MacRuby returns:

  => "tp://farm3.static.flickr.com/2262/2211862262_2f08c343a3_m.jpg\" "

Change History

Changed 3 years ago by mattaimonetti@…

looks like a duplicate of  http://www.macruby.org/trac/ticket/94

Changed 3 years ago by mattaimonetti@…

  • status changed from new to closed
  • resolution set to duplicate
Note: See TracTickets for help on using tickets.