The used the ISO-8859-1 character set while Cygwin by default uses UTF-8. The characters not being matched formed invalid sequences in UTF-8. Setting ‘LANG=C’ also fixes the problem.
s/.*//
does not clear pattern space- This happens if your input stream includes invalid multibyte sequences. POSIX mandates that such sequences are not matched by ‘.’, so that ‘s/.*//’ will not clear pattern space as you would expect. In fact, there is no way to clear sed's buffers in the middle of the script in most multibyte locales (including UTF-8 locales). For this reason, GNU sed provides a `z' command (for `zap') as an extension.
To work around these problems, which may cause bugs in shell scripts, set the LC_COLLATE and LC_CTYPE environment variables to ‘C’.
Wednesday, December 08, 2010
Sometimes, a period doesn't match any character
Recently I found that in regular expressions in GNU sed, ‘.’ failed to match some characters. At first it was very surprising, but there's a simple explanation in the GNU sed manual:
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment