* doc/re.rdoc: use rdoc mode.

git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@24992 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
author: nobu <nobu@b2dd03c8-39d4-4d8f-98ff-823fe69b080e> 2009-09-18 01:11:55 +0000
committer: nobu <nobu@b2dd03c8-39d4-4d8f-98ff-823fe69b080e> 2009-09-18 01:11:55 +0000
commit: d7f76f84d923ac8c7ef585e1f43fc0d691625377 (patch)
tree: 3a9fe7c617df7b28d63e6b95ad917826a1994d60 /doc/re.rdoc
parent: 96ac19481183e11bd12ca0a704d90e415c4b3954 (diff)
download: ruby-d7f76f84d923ac8c7ef585e1f43fc0d691625377.tar.gz
1 files changed, 583 insertions, 582 deletions
diff --git a/doc/re.rdoc b/doc/re.rdoc
index 9671a7bd0b..6ca3325682 100644
--- a/doc/re.rdoc
+++ b/doc/re.rdoc
@@ -1,583 +1,584 @@
-# -*- coding: utf-8 -*-
-#  Regular expressions (<i>regexp</i>s) are patterns which describe the
-#  contents of a string. They're used for testing whether a string contains a
-#  given pattern, or extracting the portions that match. They are created
-#  with the <tt>/</tt><i>pat</i><tt>/</tt> and
-#  <tt>%r{</tt><i>pat</i><tt>}</tt> literals or the <tt>Regexp.new</tt>
-#  constructor.
-#
-#  A regexp is usually delimited with forward slashes (<tt>/</tt>). For
-#  example:
-#
-#      /hay/ =~ 'haystack'   #=> 0
-#      /y/.match('haystack') #=> #<MatchData "y">
-#
-#  If a string contains the pattern it is said to <i>match</i>. A literal
-#  string matches itself.
-#
-#      # 'haystack' does not contain the pattern 'needle', so doesn't match.
-#      /needle/.match('haystack') #=> nil
-#      # 'haystack' does contain the pattern 'hay', so it matches
-#      /hay/.match('haystack')    #=> #<MatchData "hay">
-#
-#  Specifically, <tt>/st/</tt> requires that the string contains the letter
-#  _s_ followed by the letter _t_, so it matches _haystack_, also.
-#
-#  == Metacharacters and Escapes
-#
-#  The following are <i>metacharacters</i> <tt>(</tt>, <tt>)</tt>,
-#  <tt>[</tt>, <tt>]</tt>, <tt>{</tt>, <tt>}</tt>, <tt>.</tt>, <tt>?</tt>,
-#  <tt>+</tt>, <tt>*</tt>. They have a specific meaning when appearing in a
-#  pattern. To match them literally they must be backslash-escaped. To match
-#  a backslash literally backslash-escape that: <tt>\\\\\\</tt>.
-#
-#      /1 \+ 2 = 3\?/.match('Does 1 + 2 = 3?') #=> #<MatchData "1 + 2 = 3?">
-#
-#  Patterns behave like double-quoted strings so can contain the same
-#  backslash escapes.
-#
-#      /\s\u{6771 4eac 90fd}/.match("Go to 東京都")
-#          #=> #<MatchData " 東京都">
-#
-#  Arbitrary Ruby expressions can be embedded into patterns with the
-#  <tt>#{...}</tt> construct.
-#
-#      place = "東京都"
-#      /#{place}/.match("Go to 東京都")
-#          #=> #<MatchData "東京都">
-#
-#  == Character Classes
-#
-#  A <i>character class</i> is delimited with square brackets (<tt>[</tt>,
-#  <tt>]</tt>) and lists characters that may appear at that point in the
-#  match. <tt>/[ab]/</tt> means _a_ or _b_, as opposed to <tt>/ab/</tt> which
-#  means _a_ followed by _b_.
-#
-#      /W[aeiou]rd/.match("Word") #=> #<MatchData "Word">
-#
-#  Within a character class the hyphen (<tt>-</tt>) is a metacharacter
-#  denoting an inclusive range of characters. <tt>[abcd]</tt> is equivalent
-#  to <tt>[a-d]</tt>. A range can be followed by another range, so
-#  <tt>[abcdwxyz]</tt> is equivalent to <tt>[a-dw-z]</tt>. The order in which
-#  ranges or individual characters appear inside a character class is
-#  irrelevant.
-#
-#      /[0-9a-f]/.match('9f') #=> #<MatchData "9">
-#      /[9f]/.match('9f')     #=> #<MatchData "9">
-#
-#  If the first character of a character class is a caret (<tt>^</tt>) the
-#  class is inverted: it matches any character _except_ those named.
-#
-#      /[^a-eg-z]/.match('f') #=> #<MatchData "f">
-#
-#  A character class may contain another character class. By itself this
-#  isn't useful because <tt>[a-z[0-9]]</tt> describes the same set as
-#  <tt>[a-z0-9]</tt>. However, character classes also support the <tt>&&</tt>
-#  operator which performs set intersection on its arguments. The two can be
-#  combined as follows:
-#
-#      /[a-w&&[^c-g]z]/ # ([a-w] AND ([^c-g] OR z))
-#      # This is equivalent to:
-#      /[abh-w]/
-#
-#  The following metacharacters also behave like character classes:
-#
-#  * <tt>/./</tt> - Any character except a newline.
-#  * <tt>/./m</tt> - Any character (the +m+ modifier enables multiline mode)
-#  * <tt>/\w/</tt> - A word character (<tt>[a-zA-Z0-9_]</tt>)
-#  * <tt>/\W/</tt> - A non-word character (<tt>[^a-zA-Z0-9_]</tt>)
-#  * <tt>/\d/</tt> - A digit character (<tt>[0-9]</tt>)
-#  * <tt>/\D/</tt> - A non-digit character (<tt>[^0-9]</tt>)
-#  * <tt>/\h/</tt> - A hexdigit character (<tt>[0-9a-fA-F]</tt>)
-#  * <tt>/\H/</tt> - A non-hexdigit character (<tt>[^0-9a-fA-F]</tt>)
-#  * <tt>/\s/</tt> - A whitespace character: <tt>/[ \t\r\n\f]/</tt>
-#  * <tt>/\S/</tt> - A non-whitespace character: <tt>/[^ \t\r\n\f]/</tt>
-#
-#  POSIX <i>bracket expressions</i> are also similar to character classes.
-#  They provide a portable alternative to the above, with the added benefit
-#  that they encompass non-ASCII characters. For instance, <tt>/\d/</tt>
-#  matches only the ASCII decimal digits (0-9); whereas <tt>/[[:digit:]]/</tt>
-#  matches any character in the Unicode _Nd_ category.
-#
-#  * <tt>/[[:alnum:]]/</tt> - Alphabetic and numeric character
-#  * <tt>/[[:alpha:]]/</tt> - Alphabetic character
-#  * <tt>/[[:blank:]]/</tt> - Space or tab
-#  * <tt>/[[:cntrl:]]/</tt> - Control character
-#  * <tt>/[[:digit:]]/</tt> - Digit
-#  * <tt>/[[:graph:]]/</tt> - Non-blank character (excludes spaces, control
-#    characters, and similar)
-#  * <tt>/[[:lower:]]/</tt> - Lowercase alphabetical character
-#  * <tt>/[[:print:]]/</tt> - Like [:graph:], but includes the space character
-#  * <tt>/[[:punct:]]/</tt> - Punctuation character
-#  * <tt>/[[:space:]]/</tt> - Whitespace character (<tt>[:blank:]</tt>, newline,
-#     carriage return, etc.)
-#  * <tt>/[[:upper:]]/</tt> - Uppercase alphabetical
-#  * <tt>/[[:xdigit:]]/</tt> - Digit allowed in a hexadecimal number (i.e.,
-#    0-9a-fA-F)
-#
-#  Ruby also supports the following non-POSIX character classes:
-#
-#  * <tt>/[[:word:]]/</tt> - A character in one of the following Unicode
-#    general categories _Letter_, _Mark_, _Number_,
-#    <i>Connector_Punctuation<i/i>
-#  * <tt>/[[:ascii:]]/</tt> - A character in the ASCII character set
-#
-#      # U+06F2 is "EXTENDED ARABIC-INDIC DIGIT TWO"
-#      /[[:digit:]]/.match("\u06F2")    #=> #<MatchData "\u{06F2}">
-#      /[[:upper:]][[:lower:]]/.match("Hello") #=> #<MatchData "He">
-#      /[[:xdigit:]][[:xdigit:]]/.match("A6")  #=> #<MatchData "A6">
-#
-#  == Repetition
-#
-#  The constructs described so far match a single character. They can be
-#  followed by a repetition metacharacter to specify how many times they need
-#  to occur. Such metacharacters are called <i>quantifiers</i>.
-#
-#  * <tt>*</tt> - Zero or more times
-#  * <tt>+</tt> - One or more times
-#  * <tt>?</tt> - Zero or one times (optional)
-#  * <tt>{</tt><i>n</i><tt>}</tt> - Exactly <i>n</i> times
-#  * <tt>{</tt><i>n</i><tt>,}</tt> - <i>n</i> or more times
-#  * <tt>{,</tt><i>m</i><tt>}</tt> - <i>m</i> or less times
-#  * <tt>{</tt><i>n</i><tt>,</tt><i>m</i><tt>}</tt> - At least <i>n</i> and
-#    at most <i>m</i> times
-#
-#      # At least one uppercase character ('H'), at least one lowercase
-#      # character ('e'), two 'l' characters, then one 'o'
-#      "Hello".match(/[[:upper:]]+[[:lower:]]+l{2}o/) #=> #<MatchData "Hello">
-#
-#  Repetition is <i>greedy</i> by default: as many occurrences as possible
-#  are matched while still allowing the overall match to succeed. By
-#  contrast, <i>lazy</i> matching makes the minimal amount of matches
-#  necessary for overall success. A greedy metacharacter can be made lazy by
-#  following it with <tt>?</tt>.
-#
-#      # Both patterns below match the string. The fist uses a greedy
-#      # quantifier so '.+' matches '<a><b>'; the second uses a lazy
-#      # quantifier so '.+?' matches '<a>'.
-#      /<.+>/.match("<a><b>")  #=> #<MatchData "<a><b>">
-#      /<.+?>/.match("<a><b>") #=> #<MatchData "<a>">
-#
-#  A quantifier followed by <tt>+</tt> matches <i>possessively</i>: once it
-#  has matched it does not backtrack. They behave like greedy quantifiers,
-#  but having matched they refuse to "give up" their match even if this
-#  jeopardises the overall match.
-#
-#  == Capturing
-#
-#  Parentheses can be used for <i>capturing</i>. The text enclosed by the
-#  <i>n</i><sup>th</sup> group of parentheses can be subsequently referred to
-#  with <i>n</i>. Within a pattern use the <i>backreference</i>
-#  <tt>\</tt><i>n</i>; outside of the pattern use
-#  <tt>MatchData[</tt><i>n</i><tt>]</tt>.
-#
-#      # 'at' is captured by the first group of parentheses, then referred to
-#      # later with \1
-#      /[csh](..) [csh]\1 in/.match("The cat sat in the hat")
-#          #=> #<MatchData "cat sat in" 1:"at">
-#      # Regexp#match returns a MatchData object which makes the captured
-#      # text available with its #[] method.
-#      /[csh](..) [csh]\1 in/.match("The cat sat in the hat")[1] #=> 'at'
-#
-#  Capture groups can be referred to by name when defined with the
-#  <tt>(?<</tt><i>name</i><tt>>)</tt> or <tt>(?'</tt><i>name</i><tt>')</tt>
-#  constructs.
-#
-#      /\$(?<dollars>\d+)\.(?<cents>\d+)/.match("$3.67")
-#          => #<MatchData "$3.67" dollars:"3" cents:"67">
-#      /\$(?<dollars>\d+)\.(?<cents>\d+)/.match("$3.67")[:dollars] #=> "3"
-#
-#  Named groups can be backreferenced with <tt>\k<</tt><i>name</i><tt>></tt>,
-#  where _name_ is the group name.
-#
-#      /(?<vowel>[aeiou]).\k<vowel>.\k<vowel>/.match('ototomy')
-#          #=> #<MatchData "ototo" vowel:"o">
-#
-#  *Note*: A regexp can't use named backreferences and numbered
-#  backreferences simultaneously.
-#
-#  When named capture groups are used with a literal regexp on the left-hand
-#  side of an expression and the <tt>=~</tt> operator, the captured text is
-#  also assigned to local variables with corresponding names.
-#
-#      /\$(?<dollars>\d+)\.(?<cents>\d+)/ =~ "$3.67" #=> 0
-#      dollars #=> "3"
-#
-#  == Grouping
-#
-#  Parentheses also <i>group</i> the terms they enclose, allowing them to be
-#  quantified as one <i>atomic</i> whole.
-#
-#      # The pattern below matches a vowel followed by 2 word characters:
-#      # 'aen'
-#      /[aeiou]\w{2}/.match("Caenorhabditis elegans") #=> #<MatchData "aen">
-#      # Whereas the following pattern matches a vowel followed by a word
-#      # character, twice, i.e. <tt>[aeiou]\w[aeiou]\w</tt>: 'enor'.
-#      /([aeiou]\w){2}/.match("Caenorhabditis elegans")
-#          #=> #<MatchData "enor" 1:"or">
-#
-#  The <tt>(?:</tt>...<tt>)</tt> construct provides grouping without
-#  capturing. That is, it combines the terms it contains into an atomic whole
-#  without creating a backreference. This benefits performance at the slight
-#  expense of readabilty.
-#
-#      # The group of parentheses captures 'n' and the second 'ti'. The
-#      # second group is referred to later with the backreference \2
-#      /I(n)ves(ti)ga\2ons/.match("Investigations")
-#          #=> #<MatchData "Investigations" 1:"n" 2:"ti">
-#      # The first group of parentheses is now made non-capturing with '?:',
-#      # so it still matches 'n', but doesn't create the backreference. Thus,
-#      # the backreference \1 now refers to 'ti'.
-#      /I(?:n)ves(ti)ga\1ons/.match("Investigations")
-#          #=> #<MatchData "Investigations" 1:"ti">
-#
-#  === Atomic Grouping
-#
-#  Grouping can be made <i>atomic</i> with
-#  <tt>(?></tt><i>pat</i><tt>)</tt>. This causes the subexpression <i>pat</i>
-#  to be matched independently of the rest of the expression such that what
-#  it matches becomes fixed for the remainder of the match, unless the entire
-#  subexpression must be abandoned and subsequently revisited. In this
-#  way <i>pat</i> is treated as a non-divisible whole. Atomic grouping is
-#  typically used to optimise patterns so as to prevent the regular
-#  expression engine from backtracking needlesly.
-#
-#      # The <tt>"</tt> in the pattern below matches the first character of
-#      # the string, then <tt>.*</tt> matches <i>Quote"</i>. This causes the
-#      # overall match to fail, so the text matched by <tt>.*</tt> is
-#      # backtracked by one position, which leaves the final character of the
-#      # string available to match <tt>"</tt>
-#            /".*"/.match('"Quote"')     #=> #<MatchData "\"Quote\"">
-#      # If <tt>.*</tt> is grouped atomically, it refuses to backtrack
-#      # <i>Quote"</i>, even though this means that the overall match fails
-#      /"(?>.*)"/.match('"Quote"') #=> nil
-#
-#  == Subexpression Calls
-#
-#  The <tt>\g<</tt><i>name</i><tt>></tt> syntax matches the previous
-#  subexpression named _name_, which can be a group name or number, again.
-#  This differs from backreferences in that it re-executes the group rather
-#  than simply trying to re-match the same text.
-#
-#      # Matches a <i>(</i> character and assigns it to the <tt>paren</tt>
-#      # group, tries to call that the <tt>paren</tt> sub-expression again
-#      # but fails, then matches a literal <i>)</i>.
-#      /\A(?<paren>\(\g<paren>*\))*\z/ =~ '()'
-#
-#
-#      /\A(?<paren>\(\g<paren>*\))*\z/ =~ '(())' #=> 0
-#      # ^1
-#      #      ^2
-#      #           ^3
-#      #                 ^4
-#      #      ^5
-#      #           ^6
-#      #                      ^7
-#      #                       ^8
-#      #                       ^9
-#      #                           ^10
-#
-#  1.  Matches at the beginning of the string, i.e. before the first
-#      character.
-#  2.  Enters a named capture group called <tt>paren</tt>
-#  3.  Matches a literal <i>(</i>, the first character in the string
-#  4.  Calls the <tt>paren</tt> group again, i.e. recurses back to the
-#      second step
-#  5.  Re-enters the <tt>paren</tt> group
-#  6.  Matches a literal <i>(</i>, the second character in the
-#      string
-#  7.  Try to call <tt>paren</tt> a third time, but fail because
-#      doing so would prevent an overall successful match
-#  8.  Match a literal <i>)</i>, the third character in the string.
-#      Marks the end of the second recursive call
-#  9.  Match a literal <i>)</i>, the fourth character in the string
-#  10. Match the end of the string
-#
-#  == Alternation
-#
-#  The vertical bar metacharacter (<tt>|</tt>) combines two expressions into
-#  a single one that matches either of the expressions. Each expression is an
-#  <i>alternative</i>.
-#
-#      /\w(and|or)\w/.match("Feliformia") #=> #<MatchData "form" 1:"or">
-#      /\w(and|or)\w/.match("furandi")    #=> #<MatchData "randi" 1:"and">
-#      /\w(and|or)\w/.match("dissemblance") #=> nil
-#
-#  == Character Properties
-#
-#  The <tt>\p{}</tt> construct matches characters with the named property,
-#  much like POSIX bracket classes.
-#
-#  * <tt>/\p{Alnum}/</tt> - Alphabetic and numeric character
-#  * <tt>/\p{Alpha}/</tt> - Alphabetic character
-#  * <tt>/\p{Blank}/</tt> - Space or tab
-#  * <tt>/\p{Cntrl}/</tt> - Control character
-#  * <tt>/\p{Digit}/</tt> - Digit
-#  * <tt>/\p{Graph}/</tt> - Non-blank character (excludes spaces, control
-#    characters, and similar)
-#  * <tt>/\p{Lower}/</tt> - Lowercase alphabetical character
-#  * <tt>/\p{Print}/</tt> - Like <tt>\p{Graph}</tt>, but includes the space character
-#  * <tt>/\p{Punct}/</tt> - Punctuation character
-#  * <tt>/\p{Space}/</tt> - Whitespace character (<tt>[:blank:]</tt>, newline,
-#    carriage return, etc.)
-#  * <tt>/\p{Upper}/</tt> - Uppercase alphabetical
-#  * <tt>/\p{XDigit}/</tt> - Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
-#  * <tt>/\p{Word}/</tt> - A member of one of the following Unicode general
-#    category <i>Letter</i>, <i>Mark</i>, <i>Number</i>,
-#    <i>Connector\_Punctuation</i>
-#  * <tt>/\p{ASCII}/</tt> - A character in the ASCII character set
-#  * <tt>/\p{Any}/</tt> - Any Unicode character (including unassigned
-#    characters)
-#  * <tt>/\p{Assigned}/</tt> - An assigned character
-#
-#  A Unicode character's <i>General Category</i> value can also be matched
-#  with <tt>\p{</tt><i>Ab</i><tt>}</tt> where <i>Ab</i> is the category's
-#  abbreviation as described below:
-#
-#  * <tt>/\p{L}/</tt> - 'Letter'
-#  * <tt>/\p{Ll}/</tt> - 'Letter: Lowercase'
-#  * <tt>/\p{Lm}/</tt> - 'Letter: Mark'
-#  * <tt>/\p{Lo}/</tt> - 'Letter: Other'
-#  * <tt>/\p{Lt}/</tt> - 'Letter: Titlecase'
-#  * <tt>/\p{Lu}/</tt> - 'Letter: Uppercase
-#  * <tt>/\p{Lo}/</tt> - 'Letter: Other'
-#  * <tt>/\p{M}/</tt> - 'Mark'
-#  * <tt>/\p{Mn}/</tt> - 'Mark: Nonspacing'
-#  * <tt>/\p{Mc}/</tt> - 'Mark: Spacing Combining'
-#  * <tt>/\p{Me}/</tt> - 'Mark: Enclosing'
-#  * <tt>/\p{N}/</tt> - 'Number'
-#  * <tt>/\p{Nd}/</tt> - 'Number: Decimal Digit'
-#  * <tt>/\p{Nl}/</tt> - 'Number: Letter'
-#  * <tt>/\p{No}/</tt> - 'Number: Other'
-#  * <tt>/\p{P}/</tt> - 'Punctuation'
-#  * <tt>/\p{Pc}/</tt> - 'Punctuation: Connector'
-#  * <tt>/\p{Pd}/</tt> - 'Punctuation: Dash'
-#  * <tt>/\p{Ps}/</tt> - 'Punctuation: Open'
-#  * <tt>/\p{Pe}/</tt> - 'Punctuation: Close'
-#  * <tt>/\p{Pi}/</tt> - 'Punctuation: Initial Quote'
-#  * <tt>/\p{Pf}/</tt> - 'Punctuation: Final Quote'
-#  * <tt>/\p{Po}/</tt> - 'Punctuation: Other'
-#  * <tt>/\p{S}/</tt> - 'Symbol'
-#  * <tt>/\p{Sm}/</tt> - 'Symbol: Math'
-#  * <tt>/\p{Sc}/</tt> - 'Symbol: Currency'
-#  * <tt>/\p{Sc}/</tt> - 'Symbol: Currency'
-#  * <tt>/\p{Sk}/</tt> - 'Symbol: Modifier'
-#  * <tt>/\p{So}/</tt> - 'Symbol: Other'
-#  * <tt>/\p{Z}/</tt> - 'Separator'
-#  * <tt>/\p{Zs}/</tt> - 'Separator: Space'
-#  * <tt>/\p{Zl}/</tt> - 'Separator: Line'
-#  * <tt>/\p{Zp}/</tt> - 'Separator: Paragraph'
-#  * <tt>/\p{C}/</tt> - 'Other'
-#  * <tt>/\p{Cc}/</tt> - 'Other: Control'
-#  * <tt>/\p{Cf}/</tt> - 'Other: Format'
-#  * <tt>/\p{Cn}/</tt> - 'Other: Not Assigned'
-#  * <tt>/\p{Co}/</tt> - 'Other: Private Use'
-#  * <tt>/\p{Cs}/</tt> - 'Other: Surrogate'
-#
-#  Lastly, <tt>\p{}</tt> matches a character's Unicode <i>script</i>. The
-#  following scripts are supported: <i>Arabic</i>, <i>Armenian</i>,
-#  <i>Balinese</i>, <i>Bengali</i>, <i>Bopomofo</i>, <i>Braille</i>,
-#  <i>Buginese</i>, <i>Buhid</i>, <i>Canadian_Aboriginal</i>, <i>Carian</i>,
-#  <i>Cham</i>, <i>Cherokee</i>, <i>Common</i>, <i>Coptic</i>,
-#  <i>Cuneiform</i>, <i>Cypriot</i>, <i>Cyrillic</i>, <i>Deseret</i>,
-#  <i>Devanagari</i>, <i>Ethiopic</i>, <i>Georgian</i>, <i>Glagolitic</i>,
-#  <i>Gothic</i>, <i>Greek</i>, <i>Gujarati</i>, <i>Gurmukhi</i>, <i>Han</i>,
-#  <i>Hangul</i>, <i>Hanunoo</i>, <i>Hebrew</i>, <i>Hiragana</i>,
-#  <i>Inherited</i>, <i>Kannada</i>, <i>Katakana</i>, <i>Kayah_Li</i>,
-#  <i>Kharoshthi</i>, <i>Khmer</i>, <i>Lao</i>, <i>Latin</i>, <i>Lepcha</i>,
-#  <i>Limbu</i>, <i>Linear_B</i>, <i>Lycian</i>, <i>Lydian</i>,
-#  <i>Malayalam</i>, <i>Mongolian</i>, <i>Myanmar</i>, <i>New_Tai_Lue</i>,
-#  <i>Nko</i>, <i>Ogham</i>, <i>Ol_Chiki</i>, <i>Old_Italic</i>,
-#  <i>Old_Persian</i>, <i>Oriya</i>, <i>Osmanya</i>, <i>Phags_Pa</i>,
-#  <i>Phoenician</i>, <i>Rejang</i>, <i>Runic</i>, <i>Saurashtra</i>,
-#  <i>Shavian</i>, <i>Sinhala</i>, <i>Sundanese</i>, <i>Syloti_Nagri</i>,
-#  <i>Syriac</i>, <i>Tagalog</i>, <i>Tagbanwa</i>, <i>Tai_Le</i>,
-#  <i>Tamil</i>, <i>Telugu</i>, <i>Thaana</i>, <i>Thai</i>, <i>Tibetan</i>,
-#  <i>Tifinagh</i>, <i>Ugaritic</i>, <i>Vai</i>, and <i>Yi</i>.
-#
-#      # Unicode codepoint U+06E9 is named "ARABIC PLACE OF SAJDAH" and
-#      # belongs to the Arabic script.
-#      /\p{Arabic}/.match("\u06E9") #=> #<MatchData "\u06E9">
-#
-#  All character properties can be inverted by prefixing their name with a
-#  caret (<tt>^</tt>).
-#
-#      # Letter 'A' is not in the Unicode Ll (Letter; Lowercase) category, so
-#      # this match succeeds
-#      /\p{^Ll}/.match("A") #=> #<MatchData "A">
-#
-#  == Anchors
-#
-#  Anchors are metacharacter that match the zero-width positions between
-#  characters, <i>anchoring</i> the match to a specific position.
-#
-#  * <tt>^</tt> - Matches beginning of line
-#  * <tt>$</tt> - Matches end of line
-#  * <tt>\A</tt> - Matches beginning of string.
-#  * <tt>\Z</tt> - Matches end of string. If string ends with a newline,
-#    it matches just before newline
-#  * <tt>\z</tt> - Matches end of string
-#  * <tt>\G</tt> - Matches point where last match finished
-#  * <tt>\b</tt> - Matches word boundaries when outside brackets; backspace
-#     (0x08) inside brackets
-#  * <tt>\B</tt> - Matches non-word boundaries
-#  * <tt>(?=</tt><i>pat</i><tt>)</tt> - <i>Positive lookahead</i> assertion:
-#    ensures that the following characters match <i>pat</i>, but doesn't
-#    include those characters in the matched text
-#  * <tt>(?!</tt><i>pat</i><tt>)</tt> - <i>Negative lookahead</i> assertion:
-#    ensures that the following characters do not match <i>pat</i>, but
-#    doesn't include those characters in the matched text
-#  * <tt>(?<=</tt><i>pat</i><tt>)</tt> - <i>Positive lookbehind</i>
-#    assertion: ensures that the preceding characters match <i>pat</i>, but
-#    doesn't include those characters in the matched text
-#  * <tt>(?<!</tt><i>pat</i><tt>)</tt> - <i>Negative lookbehind</i>
-#    assertion: ensures that the preceding characters do not match
-#    <i>pat</i>, but doesn't include those characters in the matched text
-#
-#      # If a pattern isn't anchored it can begin at any point in the string
-#      /real/.match("surrealist") #=> #<MatchData "real">
-#      # Anchoring the pattern to the beginning of the string forces the
-#      # match to start there. 'real' doesn't occur at the beginning of the
-#      # string, so now the match fails
-#      /\Areal/.match("surrealist") #=> nil
-#      # The match below fails because although 'Demand' contains 'and', the
-#      pattern does not occur at a word boundary.
-#      /\band/.match("Demand")
-#      # Whereas in the following example 'and' has been anchored to a
-#      # non-word boundary so instead of matching the first 'and' it matches
-#      # from the fourth letter of 'demand' instead
-#      /\Band.+/.match("Supply and demand curve") #=> #<MatchData "and curve">
-#      # The pattern below uses positive lookahead and positive lookbehind to
-#      # match text appearing in <b></b> tags without including the tags in the
-#      # match
-#      /(?<=<b>)\w+(?=<\/b>)/.match("Fortune favours the <b>bold</b>")
-#          #=> #<MatchData "bold">
-#
-#  == Options
-#
-#  The end delimiter for a regexp can be followed by one or more single-letter
-#  options which control how the pattern can match.
-#
-#  * <tt>/pat/i</tt> - Ignore case
-#  * <tt>/pat/m</tt> - Treat a newline as a character matched by <tt>.</tt>
-#  * <tt>/pat/x</tt> - Ignore whitespace and comments in the pattern
-#  * <tt>/pat/o</tt> - Perform <tt>#{}</tt> interpolation only once
-#
-#  <tt>i</tt>, <tt>m</tt>, and <tt>x</tt> can also be applied on the
-#  subexpression level with the
-#  <tt>(?</tt><i>on</i><tt>-</tt><i>off</i><tt>)</tt> construct, which
-#  enables options <i>on</i>, and disables options <i>off</i> for the
-#  expression enclosed by the parentheses.
-#
-#      /a(?i:b)c/.match('aBc') #=> #<MatchData "aBc">
-#      /a(?i:b)c/.match('abc') #=> #<MatchData "abc">
-#
-#  == Free-Spacing Mode and Comments
-#
-#  As mentioned above, the <tt>x</tt> option enables <i>free-spacing</i>
-#  mode. Literal white space inside the pattern is ignored, and the
-#  octothorpe (<tt>#</tt>) character introduces a comment until the end of
-#  the line. This allows the components of the pattern to be organised in a
-#  potentially more readable fashion.
-#
-#      # A contrived pattern to match a number with optional decimal places
-#      float_pat = /\A
-#          [[:digit:]]+ # 1 or more digits before the decimal point
-#          (\.          # Decimal point
-#              [[:digit:]]+ # 1 or more digits after the decimal point
-#          )? # The decimal point and following digits are optional
-#      \Z/x
-#      float_pat.match('3.14') #=> #<MatchData "3.14" 1:".14">
-#
-#  *Note*: To match whitespace in an <tt>x</tt> pattern use an escape such as
-#  <tt>\s</tt> or <tt>\p{Space}</tt>.
-#
-#  Comments can be included in a non-<tt>x</tt> pattern with the
-#  <tt>(?#</tt><i>comment</i><tt>)</tt> construct, where <i>comment</i> is
-#  arbitrary text ignored by the regexp engine.
-#
-#  == Encoding
-#
-#  Regular expressions are assumed to use the source encoding. This can be
-#  overridden with one of the following modifiers.
-#
-#  * <tt>/</tt><i>pat</i><tt>/u</tt> - UTF-8
-#  * <tt>/</tt><i>pat</i><tt>/e</tt> - EUC-JP
-#  * <tt>/</tt><i>pat</i><tt>/s</tt> - Windows-31J
-#  * <tt>/</tt><i>pat</i><tt>/n</tt> - ASCII-8BIT
-#
-#  A regexp can be matched against a string when they either share an
-#  encoding, or the regexp's encoding is _US-ASCII_ and the string's encoding
-#  is ASCII-compatible.
-#
-#  If a match between incompatible encodings is attempted an
-#  <tt>Encoding::CompatibilityError</tt> exception is raised.
-#
-#  The <tt>Regexp#fixed_encoding?</tt> predicate indicates whether the regexp
-#  has a <i>fixed</i> encoding, that is one incompatible with ASCII. A
-#  regexp's encoding can be explicitly fixed by supplying
-#  <tt>Regexp::FIXEDENCODING</tt> as the second argument of
-#  <tt>Regexp.new</tt>:
-#
-#      r = Regexp.new("a".force_encoding("iso-8859-1"),Regexp::FIXEDENCODING)
-#      r =~"a\u3042"
-#         #=> Encoding::CompatibilityError: incompatible encoding regexp match
-#              (ISO-8859-1 regexp with UTF-8 string)
-#
-#  == Performance
-#
-#  Certain pathological combinations of constructs can lead to abysmally bad
-#  performance.
-#
-#  Consider a string of 25 <i>a</i>s, a <i>d</i>, 4 <i>a</i>s, and a
-#  <i>c</i>.
-#
-#      s = 'a' * 25 + 'd' 'a' * 4 + 'c'
-#          #=> "aaaaaaaaaaaaaaaaaaaaaaaaadadadadac"
-#
-#  The following patterns match instantly as you would expect:
-#
-#      /(b|a)/ =~ s #=> 0
-#      /(b|a+)/ =~ s #=> 0
-#      /(b|a+)*\/ =~ s #=> 0
-#
-#  However, the following pattern takes appreciably longer:
-#
-#      /(b|a+)*c/ =~ s #=> 32
-#
-#  This happens because an atom in the regexp is quantified by both an
-#  immediate <tt>+</tt> and an enclosing <tt>*</tt> with nothing to
-#  differentiate which is in control of any particular character. The
-#  nondeterminism that results produces super-linear performance. (Consult
-#  <i>Mastering Regular Expressions</i> (3rd ed.), pp 222, by
-#  <i>Jeffery Friedl</i>, for an in-depth analysis). This particular case
-#  can be fixed by use of atomic grouping, which prevents the unnecessary
-#  backtracking:
-#
-#      (start = Time.now) && /(b|a+)*c/ =~ s && (Time.now - start)
-#         #=> 24.702736882
-#      (start = Time.now) && /(?>b|a+)*c/ =~ s && (Time.now - start)
-#         #=> 0.000166571
-#
-#  A similar case is typified by the following example, which takes
-#  approximately 60 seconds to execute for me:
-#
-#      # Match a string of 29 <i>a</i>s against a pattern of 29 optional
-#      # <i>a</i>s followed by 29 mandatory <i>a</i>s.
-#      Regexp.new('a?' * 29 + 'a' * 29) =~ 'a' * 29
-#
-#  The 29 optional <i>a</i>s match the string, but this prevents the 29
-#  mandatory <i>a</i>s that follow from matching. Ruby must then backtrack
-#  repeatedly so as to satisfy as many of the optional matches as it can
-#  while still matching the mandatory 29. It is plain to us that none of the
-#  optional matches can succeed, but this fact unfortunately eludes Ruby.
-#
-#  One approach for improving performance is to anchor the match to the
-#  beginning of the string, thus significantly reducing the amount of
-#  backtracking needed.
-#
-#      Regexp.new('\A' 'a?' * 29 + 'a' * 29).match('a' * 29)
-#          #=> #<MatchData "aaaaaaaaaaaaaaaaaaaaaaaaaaaaa">
-#
-#
+# -*- mode: rdoc; coding: utf-8; fill-column: 74; -*-
+=begin rdoc
+
+Regular expressions (<i>regexp</i>s) are patterns which describe the
+contents of a string. They're used for testing whether a string contains a
+given pattern, or extracting the portions that match. They are created
+with the <tt>/</tt><i>pat</i><tt>/</tt> and
+<tt>%r{</tt><i>pat</i><tt>}</tt> literals or the <tt>Regexp.new</tt>
+constructor.
+
+A regexp is usually delimited with forward slashes (<tt>/</tt>). For
+example:
+
+    /hay/ =~ 'haystack'   #=> 0
+    /y/.match('haystack') #=> #<MatchData "y">
+
+If a string contains the pattern it is said to <i>match</i>. A literal
+string matches itself.
+
+    # 'haystack' does not contain the pattern 'needle', so doesn't match.
+    /needle/.match('haystack') #=> nil
+    # 'haystack' does contain the pattern 'hay', so it matches
+    /hay/.match('haystack')    #=> #<MatchData "hay">
+
+Specifically, <tt>/st/</tt> requires that the string contains the letter
+_s_ followed by the letter _t_, so it matches _haystack_, also.
+
+== Metacharacters and Escapes
+
+The following are <i>metacharacters</i> <tt>(</tt>, <tt>)</tt>,
+<tt>[</tt>, <tt>]</tt>, <tt>{</tt>, <tt>}</tt>, <tt>.</tt>, <tt>?</tt>,
+<tt>+</tt>, <tt>*</tt>. They have a specific meaning when appearing in a
+pattern. To match them literally they must be backslash-escaped. To match
+a backslash literally backslash-escape that: <tt>\\\\\\</tt>.
+
+    /1 \+ 2 = 3\?/.match('Does 1 + 2 = 3?') #=> #<MatchData "1 + 2 = 3?">
+
+Patterns behave like double-quoted strings so can contain the same
+backslash escapes.
+
+    /\s\u{6771 4eac 90fd}/.match("Go to 東京都")
+        #=> #<MatchData " 東京都">
+
+Arbitrary Ruby expressions can be embedded into patterns with the
+<tt>#{...}</tt> construct.
+
+    place = "東京都"
+    /#{place}/.match("Go to 東京都")
+        #=> #<MatchData "東京都">
+
+== Character Classes
+
+A <i>character class</i> is delimited with square brackets (<tt>[</tt>,
+<tt>]</tt>) and lists characters that may appear at that point in the
+match. <tt>/[ab]/</tt> means _a_ or _b_, as opposed to <tt>/ab/</tt> which
+means _a_ followed by _b_.
+
+    /W[aeiou]rd/.match("Word") #=> #<MatchData "Word">
+
+Within a character class the hyphen (<tt>-</tt>) is a metacharacter
+denoting an inclusive range of characters. <tt>[abcd]</tt> is equivalent
+to <tt>[a-d]</tt>. A range can be followed by another range, so
+<tt>[abcdwxyz]</tt> is equivalent to <tt>[a-dw-z]</tt>. The order in which
+ranges or individual characters appear inside a character class is
+irrelevant.
+
+    /[0-9a-f]/.match('9f') #=> #<MatchData "9">
+    /[9f]/.match('9f')     #=> #<MatchData "9">
+
+If the first character of a character class is a caret (<tt>^</tt>) the
+class is inverted: it matches any character _except_ those named.
+
+    /[^a-eg-z]/.match('f') #=> #<MatchData "f">
+
+A character class may contain another character class. By itself this
+isn't useful because <tt>[a-z[0-9]]</tt> describes the same set as
+<tt>[a-z0-9]</tt>. However, character classes also support the <tt>&&</tt>
+operator which performs set intersection on its arguments. The two can be
+combined as follows:
+
+    /[a-w&&[^c-g]z]/ # ([a-w] AND ([^c-g] OR z))
+    # This is equivalent to:
+    /[abh-w]/
+
+The following metacharacters also behave like character classes:
+
+* <tt>/./</tt> - Any character except a newline.
+* <tt>/./m</tt> - Any character (the +m+ modifier enables multiline mode)
+* <tt>/\w/</tt> - A word character (<tt>[a-zA-Z0-9_]</tt>)
+* <tt>/\W/</tt> - A non-word character (<tt>[^a-zA-Z0-9_]</tt>)
+* <tt>/\d/</tt> - A digit character (<tt>[0-9]</tt>)
+* <tt>/\D/</tt> - A non-digit character (<tt>[^0-9]</tt>)
+* <tt>/\h/</tt> - A hexdigit character (<tt>[0-9a-fA-F]</tt>)
+* <tt>/\H/</tt> - A non-hexdigit character (<tt>[^0-9a-fA-F]</tt>)
+* <tt>/\s/</tt> - A whitespace character: <tt>/[ \t\r\n\f]/</tt>
+* <tt>/\S/</tt> - A non-whitespace character: <tt>/[^ \t\r\n\f]/</tt>
+
+POSIX <i>bracket expressions</i> are also similar to character classes.
+They provide a portable alternative to the above, with the added benefit
+that they encompass non-ASCII characters. For instance, <tt>/\d/</tt>
+matches only the ASCII decimal digits (0-9); whereas <tt>/[[:digit:]]/</tt>
+matches any character in the Unicode _Nd_ category.
+
+* <tt>/[[:alnum:]]/</tt> - Alphabetic and numeric character
+* <tt>/[[:alpha:]]/</tt> - Alphabetic character
+* <tt>/[[:blank:]]/</tt> - Space or tab
+* <tt>/[[:cntrl:]]/</tt> - Control character
+* <tt>/[[:digit:]]/</tt> - Digit
+* <tt>/[[:graph:]]/</tt> - Non-blank character (excludes spaces, control
+  characters, and similar)
+* <tt>/[[:lower:]]/</tt> - Lowercase alphabetical character
+* <tt>/[[:print:]]/</tt> - Like [:graph:], but includes the space character
+* <tt>/[[:punct:]]/</tt> - Punctuation character
+* <tt>/[[:space:]]/</tt> - Whitespace character (<tt>[:blank:]</tt>, newline,
+   carriage return, etc.)
+* <tt>/[[:upper:]]/</tt> - Uppercase alphabetical
+* <tt>/[[:xdigit:]]/</tt> - Digit allowed in a hexadecimal number (i.e.,
+  0-9a-fA-F)
+
+Ruby also supports the following non-POSIX character classes:
+
+* <tt>/[[:word:]]/</tt> - A character in one of the following Unicode
+  general categories _Letter_, _Mark_, _Number_,
+  <i>Connector_Punctuation<i/i>
+* <tt>/[[:ascii:]]/</tt> - A character in the ASCII character set
+
+    # U+06F2 is "EXTENDED ARABIC-INDIC DIGIT TWO"
+    /[[:digit:]]/.match("\u06F2")    #=> #<MatchData "\u{06F2}">
+    /[[:upper:]][[:lower:]]/.match("Hello") #=> #<MatchData "He">
+    /[[:xdigit:]][[:xdigit:]]/.match("A6")  #=> #<MatchData "A6">
+
+== Repetition
+
+The constructs described so far match a single character. They can be
+followed by a repetition metacharacter to specify how many times they need
+to occur. Such metacharacters are called <i>quantifiers</i>.
+
+* <tt>*</tt> - Zero or more times
+* <tt>+</tt> - One or more times
+* <tt>?</tt> - Zero or one times (optional)
+* <tt>{</tt><i>n</i><tt>}</tt> - Exactly <i>n</i> times
+* <tt>{</tt><i>n</i><tt>,}</tt> - <i>n</i> or more times
+* <tt>{,</tt><i>m</i><tt>}</tt> - <i>m</i> or less times
+* <tt>{</tt><i>n</i><tt>,</tt><i>m</i><tt>}</tt> - At least <i>n</i> and
+  at most <i>m</i> times
+
+    # At least one uppercase character ('H'), at least one lowercase
+    # character ('e'), two 'l' characters, then one 'o'
+    "Hello".match(/[[:upper:]]+[[:lower:]]+l{2}o/) #=> #<MatchData "Hello">
+
+Repetition is <i>greedy</i> by default: as many occurrences as possible
+are matched while still allowing the overall match to succeed. By
+contrast, <i>lazy</i> matching makes the minimal amount of matches
+necessary for overall success. A greedy metacharacter can be made lazy by
+following it with <tt>?</tt>.
+
+    # Both patterns below match the string. The fist uses a greedy
+    # quantifier so '.+' matches '<a><b>'; the second uses a lazy
+    # quantifier so '.+?' matches '<a>'.
+    /<.+>/.match("<a><b>")  #=> #<MatchData "<a><b>">
+    /<.+?>/.match("<a><b>") #=> #<MatchData "<a>">
+
+A quantifier followed by <tt>+</tt> matches <i>possessively</i>: once it
+has matched it does not backtrack. They behave like greedy quantifiers,
+but having matched they refuse to "give up" their match even if this
+jeopardises the overall match.
+
+== Capturing
+
+Parentheses can be used for <i>capturing</i>. The text enclosed by the
+<i>n</i><sup>th</sup> group of parentheses can be subsequently referred to
+with <i>n</i>. Within a pattern use the <i>backreference</i>
+<tt>\</tt><i>n</i>; outside of the pattern use
+<tt>MatchData[</tt><i>n</i><tt>]</tt>.
+
+    # 'at' is captured by the first group of parentheses, then referred to
+    # later with \1
+    /[csh](..) [csh]\1 in/.match("The cat sat in the hat")
+        #=> #<MatchData "cat sat in" 1:"at">
+    # Regexp#match returns a MatchData object which makes the captured
+    # text available with its #[] method.
+    /[csh](..) [csh]\1 in/.match("The cat sat in the hat")[1] #=> 'at'
+
+Capture groups can be referred to by name when defined with the
+<tt>(?<</tt><i>name</i><tt>>)</tt> or <tt>(?'</tt><i>name</i><tt>')</tt>
+constructs.
+
+    /\$(?<dollars>\d+)\.(?<cents>\d+)/.match("$3.67")
+        => #<MatchData "$3.67" dollars:"3" cents:"67">
+    /\$(?<dollars>\d+)\.(?<cents>\d+)/.match("$3.67")[:dollars] #=> "3"
+
+Named groups can be backreferenced with <tt>\k<</tt><i>name</i><tt>></tt>,
+where _name_ is the group name.
+
+    /(?<vowel>[aeiou]).\k<vowel>.\k<vowel>/.match('ototomy')
+        #=> #<MatchData "ototo" vowel:"o">
+
+*Note*: A regexp can't use named backreferences and numbered
+backreferences simultaneously.
+
+When named capture groups are used with a literal regexp on the left-hand
+side of an expression and the <tt>=~</tt> operator, the captured text is
+also assigned to local variables with corresponding names.
+
+    /\$(?<dollars>\d+)\.(?<cents>\d+)/ =~ "$3.67" #=> 0
+    dollars #=> "3"
+
+== Grouping
+
+Parentheses also <i>group</i> the terms they enclose, allowing them to be
+quantified as one <i>atomic</i> whole.
+
+    # The pattern below matches a vowel followed by 2 word characters:
+    # 'aen'
+    /[aeiou]\w{2}/.match("Caenorhabditis elegans") #=> #<MatchData "aen">
+    # Whereas the following pattern matches a vowel followed by a word
+    # character, twice, i.e. <tt>[aeiou]\w[aeiou]\w</tt>: 'enor'.
+    /([aeiou]\w){2}/.match("Caenorhabditis elegans")
+        #=> #<MatchData "enor" 1:"or">
+
+The <tt>(?:</tt>...<tt>)</tt> construct provides grouping without
+capturing. That is, it combines the terms it contains into an atomic whole
+without creating a backreference. This benefits performance at the slight
+expense of readabilty.
+
+    # The group of parentheses captures 'n' and the second 'ti'. The
+    # second group is referred to later with the backreference \2
+    /I(n)ves(ti)ga\2ons/.match("Investigations")
+        #=> #<MatchData "Investigations" 1:"n" 2:"ti">
+    # The first group of parentheses is now made non-capturing with '?:',
+    # so it still matches 'n', but doesn't create the backreference. Thus,
+    # the backreference \1 now refers to 'ti'.
+    /I(?:n)ves(ti)ga\1ons/.match("Investigations")
+        #=> #<MatchData "Investigations" 1:"ti">
+
+=== Atomic Grouping
+
+Grouping can be made <i>atomic</i> with
+<tt>(?></tt><i>pat</i><tt>)</tt>. This causes the subexpression <i>pat</i>
+to be matched independently of the rest of the expression such that what
+it matches becomes fixed for the remainder of the match, unless the entire
+subexpression must be abandoned and subsequently revisited. In this
+way <i>pat</i> is treated as a non-divisible whole. Atomic grouping is
+typically used to optimise patterns so as to prevent the regular
+expression engine from backtracking needlesly.
+
+    # The <tt>"</tt> in the pattern below matches the first character of
+    # the string, then <tt>.*</tt> matches <i>Quote"</i>. This causes the
+    # overall match to fail, so the text matched by <tt>.*</tt> is
+    # backtracked by one position, which leaves the final character of the
+    # string available to match <tt>"</tt>
+          /".*"/.match('"Quote"')     #=> #<MatchData "\"Quote\"">
+    # If <tt>.*</tt> is grouped atomically, it refuses to backtrack
+    # <i>Quote"</i>, even though this means that the overall match fails
+    /"(?>.*)"/.match('"Quote"') #=> nil
+
+== Subexpression Calls
+
+The <tt>\g<</tt><i>name</i><tt>></tt> syntax matches the previous
+subexpression named _name_, which can be a group name or number, again.
+This differs from backreferences in that it re-executes the group rather
+than simply trying to re-match the same text.
+
+    # Matches a <i>(</i> character and assigns it to the <tt>paren</tt>
+    # group, tries to call that the <tt>paren</tt> sub-expression again
+    # but fails, then matches a literal <i>)</i>.
+    /\A(?<paren>\(\g<paren>*\))*\z/ =~ '()'
+
+
+    /\A(?<paren>\(\g<paren>*\))*\z/ =~ '(())' #=> 0
+    # ^1
+    #      ^2
+    #           ^3
+    #                 ^4
+    #      ^5
+    #           ^6
+    #                      ^7
+    #                       ^8
+    #                       ^9
+    #                           ^10
+
+1.  Matches at the beginning of the string, i.e. before the first
+    character.
+2.  Enters a named capture group called <tt>paren</tt>
+3.  Matches a literal <i>(</i>, the first character in the string
+4.  Calls the <tt>paren</tt> group again, i.e. recurses back to the
+    second step
+5.  Re-enters the <tt>paren</tt> group
+6.  Matches a literal <i>(</i>, the second character in the
+    string
+7.  Try to call <tt>paren</tt> a third time, but fail because
+    doing so would prevent an overall successful match
+8.  Match a literal <i>)</i>, the third character in the string.
+    Marks the end of the second recursive call
+9.  Match a literal <i>)</i>, the fourth character in the string
+10. Match the end of the string
+
+== Alternation
+
+The vertical bar metacharacter (<tt>|</tt>) combines two expressions into
+a single one that matches either of the expressions. Each expression is an
+<i>alternative</i>.
+
+    /\w(and|or)\w/.match("Feliformia") #=> #<MatchData "form" 1:"or">
+    /\w(and|or)\w/.match("furandi")    #=> #<MatchData "randi" 1:"and">
+    /\w(and|or)\w/.match("dissemblance") #=> nil
+
+== Character Properties
+
+The <tt>\p{}</tt> construct matches characters with the named property,
+much like POSIX bracket classes.
+
+* <tt>/\p{Alnum}/</tt> - Alphabetic and numeric character
+* <tt>/\p{Alpha}/</tt> - Alphabetic character
+* <tt>/\p{Blank}/</tt> - Space or tab
+* <tt>/\p{Cntrl}/</tt> - Control character
+* <tt>/\p{Digit}/</tt> - Digit
+* <tt>/\p{Graph}/</tt> - Non-blank character (excludes spaces, control
+  characters, and similar)
+* <tt>/\p{Lower}/</tt> - Lowercase alphabetical character
+* <tt>/\p{Print}/</tt> - Like <tt>\p{Graph}</tt>, but includes the space character
+* <tt>/\p{Punct}/</tt> - Punctuation character
+* <tt>/\p{Space}/</tt> - Whitespace character (<tt>[:blank:]</tt>, newline,
+  carriage return, etc.)
+* <tt>/\p{Upper}/</tt> - Uppercase alphabetical
+* <tt>/\p{XDigit}/</tt> - Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
+* <tt>/\p{Word}/</tt> - A member of one of the following Unicode general
+  category <i>Letter</i>, <i>Mark</i>, <i>Number</i>,
+  <i>Connector\_Punctuation</i>
+* <tt>/\p{ASCII}/</tt> - A character in the ASCII character set
+* <tt>/\p{Any}/</tt> - Any Unicode character (including unassigned
+  characters)
+* <tt>/\p{Assigned}/</tt> - An assigned character
+
+A Unicode character's <i>General Category</i> value can also be matched
+with <tt>\p{</tt><i>Ab</i><tt>}</tt> where <i>Ab</i> is the category's
+abbreviation as described below:
+
+* <tt>/\p{L}/</tt> - 'Letter'
+* <tt>/\p{Ll}/</tt> - 'Letter: Lowercase'
+* <tt>/\p{Lm}/</tt> - 'Letter: Mark'
+* <tt>/\p{Lo}/</tt> - 'Letter: Other'
+* <tt>/\p{Lt}/</tt> - 'Letter: Titlecase'
+* <tt>/\p{Lu}/</tt> - 'Letter: Uppercase
+* <tt>/\p{Lo}/</tt> - 'Letter: Other'
+* <tt>/\p{M}/</tt> - 'Mark'
+* <tt>/\p{Mn}/</tt> - 'Mark: Nonspacing'
+* <tt>/\p{Mc}/</tt> - 'Mark: Spacing Combining'
+* <tt>/\p{Me}/</tt> - 'Mark: Enclosing'
+* <tt>/\p{N}/</tt> - 'Number'
+* <tt>/\p{Nd}/</tt> - 'Number: Decimal Digit'
+* <tt>/\p{Nl}/</tt> - 'Number: Letter'
+* <tt>/\p{No}/</tt> - 'Number: Other'
+* <tt>/\p{P}/</tt> - 'Punctuation'
+* <tt>/\p{Pc}/</tt> - 'Punctuation: Connector'
+* <tt>/\p{Pd}/</tt> - 'Punctuation: Dash'
+* <tt>/\p{Ps}/</tt> - 'Punctuation: Open'
+* <tt>/\p{Pe}/</tt> - 'Punctuation: Close'
+* <tt>/\p{Pi}/</tt> - 'Punctuation: Initial Quote'
+* <tt>/\p{Pf}/</tt> - 'Punctuation: Final Quote'
+* <tt>/\p{Po}/</tt> - 'Punctuation: Other'
+* <tt>/\p{S}/</tt> - 'Symbol'
+* <tt>/\p{Sm}/</tt> - 'Symbol: Math'
+* <tt>/\p{Sc}/</tt> - 'Symbol: Currency'
+* <tt>/\p{Sc}/</tt> - 'Symbol: Currency'
+* <tt>/\p{Sk}/</tt> - 'Symbol: Modifier'
+* <tt>/\p{So}/</tt> - 'Symbol: Other'
+* <tt>/\p{Z}/</tt> - 'Separator'
+* <tt>/\p{Zs}/</tt> - 'Separator: Space'
+* <tt>/\p{Zl}/</tt> - 'Separator: Line'
+* <tt>/\p{Zp}/</tt> - 'Separator: Paragraph'
+* <tt>/\p{C}/</tt> - 'Other'
+* <tt>/\p{Cc}/</tt> - 'Other: Control'
+* <tt>/\p{Cf}/</tt> - 'Other: Format'
+* <tt>/\p{Cn}/</tt> - 'Other: Not Assigned'
+* <tt>/\p{Co}/</tt> - 'Other: Private Use'
+* <tt>/\p{Cs}/</tt> - 'Other: Surrogate'
+
+Lastly, <tt>\p{}</tt> matches a character's Unicode <i>script</i>. The
+following scripts are supported: <i>Arabic</i>, <i>Armenian</i>,
+<i>Balinese</i>, <i>Bengali</i>, <i>Bopomofo</i>, <i>Braille</i>,
+<i>Buginese</i>, <i>Buhid</i>, <i>Canadian_Aboriginal</i>, <i>Carian</i>,
+<i>Cham</i>, <i>Cherokee</i>, <i>Common</i>, <i>Coptic</i>,
+<i>Cuneiform</i>, <i>Cypriot</i>, <i>Cyrillic</i>, <i>Deseret</i>,
+<i>Devanagari</i>, <i>Ethiopic</i>, <i>Georgian</i>, <i>Glagolitic</i>,
+<i>Gothic</i>, <i>Greek</i>, <i>Gujarati</i>, <i>Gurmukhi</i>, <i>Han</i>,
+<i>Hangul</i>, <i>Hanunoo</i>, <i>Hebrew</i>, <i>Hiragana</i>,
+<i>Inherited</i>, <i>Kannada</i>, <i>Katakana</i>, <i>Kayah_Li</i>,
+<i>Kharoshthi</i>, <i>Khmer</i>, <i>Lao</i>, <i>Latin</i>, <i>Lepcha</i>,
+<i>Limbu</i>, <i>Linear_B</i>, <i>Lycian</i>, <i>Lydian</i>,
+<i>Malayalam</i>, <i>Mongolian</i>, <i>Myanmar</i>, <i>New_Tai_Lue</i>,
+<i>Nko</i>, <i>Ogham</i>, <i>Ol_Chiki</i>, <i>Old_Italic</i>,
+<i>Old_Persian</i>, <i>Oriya</i>, <i>Osmanya</i>, <i>Phags_Pa</i>,
+<i>Phoenician</i>, <i>Rejang</i>, <i>Runic</i>, <i>Saurashtra</i>,
+<i>Shavian</i>, <i>Sinhala</i>, <i>Sundanese</i>, <i>Syloti_Nagri</i>,
+<i>Syriac</i>, <i>Tagalog</i>, <i>Tagbanwa</i>, <i>Tai_Le</i>,
+<i>Tamil</i>, <i>Telugu</i>, <i>Thaana</i>, <i>Thai</i>, <i>Tibetan</i>,
+<i>Tifinagh</i>, <i>Ugaritic</i>, <i>Vai</i>, and <i>Yi</i>.
+
+    # Unicode codepoint U+06E9 is named "ARABIC PLACE OF SAJDAH" and
+    # belongs to the Arabic script.
+    /\p{Arabic}/.match("\u06E9") #=> #<MatchData "\u06E9">
+
+All character properties can be inverted by prefixing their name with a
+caret (<tt>^</tt>).
+
+    # Letter 'A' is not in the Unicode Ll (Letter; Lowercase) category, so
+    # this match succeeds
+    /\p{^Ll}/.match("A") #=> #<MatchData "A">
+
+== Anchors
+
+Anchors are metacharacter that match the zero-width positions between
+characters, <i>anchoring</i> the match to a specific position.
+
+* <tt>^</tt> - Matches beginning of line
+* <tt>$</tt> - Matches end of line
+* <tt>\A</tt> - Matches beginning of string.
+* <tt>\Z</tt> - Matches end of string. If string ends with a newline,
+  it matches just before newline
+* <tt>\z</tt> - Matches end of string
+* <tt>\G</tt> - Matches point where last match finished
+* <tt>\b</tt> - Matches word boundaries when outside brackets; backspace
+   (0x08) inside brackets
+* <tt>\B</tt> - Matches non-word boundaries
+* <tt>(?=</tt><i>pat</i><tt>)</tt> - <i>Positive lookahead</i> assertion:
+  ensures that the following characters match <i>pat</i>, but doesn't
+  include those characters in the matched text
+* <tt>(?!</tt><i>pat</i><tt>)</tt> - <i>Negative lookahead</i> assertion:
+  ensures that the following characters do not match <i>pat</i>, but
+  doesn't include those characters in the matched text
+* <tt>(?<=</tt><i>pat</i><tt>)</tt> - <i>Positive lookbehind</i>
+  assertion: ensures that the preceding characters match <i>pat</i>, but
+  doesn't include those characters in the matched text
+* <tt>(?<!</tt><i>pat</i><tt>)</tt> - <i>Negative lookbehind</i>
+  assertion: ensures that the preceding characters do not match
+  <i>pat</i>, but doesn't include those characters in the matched text
+
+    # If a pattern isn't anchored it can begin at any point in the string
+    /real/.match("surrealist") #=> #<MatchData "real">
+    # Anchoring the pattern to the beginning of the string forces the
+    # match to start there. 'real' doesn't occur at the beginning of the
+    # string, so now the match fails
+    /\Areal/.match("surrealist") #=> nil
+    # The match below fails because although 'Demand' contains 'and', the
+    pattern does not occur at a word boundary.
+    /\band/.match("Demand")
+    # Whereas in the following example 'and' has been anchored to a
+    # non-word boundary so instead of matching the first 'and' it matches
+    # from the fourth letter of 'demand' instead
+    /\Band.+/.match("Supply and demand curve") #=> #<MatchData "and curve">
+    # The pattern below uses positive lookahead and positive lookbehind to
+    # match text appearing in <b></b> tags without including the tags in the
+    # match
+    /(?<=<b>)\w+(?=<\/b>)/.match("Fortune favours the <b>bold</b>")
+        #=> #<MatchData "bold">
+
+== Options
+
+The end delimiter for a regexp can be followed by one or more single-letter
+options which control how the pattern can match.
+
+* <tt>/pat/i</tt> - Ignore case
+* <tt>/pat/m</tt> - Treat a newline as a character matched by <tt>.</tt>
+* <tt>/pat/x</tt> - Ignore whitespace and comments in the pattern
+* <tt>/pat/o</tt> - Perform <tt>#{}</tt> interpolation only once
+
+<tt>i</tt>, <tt>m</tt>, and <tt>x</tt> can also be applied on the
+subexpression level with the
+<tt>(?</tt><i>on</i><tt>-</tt><i>off</i><tt>)</tt> construct, which
+enables options <i>on</i>, and disables options <i>off</i> for the
+expression enclosed by the parentheses.
+
+    /a(?i:b)c/.match('aBc') #=> #<MatchData "aBc">
+    /a(?i:b)c/.match('abc') #=> #<MatchData "abc">
+
+== Free-Spacing Mode and Comments
+
+As mentioned above, the <tt>x</tt> option enables <i>free-spacing</i>
+mode. Literal white space inside the pattern is ignored, and the
+octothorpe (<tt>#</tt>) character introduces a comment until the end of
+the line. This allows the components of the pattern to be organised in a
+potentially more readable fashion.
+
+    # A contrived pattern to match a number with optional decimal places
+    float_pat = /\A
+        [[:digit:]]+ # 1 or more digits before the decimal point
+        (\.          # Decimal point
+            [[:digit:]]+ # 1 or more digits after the decimal point
+        )? # The decimal point and following digits are optional
+    \Z/x
+    float_pat.match('3.14') #=> #<MatchData "3.14" 1:".14">
+
+*Note*: To match whitespace in an <tt>x</tt> pattern use an escape such as
+<tt>\s</tt> or <tt>\p{Space}</tt>.
+
+Comments can be included in a non-<tt>x</tt> pattern with the
+<tt>(?#</tt><i>comment</i><tt>)</tt> construct, where <i>comment</i> is
+arbitrary text ignored by the regexp engine.
+
+== Encoding
+
+Regular expressions are assumed to use the source encoding. This can be
+overridden with one of the following modifiers.
+
+* <tt>/</tt><i>pat</i><tt>/u</tt> - UTF-8
+* <tt>/</tt><i>pat</i><tt>/e</tt> - EUC-JP
+* <tt>/</tt><i>pat</i><tt>/s</tt> - Windows-31J
+* <tt>/</tt><i>pat</i><tt>/n</tt> - ASCII-8BIT
+
+A regexp can be matched against a string when they either share an
+encoding, or the regexp's encoding is _US-ASCII_ and the string's encoding
+is ASCII-compatible.
+
+If a match between incompatible encodings is attempted an
+<tt>Encoding::CompatibilityError</tt> exception is raised.
+
+The <tt>Regexp#fixed_encoding?</tt> predicate indicates whether the regexp
+has a <i>fixed</i> encoding, that is one incompatible with ASCII. A
+regexp's encoding can be explicitly fixed by supplying
+<tt>Regexp::FIXEDENCODING</tt> as the second argument of
+<tt>Regexp.new</tt>:
+
+    r = Regexp.new("a".force_encoding("iso-8859-1"),Regexp::FIXEDENCODING)
+    r =~"a\u3042"
+       #=> Encoding::CompatibilityError: incompatible encoding regexp match
+            (ISO-8859-1 regexp with UTF-8 string)
+
+== Performance
+
+Certain pathological combinations of constructs can lead to abysmally bad
+performance.
+
+Consider a string of 25 <i>a</i>s, a <i>d</i>, 4 <i>a</i>s, and a
+<i>c</i>.
+
+    s = 'a' * 25 + 'd' 'a' * 4 + 'c'
+        #=> "aaaaaaaaaaaaaaaaaaaaaaaaadadadadac"
+
+The following patterns match instantly as you would expect:
+
+    /(b|a)/ =~ s #=> 0
+    /(b|a+)/ =~ s #=> 0
+    /(b|a+)*\/ =~ s #=> 0
+
+However, the following pattern takes appreciably longer:
+
+    /(b|a+)*c/ =~ s #=> 32
+
+This happens because an atom in the regexp is quantified by both an
+immediate <tt>+</tt> and an enclosing <tt>*</tt> with nothing to
+differentiate which is in control of any particular character. The
+nondeterminism that results produces super-linear performance. (Consult
+<i>Mastering Regular Expressions</i> (3rd ed.), pp 222, by
+<i>Jeffery Friedl</i>, for an in-depth analysis). This particular case
+can be fixed by use of atomic grouping, which prevents the unnecessary
+backtracking:
+
+    (start = Time.now) && /(b|a+)*c/ =~ s && (Time.now - start)
+       #=> 24.702736882
+    (start = Time.now) && /(?>b|a+)*c/ =~ s && (Time.now - start)
+       #=> 0.000166571
+
+A similar case is typified by the following example, which takes
+approximately 60 seconds to execute for me:
+
+    # Match a string of 29 <i>a</i>s against a pattern of 29 optional
+    # <i>a</i>s followed by 29 mandatory <i>a</i>s.
+    Regexp.new('a?' * 29 + 'a' * 29) =~ 'a' * 29
+
+The 29 optional <i>a</i>s match the string, but this prevents the 29
+mandatory <i>a</i>s that follow from matching. Ruby must then backtrack
+repeatedly so as to satisfy as many of the optional matches as it can
+while still matching the mandatory 29. It is plain to us that none of the
+optional matches can succeed, but this fact unfortunately eludes Ruby.
+
+One approach for improving performance is to anchor the match to the
+beginning of the string, thus significantly reducing the amount of
+backtracking needed.
+
+    Regexp.new('\A' 'a?' * 29 + 'a' * 29).match('a' * 29)
+        #=> #<MatchData "aaaaaaaaaaaaaaaaaaaaaaaaaaaaa">
+=end
 class Regexp; end
author	nobu <nobu@b2dd03c8-39d4-4d8f-98ff-823fe69b080e>	2009-09-18 01:11:55 +0000
committer	nobu <nobu@b2dd03c8-39d4-4d8f-98ff-823fe69b080e>	2009-09-18 01:11:55 +0000
commit	d7f76f84d923ac8c7ef585e1f43fc0d691625377 (patch)
tree	3a9fe7c617df7b28d63e6b95ad917826a1994d60 /doc/re.rdoc
parent	96ac19481183e11bd12ca0a704d90e415c4b3954 (diff)
download	ruby-d7f76f84d923ac8c7ef585e1f43fc0d691625377.tar.gz