Capturing groups are numbered by counting their opening
parentheses from left to right. In the expression
((A)(B(C)))
, for example, there are four such
groups:
1 | ((A)(B(C))) |
2 | (A) |
3 | (B(C)) |
4 | (C) |
Group zero always stands for the entire expression.
Capturing groups are so named because, during a match, each subsequence of the input sequence that matches such a group is saved. The captured subsequence may be used later in the expression, via a back reference, and may also be retrieved from the matcher once the match operation is complete.
The captured input associated with a group is always the
subsequence that the group most recently matched. If a group is
evaluated a second time because of quantification then its
previously-captured value, if any, will be retained if the second
evaluation fails. Matching the string "aba
" against the
expression (a(b)?)+
, for example, leaves group two set to
"b
". All captured input is discarded at the beginning of
each match.
Groups beginning with (?
are pure,
non-capturing groups that do not capture text and
do not count towards the group total.
Unicode support
This class follows Unicode Technical Report #18: Unicode Regular Expression Guidelines, implementing its second level of support though with a slightly different concrete syntax.
Unicode escape sequences such as \u2014
in Java
source code are processed as described in ?3.3
of the Java Language Specification. Such escape sequences are also
implemented directly by the regular-expression parser so that Unicode
escapes can be used in expressions that are read from files or from the
keyboard. Thus the strings "\u2014
" and
"\\u2014
", while not equal, compile into the same pattern,
which matches the character with hexadecimal value
0x2014
.
Unicode blocks and categories are written with the \p
and \P
constructs as in Perl.
\p{
matches if
the input has the property prop, while
prop
}\P{
does not match if the input has
that property. Blocks are specified with the prefix prop
}In
, as
in InMongolian
. Categories may be specified with the
optional prefix Is
: Both \p{L}
and
\p{IsL}
denote the category of Unicode letters. Blocks and
categories can be used both inside and outside of a character
class.
The supported blocks and categories are those of The Unicode
Standard, Version 3.0. The block names are those
defined in Chapter 14 and in the file Blocks-3.txt
of the Unicode
Character Database except that the spaces are removed;
"Basic Latin
", for example, becomes
"BasicLatin
". The category names are those defined in table
4-5 of the Standard (p. 88), both normative and informative.