Capturing groups are numbered by counting their opening parentheses from left to right.
In the expression ((A)(B(C)))
, for example, there are four such groups:
1 | ((A)(B(C))) |
2 | (A) |
3 | (B(C)) |
4 | (C) |
Group zero always stands for the entire expression.
Capturing groups are so named because, during a match, each subsequence of the input sequence that matches such a group is saved. The captured subsequence may be used later in the expression, via a back reference, and may also be retrieved from the matcher once the match operation is complete.
The captured input associated with a group is always the subsequence that the group most
recently matched. If a group is evaluated a second time because of quantification then its
previously-captured value, if any, will be retained if the second evaluation fails. Matching
the string "aba
" against the expression (a(b)?)+
, for example,
leaves group two set to "b
". All captured input is discarded at the beginning
of each match.
Groups beginning with (?
are pure, non-capturing
groups that do not capture text and do not count towards the group total.
Unicode support
This class follows Unicode Technical Report #18: Unicode Regular Expression Guidelines, implementing its second level of support though with a slightly different concrete syntax.
Unicode escape sequences such as \u2014
in Java source code are processed
as described in paragraph 3.3 of the Java Language Specification. Such escape sequences are also
implemented directly by the regular-expression parser so that Unicode escapes can be used in
expressions that are read from files or from the keyboard. Thus the strings
"\u2014
" and "\\u2014
", while not equal, compile into the same
pattern, which matches the character with hexadecimal value 0x2014
.
Unicode blocks and categories are written with the \p
and \P
constructs as in Perl. \p{
matches if the input has the
property prop, while prop
}\P{
does not match if the input
has that property. Blocks are specified with the prefix prop
}In
, as in
InMongolian
. Categories may be specified with the optional prefix
Is
: Both \p{L}
and \p{IsL}
denote the category of
Unicode letters. Blocks and categories can be used both inside and outside of a character
class.
The supported blocks and categories are those of The Unicode Standard,
Version 3.0. The block names are those defined in Chapter 14 and in
the file Blocks-3.txt of
the Unicode Character Database except that the spaces are removed; "Basic
Latin
", for example, becomes "BasicLatin
". The category names are
those defined in table 4-5 of the Standard (p. 88), both normative and informative.