Boost C++ Libraries Home Libraries People FAQ More

PrevUpHomeNext

Creating a Regex Object

Static Regexes
Dynamic Regexes

When using xpressive, the first thing you'll do is create a basic_regex<> object. This section goes over the nuts and bolts of building a regular expression in the two dialects xpressive supports: static and dynamic.

Overview

The feature that really sets xpressive apart from other C/C++ regular expression libraries is the ability to author a regular expression using C++ expressions. xpressive achieves this through operator overloading, using a technique called expression templates to embed a mini-language dedicated to pattern matching within C++. These "static regexes" have many advantages over their string-based brethren. In particular, static regexes:

  • are syntax-checked at compile-time; they will never fail at run-time due to a syntax error.
  • can naturally refer to other C++ data and code, including other regexes, making it simple to build grammars out of regular expressions and bind user-defined actions that execute when parts of your regex match.
  • are statically bound for better inlining and optimization. Static regexes require no state tables, virtual functions, byte-code or calls through function pointers that cannot be resolved at compile time.
  • are not limited to searching for patterns in strings. You can declare a static regex that finds patterns in an array of integers, for instance.

Since we compose static regexes using C++ expressions, we are constrained by the rules for legal C++ expressions. Unfortunately, that means that "classic" regular expression syntax cannot always be mapped cleanly into C++. Rather, we map the regex constructs, picking new syntax that is legal C++.

Construction and Assignment

You create a static regex by assigning one to an object of type basic_regex<>. For instance, the following defines a regex that can be used to find patterns in objects of type std::string:

sregex re = '$' >> +_d >> '.' >> _d >> _d;

Assignment works similarly.

Character and String Literals

In static regexes, character and string literals match themselves. For instance, in the regex above, '$' and '.' match the characters '$' and '.' respectively. Don't be confused by the fact that $ and . are meta-characters in Perl. In xpressive, literals always represent themselves.

When using literals in static regexes, you must take care that at least one operand is not a literal. For instance, the following are not valid regexes:

sregex re1 = 'a' >> 'b';         // ERROR!
sregex re2 = +'a';               // ERROR!

The two operands to the binary >> operator are both literals, and the operand of the unary + operator is also a literal, so these statements will call the native C++ binary right-shift and unary plus operators, respectively. That's not what we want. To get operator overloading to kick in, at least one operand must be a user-defined type. We can use xpressive's as_xpr() helper function to "taint" an expression with regex-ness, forcing operator overloading to find the correct operators. The two regexes above should be written as:

sregex re1 = as_xpr('a') >> 'b'; // OK
sregex re2 = +as_xpr('a');       // OK

Sequencing and Alternation

As you've probably already noticed, sub-expressions in static regexes must be separated by the sequencing operator, >>. You can read this operator as "followed by".

// Match an 'a' followed by a digit
sregex re = 'a' >> _d;

Alternation works just as it does in Perl with the | operator. You can read this operator as "or". For example:

// match a digit character or a word character one or more times
sregex re = +( _d | _w );

Grouping and Captures

In Perl, parentheses () have special meaning. They group, but as a side-effect they also create back-references like $1 and $2. In C++, parentheses only group -- there is no way to give them side-effects. To get the same effect, we use the special s1, s2, etc. tokens. Assigning to one creates a back-reference. You can then use the back-reference later in your expression, like using \1 and \2 in Perl. For example, consider the following regex, which finds matching HTML tags:

"<(\\w+)>.*?</\\1>"

In static xpressive, this would be:

'<' >> (s1= +_w) >> '>' >> -*_ >> "</" >> s1 >> '>'

Notice how you capture a back-reference by assigning to s1, and then you use s1 later in the pattern to find the matching end tag.

[Tip] Tip

Grouping without capturing a back-reference

In xpressive, if you just want grouping without capturing a back-reference, you can just use () without s1. That is the equivalent of Perl's (?:) non-capturing grouping construct.

Case-Insensitivity and Internationalization

Perl lets you make part of your regular expression case-insensitive by using the (?i:) pattern modifier. xpressive also has a case-insensitivity pattern modifier, called icase. You can use it as follows:

sregex re = "this" >> icase( "that" );

In this regular expression, "this" will be matched exactly, but "that" will be matched irrespective of case.

Case-insensitive regular expressions raise the issue of internationalization: how should case-insensitive character comparisons be evaluated? Also, many character classes are locale-specific. Which characters are matched by digit and which are matched by alpha? The answer depends on the std::locale object the regular expression object is using. By default, all regular expression objects use the global locale. You can override the default by using the imbue() pattern modifier, as follows:

std::locale my_locale = /* initialize a std::locale object */;
sregex re = imbue( my_locale )( +alpha >> +digit );

This regular expression will evaluate alpha and digit according to my_locale. See the section on Localization and Regex Traits for more information about how to customize the behavior of your regexes.

Static xpressive Syntax Cheat Sheet

The table below lists the familiar regex constructs and their equivalents in static xpressive.

Table 1.4. Perl syntax vs. Static xpressive syntax

Perl

Static xpressive

Meaning

.

_

any character (assuming Perl's /s modifier).

ab

a >> b

sequencing of a and b sub-expressions.

a|b

a | b

alternation of a and b sub-expressions.

(a)

(s1= a)

group and capture a back-reference.

(?:a)

(a)

group and do not capture a back-reference.

\1

s1

a previously captured back-reference.

a*

*a

zero or more times, greedy.

a+

+a

one or more times, greedy.

a?

!a

zero or one time, greedy.

a{n,m}

repeat<n,m>(a)

between n and m times, greedy.

a*?

-*a

zero or more times, non-greedy.

a+?

-+a

one or more times, non-greedy.

a??

-!a

zero or one time, non-greedy.

a{n,m}?

-repeat<n,m>(a)

between n and m times, non-greedy.

^

bos

beginning of sequence assertion.

$

eos

end of sequence assertion.

\b

_b

word boundary assertion.

\B

~_b

not word boundary assertion.

\n

_n

literal newline.

.

~_n

any character except a literal newline (without Perl's /s modifier).

\r?\n|\r

_ln

logical newline.

[^\r\n]

~_ln

any single character not a logical newline.

\w

_w

a word character, equivalent to set[alnum | '_'].

\W

~_w

not a word character, equivalent to ~set[alnum | '_'].

\d

_d

a digit character.

\D

~_d

not a digit character.

\s

_s

a space character.

\S

~_s

not a space character.

[:alnum:]

alnum

an alpha-numeric character.

[:alpha:]

alpha

an alphabetic character.

[:blank:]

blank

a horizontal white-space character.

[:cntrl:]

cntrl

a control character.

[:digit:]

digit

a digit character.

[:graph:]

graph

a graphable character.

[:lower:]

lower

a lower-case character.

[:print:]

print

a printing character.

[:punct:]

punct

a punctuation character.

[:space:]

space

a white-space character.

[:upper:]

upper

an upper-case character.

[:xdigit:]

xdigit

a hexadecimal digit character.

[0-9]

range('0','9')

characters in range '0' through '9'.

[abc]

as_xpr('a') | 'b' |'c'

characters 'a', 'b', or 'c'.

[abc]

(set= 'a','b','c')

same as above

[0-9abc]

set[ range('0','9') | 'a' | 'b' | 'c' ]

characters 'a', 'b', 'c' or in range '0' through '9'.

[0-9abc]

set[ range('0','9') | (set= 'a','b','c') ]

same as above

[^abc]

~(set= 'a','b','c')

not characters 'a', 'b', or 'c'.

(?i:stuff)

icase(stuff)

match stuff disregarding case.

(?>stuff)

keep(stuff)

independent sub-expression, match stuff and turn off backtracking.

(?=stuff)

before(stuff)

positive look-ahead assertion, match if before stuff but don't include stuff in the match.

(?!stuff)

~before(stuff)

negative look-ahead assertion, match if not before stuff.

(?<=stuff)

after(stuff)

positive look-behind assertion, match if after stuff but don't include stuff in the match. (stuff must be constant-width.)

(?<!stuff)

~after(stuff)

negative look-behind assertion, match if not after stuff. (stuff must be constant-width.)

(?P<name>stuff)

mark_tag name(n);
...
(name= stuff)

Create a named capture.

(?P=name)

mark_tag name(n);
...
name

Refer back to a previously created named capture.




PrevUpHomeNext