The next two weeks I am trying to prepare myself to Sun's certified Java programmer (SCJP) exam. Up to now it really does not kick ass. Well my main troubles with the studies are currently some topics in the exam objective 3.5 (Tokenizing).
Objective 3.5 - Tokenizing with Pattern and MatcherLet's say there's the pattern 'a?' and we would like to tokenize the following string 'aba' using the following Java class.
public static void main(String[] args) {
Pattern p = Pattern.compile("a?"); // compile the pattern
Matcher m = p.matcher("aba");
while (m.find()) {
System.out.println(m.start() + " >" + m.group() + "<");
}
}
What I expected out of this listing was something like this.
0 >a<
2 >a<
But what it actually returns was this.
0 >a<
1 ><
2 >a<
3 ><
Mhm. According to my
study book (from which I took this example) they talking about some zero-length matches when using the greedy quantifiers '*' or '?'. They say that zero-length matches can appear under the following circumstances.
- After the last character of source data
- In between characters after a match has been found
- At the beginning of source data (if the first character is not a match. Try tokenizing this string '2aba')
- At the beginning of zero-length source data
Well this rules seems to be regular expression specific and does not have anything to do with the Java implementation. The only funny thing is - I use RegEx Buddy to develop my regexes - that I cannot reproduce it in my favourite regex editor. Yet another topic to learn by rote (means not really understanding it but memorize it for the exams purpose only)