Regular expressions are powerful and powerfully frustrating. Frustrating? I don’t use them enough to be able to think in regular expression. So I search the internet for someone who’s solved my exact problem and I copy their solution. When that doesn’t work, it’s very frustrating. One problem is that there are different variants of RE. I, of course, am using the VB Script 5.5 version, which I understand is almost identical to the Javascript variant.
I want to match an Amazon product link. I start with the RE from this stackoverflow answer.
http://www.amazon.com/([\\w-]+/)?(dp|dp/product|gp/product|exec/obidos/asin)/(\\w+/)?(\\w{10})
It didn’t work. So I started by pasting it into MyEZApp’s Analyzer to get the English equivalent. You still have to know something about RE, but it’s helpful when trying to read someone else’s pattern. Using that and this basic reference and this advanced reference, I started to convert the above pattern into one that will work with VB Script 5.5. I ended up with
http://www\.amazon\.com/([\w\-]+/)?(?:dp|dp/product|gp/product|exec/obidos/asin)/(?:\w+/)?(\w{10})
Not too different than where I started. But it seems to work, which is key. Here’s how it breaks down.
http://www
\.amazon\.com/([\w\-]+/)?(?:dp|dp/product|gp/product|exec/obidos/asin)/(?:\w+/)?(\w{10})
Match those 10 characters exactly.
http://www\.
amazon\.
com/([\w\-]+/)?(?:dp|dp/product|gp/product|exec/obidos/asin)/(?:\w+/)?(\w{10})
Match a period exactly. The backslash is called an escape character. A period normally means to match any character except a line break. I don’t want that meaning, I want to match a period, so I have to escape it.
http://www\.amazon
\.com/
([\w\-]+/)?(?:dp|dp/product|gp/product|exec/obidos/asin)/(?:\w+/)?(\w{10})
More exact matching. The forward slash after the .com is also an exact match. As far as I know, the front slash has no special meaning.
http://www\.amazon\.com/(...)
?(?:dp|dp/product|gp/product|exec/obidos/asin)/(?:\w+/)?(\w{10})
Parentheses create a group. It’s kind of like an order or precedence thing. Sometimes you want to perform an operation on a regular expression taken as a whole. Most commonly, you see groups with decisions. Because the pipe, the decision operator, is such a low precedent, you generally need to create a group.
Diet|Crystal Pepsi
will match Diet
or Crystal Pepsi
whereas
(Diet|Crystal) Pepsi
will match Diet Pepsi
or Crystal Pepsi
When’s the last time you saw a Crystal Pepsi reference? Grouping has another important characteristic. It saves the part of the string that matched for later use. In some Amazon links, this portion of the link represents a description of the product. I’m using the group, so that I can extract that portion of the string and use it elsewhere, as you’ll see later.
http://www\.amazon\.com/(...)?
(?:dp|dp/product|gp/product|exec/obidos/asin)/(?:\w+/)?(\w{10})
A question mark makes the preceding item optional. Another reason I’m using a group, in addition to saving the value for later use, is because I want to apply the optional flag to the whole group. In some Amazon links, there is no product description. When it’s not there, it’s not a problem, the grouping is optional. If it is there, it saves it.
http://www\.amazon\.com/([\w\-]
+/)?(?:dp|dp/product|gp/product|exec/obidos/asin)/(?:\w+/)?(\w{10})
Square brackets enclose a character set. In a character set, you can specify characters, ranges of characters, and a whole bunch of other stuff. In this character set, I’ve used a shorthand character class (\w) for “word characters”. Word characters are letters, digits, and underscores. I’ve also included a minus sign – escaped with a backslash because a minus sign inside a character set indicates a range, like A-Z and I don’t want that special meaning. I want to include minus signs explicitly. The highlighted portion above will match exactly one character, digit, underscore, or hyphen.
http://www\.amazon\.com/([\w\-]+
/)?(?:dp|dp/product|gp/product|exec/obidos/asin)/(?:\w+/)?(\w{10})
The plus sign repeats the previous item one or more times. The previous item in this case is a character set. I’m not trying to match one character from that set, I’m trying to match an unknown length string that only contains certain characters (word characters and hyphens).
http://www\.amazon\.com/([\w\-]+/
)?(?:dp|dp/product|gp/product|exec/obidos/asin)/(?:\w+/)?(\w{10})
Another exact match character, matching a front slash.
http://www\.amazon\.com/([\w\-]+/)?
(?:dp|dp/product|gp/product|exec/obidos/asin)/(?:\w+/)?(\w{10})
To this point we have: match http://www.amazon.com/AnyNumberOfCharsDigitsUnderscoresHyphensFollowedByAFrontSlashOrNothing
http://www\.amazon\.com/([\w\-]+/)?(?:dp|dp/product|gp/product|exec/obidos/asin)
/(?:\w+/)?(\w{10})
You know that parentheses means a group and that groups can be accessed later. When I put question mark colon combination to start a group, it keeps all of the normal group properties except that it doesn’t save it for later. There’s no real harm in saving a group for later even if you don’t intend to use it. But I wanted to keep things clean. I knew I only wanted the description and the ASIN number, so I marked all the other groups to not save.
Inside the group is a decision using the pipe operator. In this case it will match one of four strings: dp, dp/product, gp/product, or exec/obidos/asin
http://www\.amazon\.com/([\w\-]+/)?(?:dp|dp/product|gp/product|exec/obidos/asin)/(?:\w+/)?
(\w{10})
Skipping over another exact match front slash, the next item is an optional group, that doesn’t save its value, and contains any number of word characters followed by a front slash. Got that? The parentheses means it’s a group – a self-contained regular expression that can be treated as a whole. The question mark after means that the whole group is optional – great if it’s there, no worries if it’s not. The question mark/colon combo tells the group to look for a match, but no need to save the matching substring because we won’t be asking for it later. The backslash-w is a shorthand character class called “word characters” that consists of letters, digits, and underscores. The plus sign following backslash-w means match any number of word characters. The front slash simply matches a front slash.
It’s pretty similar to the group we had earlier for the product description. The product description also included hyphens while this one doesn’t. Because the product description could contain word characters and hyphens we needed to create a character set that included both, while this group only has word characters so no character set [] necessary. This group has a ?: but the product description we wanted to use later, so no ?: in that group.
http://www\.amazon\.com/([\w\-]+/)?(?:dp|dp/product|gp/product|exec/obidos/asin)/(?:\w+/)?(\w{10})
By now you now that backslash-w is a word character. The {10} indicates that we want exactly 10 of them. A plus sign after the backslash-w would mean any number of them, but if you know you’re looking for a specific length string, you can specify that length with the curly braces. This grouping would not have to be a group except that I want to use this value later (it’s the ASIN). Otherwise none of the other properties of a group are needed.
Wikipedia lists out some URLs that contain the ASIN, but they don’t quite match the stackoverflow examples. Here’s a few that will match.
http://www.amazon.com/ gp/product / ASIN-VALUE-HERE
http://www.amazon.com/([\w\-]+/)?(?:dp|dp/product|gp/product|exec/obidos/asin)/(?:\w+/)?(\w{10})
http://www.amazon.com/ dp / ASIN-VALUE-HERE
http://www.amazon.com/([\w\-]+/)?(?:dp|dp/product|gp/product|exec/obidos/asin)/(?:\w+/)?(\w{10})
http://www.amazon.com/ dp/product / ASIN-VALUE-HERE
http://www.amazon.com/([\w\-]+/)?(?:dp|dp/product|gp/product|exec/obidos/asin)/(?:\w+/)?(\w{10})
http://www.amazon.com/
http://www.amazon.com/([\w\-]+/)? (?:dp|dp/product|gp/product|exec/obidos/asin)/(?:\w+/)?(\w{10})
We’ll put this regex to use in the next post.
I learned something new today! I didn’t know a ? after a group made it optional I’ve just been added a blank or statement in the group, e.g., ([\w\-]+/|).
I’m not a big fan of VBA’s regex. I do it using Notepad++ and http://gskinner.com/RegExr/ but then it won’t work with VBA’s regex. Pretty aggravating, I was having a problem with VBA regex being greedy even though the pattern worked perfectly on the other two platforms.
@Jon
[quote]
Pretty aggravating, I was having a problem with VBA regex being greedy even though the pattern worked perfectly on the other two platforms.
[end quote]
I suggest reading this article it is very interesting:
http://swtch.com/~rsc/regexp/regexp1.html
Consider also that, regexp vbscript, using an architecture NFA
Jeffrey E.F. Friedl in his book Mastering Regular Expressions offers a test to check the architecture type of regexp … DFA or NFA
pattern=”X(.+)+X”
source= “-XX———————————–”
DFA responds immediately … NFA takes a long time
then other test you can check with this
source=”xx———————————–x”
NFA responds immediately, POSIX (mixed architecture) takes a long time
best regards
r
http://regexpal.com/
Use this site. It helped me a lot!