Answers to Episode 2 (Real-life regular expressions)

Yeah, I’m a little late getting these answers posted. Sorry!

If you missed it, last week’s challenge dealt with deciphering regular expressions and finding subtle bugs within ’em.

As with last week, before getting to the actual answers please indulge while I pontificate a bit:

Hopefully it’s pretty obvious that regular expressions are a double-edged sword. Sure, deciphering them makes a fun quiz, but imagine running across these monsters in code and trying to figure out what they do… not fun.

Fortunately, nearly every regex implementation has a “verbose” mode that allows you to embed comments inside regular expressions (n most languages this is the x flag). For the sake of those who must read your code, please use the verbose mode!

OK, on to the answers:

1. [A-PR-Y0-9]{3}-[A-PR-Y0-9]{3}-[A-PR-Y0-9]{4}

This is a US phone number, including ones that use letters (i.e. 831-555-CODE). Rewritten in verbose mode, it makes a lot more sense:

   [A-PR-Y0-9]{3}  # Area code prefix   -   [A-PR-Y0-9]{3}  # 3-digit exchange   -   [A-PR-Y0-9]{4}  # 4-digit suffix

birman had a nice roundup of the problems with this pattern:

[It] doesn’t account for a preceding 1, if the area code is in parenthesis, if the digit groups are separated by a dot or space instead of a dash, or the fact that cell phones have Q and Z on them. It also doesn’t make sure the group is isolated, and not part of something like 1234888-234-123456123.

That last point — the isolation error — is a very common error when writing regular expressions.

2. &(?!(w+|#d+);)

This is not, as most people thought, a mistaken attempt to match HTML entities. It’s actually a pattern that will match ampersands in HTML that are not part of entities (it’s taken from Django’s fix_ampersands template filter).

Here’s the verbose mode:

   &     # Match an ampersand...   (?!       # ... that is *not* followed by...     (       w+   # ... word characters...       |     # ... or...       #d+ # ... numeric entity symbols...     )     ;       # ... and a semi-colon.   )

The “problem” with this pattern is pretty subtle: it matches HTML entities that are well-formed by still invalid (e.g. &#ggxy;). So as a way of finding unencoded ampersands it’s just fine, but if you wanted to use it as part of an HTML validator, it would be unacceptable.

3. (-?(?:0|[1-9]d*))(.d+)?([eE][-+]?d+)?

Most readers got this one; it’s a IEEE floating point number, with optional exponent. In verbose mode:

   (             # The non-fractional part of the base     -?            # could be a leading negative sign      (?:           # Non-matching group...       0|[1-9]d*  # 0, or multiple digits     )     )   (.d+)?      # Decimal point and fractional part of the base   (             # Exponent     [eE]          #      [-+]?         #  > "e", plus or minus, exponent.     d+           # /   )?

Some readers thought the d in the base part was a bug; it’s not, actually — that expression matches either 0, or a number that starts with 1-9 and then contains any digits.

The actual bug is that this pattern matches non-normalized numbers (i.e. 123.45e3, which should more properly be written 1.2345e5).

4. ([da-f]{2}:){5}([da-f]{2})

Nearly everyone got this one: it’s a MAC address:

   ([da-f]{2}:){5}  # Two hex digits followed by a colon, x5   ([da-f]{2})      # Two hex digits to end.

As birman noted, this pattern fails to match a few other forms allowed for MAC addresses; they can be written with hyphens (12-34-56-78-9A-BC), or as dotted quads (1234.5678.9ABC).

5. <[^>]*?>

This one also seemed to be easy for most readers; it matches any SGML tag. In verbose syntax:

   <        # Atart the tag   [^>]*?   # Any non-gt character   >        # End the tag

The “bug” in this one is a little more abstract: malformed SGML/HTML will severely muck it up. I’ll leave finding such code an exercise for the reader, though.

Next time

Tune in tomorrow for the next installment of the quiz. This week’s question will be a “things that every web developer should know” quiz; I think it’s a lot of fun.

See you tomorrow!

Replay

Category: community Time: 2006-11-28 Views: 0
Tags:

Related post

  • Episode 2: Real-world regular expressions 2006-11-22

    Let's get this out there right off the bat: I love regular expressions. Really, I do - they're the Swiss Army Knife of text processing, and no respecting developer can go long without needing 'em. Of course, we all also know how dangerous they can be

  • The Joy of Regular Expressions [1] 2006-09-26

    Was asked recently if I knew of any good regular expressions tutorials (preferably in PHP). The question came from someone certainly smart enough to "get" regular expressions but they'd been unable to find accessible help. Most regular expressio

  • searching for doubles with regular expressions 2016-01-15

    This question already has an answer here: How to extract a floating number from a string in Python [duplicate] 7 answers Is there a regular expression which would match floating point numbers but not if this is a part of a construct like 15.01.2016?

  • Need Java regular expression 2016-01-23

    This question already has an answer here: Regex for splitting a string using space when not surrounded by single or double quotes 11 answers I need a regular expression which should parse string with white space and if white space exists in a string

  • Regular expression for decimal numbers with limiting character size 2016-01-25

    I'm trying to frame regular expression that matches the following criteria. Can you please suggest me a solution. I tried with : ^[0-9]+(\.[0-9]{1,2})?$ It is working as expected but now i want to add a pattern to limit the character size up to 10(ma

  • Extract single substring from each row of the series using regular expression with named capturing groups in the alternation operator 2016-01-30

    Given: Pandas series src of strings; Complex regular expression (for simplicity let '^(?:\d+ (\w+)|(\w+) \d+)$') that can extract some single substring (let each string matches regex). The goal: get pandas series (i.e. "column") that has extract

  • How do you learn Regular Expressions? 2011-09-22

    I'm not asking where to learn. I've found lots of good resources online, and books etc. But how the heck do I tackle them. Where is the start of it, the end? When does the regexp processor advance on the text, when does it hold its stand and tries an

  • Is it a must for every programmer to learn regular expressions? 2012-02-08

    I am new to programming, and at an interview I got a question on regular expressions; needless to say I couldn't answer. So I was wondering whether I should learn regular expression? Is it a must for every programmer of all fields? Or it is a must fo

  • Are regular expressions a programming language? 2012-09-21

    In the academic sense, do regular expressions qualify as a programming language? The motivation for my curiosity is an SO question I just looked at which asked "can regex do X?" and it made me wonder what can be said in the generic sense about t

  • What does the Jamie Zawinski's quotation about regular expressions mean? 2014-01-09

    There is a popular quote by Jamie Zawinski: Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. How is this quote supposed to be understood? --------------Solutions-------------

  • The Joy of Regular Expressions [2] 2006-09-27

    So continuing the fun started here- Contents Part 2 Where we've been so far- Hunting for .jp(e)gs Escaping Meta Characters Search and Replace preg_quote() preg_replace() Word Boundaries, Word Characters- and everything else Sub-patterns Spot the XSS

  • Why my regular expression doesn't work? 2010-09-06

    I am trying to match a multi line text using java. When I use the Pattern class with the Pattern.MULTILINE modifier, I am able to match, but I am not able to do so with (?m). The same pattern with (?m) and using String.matches does not seem to work.

  • NP complete or NP hard problems in real life 2011-04-26

    Does anybody have real life examples where they regularly solve NP complete or NP hard problems (by heuristics, or chasing a suboptimal solution or whatever) in their job? I know they occur in scheduling, planning, VLSI design, etc., but I am trying

  • When you should NOT use Regular Expressions? 2011-10-09

    Regular expressions are powerful tool in programmer's arsenal, but - there are some cases when they are not a best choice, or even outright harmful. Simple example #1 is parsing HTML with regexp - a known road to numerous bugs. Probably, this also at

  • How do regular expressions actually work? 2011-11-30

    Say you have a document with an essay written. You want to parse this essay to only select certain words. Cool. Is using a regular expression faster than parsing the file line by line and word by word looking for a match? If so, how does it work? How

  • Python flavor of regular expressions - related to which? 2012-05-23

    So my copy of the classic book, Mastering Regular Expressions, just arrived, and I'm scanning through it. The cover (third edition) says, "for Perl, PHP, Java, .NET, Ruby, and More!" Well, it does have a full chapter for each of Perl, PHP, Java,

  • Regular expression problem(s) in Bash: [^negate] doesn't seem to work 2013-03-31

    When I execute ls /directory | grep '[^term]' in Bash I get a regular listing, as if the grep command is ignored somehow. I tried the same thing with egrep, I tried to use it with double and single quotes, but to no better results. When I try ls /dir

  • Regular Expressions: How is group matching useful? 2013-06-20

    I've decided to learn some regular expression basics. I am using the Regex One lessons online and I was stuck at lession 11 for a while, but I think I got it now. This was the task. "Write a regular expression that matches only the filenames (not inc

  • Shortest unmatchable regular expression 2014-01-13

    Your mission is to write the shortest valid regular expression that no string can match, empty string included. Submissions must have this form ("literal notation"): /pattern/optional-flags Shortest regexp wins. The regexp size is counted in cha

iOS development

Android development

Python development

JAVA development

Development language

PHP development

Ruby development

search

Front-end development

Database

development tools

Open Platform

Javascript development

.NET development

cloud computing

server

Copyright (C) avrocks.com, All Rights Reserved.

processed in 0.376 (s). 13 q(s)