Thursday, July 26, 2007

A Plain English Regular Expression Language

Well, I haven't posted in a while due to some heavy workload, but things are beginning to wrap up a bit on this particular phase, so hopefully I will find a bit more time to blog again (life doesn't always seem to allow me to make this the priority I'd like it to be, so we'll see).

Anyway, I've recently been revisiting a thought I've been developing for a while now.

I've been using Regular Expressions for quite a few years now, and I've noticed some issues with this that I would like to find or create a solution for.

What Regular Expression Are

For those of you who are not familiar with what Regular Expressions are, the quick answer is that they are like wildcard searches on steroids. They can find text that follows patterns extremely well and are often used for text validation and searching for common patterns in text documents (like web addresses in e-mails that automatically link to those addresses, for instance).

Well, the great thing about regular expressions is that they are an extremely powerful way to express search, replace and validation requests on text. I would use them a whole lot more if it weren't for some of their shortcomings.

Some problems I have with Regular Expressions

I have a few problems with Regular Expressions, though, and most of them are related to one primary issue.

Regular Expressions are currently, extremely cryptic.

Let me give you an example. Let's say we are validating an e-mail address. A simple regex for that might look something like this

[\w\.\-]+\@\w+\.\w{2,3}


This expression wouldn't even accurately cover all the rules for an e-mail address, but it should match most.

The above expression looks for the following
  • One or more of the following in any order
    • A word character (A-Z any case or a number)
    • A period (.)
    • A dash (-)
  • Followed by the @ sign
  • Followed by one or more word characters (same as above) in any order
  • Followed by a period
  • Followed by 2 or 3 word characters

Even though I've become pretty adept at creating regular expressions, expressions like these are even difficult for me to read and maintain. This particular one is short, but sometimes there are many more rules to apply and this can get even harder to read.

Reluctance to use Regular Expressions

Because of this many people are reluctant to even learn about regular expressions, and even if they understand the power of them, many development teams will be reluctant to use them because
  • They can be extremely difficult to create, even for those who are adept
  • They are difficult, if not impossible to debug without some really good tools
  • They are really difficult to change the bigger they get
  • The longer it has been since they've been created, the harder they become to read (Even when I comment, it can be difficult for me to see how a particular regular expression I've created all goes together after I've been away from it)
  • - Typically very few people on a development team even know how to do them.

There are other reasons, but these are some of the big ones. These days, programmers don't want to be indispensable on a particular project anymore simply because that ties them to the project and can make it particularly difficult for them to advance later. Whey they use regular expressions, they can find themselves becoming the only person who has any idea how these things work in the project, and that's not good for anybody.

Some other problems I've had with regular expressions are simply the amount of repeat work needed to deal with patterns that are similar to, but not exactly the same as other patterns. There is nothing inherent in regular expression syntax that allows me to reuse part of a pattern in another pattern with the possible exception of a back-reference pattern which checks for the exact text match of a previous pattern more than once. Sometimes I only want to use the pattern fragment without having to rewrite the whole thing to find that pattern (not exactly matching text) in another part of the pattern.

The Point

I'm finally getting to the point of this post. I think it's time that someone provides better support for regular expressions by creating a "plain english" version of them.

Lets take the pattern I used above. What if I could write it like this

LookFor {
OneOrMore Of { WordChar, Literal ".", Literal "-" }
Then Literal "@"
Then OneOrMore Of WordChar
Then Literal "."
Then Between 2 AND 3 Of WordChar
}


This could potentially also be written like this to illustrate my point about pattern reuse

Pattern Dot = Literal "."
LookFor {
OneOrMore Of {WorkChar, Dot, Literal "-"}
Then Literal "@"
Then OneOrMore Of WordChar
Then Dot
Then Between 2 AND 3 Of WordChar
}

Now, obviously this would be more verbose than the Regular Expressions we've come to know and love, however, I think something like this would lend itself to better readability, and once compiled, wouldn't even be more costly in terms of size than current regular expression patterns.

In fact, I think if it turned out that some expression simply couldn't be translated well using this format, it could be assigned to a pattern expression as with the "Dot" pattern above

for instance let's say that you could not easily express the pattern of any value on the keyboard as an expression. You could assign this as a legacy pattern to a pattern variable, and then use that variable in other expressions

Pattern KBKey = [\r\w\~\!\@\#\$\%\^\&\*\(\)\_\+\-\=\{\}\[\]\:\"\;\'\<\>\?\,\.\/\`\|\\ ]

Moving Forward

Honestly, the concepts I have indicated here are just preliminary and I haven't put a whole lot of thought into whether this is actually the best way of going about this. I hope, however, that I have conveyed an idea of what it is that I'd like to accomplish.

With any luck, I will be able to get started on this, this year.

I will post here as I come up with ideas. Please share any ideas, or findings you may have regarding this idea. If anyone has already started (and/or completed) something like this, I would very much like to know about it.

bye for now

No comments: