How (not) to validate phone numbers

I haven’t shown any signs of life in here for some time now. That’s because i have been doing great things. Honestly!

One of these things is a current software project i’m working on. I had to fix a bug related with user entering his phone number on his contact information page.

After some digging i’ve found the place where phone number validation was done and how was it done. Of course a regular expression was used:

This is how you should validate a phone number… NOT! I don’t see myself as a regular expression hacker, but come on! What is this? If this developer wanted to make a joke, then he (or she) really succeeded.

Anyway, if you are not familiar with regular expressions, then i’m trying to explain the regular expression above in more detail.

In short, the phone number is valid, if it matches a pattern of being an empty string or a string, which might have a zero or more whitespace (space, tab, line feed and so on) characters, a one digit, zero or more whitespace characters, a one digit, zero or whitespace characters…zero or one digit and zero or more whitespace characters in it. In other words, it is a valid phone number when it has 7-8 digits with possible whitespace in between them. So, “1234567”, “12 34 5678” and ” 1 2 3 4 5 6 7 8 ” are valid and “123456”, “12a34567”, “12-34-567” and “+3721234567” are not valid phone numbers.

First of all, let’s create a simple PhoneNumber class and tests for it to verify if it is actually working:

Despite of the fact that there was no unit tests created this seemed to work as long as we’d consider phone numbers with “-” characters and country code as an invalid. But first, before going to change these requirements, let’s look on the regular expression used again. If you noticed then i had to repeat myself quite much when explaining the regular expression above. Hell, this is not how regular expressions should be used. Let’s try to make it more elegant, shall we.

First of all, let’s just remove all the parts which are not needed - there’s no need to specify repetition of one and zero or one. Let’s remove those:

It seems to be already better, lol. If you try to think about the pattern, which needs to be matched, then you notice that there’s really one pattern, which is repeating itself so why not create regular expression just for that one pattern?

An empty string or a pattern of possible whitespace and a digit and a possible whitespace for seven to eight times is allowed. And of course all our tests pass. Isn’t this regexp a lot smaller compared to the original? Well, i think it is. But what happens if we change our original requirements to allow dashes between numbers and a country code before numbers? Let’s change our tests first:

And let’s see them fail!

1)
'PhoneNumber validates number 12-34-567 as valid' FAILED
expected 12-34-567 to be valid, but was not!
./phone_number_validator.rb:15:

2)
'PhoneNumber validates number +3721234567 as valid' FAILED
expected +3721234567 to be valid, but was not!
./phone_number_validator.rb:15:

At least our tests work as expected. Let’s try to change the regular expression so all tests would pass:

And tests pass. You can read it as “allow an empty string or a string which has an optional ‘+’ character with a three to four numbers (country code), seven to eight times a pattern of zero or more whitespace or ‘-’ characters before the digit and having zero or more whitespace or ‘-’ characters after it”. This actually counts a number of “1——23-45 678” as a valid phone number. Let’s create test for it.

Should we try to create a regular expression to validate this number as an invalid? To be honest, then the answer would be no. All those regexps created so far have been actually quite meaningless (but i did them nevertheless to have some examples of a regular expressions optimization) and we should have deleted the original regular expression right away and create a proper validation method instead! Let’s just have phone numbers by replacing all unneeded characters before matching against a much easier regular expression. This allows us to have phone numbers in our system in the same format. For showing the numbers on the screen a special formatting should be used instead if needed. A new PhoneNumber class might look something like this instead:

Notice how the regular expression used in validation method has become a lot smaller and easier to understand. Let’s just have the original, middle and end result lined up in here for a better overview:

Now you know how and how not to validate phone numbers or actually anything else.