Searching a text string

When asked to pull an eight digit account number out of a random string there are many methods that can be used. I will attempt to explain my method and some of the reasons why I chose it, but first I think I should mention more about the problem.

The strings in question were made up of user typed sentence which should include the account number but did not contain a consistent format.  This was made more difficult by the fact that the account number could be continuous, contain hyphens or spaces and in some cases may not exist at all. I did know that an account number should be eight digits and could not begin with a zero. Some examples of the problem are shown below:

“Please transfer 500.00 to 12345678”
“For the benefit of 1234-5678, John”
“Please send $1000, 12/19 to 1234 5678 for Christmas.”
“12345678 Mary Smith, please call with any questions (555)-555-5555”

 As the examples indicate other numbers may be present and the account could be anywhere in the string. Removing all non numbers would occasionally leave strings integers with no clue which belonged to the account. Removing just spaces and special characters could have similar problems in cases such as the one above with $1000 12/19 which would become 10001219 without spaces or special characters. 

I decided to start with a search using regular expressions to find the digit 1-9 followed by nine more digits. I also had to check that the nine digit number did not contain an added number before or after the string I found. It was possible that a person included a phone number or SSN which could be picked up in error if I found my string but did not check for either nothing before or after or at least a non digit.

The regular expression check wouldn’t work if the number contained a space or hyphen so the next step was to then remove these likely account spacings and run the search again. Removing the characters after the first run helped to reduce the chance of picking up some other number incase two other numbers happened to form an eight digit number.

Once I had completed the two rounds of searches I was typically left with a usable account number if one was provided. If an account could not be found the case would have to be manually reviewed.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s