Street parsing

2012-09-26 11:10
Written by

At the Apeldoorn sprint today we had several discussions about street parsing and what we should do about it. A couple of solutions came up, I spoke first with Joe Murray and Xavier Dutoit. At that point in time using Extensions per street parsing seemed a logical solution. Discussing a little more with Lobo and Tim Otten the idea changed, and perhaps one Extension for international street parsing should be enough. Let me explain the issue first for all of you out there that are not into street parsing....


We have the option in CiviCRM to activate street parsing. That means that the street address (for example 512 Rodeo Drive) is split up in different fields (street name, street number, appt/building). So far so good....the issue is that different countries use different sequences to make addresses. In my country for example an address is made up of street name, street number, appt/building, postal code and city. So Ambachstraat 21, 6971 BN Brummen is a valid address. In other countries an address means street number, appt/building, street name, city, county, postal code. So in this system 52 Cheviot Close, Camberley, Surrey, GU13 SW4 is a valid address.


The current functionality in CiviCRM supports the street parsing with the US format, and some of us (I for one for the Dutch way) have modified this at a specific project to make it behave according to local standards.


The aim would be to have the possibility to use different ways of street parsing based on the country code of the contact. So I could have a site for a Dutch NGO, with English as the language but have street parsing in the Dutch way for Dutch contacts, in the English way for English contact etc.


Initially we thought we would create an Extension for a country code which could have local street parsing. But actually there are only a number of street parsing options, and each of those is used in many countries. I can find the following varieties:

  1. {street number}{ }{appt}{ }{street name}<next line>{city}{, }{county}{, }{state}{ }{postal code}
  2. {street number}{, }{appt}{, }{street name}<next line>{city}{, }{county}{, }{state}{ }{postal code}
  3. {street name}{ }{street number}{ }{appt}<next line>{postal code}{ }{city}
  4. {street name}{, }{street number}{ }{appt}<next line>{postal code}{ }{city}
  5. {street number}{ }{appt}{ }{street name}<next line>{postal code}{ }{city}
  6. {street number}{, }{appt}{, }{street name}<next line>{postal code}{ }{city}

First of all I would like to know if you know of any other sequences of street parsing?


My line of thinking is now that we develop an extension that caters for these six ways of street parsing, with the first one as default for all countries. We then have a setting that allows the user to change the street parsing for a country, with the five other ways as options. Does that make sense? Would love to hear from you all!


I like the idea of putting all parsing into a single extension. We'll just need to open up commit permission to the code repository for the extension to several people.

Your list doesn't include the normal English Canadian civic street address (ie the one used in cities and towns), which is also similar to the US address format:

{apartment or unit}{-}{street number}{optional street number suffix}{ }{street name}{ }{street type}{ }{street direction}<next line>{city}{, }{province}{  }{postal code}


{street number}{optional street number suffix}{ }{street name}{ }{street type}{ }{street direction}{, }{Unit type, eg Apt or Ste or #}{ }{apartment or unit}<next line>{city}{, }{province}{  }{postal code}

Currently in CiviCRM we don't have reason to parse street type (ie Ave, St, Blvd, Ln...) or street direction (N, S, E, W, NE, NW, SE, SW) since we are not doing detailed address matching for duplicate identification and elimination yet. On a simplified level that is good enough for now:

{street number including optional street number suffix}{ }{street name including street type and street direction}<next line>{city}{, }{province}{, country}{  }{postal code}

French Canadian civic addresses format are:

{apartment}{-}{street number}{optional street number suffix}{, }{street type}{ }{street name}{ }{street direction}<next line>{city}{, }{province}{  }{postal code}

The comma after the street number is commonly used though it is not in the official standard, and there is the same common switch of the apartment or unit information from the beginning of the civic address line to the end of line as in English.

There are a large variety of rural address formats used in different parts of the country, which can be usefully parsed into things like PO Box, RR, Concession, Lot, Quarter, Meridian, Range, etc. for matching purposes. The Canadian address parser in CiviCRM currently ignores them.

I think there will be many more than your original 6 formats if the presence or absence of commas and dashes and upper or lower case for the various components are considered.  For example the Australian convention is a variation of the English-Canadian:

{apartment or unit}{ }{street number}{ }{street name}{ }{street type}<next line>{CITY}{  }{STATE}{  }{postal code}

How about a different approach altogether with an street parsing/address format table with a column that specifies which country codes that format should be applied to. If your CRM includes addresses for numerous countries then the addresses would be formatted correctly for each country rather than the 'one format fits all' approach that currently applies.

The standard address format would include {Country}

A customisable format would also be included with 'use for country X' which would take precedence over the preset formats in  table.


I am suggesting this based on my use for CiviCRM and my previous use of a different CRM

Most of our members live in Australia, but some do not.  I would like the addresses formatted correctly for each country, but with the current 'one size fits all' approach that is not always possible.  I also need the country include in non-Australian addresses but do not want it included for Australian addresses (including it is non-standard addresss formating for our postal service which means I lose access to bulk mail discounts for NFPs if it is included)

With a customisable option I could set it up to be in the standard Australian format  (no commas or dashes, uppercase {State} and no country) and specify it is to be used for country code 1013. 

Addresses for all other countries would be formatted according to the preset formats in the Address format table.

This customisable option would also allow changes in addressing standards for a particular country to be addressed promptly.  For example standard UK Pastal format is now:
52 Cheviot Close, Camberley, GU13 SW4    (County not included)

Thanks for all your comments. I think the thing should be customisalbe (obviously). The objective should certainly be to make correct street parsing available for every country format, even if you want to deviate from that for one specific install. At the same time, we need to have a structure that makes maximum room for community development, because I do not think 'core' should take care of all street parsing. To my mind that does not agree with the whole idea of Civi.

So what I am looking for is a structure that allows me to do my bit quite easily (i am fine with taking care of the Dutch format) but not taking on the responsibility for the whole thing. That is way too big for me....all and all the suggestion of a table that tells me what country uses what format option, with a customisable set up too (much like the current email/postal greeting stuff) makes sense to me. We do need to make sure we have a good idea of what needs adapting. Is there anyone willing to take part in the development of this stuff (and I think Joe Murray and me are both up for a part of this)?


Sorry, I am not able to help with any coding as I have zero skills in that area, but if you opt for a table listing what country uses what format I could collate that info if that would help.

That would certainly help if you can dive into what country uses what format!!

This might help:

Thanks for that dalin! That is really helpful, I will test the API from gisgraphy. Looks like we can incorporate calls to that API in our extension! Cool


and that will allow us to do the same for addresses for different countries. maybe we should convince the gisgraphy folks to do so if they dont have it already :)





Looking at civicrm_country table, there is already a column address_format_id. It doesn't seem to be used so far, but could be to store that?



Xavier's point makes me wonder - is this just about storing an option value (for each country that we have data for) with the relevant address parsing string? Which would be of course UI editable through existing mechanisms on a per site basis.

Is this an extension requirement or a minor tweak Plus some data gathering?