Duplicate Address Check: Difference between revisions
(Created page with "The purpose of this feature is to use fuzzy logic to determine if two address might be duplicates. == How it worked in Petra == Load all PLocation records and do the followi...") |
No edit summary |
||
Line 19: | Line 19: | ||
'''Step 1''' - Iterate through every record that has the same CountryCode as Location 1 and has not already been compared to Location 1. Perform steps 2-6 for each address (location 2). | '''Step 1''' - Iterate through every record that has the same CountryCode as Location 1 and has not already been compared to Location 1. Perform steps 2-6 for each address (location 2). | ||
'''Step 2''' - If both locations have a postcode then postcodes must match. If they are not an exact match then move on to next location. (Note: this feature will have a considerably longer run time for countries that do not use postcodes.) | '''Step 2''' - If both locations have a postcode then postcodes must match. If they are not an exact match (ignoring case) then move on to next location. (Note: this feature will have a considerably longer run time for countries that do not use postcodes.) | ||
'''Step 3''' - Put the following fields from the two location records into two separate single strings: Locality, StreetName, Address3. Make strings lower case, replace punctuation characters with spaces and insert spaces between letters and numbers not already separated with a space. | '''Step 3''' - Put the following fields from the two location records into two separate single strings: Locality, StreetName, Address3. Make strings lower case, replace punctuation characters with spaces and insert spaces between letters and numbers not already separated with a space. |
Latest revision as of 13:46, 14 April 2015
The purpose of this feature is to use fuzzy logic to determine if two address might be duplicates.
How it worked in Petra
Load all PLocation records and do the following for each record (location 1):
Step 1 - Put the following fields from location 1 into a single string: Locality, StreetName, Address3 , City, County, PostalCode, CountryCode.
Step 2 - Iterate through every other address and perform steps 3-4 for each address (location 2).
Step 3 - Check to see if the string for location 1 contains each field from location 2.
Step 4 - The percentage of characters from Location 2 that are contained in the Location 1 string is then calculated. If this percentage is above a minimum amount then we have a potential duplicate.
How it currently works in OpenPetra
Load all PLocation records that do not have empty Locality, StreetName and Address3 fields (i.e. address is too vague to be a duplicate). Categorise locations based on CountryCode. Do the following for each record (location 1):
Step 1 - Iterate through every record that has the same CountryCode as Location 1 and has not already been compared to Location 1. Perform steps 2-6 for each address (location 2).
Step 2 - If both locations have a postcode then postcodes must match. If they are not an exact match (ignoring case) then move on to next location. (Note: this feature will have a considerably longer run time for countries that do not use postcodes.)
Step 3 - Put the following fields from the two location records into two separate single strings: Locality, StreetName, Address3. Make strings lower case, replace punctuation characters with spaces and insert spaces between letters and numbers not already separated with a space.
Step 4 - Check for a match in location 2 for each item in location 1.
1. If item is a number then Location 2 must either contain no numbers or contain an exact match.
2. If item is a word then check if this word is contained anywhere in Location 2.
3. If word is not found contained in location 2 then compare it to each individual word in location 2. For each pair calculate the Levenshtein Distance. If this distance is 2 or less then this is a possible match. This allows for small spelling mistakes.
Step 5 - If any words of the address did not match in Step 4 but the numbers did match then try the comparison the other way around. Check for a match in location 1 for each item in location 2 (ignoring numbers this time).
1. If item is a word then check if this word is contained anywhere in location 1.
2. If word is not found contained in location 1 then compare it to each individual word in location 1. For each pair calculate the Levenshtein Distance. If this distance is 2 or less then this is a possible match.
Step 6 - Iterate through every record that has a blank or 99 CountryCode and repeat steps 2-5