Although this case study is focused on the solution for British postal addresses, the same process and techniques can be applied for virtually all countries’ addressing systems. The specific industry of a customer presented here also does not limit our engineering teams. They can provide a reliable, efficient and user-friendly method of address management to optimise delivery costs and times for companies working on logistics, e-commerce, customer engagement and many other aspects of modern business.
Our customer is a global retailer serving more than one million online grocery orders per week. The huge scale of operations requires the best industry standards.
With roughly tens of thousands of new users daily, it is important to keep the registration process frictionless. The most arduous and error-prone part of that process is the typing of a delivery address. Technically, this problem occurs in every company that processes a lot of address data – e.g. Retailer, Logistics, Freight Forwarding etc.
We helped our client design and build the solution which makes the operation fast, intuitive and simple.
For the big global retailer, a company that is well known for offering home grocery delivery, precise information about customer address is vital. An address containing typos or missing important information may extend time in which the deliverer reaches the purchaser or even make the whole order impossible to fulfil. Of course, such situations in this competitive market bring additional costs and decrease customer’s satisfaction. Every minute of delivery delay scales up to large company-wide expenses.Precise knowledge about order destinations has to be considered as a part of the strategy of being ahead of the competition. This is why the global retailer acquires a collection of currently existing mailing addresses from a third party company.
When customers are registering a new account or placing an order, they are encouraged to pick an address from a list of official postal addresses. This makes sure that delivery details are of good quality and can be assigned with approximate geocoordinates. The existing registration page asked users to provide a postcode and then select a matching address from the dropdown list. If the correct address was not present on the list it could be entered manually with mailing details form.
This seemed to be a sensible approach. People in general know their postcodes and postcodes in most countries cover rather small areas. The dropdown list was not expected to contain more than 30 elements. However, after implementing this solution some edge cases emerged. Purchasing more precise dataset from the third party company caused some postcodes to include hundreds or even thousands of addresses. It was forcing the user to scroll over neverending list or to provide an address via form increasing the share of potentially imprecise manually typed addresses. Statistics showed that around 4000 users were registering with manual addresses per day. In most of such cases a proper address was present on the dropdown, but due to the large number of addresses on the list finding it was difficult for users. Instead, many users prefered to enter a manual address with form. It is worth emphasizing that such manually typed addresses do not guarantee precision and correctness and may bring additional cost and delay during order delivery.
Our goal was to prepare an interactive, user-friendly tool for entering users’ postal addresses. We decided to replace two-step postcode-address selection with a single text input field providing the search-as-you-type experience. Users will be able to search by any meaningful parts of address: postcode, street, town or even company and building name. Matching addresses will be returned as soon as the user starts to type. The result set will be constantly adjusted as the user provides more details in the query. Although such change appears to be very simple from the website perspective the underlying mechanisms introduced multiple engineering challenges. New address finder needs to be capable of handling tens of thousands of new user registrations per day. Searching for a single address can trigger multiple requests resulting in hundreds of thousands queries to the application every day. Around 10% of registrations resulted in the creation of new manually typed addresses, either because of address missing or difficult to find – this was an indicator we definitely wanted to lower.
Understanding the requirements
We were supposed to start our work with an address finder for the United Kingdom. This densely populated European country has over 30 million postal addresses. Rapid filtering over such a dataset every time the user types a few more letters is a really demanding task. The other difficulty was the structure of the British address. Addresses in the UK can contain even 7 address lines consisting of not only thoroughfare with a number but also a business park name, building name, or additional dependent street or sub-locality.
People are very likely to make spelling errors in their addresses – we noticed it after analyzing hundreds of manually typed addresses and fully understanding the domain. This is why we wanted to handle typos in the entered address query. However, not all parts of the address are equally immune to typographical errors. While typing
will not be a serious problem then changing a single digit in postcode will result in getting addresses from different postal areas.
Data driven testing
We knew that standard testing approaches were not sufficient in our case. Of course, we were going to cover our code with unit, integration and performance tests, but they were able to verify just specific parts of our logic. The behaviour of the whole solution, its accuracy and responsiveness will be influenced by the address dataset. We wanted to quickly see how changes in our application influence the ability of finder to return the most suitable results. This is why we introduced another type of tests – regression tests.
We started with obtaining data for test cases. Fortunately, we had access to a database with addresses of our users. We focused on manually typed addresses as we wanted to verify how our finder will behave in situations when the old solution failed to return a proper address. We took a large set of such addresses and distributed them among our engineering team. Then each team member spent time manually analysing hundreds of addresses to find the most interesting and sometimes tricky cases. Once the test dataset was completed we implemented a tool that will generate possible search queries like street+town, postcode+building number, company name with other address parts, etc. The tool checks if for a given query the expected address is returned within the top of the result list. When we were sure that each version of our finder can be tested within minutes we were able to start development of the finder itself.
Start with a simple solution
We always try to start with the simplest solution. It gives our customers almost immediate feedback and gives them precious time to refine the requirements. It also saves money because we can validate our concepts quickly.
To solve this particular problem, we decided to base our project on the most popular text search engine – ElasticSearch. To decrease the cost of cluster maintenance and improve the resilience and availability of the solution, we used cloud AWS ElasticSearch service.
The first basic idea was to feed the ElasticSearch cluster with the address data and use the simplest queries to return address suggestions to users. Our tests quickly proved that this solution is not good enough because addresses are specific kinds of documents.
Whatever is acceptable for a standard free-text search, is not always the best for all use cases. Typos, fuzzy matching of numbers, discrepancies between user queries and the data, poor performance – these were the challenges we were facing after the first iteration.
Room for improvement
When we started testing our initial approach we did not have precise requirements for performance, responsiveness or search accuracy. Even without knowing the exact digits describing predicted traffic or acceptable response times we were aware the first test results were not promising.
Our simple implementation, stressed with performance tests, could deliver results not faster than in 3-4 seconds on average. These are not response times you can expect from an interactive solution that should display new candidate addresses every time the user types a letter or two. Still, if our users turned out to be patient their waiting would not be awarded with addresses they wanted. Our regression tests showed us that we were able to serve the correct address on the top of the list only in 60% of cases. Such numbers could only suggest one thing – we needed improvement!
The iterative approach
Since we were not satisfied with the performance and accuracy of our initial simple solution we started the gradual process of improving our address finder. We organised our work into consecutive iterations during which we attempted to fulfil one of the following goals:
- improve finder’s performance without losing its accuracy
- make finder return more accurate results without increasing response times
In each iteration, we focused only on one of these targets. We monitored the effects of our efforts with regressions and performance tests. We prototyped and verified multiple ideas starting with universal ones that could be applied in other countries’ address finders then moving to ideas more specific for British address domain.
As a result, we managed to decrease the volume of data stored in the Elasticsearch cluster by grouping similar addresses together. We tested different strategies for handling building numbers, postcodes and frequent words like ‘street’, ‘flat’ or ‘road’. We found an effective way of providing a fuzzy search for queries containing typos. Finally, to optimise both performance and upkeep costs, we adjusted the infrastructure running our Elasticsearch cluster.
Although the address finder was ready for production release it wasn’t the end of improvements. We collect statistics to find room for other optimisations.
Explore the range of possibilities
We believe in all our developers.
Innovative ideas come not only into heads of senior colleagues. This is why we organise brainstorming sessions whenever we face a challenge which requires a non-standard approach. The best propositions deserve prototype implementations hence we have internal hackathons where we try to prove their feasibility and effectiveness.
We applied the same strategy when building the address finder solution. This way, we were able to choose the most promising directions upon which we could focus entirely. The ideas to start parsing a query before sending it to ElasticSearch, to group all buildings from the same street, and to build n-gram indexes were born and prototyped during iterations of gathering ideas from engineers, tested using our extensive data-driven test suites and later successfully incorporated within a final solution.
The solution currently works successfully in production. For hundreds of dollars monthly, the customer gets the solution which is robust, effective and scalable.
Our continuous regression tests prove that we return the matching address as the first result for:
This iterative approach helped us to:
Decrease the number of indexed documents by 70%.
Reduce the monthly cost of infrastructure by 15%.
Improve 50 times P99 response time latency.
Increase the search result accuracy (% of expected results returned as the first result) from 60% to 90%.
We are proud that
and reduces the cost
Also worth mentioning is that our solution was designed as a multi-purpose, independent service that can be easily used in many different scenarios, like:
- address comparison used for fraud-detection
- address cleanup when migrating old databases
We understand that difficulty of adopting new solutions and managing cross-teams programmes grows proportionally with company size and that our component is only a part of a bigger narration about handling an increasing number of new customers in a cost-efficient manner. But even if our service still awaits being incorporated into the main business flow we believe that the client and its users will see the benefits of having our component from day one.