From legacy to agility: re-architecting search on Grover.com

Software Development

A story of a team renewing and taking end-to-end ownership of search

6.22.2023

Product search is fundamental to any e-commerce business: for conversion to happen the user has to be able to search for something specific, narrow down the results by desired criteria, or casually browse through the catalog. Our team consists of 4 engineers, an engineering manager, a designer, and a product manager—all dedicated to bringing you the best possible experience for product search on Grover.com.

Grover is a subscription service, but from a technical point of view, the user-facing part of the product catalog is much closer to a standard e-commerce storefront than something like Netflix. The overall scale is by no means small—we have over 5000 products in 4 languages in 64 different categories across 10 different stores. In 2022, we had been experiencing several problems with the existing search architecture for a while, slowing down development or, in some cases, completely blocking us from building a particular feature. It was time to start figuring out how to fix the situation and plan the next steps.

If you’re an engineer working on search, or just curious about how we did our search renewal project, hopefully this article can shed some light on our process and give you some ideas on how to solve similar problems that you might be facing. This article is only an overview, but we might also do some follow-ups on the same topic that dive deeper into the technical details, if that’s something you are interested in.

A closer look at our situation

On Grover.com we provide 2 main ways for the user to browse our catalog:

Product listing pages—where the user can filter products based on criteria such as category, brand, price, or a specification unique to a particular category, such as screen size for smartphones.
Product search—where the results are returned based on keywords provided by the user.

Both of these features are powered by the same search engine that maintains an up-to-date index of the product catalog and allows us to resolve the data required for the above scenarios efficiently. Our previous search endpoints were part of a monolithic backend service from the very early days of Grover, which no longer fulfilled all business requirements. This was also not the best fit regarding our team’s skill set since it is written in Ruby, preventing us from working independently and slowing down development.

The search engine used was ElasticSearch, which was tightly coupled with the Backend monolith. Don’t get me wrong, ElasticSearch is battle-tested and feature-rich, but it can be a bit cumbersome. For all its complexity, it’s not something ideally to be owned by a small team such as ours. This meant some issues that might seem simple to solve remained unaddressed—for example, faceting on product listing pages didn’t work as well as we wanted: it was possible that the user could end up in situations where no results were found for the selected combination of filters. More importantly, we had issues with synonyms: you could not find a “PlayStation 5” using the keyword “ps5”.

Where’s my ps5? Sure, the user would be able to find all the ps5’s with “playstation 5” or at the very top of the “gaming consoles” category, but we can do better.

ps5 search result — Where’s my ps5? Sure, the user would be able to find all the ps5’s with “playstation 5” or at the very top of the “gaming consoles” category, but we can do better.

Performance also left a lot to be desired: in the keyword search endpoint we were seeing latencies of up to ~800ms, and the endpoint used for product listing pages performed even worse with latencies of about twice that. This had nothing to do with ElasticSearch. It was simply because we were doing multiple calls to fetch additional data from a separate database.

One would think that a search endpoint would only do one or only a handful of parallel queries against the search engine, but a peek at a transaction trace will tell you what’s really going on behind the scenes. What this graph doesn’t show is that the client is also doing an additional call to fetch the data for rendering facets!

previous search transaction trace — One would think that a search endpoint would only do one or only a handful of parallel queries against the search engine, but a peek at a transaction trace will tell you what’s really going on behind the scenes. What this graph doesn’t show is that the client is also doing an additional call to fetch the data for rendering facets!

Moving to a more modular approach

We knew we wanted our new architecture to be more modular and decoupled—individual services being down should not necessarily mean that the search doesn’t work from the user's point of view. Being search engine agnostic was also important: we should be able to plug in whichever search-optimized database we wanted with little effort. Our solution comprised of 3 main components:

The search engine—running on Grover’s own infrastructure.
Indexer service—a service responsible for pulling data from different data sources, then formatting and indexing it to the search engine. This service is also a Kafka consumer, with individual Kafka events, such as product price or availability changes, triggering per-product (re)indexing to keep data up to date.
Query service—a service providing the subgraph that allows querying the search engine. We also handle some common business logic here, so the client doesn’t have to worry about it, and it’s connected to some other backend services that provide configuration for formatting facets, etc.

As for the search engine, we knew we wanted something more manageable that would allow us to respond swiftly to product requirements and move faster as a team. We did some extensive research and a number of POCs and considered both proprietary and open-source solutions. Our shortlist of options included ElasticSearch, Algolia, Solr, TypeSense, and MeiliSearch—some of the options are cloud-based, some are self-hosted, and some can do both.

In the end, we decided that TypeSense would be a great fit. It provides a robust feature set that covers our requirements well:

Filtering—we have product listing pages where we display products only from a particular category.
Faceting—we want to allow users to narrow down the search based on product characteristics such as brand, price, or specification. Algolia has a good article about the difference between Filters and Facets.
Typo tolerance—if a user makes a small mistake while typing search keywords, we still want to return meaningful results.
Grouping—we want to index product variants (e.g. color) instead of whole products. Most of the time, however, we don’t display product variants to the user separately—result grouping is necessary to achieve this.
Synonyms—a user might search for “ps5” when the product name is “PlayStation 5.”
High availability—product search is a critical component of the user journey. Running TypeSense in cluster mode means that it can tolerate individual node failures.
Result pinning—we want to be able to promote individual products and make sure they are displayed in the exact position that we want in the product list.

TypeSense also provides sensible defaults that provide good search results with minimal configuration, accelerating the development process. Its relative simplicity made it a great fit for a team that wanted to own the entire search end-to-end, and self-hosting would give us more flexibility and a faster time to market.

Wrong turns and dead ends

At Grover, we use federated GraphQL for pretty much all of our customer-facing data needs, allowing individual services to expose a subgraph that can be queried via our common gateway. These services can either define their own entities or add fields to existing ones, allowing the client to pull data from multiple services in a single query. As our previous search endpoints were REST, we often ended up in cases where search data had to be stitched together by hand with additional product data resolved from other services. This added business logic on the client side, or in some cases, in the BFF.

When starting the project, we were unsure if our new search should be a part of the federated graph. Including it could potentially increase latency, and client-side stitching of data could be prevented by just indexing all the necessary fields in the search engine, or handling it in the query service. We started off developing our query service independent of the federated graph, and this proved to be a mistake: the ability to index only a subset of the required fields and have the rest resolved from other services ended up giving us a ton of flexibility and enabled us to ship a fully working version faster. Later on, we could add any expensive fields to the search index, giving us a boost in performance as they don’t have to be resolved from elsewhere at query time. In our case, this approach was made possible by the federation-specific GraphQL directives @provides and @external.

We also cut corners in some places that turned out to be wrong. We didn’t initially build category hierarchy into our new search, as it wasn’t necessary for feature parity. This all changed very shortly after launch. It was disappointing to have to refactor such a significant part so soon, but on the other hand, it was rewarding to see the new search unlocking new feature requests. Predicting future requirements can sometimes be hard. Making the wrong trade-offs just so you can ship faster initially might end up costing a ton of time and effort in the long run.

Ensuring result parity

So after you have your new search up and running, how do you make sure that it’s returning the correct results? Even with all of its shortcomings, the previous search was the best reference point we had for validating our new implementation. We needed to make sure that our shiny new search would index the products that are supposed to be indexed and essentially return the same results as the previous search. After all, our main goal for the initial phase of the project was result parity—more improvements would be made in later phases.

One of the most time-consuming parts of our implementation was the handling of edge cases. Over time, our previous search had accumulated a lot of these, and often implemented in very creative ways. Just reading the code and re-implementing everything didn’t seem like a viable plan as the architecture was different, and in a lot of cases our new architecture provided a different and better way of tackling the same problem.

We ended up building 2 different comparison tools. We made the first with a UI and the ability to quickly test different search parameters for TypeSense, and compare side by side with the previous search. The second tool was a script that was diffing the new results against the previous ones. We ended up running this periodically and adding monitoring to it. The main use case was to make sure that Kafka events were consumed correctly, and the indices would not drift away from one another over time.

Rollout and early results

In the previous search, the keyword search was the part where we had the most issues with results (see the “ps5” example above!). Also, implementing all the functionality needed for product listing pages proved to be the far more time-consuming part of the development process (filtering and faceting). The keyword search seemed like a good opportunity to get some early insight into how our new search performed.

So we did what we usually do in this type of situation: set up an A/B test—rolling out for just a small percentage of users, and tweak accordingly. Also, as each of our product listing pages has a slightly different set of features, the rollout for these could be done one at a time.

Some of the highlights for performance improvements include:

Lower latency in keyword search—a reduction from ~800ms to ~100ms at p95 provides a snappier user experience, which matters especially as we use autocomplete.
Lower latency for product listing pages—in this context the data is required for server side rendering, which made the ~50% drop in latency all that sweeter, also having a significant impact on the more user facing performance metrics.

So far, our new search infrastructure has successfully served over 12 million requests without major issues. To be clear, we’re only getting started here. Now that the project’s initial phase is done, and we have reached feature and result parity, we have shifted our focus to providing better search results to the user, helping them find the products they want. After all, one of the project’s main goals was to build a platform enabling fast iteration and experimentation, so it’s time to put it to work and see what it’s capable of doing.

Lauri Viitala