Model Cascades for Efficient Image Search

Model Cascades for Efficient Image Search

Modern neural encoders offer unprecedented text-image retrieval (TIR) accuracy.
However, their high computational cost impedes an adoption to large-scale image searches.
We propose a novel image ranking algorithm that uses
a cascade of increasingly powerful neural encoders
to progressively filter images by how well they match a given text.
Our algorithm reduces lifetime TIR costs by over 3x.

Introduction

Search engines are the most widely used tool for information retrieval (IR) on the internet — Google alone processes over 8.5 billion searches a day. A search engine takes as input some query qq and returns a list of documents D\mathbb D ranked by how well they match qq. Keyword-based search ranks results by naively matching query keywords with documents. Semantic search tries to improve on keyword-based search by matching queries to documents based on their meaning.
A fruitful domain for semantic search is TIR, where documents are images and queries are texts. New semantic search engines for TIR leverage recent advances in deep learning for processing images and natural language. Typically,
these engines use neural networks to construct an image encoder II and a text encoder TT that process text qq and each image dDd\in \mathbb D
into embeddings vq=Tqv_q = {T}{q} and VD={vd=Id:dD}V_{\mathbb D} = \{v_d = {I}{d} : d\in\mathbb D\} that capture their semantics. Then, the engines rank images in D\mathbb D
by some similarity measure of vdv_d and vqv_q.
In large-scale search scenarios, D\mathbb D may contain several million documents. This makes it
computationally expensive to compute embeddings for all documents.

We seek to lower this computational cost while preserving search quality. To this end, we measure search quality as Recall@kk, which denotes the fraction of searches that include the desired result in the top-kk results. Even small increases in kk can
significantly improve the Recall@kk. Hence, for gkg\gg k, the top-kk results
of a large and expensive encoder IlI_l are likely included in the top-gg results of a small and cheap encoder IsI_s. This observation leads to our main idea: \emph{At build time, pre-compute VDV_{\mathbb D} with IsI_s. Then, at runtime, to handle a query qq, retrieve the top-mm results
DmD\mathbb D_m \subset \mathbb D for some mkm\gg k, recompute VDmV_{\mathbb D_m} with IlI_l and return the top-kk results.}
This idea, illustrated in \Cref{fig:arch_detail}), naturally extends to a cascade of rr progressively larger encoders that compute progressively smaller sets V1,VrV_1\ldots,V_r.

In practice, it is possible for over 90% of all documents in D\mathbb D to never be included in any search result over the lifetime of a large-scale search engine. This means that our technique would evaluate IlI_l on less than 10% of D\mathbb D, resulting in significant lifetime computational savings. In this work we make the following contributions:

  • We introduce a novel cascading algorithm for fast TIR.
  • We show that our algorithm speeds up TIR on standard benchmarks by over 3x at no reduction in search quality.
  • We investigate the benefits of deep cascades and
    demonstrate a 2x reduction in query latency.

Schematic of our algorithm for a 2-level cascade . In this example, encoder  computes embeddings  of all four images (leftmost four squares) at build time. At runtime, the images that correspond to two the highest-ranking embeddings (green) are processed by encoder  that produces embeddings  of higher quality. Finally, we rerank the top-2 images with   to output the highest-ranking image.

Model cascading is a recurrent theme in the literature on efficient machine learning (ML) systems.
FrugalML minimizes access costs of ML APIs by cascading two calls to a cheap API and to an expensive API.
NoScopespeeds up object detection in videos by splitting a reference model into a squence of two specialized models.
Model cascades have also been applied to facial key point estimation, pedestrian detection and other domains.

Recent work on encoders for TIR is dominated by transformer-based bi-encoders (BEs) and
cross-encoders (CEs). BEs process images and texts with separate encoders, whereas CEs also add cross-connections between the encoders. Hence, CEs are more powerful, but need to recompute VDV_{\mathbb D} for new queries. This makes them impractical for large-scale searches and unsuitable for our idea. Therefore, we focus on BEs.

Several methods for fast TIR with CEs have been developed: VLDeformer trains a decomposable CE that can be used as a BE at inference time with minimal loss in quality.
CrispSearch, LightningDot and Retrieve Fast, Rerank Smart all introduce two-level sequences of a BE whose results can be cached for approximate inference and a CE for precise inference on a subset of the BE results. This is similar to our idea but differs in two key ways:
First, we consider arbitrarily deep model cascades, whereas these approaches are fundamentally limited to two models.
Second, we target BE inference instead of CE inference. In fact, this suggests that our approach could complement these existing techniques as the BE model in their first stage for even faster TIR.

Models and Methods

Let D\mathbb D be a collection of nn images
that we want to query with a cascade of BEs. Consider a cascade of image encoders
I={Is,I1,,Ir}I = \{I_s, I_1, \ldots, I_r\} that all use the same text encoder TT. We propose algorithm Cascade Search to query D\mathbb D by ranking all images with IsI_s and subsequently the top mjm_j images with IjI_j. Note that with r=0r=0, Cascade Search reduces to a standard BE search.

Computational cost
Assume that function Query in Cascade Search invoked qq times and denote the computational cost ofCascade Search with C(I,q)C(I, q).
We want to minimize the lifetime computational cost of Cascade Search, that is C(I,q)C(I, q) as qq\rightarrow\infty. We can decompose C(I,q)C(I, q) into the sum of the lifetime image encoding cost a(I,q)a(I, q) and some term b(q)b(q) that is independent of II and thus irrelevant for optimization over II.
Next, we formalize our introductory observation on the set of a search engine’s lifetime search results into the following key assumption:

Cascaded Search. Here, Rank(I,V,t)\mathrm{Rank}(I, V, t) sorts the images in II by the cosine similarity of their encodings VV with text encoding tt.

  1. Input : {Is,I1,,Ir}\{I_s, I_1,\ldots,I_r\}, m1>>mrNm_1 > \ldots > m_r \in \mathbb N, D\mathcal D
  2. Init: for cDc \in \mathcal D do Vs ⁣[c]Is(c)V_s\![c] \longleftarrow I_s(c)
  3. Function Query(text)
    TopRank(D,Vs,fTtext)\mathrm{Top} \longleftarrow \mathrm{Rank}(\mathcal D,V_s,f{T}{\mathrm{text}})
  4. for j=1j=1 to rr
    for cTop ⁣[1mj]c\in \mathrm{Top}\![1\ldots m_j],do Vj ⁣[c]if emptyIj(c)V_j\![c]\xleftarrow{\text{if empty}} I_j(c)
    TopRank(Top ⁣[1mj],Vj,T(text))\mathrm{Top} \longleftarrow \mathrm{Rank}(\mathrm{Top}\![1\ldots m_j],V_j,T(\mathrm{text}))
    end for
    Return Top[1]\mathrm{Top}[1]
    End Function

Assumption
For qNq\in\mathbb N, let SqDS_q \subset \mathcal D be the set of all images pushed to Top\mathrm{Top} in query qq. Then, frac1nqNSq=:f1frac{1}{n}|\bigcup_{q\in\mathbb N}S_q| =: f \ll 1.

If Is,I1,,IrI_s, I_1, \ldots, I_r have costs ts<t1<<trt_s < t_1 < \ldots < t_r, then Assumption implies that a(I,q)=nts+fni=1rtia(I, q) = nt_s + fn\sum_{i=1}^rt_i. Hence, the 2-level cascade [Is,I1][I_s, I_1] is cheaper than the 1-level cascade [Is][I_s] if the speedup factor [ts+ft1]/t1[t_s + ft_1]/{t_1} exceeds 1.

We note that Assumption implies no computational advantage of a II over an equally powerful 22-level cascade I=[Is,Ir]I' = [I_s,I_r] with m1=m1m'_1 = m_1. However, if qq is low enough that VV is not hit, then the rr-level cascade speeds up individual queries by a factor of

m1tr/i=1rmiti m_1t_r/\sum_{i=1}^rm_it_i

This is useful, because unlike uncascaded models that execute the expensive image encoder only during build time, 2-level cascades have a m1trm_1 t_r runtime overhead when VV is not hit. Hence, deep cascades can mitigate the increased latency of early queries in 2-level cascades.

Creating the Cascade

Dataset Method R@1 R@5 R@10 Speedup
MSCOCO No Cascade 30.1 54.2 64.6 1x
Cascade +0.2 +0.4 +0.5 3.2x
Flickr30k No Cascade 29.9 52.0 61.3 1x
Cascade +0.8 +2.0 +2.4 3.2x

We apply our proposed methods to CLIP, a powerful transformer-based text-image BE. CLIP uses the GPT-2 architecture for the text encoder, the vision transformer (ViT) architecture
for the image encoder and matches images to texts by the cosine similarity of their embeddings.
% Several pre-trained ViT encoders of different sizes are publicly available.
We create a cascade [Is,I1,,Ir][I_s, I_1, \ldots, I_r] from publicly available trained CLIP image encoders of different sizes.

Experiments

Experimental Setup

  • Metrics Given a dataset D\mathcal D of image-caption pairs we measure the Recall@kk (R@kk) metric as the fraction of captions in D\mathcal D whose corresponding image is among the top-kk search results.
    In line with the IR literature, we report the Recall@kk for k{1,5,10}k\in\{1,5,10\}. In addition, we report for 2-level cascades the lifetime speedup and for deeper cascades the query speedup as discussed in Algorithm. We run all experiments on an Intel i7-11800H CPU at 2.30 GHz with turboboost disabled and compute speedups by measuring the total CPU time of queries.

  • Datasets We evaluate our algorithm on the MSCOCO validation dataset with 5k samples and on the Flickr30k dataset with 32k samples.

  • Parameters We set the top-mm value of encoder I1I_1 to m1=50m_1 = 50 and assume a lifetime return fraction of f=0.1f = 0.1.

2-level cascades

We use the Huggingface CLIP implementation with a ViT-B/16 image encoder as our uncascaded baseline [I1][I_1]. We use the faster ViT-B/32 image encoder as IsI_s to create the 2-level cascade [Is,I1][I_s, I_1].
Table 1 shows empirical results. The cascaded model reduces lifetime computational costs threefold. Surprisingly, the cascaded model achieves at the same time consistently higher Recall@kk than the uncascaded model. One explanation may be that ViT-B/32 initially processes input images into 32x32 tiles. Since these tiles are more coarse-grained than the 16x16 tiles used by ViT-B/16, they may offer superior approximate filtering of search results. Hence, IsI_s could determine the top m1m_1 images more effectively than I1I_1. Further research is needed to explain why 2-level cascades show superior Recall@k.

nn-level cascades

Dataset Method R@1 R@5 R@10 Speedup
MSCOCO No Cascade 32.5 57.2 68.1 1x
Cascade +0.5 +0.2 -3.0 2.0x
Flickr30k No Cascade 35.3 58.5 67.4 1x
Cascade +1.0 +0.0 -3.7 2.0x

Table: Recall@kk in % and query speedup of the 3-level cascade ([ \textrm{ViT-B/32}, \textrm{ViT-B/16}, \textrm{ViT-L/14} ]) with ( m_2=10 ) over the 2-level cascade ([ \textrm{ViT-B/32}, \textrm{ViT-L/14} ]).

Note: The table above shows the recall at ( k ) (R@( k )) and the query speedup of a 3-level cascade over a 2-level cascade for different datasets.

As noted in Section Cascaded Search, nn-level cascades offer no reduced lifetime costs over 22-level cascades, but may speed up individual queries. This is important for large image encoders that slow down queries, such as the ViT-L/14 encoder that is 3.3x slower than ViT-B/16. Therefore, we introduce the 2-level cascade [ViT-B/32,ViT-L/14][\textrm{ViT-B/32}, \textrm{ViT-L/14}] and compare it against the 3-level cascade [ViT-B/32,ViT-B/16,ViT-L/14][\textrm{ViT-B/32}, \textrm{ViT-B/16}, \textrm{ViT-L/14}]. Concretely, we set a target speedup of 2x and use formula \Cref{eq:queryspeedup} to determine the corresponding number m2m_2 of top ranked images on which \CrefCascaded Algotithm should execute ViT-L/14. This yields m2=m1[frac12fract1t2]=50[frac12frac13.3]10m_2 = m_1[frac{1}{2} - frac{t_1}{t_2}] = 50[frac{1}{2} - frac{1}{3.3}]\approx 10. Table 2 reports the empirically measured query speedups and the change in Recall@kk of the 3-level cascade. Similarly to Table 1, the deeper cascade offers superior predictions. However, for Recall@10 the predictions become significantly worse. This is because Cascaded Algotithm only uses ViT-L/14 to rerank the top m2=10m_2 = 10 images, so the set of the top 10 images stays unchanged. Hence, for m2=10m_2 = 10, the cascade [ViT-B/32,ViT-B/16,ViT-L/14][\textrm{ViT-B/32}, \textrm{ViT-B/16}, \textrm{ViT-L/14}] is equivalent to the less powerful cascade [ViT-B/32,ViT-B/16][\textrm{ViT-B/32}, \textrm{ViT-B/16}] with respect to the Recall@10 metric.

Conclusion

Our experiments show that Cascaded Algorithm can lower lifetime computational search costs by over 3x at no reduction in search quality. At the same time, deeper model cascades can mitigate the increase in latency of early queries.

However, single-digit speedups may not sufficiently reduce computational costs to economically rank large-scale image databases with expensive transformer-based BEs. Instead, a practitioner may use traditional search engines to retrieve the top-kk images and apply a neural search cascade on top of it. This heterogeneous cascade may offer a viable path towards the integration of state-of-the-art neural networks with established image search platforms.

It is important to note that all our observations rely on Assumption 1. While we have provided anecdotal evidence to support our choice of the lifetime return fraction as f=10%f = 10\%,different search scenarios likely vary in ff and achieve accordingly different speedups.


Model Cascades for Efficient Image Search
https://walkerchi.github.io/2023/05/23/ETHz/ETHz-DL/
Author
walkerchi
Posted on
May 23, 2023
Licensed under