From Surf Wiki (app.surf) — the open knowledge base

Johnson–Lindenstrauss lemma

Mathematical result

Summary

Mathematical result

In mathematics, the Johnson–Lindenstrauss lemma is a result named after William B. Johnson and Joram Lindenstrauss concerning low-distortion embeddings of points from high-dimensional into low-dimensional Euclidean space. The lemma states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. In the classical proof of the lemma, the embedding is a random orthogonal projection.

The lemma has applications in compressed sensing, manifold learning, dimensionality reduction, graph embedding, and natural language processing. Much of the data stored and manipulated on computers, including text and images, can be represented as points in a high-dimensional space (see vector space model for the case of text). However, the essential algorithms for working with such data tend to become bogged down very quickly as dimension increases.For instance, writing about nearest neighbor search in high-dimensional data sets, Jon Kleinberg writes: "The more sophisticated algorithms typically achieve a query time that is logarithmic in n at the expense of an exponential dependence on the dimension d; indeed, even the average case analysis of heuristics such as k-d trees reveal an exponential dependence on d in the query time. {{citation

Statement

Given 0 , a set X of N points in \mathbb{R}^n , and an integer k 8(\ln N)/\varepsilon^2, there is a linear map f: \mathbb{R}^n \rightarrow \mathbb{R}^k such that

: (1-\varepsilon)|u-v|^2 \leq |f(u) - f(v)|^2 \leq (1+\varepsilon)|u-v|^2

for all u,v \in X.

The formula can be rearranged:(1+\varepsilon)^{-1}|f(u)-f(v)|^2 \leq |u-v|^2 \leq (1-\varepsilon)^{-1}|f(u)-f(v)|^2

Alternatively, for any \epsilon\in(0,1) and any integer k\ge15(\ln N)/\varepsilon^2Or any integer k128(\ln N)/(9\varepsilon^2). there exists a linear function f: \mathbb{R}^n \rightarrow \mathbb{R}^k such that the restriction f|_X is (1+\varepsilon)-bi-Lipschitz.This result follows from the above result. Sketch of proof: Note 1/(1+\varepsilon) and \sqrt{1+3\varepsilon/4} for all \varepsilon\in(0,1). Do casework for 1=N and 13\varepsilon/4 in the latter case, noting 128/9

Also, the lemma is tight up to a constant factor, i.e. there exists a set of points of size N that needs dimension

: \Omega \left(\frac{\log(N)}{\varepsilon^2}\right)

in order to preserve the distances between all pairs of points within a factor of (1 \pm \varepsilon). |chapter-url=https://www.researchgate.net/publication/313162957 }}

The classical proof of the lemma takes f to be a scalar multiple of an orthogonal projection P onto a random subspace of dimension k in \mathbb{R}^n. An orthogonal projection collapses some dimensions of the space it is applied to, which reduces the length of all vectors, as well as distance between vectors in the space. Under the conditions of the lemma, concentration of measure ensures there is a nonzero chance that a random orthogonal projection reduces pairwise distances between all points in X by roughly a constant factor c. Since the chance is nonzero, such projections must exist, so we can choose one P and set f(v) = Pv/c.

To obtain the projection algorithmically, it suffices with high probability to repeatedly sample orthogonal projection matrices at random. If you keep rolling the dice, you will eventually obtain one in polynomial random time.

Proof

Based on.

Construct a random matrix A \sim \mathcal N(0,1)^{k \times n} , obtained by sampling each entry from the standard normal distribution. Then define P := A/\sqrt k . Then, for any nonzero vector x \in \R^n , let the projected vector be \hat x := A x . Standard geometric argument show that r := \frac{|\hat x|^2}{| x|^2} is chi-square distributed, that is, r \sim \chi^2(k) . Thus, it satisfies a concentration inequality for the chi-squared distribution:\Pr(r \in (1\pm \epsilon ) k ) \geq 1 - 2e^{-\frac k2 (\frac 12 \epsilon^2 - \frac 13 \epsilon^3)}By the union bound, the probability that this relation is true for all of x_1, \dots, x_N is greater than 1 - 2N e^{-\frac k2 (\frac 12 \epsilon^2 - \frac 13 \epsilon^3)}.

When k \geq \frac{4\ln 2N}{\epsilon^2(1 - 2\epsilon / 3)}, the probability is nonzero.

More generally, when k \geq \frac{4(d+1)\ln 2N}{\epsilon^2(1 - 2\epsilon / 3)}, the probability is \geq 1 - 1/(2N)^d , allowing arbitrarily high probability of success per sample, and a fortiori polynomial random time.

Alternate statement

A related lemma is the distributional JL lemma. This lemma states that for any 0 and positive integer d , there exists a distribution over \mathbb{R}^{k \times d} from which the matrix A is drawn such that for k = O(\varepsilon^{-2} \log(1/\delta)) and for any unit-length vector x \in \mathbb{R}^{d} , the claim below holds.{{citation | author1-link = William B. Johnson (mathematician) | author2-link = Joram Lindenstrauss | editor1-last = Beals | editor1-first = Richard | editor2-last = Beck | editor2-first = Anatole | editor3-last = Bellow | editor3-first = Alexandra | display-editors = 3 | editor4-last = Hajian | editor4-first = Arshag | url-access = registration : P(|\Vert Ax\Vert_2^2-1|\varepsilon)

One can obtain the JL lemma from the distributional version by setting x = (u-v)/|u-v|_2 and \delta for some pair u,v both in X. Then the JL lemma follows by a union bound over all such pairs.

Sparse JL transform

Database-friendly JL transform

(Achlioptas, 2003) proposed "database-friendly" JL transform, using matrices with only entries from (-1, 0, +1).

Let the random k \times n projection matrix R have entries drawn i.i.d., either from

R_{i j}= \begin{cases}+1 & \text { with probability } 1 / 2 \ -1 & \text { with probability } 1 / 2\end{cases}

or from R_{i j}= \begin{cases}+\sqrt{3} & \text { with probability } 1 / 6 \ 0 & \text { with probability } 2/3 \ -\sqrt{3} & \text { with probability } 1 / 6 \end{cases}

Given a vector v, we define the random projection f(v)=\frac{1}{\sqrt{k}} R v. Then for any vector v\in \R^n, we have \begin{aligned} &-\ln Pr(|f(v)|_2^2 \geq (1+\epsilon) |v|_2^2) \geq \frac k2 \left(\frac{\epsilon^2}{2} -\frac{\epsilon^3}{3} \right) \quad &\forall \epsilon 0 \ &-\ln Pr(|f(v)|_2^2 \leq (1-\epsilon) |v|_2^2) \geq \frac k2 \left(\frac{\epsilon^2}{2} -\frac{\epsilon^3}{3} \right) \quad &\forall \epsilon \in (0, 1) \end{aligned}

Fix some unit vector v \in \R^n. Define Q_i := \sum_j R_{ij} v_j. We have |f(v)|_2^2 = \frac{1}{k} \sum_i Q_i^2 .

Now, since the Q_1, \dots, Q_k are IID, we want to apply a Chernoff concentration bound for \frac 1k \sum_i Q_i^2 around 1. This requires upper-bounding the cumulant generating function (CGF).

For any k \in 1, 2, \dots, the moment of Q_i is upper-bound by the standard gaussian Z \sim N(0, 1): E[Q_i^{2k-1}] = 0 = E[Z^{2k-1}], \quad E[Q_i^{2k}] \leq E[Z^{2k}]

E[Q_i^{2k-1}] = 0 is easy: just apply the fact that E[R_{ij_1}\dots R_{ij_l}] = 0 when l is odd, since we can decompose it into a product of expectations, and one of those is the expectation of an odd power of Radamacher, which is zero.

Now, the trick is that we can rewrite Z as Z = \sum_i Z_iv_i, where each Z_1, \dots, Z_d is a standard gaussian. Then we need to compare: E[Q_i^{2k}] = \sum_{j_1, j_2, \dots, j_{2k-1}, j_{2k}}E[R_{ij_1}R_{ij_2}\dots R_{ij_{2k-1}}R_{ij_{2k}}] v_{j_1}v_{j_2}\dots v_{j_{2k-1}}v_{j_{2k}} and

E[Z^{2k}] = \sum_{j_1, j_2, \dots, j_{2k-1}, j_{2k}}E[Z_{j_1}Z_{j_2}\dots Z_{j_{2k-1}}Z_{j_{2k}}] v_{j_1}v_{j_2}\dots v_{j_{2k-1}}v_{j_{2k}}

In the top sum, a term E[R_{ij_1}R_{ij_2}\dots R_{ij_{2k-1}}R_{ij_{2k}}] v_{j_1}v_{j_2}\dots v_{j_{2k-1}}v_{j_{2k}} decomposes into a product of expectations, times v_{j_1}v_{j_2}\dots v_{j_{2k-1}}v_{j_{2k}}. The product of expectations is zero, unless the indices j_1, j_2, \dots, j_{2k} are paired off. In that case, the term v_{j_1}v_{j_2}\dots v_{j_{2k-1}}v_{j_{2k}} is the square of something, and so

v_{j_1}v_{j_2}\dots v_{j_{2k-1}}v_{j_{2k}} \geq 0 while R_{ij_1}R_{ij_2}\dots R_{ij_{2k-1}}R_{ij_{2k}} is also the square of \pm 1, and so

E[R_{ij_1}R_{ij_2}\dots R_{ij_{2k-1}}R_{ij_{2k}}] = 1

In the bottom sum, we run a similar argument with each such term E[Z_{j_1}Z_{j_2}\dots Z_{j_{2k-1}}Z_{j_{2k}}] v_{j_1}v_{j_2}\dots v_{j_{2k-1}}v_{j_{2k}} but in this case, since we have E[Z^{2k}] = (2k-1)!! \geq 1, we find that in each case,

E[R_{ij_1}R_{ij_2}\dots R_{ij_{2k-1}}R_{ij_{2k}}] v_{j_1}v_{j_2}\dots v_{j_{2k-1}}v_{j_{2k}} \leq E[Z_{j_1}Z_{j_2}\dots Z_{j_{2k-1}}Z_{j_{2k}}] v_{j_1}v_{j_2}\dots v_{j_{2k-1}}v_{j_{2k}}

And so, summing all of them up, E[Q_i^{2k}] \leq E[Z^{2k}].

The same argument works for the other case. Specifically, if R_{ij} is distributed like that, then E[R_{ij}^{2k}] =3^{k-1}\leq (2k-1)!!, and the proof goes through exactly the same way.

Now that Q_i is stochastically dominated by the standard gaussian, and E[Q_i^2] = 1, it remains to perform a Chernoff bound for Q_i^2, which requires bounding the cumulant generating function on both ends.

For any t \in (0, 1/2), we can compute the cumulant generating function \begin{aligned} K_{Q_i^2}(t) &= \ln E[e^{Q_i^2t}] \ &= \ln\sum_k \frac{t^k}{k!}E[Q_i^{2k}] \ &\leq \ln\left( 1 + \sum_k \frac{t^k}{k!}(2k-1)!!\right) \ &= -\frac 12 \ln(1-2t) \end{aligned}

Similarly, for any t\in (0, k/2), K_{\frac 1k \sum_i Q_i^2}(t) = \sum_i K_{ Q_i^2}(t/k) \leq -\frac k2 \ln(1-2t/k)

So by the standard Chernoff bound method, for any t \in (0, k/2) and any \epsilon 0, -\ln Pr\left(\frac 1k \sum_i Q_i^2 \geq 1+\epsilon\right) \geq (1+\epsilon)t + \frac k2 \ln(1-2t/k)

The right side is maximized at t = \frac{k\epsilon}{2(1+\epsilon)}, at which point we have -\ln Pr\left(\frac 1k \sum_i Q_i^2 \geq 1 + \epsilon \right) \geq \frac k2 (\epsilon - \ln(1+\epsilon)) \geq \frac k2 (\epsilon^2/2 - \epsilon^3/3)

That’s one half of the bound done. For the other half, begin with some t 0, and expand the exponential to the second order: \begin{aligned} K_{Q_i^2}(-t) &= \ln E[e^{-Q_i^2t}] \ &\leq \ln E[1-Q_i^2t + Q_i^4t^2/2] \ &\leq \ln (1 - t + 3t^2/2)\ \end{aligned}

K_{\frac 1k \sum_i Q_i^2}(-t) \leq k\ln (1 - t/k + 3t^2/(2k^2))

So by the standard Chernoff bound method, for any t 0 and any \epsilon \in (0, 1), -\ln Pr\left(\frac 1k \sum_i Q_i^2 \leq 1 - \epsilon \right) \geq -k[(1-\epsilon)(t/k) + \ln (1 - t/k + 3t^2/(2k^2))]

Plug in t = \frac{k\epsilon}{2(1+\epsilon)}, and simplify, we find the right side is \geq k \left(\frac{(\epsilon-1) \epsilon}{2(\epsilon+1)}-\ln\left(\frac{7 \epsilon^2+12 \epsilon+8}{8(\epsilon+1)^2}\right) \right) and expand to third Taylor power,

\geq k(\epsilon^2/4 - 7\epsilon^3/48) \frac k2 (\epsilon^2/2 - \epsilon^3/3)

Sparser JL transform on well-spread vectors

(Matoušek, 2008) proposed a variant of the above JL transform that is even more sparsified, though it only works on "well-spread" vectors.

Let R be a k \times n matrix sampled IID with

R_{ij}= \begin{cases}+q^{-1 / 2} & \text { with probability } \frac{1}{2} q \ -q^{-1 / 2} & \text { with probability } \frac{1}{2} q \ 0 & \text { with probability } 1-q\end{cases}

Then, for any unit vector v \in \R^n such that |v|_\infty \leq \alpha, we have Pr(|f(v)|_2^2 \in [1\pm \epsilon] ) \geq 1-\delta

where f(v) = \frac{1}{\sqrt k}Rv.

Speeding up the JL transform

Given A, computing the matrix vector product takes O(kd) time. There has been some work in deriving distributions for which the matrix vector product can be computed in less than O(kd) time.

There are two major lines of work. The first, Fast Johnson Lindenstrauss Transform (FJLT),{{citation was introduced by Ailon and Chazelle in 2006. This method allows the computation of the matrix vector product in just d\log d + k^{2+\gamma} for any constant \gamma0.

Another approach is to build a distribution supported over matrices that are sparse.{{citation This method allows keeping only an \varepsilon fraction of the entries in the matrix, which means the computation can be done in just kd\varepsilon time. Furthermore, if the vector has only b non-zero entries, the Sparse JL takes time kb\varepsilon, which may be much less than the d\log d time used by Fast JL.

Tensorized random projections

It is possible to combine two JL matrices by taking the so-called face-splitting product, which is defined as the tensor products of the rows (was proposed by V. Slyusar{{citation More directly, let {C}\in\mathbb R^{3\times 3} and {D}\in\mathbb R^{3\times 3} be two matrices. Then the face-splitting product {C}\bullet {D} is : {C} \bullet {D}

\left[ \begin{array} { c } {C}_1 \otimes {D}_1\\hline {C}_2 \otimes {D}_2\\hline {C}_3 \otimes {D}_3\ \end{array} \right]. This idea of tensorization was used by Kasiviswanathan et al. for differential privacy.{{citation | editor-last = Schulman | editor-first = Leonard J.

JL matrices defined like this use fewer random bits, and can be applied quickly to vectors that have tensor structure, due to the following identity: :(\mathbf{C} \bull \mathbf{D})(x\otimes y) = \mathbf{C}x \circ \mathbf{D} y = \left[ \begin{array} { c } (\mathbf{C}x)_1 (\mathbf{D} y)_1 \ (\mathbf{C}x)_2 (\mathbf{D} y)_2 \ \vdots \end{array}\right] , where \circ is the element-wise (Hadamard) product. Such computations have been used to efficiently compute polynomial kernels and many other .{{citation

In 2020{{citation :O(\epsilon^{-2}\log1/\delta + \epsilon^{-1}(\tfrac1c\log1/\delta)^c).

For large \epsilon this is as good as the completely random Johnson-Lindenstrauss, but a matching lower bound in the same paper shows that this exponential dependency on (\log1/\delta)^c is necessary. Alternative JL constructions are suggested to circumvent this.

Notes

References

"Lecture notes 5: Random projections".
[https://ocw.mit.edu/courses/18-s096-topics-in-mathematics-of-data-science-fall-2015/f9261308512f6b90e284599f94055bb4_MIT18_S096F15_Ses15_16.pdf MIT 18.S096 (Fall 2015): Topics in Mathematics of Data Science, Lecture 5, Johnson-Lindenstrauss Lemma and Gordons Theorem]
Achlioptas, Dimitris. (June 2003). "Database-friendly random projections: Johnson-Lindenstrauss with binary coins". Journal of Computer and System Sciences.
Matoušek, Jiří. (September 2008). "On variants of the Johnson–Lindenstrauss lemma". Random Structures & Algorithms.
Dirksen, Sjoerd. (2016-10-01). "Dimensionality Reduction with Subgaussian Matrices: A Unified Theory". Foundations of Computational Mathematics.
Slyusar, V. I.. (December 27, 1996). "End products in matrices in radar applications.". Radioelectronics and Communications Systems.
Slyusar, V. I.. (1997-05-20). "Analytical model of the digital antenna array on a basis of face-splitting matrix products.". Proc. ICATT-97, Kyiv.
Slyusar, V. I.. (1997-09-15). "New operations of matrices product for applications of radars". Proc. Direct and Inverse Problems of Electromagnetic and Acoustic Wave Theory (DIPED-97), Lviv..
Slyusar, V. I.. (March 13, 1998). "A Family of Face Products of Matrices and its Properties". Cybernetics and Systems Analysis C/C of Kibernetika I Sistemnyi Analiz.- 1999..
Slyusar, V. I.. (2003). "Generalized face-products of matrices in models of digital antenna arrays with nonidentical channels". Radioelectronics and Communications Systems.

Wikipedia Source

This article was imported from Wikipedia and is available under the Creative Commons Attribution-ShareAlike 4.0 License. Content has been adapted to SurfDoc format. Original contributors can be found on the article history page.

lemmas metric-geometry

Want to explore this topic further?

Ask Mako anything about Johnson–Lindenstrauss lemma — get instant answers, deeper analysis, and related topics.

Research with Mako

Free with your Surf account

Content sourced from Wikipedia, available under CC BY-SA 4.0.

This content may have been generated or modified by AI. CloudSurf Software LLC is not responsible for the accuracy, completeness, or reliability of AI-generated content. Always verify important information from primary sources.

Report