From Surf Wiki (app.surf) — the open knowledge base
Yule–Simon distribution
Discrete probability distribution
Discrete probability distribution
name =Yule–Simon| type =mass| pdf_image =[[File:Yule-Simon distribution PMF.svg|325px|Plot of the Yule–Simon PMF]] Yule–Simon PMF on a log-log scale. (Note that the function is only defined at integer values of k. The connecting lines do not indicate continuity.)| cdf_image =[[File:Yule-Simon distribution CMF.svg|325px|Plot of the Yule–Simon CMF]] Yule–Simon CMF. (Note that the function is only defined at integer values of k. The connecting lines do not indicate continuity.)| parameters =\rho0, shape (real)| support =k \in {1,2,\dotsc}| pdf =\rho\operatorname{B}(k, \rho+1)| cdf =1 - k\operatorname{B}(k, \rho+1)| mean =\frac \rho {\rho-1} for \rho1| median =| mode =1| variance =\frac{\rho^2}{(\rho-1)^2(\rho-2)} for \rho2| skewness =\frac{(\rho+1)^2\sqrt{\rho-2}}{(\rho-3)\rho}, for \rho3| kurtosis =\rho+3+\frac{11\rho^3-49\rho-22} {(\rho-4)(\rho-3)\rho} for \rho4| entropy =| mgf = does not exist| char =\frac{\rho}{\rho+1}{}_2F_1(1,1; \rho+2; e^{i,t})e^{i,t}| In probability and statistics, the Yule–Simon distribution is a discrete probability distribution named after Udny Yule and Herbert A. Simon. Simon originally called it the Yule distribution.{{cite journal
The probability mass function (pmf) of the Yule–Simon (ρ) distribution is
:f(k;\rho) = \rho\operatorname{B}(k, \rho+1),
for integer k \geq 1 and real \rho 0, where \operatorname{B} is the beta function. Equivalently the pmf can be written in terms of the rising factorial as
: f(k;\rho) = \frac{\rho\Gamma(\rho+1)}{(k+\rho)^{\underline{\rho+1}}},
where \Gamma is the gamma function. Thus, if \rho is an integer,
: f(k;\rho) = \frac{\rho,\rho!,(k-1)!}{(k+\rho)!}.
The parameter \rho can be estimated using a fixed point algorithm.{{cite journal
The probability mass function f has the property that for sufficiently large k we have
: f(k;\rho) \approx \frac{\rho\Gamma(\rho+1)}{k^{\rho+1}} \propto \frac 1 {k^{\rho+1}}.

This means that the tail of the Yule–Simon distribution is a realization of Zipf's law: f(k;\rho) can be used to model, for example, the relative frequency of the kth most frequent word in a large collection of text, which according to Zipf's law is inversely proportional to a (typically small) power of k.
Occurrence
The Yule–Simon distribution arose originally as the limiting distribution of a particular model studied by Udny Yule in 1925 to analyze the growth in the number of species per genus in some higher taxa of biotic organisms.{{cite journal | doi-access = free
The preferential attachment process can also be studied as an urn process in which balls are added to a growing number of urns, each ball being allocated to an urn with probability linear in the number (of balls) the urn already contains.
The distribution also arises as a compound distribution, in which the parameter of a geometric distribution is treated as a function of random variable having an exponential distribution. Specifically, assume that W follows an exponential distribution with scale 1/\rho or rate \rho:
:W \sim \operatorname{Exponential}(\rho),
with density
:h(w;\rho) = \rho \exp(-\rho w).
Then a Yule–Simon distributed variable K has the following geometric distribution conditional on W:
: K \sim \operatorname{Geometric}(\exp(-W)).
The pmf of a geometric distribution is
:g(k; p) = p (1-p)^{k-1}
for k\in{1,2,\dotsc}. The Yule–Simon pmf is then the following exponential-geometric compound distribution:
:f(k;\rho) = \int_0^\infty g(k;\exp(-w)) h(w;\rho),dw.
The maximum likelihood estimator for the parameter \rho given the observations k_1,k_2,k_3,\dots,k_N is the solution to the fixed point equation
: \rho^{(t+1)} = \frac{N+a-1}{b+\sum_{i=1}^N\sum_{j=1}^{k_i}\frac{1}{\rho^{(t)} + j}}, where b=0, a=1 are the rate and shape parameters of the gamma distribution prior on \rho .
This algorithm is derived by Garcia by directly optimizing the likelihood. Roberts and Roberts{{cite arXiv
generalize the algorithm to Bayesian settings with the compound geometric formulation described above. Additionally, Roberts and Roberts are able to use the Expectation Maximisation (EM) framework to show convergence of the fixed point algorithm. Moreover, Roberts and Roberts derive the sub-linearity of the convergence rate for the fixed point algorithm. Additionally, they use the EM formulation to give 2 alternate derivations of the standard error of the estimator from the fixed point equation. The variance of the \lambda estimator is
: \operatorname{Var}(\hat{\lambda}) = \frac{1}{\frac{N}{\hat{\lambda}^2} - \sum_{i=1}^N\sum_{j=1}^{k_i}\frac{1}{(\hat{\lambda} + j)^2}}, the standard error is the square root of the quantity of this estimate divided by N.
Generalizations
The two-parameter generalization of the original Yule distribution replaces the beta function with an incomplete beta function. The probability mass function of the generalized Yule–Simon(ρ, α) distribution is defined as
: f(k;\rho,\alpha) = \frac \rho {1-\alpha^\rho} ; \mathrm{B}_{1-\alpha}(k, \rho+1), ,
with 0 \leq \alpha . For \alpha = 0 the ordinary Yule–Simon(ρ) distribution is obtained as a special case. The use of the incomplete beta function has the effect of introducing an exponential cutoff in the upper tail.
Bibliography
- Colin Rose and Murray D. Smith, Mathematical Statistics with Mathematica. New York: Springer, 2002, . (See page 107, where it is called the "Yule distribution".)
References
This article was imported from Wikipedia and is available under the Creative Commons Attribution-ShareAlike 4.0 License. Content has been adapted to SurfDoc format. Original contributors can be found on the article history page.
Ask Mako anything about Yule–Simon distribution — get instant answers, deeper analysis, and related topics.
Research with MakoFree with your Surf account
Create a free account to save articles, ask Mako questions, and organize your research.
Sign up freeThis content may have been generated or modified by AI. CloudSurf Software LLC is not responsible for the accuracy, completeness, or reliability of AI-generated content. Always verify important information from primary sources.
Report