From Surf Wiki (app.surf) — the open knowledge base

Boolean model of information retrieval

Classical information retrieval model

Summary

Classical information retrieval model

The (standard) Boolean model of information retrieval (BIR) is a classical information retrieval (IR) model where documents are retrieved based on whether they satisfy the conditions of a query that uses Boolean logic. As the first and most-adopted information retrieval model, it treats each document as a set of words, or terms. A user's query uses logical operators like AND, OR, and NOT to create a rule for retrieval. The system then returns all documents that match the rule.

Definitions

In the Boolean model, documents and queries are represented using concepts from set theory. A document is seen as a simple collection (a set) of terms, and a query is a formal statement (a Boolean expression) that specifies which terms must or must not be present in a retrieved document.

An index term (or term) is a keyword that characterizes the content of a document. Terms are the fundamental units of the model. Common, low-information words (called stop words) like "a", "the", and "is" are typically excluded from being used as index terms.
A document is represented as a set of index terms. This is a bag-of-words model, meaning the order and frequency of terms in the original document are ignored. For example, a document about Bayes' theorem might be represented simply as the set {\text{Bayes' theorem, probability, decision-making}}.
A query is a formal expression of the user's information need, written using index terms and Boolean operators (AND, OR, NOT). The model retrieves every document that is considered a "match" for this logical expression.

Formal representation

The model can be defined formally as follows:

Let T = {t_1, t_2, \ldots, t_k} be a set of all index terms.
A document D_j is any subset of T.
A query Q is a Boolean expression, typically in conjunctive normal form:Q = (t_a \lor t_b) \land (\lnot t_c \lor t_d) \land \dotswhere t_a, t_b, \dots \in T. Retrieval is the process of identifying the set of all documents {D_j} that satisfy the query Q. For example, for the simple query Q = t_a \land t_b, the system would retrieve all documents whose set of terms contains both t_a and t_b.

Example

Let the set of original (real) documents be, for example

: D = {D_1,\ D_2,\ D_3}

where

D_1 = "Bayes' principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution)."

D_2 = "Bayesian decision theory: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest subjective expected utility. If one had unlimited time and calculating power with which to make every decision, this procedure would be the best way to make any decision."

D_3 = "Bayesian epistemology: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power."

Let the set T of terms be:T = {t_1=\text{Bayes' principle}, t_2=\text{probability}, t_3=\text{decision-making}, t_4=\text{Bayesian epistemology}}Then, the set D of documents is as follows:D = {D_1,\ D_2,\ D_3}where \begin{align} D_1 &= {\text{probability},\ \text{Bayes' principle}} \ D_2 &= {\text{probability},\ \text{decision-making}} \ D_3 &= {\text{probability},\ \text{Bayesian epistemology}} \end{align}Let the query Q be ("probability" AND "decision-making"):Q = \text{probability} \and \text{decision-making}Then to retrieve the relevant documents:

Firstly, the following sets S_1 and S_2 of documents D_i are obtained (retrieved):\begin{align} S_1 &= {D_1,\ D_2,\ D_3} \ S_2 &= {D_2} \end{align}Where S_1 corresponds to the documents which contain the term "probability" and S_2 contain the term "decision-making".
Finally, the following documents D_i are retrieved in response to Q: Q: {D_1,\ D_2,\ D_3}\ \cap\ {D_2}\ =\ {D_2}Where the query looks for documents that are contained in both sets S using the intersection operator. This means that the original document D_2 is the answer to Q.

If there is more than one document with the same representation (the same subset of index terms t_n), every such document is retrieved. Such documents are indistinguishable in the BIR (in other words, equivalent).

Advantages

Clean formalism
Easy to implement
Intuitive concept
If the resulting document set is either too small or too big, it is directly clear which operators will produce respectively a bigger or smaller set.
It gives (expert) users a sense of control over the system. It is immediately clear why a document has been retrieved given a query.

Disadvantages

Exact matching may retrieve too few or too many documents
Hard to translate a query into a Boolean expression
Ineffective for Search-Resistant Concepts
All terms are equally weighted
More like data retrieval than information retrieval
Retrieval based on binary decision criteria with no notion of partial matching
No ranking of the documents is provided (absence of a grading scale)
Information need has to be translated into a Boolean expression, which most users find awkward
The Boolean queries formulated by the users are most often too simplistic
The model frequently returns either too few or too many documents in response to a user query

Data structures and algorithms

From a pure formal mathematical point of view, the BIR is straightforward. From a practical point of view, however, several further problems should be solved that relate to algorithms and data structures, such as, for example, the choice of terms (manual or automatic selection or both), stemming, hash tables, inverted file structure, and so on.

Hash sets

Another possibility is to use hash sets. Each document is represented by a hash table which contains every single term of that document. Since hash table size increases and decreases in real time with the addition and removal of terms, each document will occupy much less space in memory. However, it will have a slowdown in performance because the operations are more complex than with bit vectors. On the worst-case performance can degrade from O(n) to O(n2). On the average case, the performance slowdown will not be that much worse than bit vectors and the space usage is much more efficient.

Signature file

Each document can be summarized by Bloom filter representing the set of words in that document, stored in a fixed-length bitstring, called a signature. The signature file contains one such superimposed code bitstring for every document in the collection. Each query can also be summarized by a Bloom filter representing the set of words in the query, stored in a bitstring of the same fixed length. The query bitstring is tested against each signature. Justin Zobel; Alistair Moffat; and Kotagiri Ramamohanarao. "Inverted Files Versus Signature Files for Text Indexing". Bob Goodwin; et al. "BitFunnel: Revisiting Signatures for Search". 2017. Richard Startin. "Bit-Sliced Signatures and Bloom Filters".

The signature file approached is used in BitFunnel.

Inverted file

An inverted index file contains two parts: a vocabulary containing all the terms used in the collection, and for each distinct term an inverted index that lists every document that mentions that term.

References

(1973). "Information Retrieval On-Line". Melville Publishing Co., Los Angeles, California.
"Information Retrieval".
(6 August 2024). "Stop searching and you will find it: Search-Resistant Concepts in systematic review searches". BMJ Evidence-Based Medicine.
Wartik, Steven. (1992). "Information Retrieval Data Structures & Algorithms". Prentice-Hall, Inc..

Wikipedia Source

This article was imported from Wikipedia and is available under the Creative Commons Attribution-ShareAlike 4.0 License. Content has been adapted to SurfDoc format. Original contributors can be found on the article history page.

mathematical-modeling information-retrieval-techniques

Want to explore this topic further?

Ask Mako anything about Boolean model of information retrieval — get instant answers, deeper analysis, and related topics.

Research with Mako

Free with your Surf account

Content sourced from Wikipedia, available under CC BY-SA 4.0.

This content may have been generated or modified by AI. CloudSurf Software LLC is not responsible for the accuracy, completeness, or reliability of AI-generated content. Always verify important information from primary sources.

Report