Closed Book Information Retrieval | Spontan•wild•und•Kuchen

Der auf maschinellem Lernen basierende Chat-Bot ChatGPT sorgt aktuell für einiges Aufsehen. Da die Antworten eine für Software noch nicht dagewesene sprachliche Qualität erreichen, besteht die Gefahr, dass Schüler und Studierende zur Anfertigung von Hausarbeiten darauf zurückgreifen könnten. Dies ist ein Problem, dass auch in meinem Kollegenkreis diskutiert wird. Wie sollte man darauf reagieren, wenn sachlich Falsches so gut dargestellt wird, dass es richtig überzeugend wirkt?

Vor diesem Hintergrund wollte ich testen, was passiert, wenn man ChatGPT nach etwas selbst ausgedachtem fragt. Hier ist mein Dialog:

What is "closed book-case information retrieval"? Please cite scientific sources!

Mal sehen, was ChatGPT mit dem „geschlossenen Bücherschrank“ anfängt.

Closed book information retrieval (CBIR) is a type of information retrieval (IR) system that does not have access to external resources or external knowledge beyond the documents in the collection being searched. CBIR systems rely solely on the content of the documents in the collection to retrieve relevant documents in response to a user's query.

CBIR systems are commonly used in situations where external resources are not available or are not reliable, such as when searching a closed-access document collection or when the quality or relevance of external resources is uncertain. CBIR systems can also be useful in situations where the content of the documents in the collection is highly specialized or domain-specific, and external resources may not be relevant or may not provide enough context to support effective retrieval.

One example of a CBIR system is a search engine that is designed to search a collection of documents on a specific topic, such as a corporate intranet or a specialized database. Another example is a digital library system that allows users to search a collection of digitized books, articles, or other documents.

There is a large body of research in the field of information retrieval on CBIR systems, and many different approaches have been developed to improve their effectiveness. Some common approaches include the use of natural language processing techniques to extract relevant information from the documents in the collection, the use of machine learning algorithms to improve the accuracy of document ranking and relevance, and the use of user feedback and interaction to refine search results and improve the overall user experience.

Some references for further reading on CBIR include:

Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill.
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York: ACM Press.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.
Liu, B. (2008). Web data mining: Exploring hyperlinks, contents, and usage data. Springer.

ChatGPT lässt den Term „Case“ stillschweigend fallen und liefert mir eine schöne Definition für „Closed Book Information Retrieval“ ab.

Die finde ich tatsächlich ziemlich überzeugend, und zwar so sehr, dass ich an mir selbst zweifle. Hab ich etwa für meinen Versuch einen Begriff gewählt, den es tatsächlich gibt?

Eine Web-Suche nach „Closed book information retrieval“ liefert allerdings (Stand 22.12.22) erwartungsgemäß nichts. Wer diese Zeilen liest, wird möglicherweise diese Seite hier als Treffer bekommen. Ein gutes Gefühl, den wahrscheinlich ersten Artikel über Closed Book Information Retrieval verfasst zu haben.

Eine Suche nach „CBIR“ liefert nur „Content Based Image Retrieval“. „Closed book QA systems“ sind anscheinend tatsächlich ein Ding, aber der Begriff ist nicht sehr weit verbreitet und ChatGPT sagt auch nichts über QA-Systeme.

Die als Referenzen angegebene Literatur gibt es tatsächlich, freilich steht darin nichts über Closed Book Information Retrieval.

Ich frage ChatGPT nach Beispielen:

Please give me an example of a publicly available closed book IR search engine!

ChatGPT liefert:

One example of a publicly available closed book information retrieval (IR) search engine is the Digital Public Library of America (DPLA). The DPLA is a national digital library that provides access to a wide variety of digital resources, including books, articles, images, and other materials from libraries, archives, and cultural heritage institutions across the United States. The DPLA search engine allows users to search the collection of digitized resources using keywords, phrases, and other search criteria, and returns relevant results from within the DPLA collection.

Another example of a publicly available closed book IR search engine is the Internet Archive, which is a non-profit digital library that provides access to a wide variety of digital resources, including books, articles, images, and other materials from libraries, archives, and cultural heritage institutions around the world. The Internet Archive search engine allows users to search the collection of digitized resources using keywords, phrases, and other search criteria, and returns relevant results from within the Internet Archive collection.

Both the DPLA and the Internet Archive are examples of closed book IR systems because they rely solely on the content of the resources in their collections to retrieve relevant results in response to user queries, and do not have access to external resources or external knowledge beyond the documents in their collections.

Die genannten Beispiele existieren wirklich sind auch (halbwegs) konsistent mit der aufgestellten Definition.

Ich versuche mal, ChatGPT nach etwas wirklich absurdem zu fragen:

What is "super-critical" information retrieval?

ChatGPT riecht Lunte und merkt, dass es das nicht gibt:

I'm sorry, but I'm not familiar with the term "super-critical" information retrieval. It's possible that this is a term that has been coined and used in a specific context or by a particular group of researchers or practitioners, but without more context or information it is difficult for me to provide a meaningful explanation.

In general, the term "information retrieval" (IR) refers to the process of finding and retrieving relevant information from a collection of documents or other sources of information, typically in response to a user's query. IR systems can be designed to search a variety of types of collections, including text documents, images, videos, and other types of multimedia content, and can be implemented using a range of different techniques and approaches.

If you could provide more context or information about the specific context in which the term "super-critical" information retrieval is being used, I would be happy to try to provide a more helpful response.

Der Bot fragt mich nach Kontext, den gebe ich doch gerne:

What is super-critical information retrieval in the context of classification of very short texts? Can you cite a source?

Die Software kann immer noch nichts damit anfangen und wiederholt die Definition von Information Retrieval im allgemeinen:

I'm sorry, but I still don't have enough information to provide a meaningful response to your question. While the term "super-critical" may be used in the context of classification of very short texts, without knowing more about the specific context in which it is being used, it is difficult for me to provide a clear explanation or to cite a relevant source.

In general, "information retrieval" (IR) refers to the process of finding and retrieving relevant information from a collection of documents or other sources of information, typically in response to a user's query. This can involve a variety of tasks, such as indexing and organizing documents in a collection, extracting relevant information from the documents, and ranking and presenting the documents to the user in a way that is relevant and useful.

There are many different approaches and techniques that can be used in IR, and the specific approach or technique that is most appropriate will depend on the characteristics of the collection being searched and the needs and goals of the users. For example, when classifying very short texts, such as tweets or social media posts, it may be necessary to use techniques that are specifically designed to handle the unique characteristics of these types of documents, such as their brevity and the high degree of variability in their content and structure.

If you could provide more specific information about the context in which the term "super-critical" information retrieval is being used, I would be happy to try to provide a more helpful response.

ChatGPT lässt sich also so leicht nicht aufs Glatteis führen.

Aber zurück zu Closed Book:

Lets go back to closed book IR. Can you point out, where exactly in the manning book closed book ir is mentioned?

Ich hake nach und will es genau wissen. Ich kenne das genannte Buch von Manning, Raghavan und Schütze ganz gut und weiß, dass da nichts über „Closed Book IR“ drin steht.

In the book "Introduction to Information Retraction" by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, the concept of closed book information retrieval (CBIR) is introduced and discussed in Chapter 3, titled "Boolean Retrieval."

Here is an excerpt from the book that explains the concept of CBIR:

"In closed book retrieval, a user poses a query to a search engine and the search engine responds with a list of documents, ranked in order of relevance, from a pre-specified collection. In contrast, in open book retrieval the search engine can go beyond the pre-specified collection and can use external resources, such as the World Wide Web, to try to find an answer to the user's query. Closed book retrieval is typically used when the collection is small and well-defined, such as a corporate intranet or a digital library, or when the quality or relevance of external resources is uncertain."

You can find this discussion in the following location in the book:

Chapter 3, "Boolean Retrieval," Section 3.1, "Basic Boolean Model," page 65

I hope this helps! If you have any other questions or need further clarification, please don't hesitate to ask.

Der Bot erkennt, was ich mit „Manning Book“ meine, aber wenn man genau hinschaut fällt auf, dass der Titel des Buchs auf einmal „Introduction to Information Retraction“ sein soll.

ChatGPT saugt sich jetzt was aus den Fingern. Schaute man an der angegebenen Stelle nach, würde der Beschiss jetzt auffliegen. Interessant ist, dass es sogar tatsächlich ein Kapitel „Boolean Retrieval“ gibt, allerdings ist das Kapitel 1.

Versuchen wir mal ein Fachgespräch. ChatGPT schreibt was von Ranking. Das hätte ich in einem Kapitel über Boolesches Retrieval nicht vermutet, denn üblicherweise spielt Ranking dabei gerade keine Rolle. In dem genannten Buch ist folglich in dem Kapitel über Boolean Retrieval das Thema Ranking nicht erklärt, sondern kommt erst in den Kapiteln 6 und 7 dran.

Ich frage also mal nach:

Why would a textbook chapter on boolean retrieval mention ranking?

ChatGPT antwortet:

In the field of information retrieval (IR), ranking is a common approach for presenting search results to users in a way that is relevant and useful. When a user performs a search using a search engine or other IR system, the system typically returns a list of documents or other resources that are relevant to the user's query. The order in which these results are presented to the user can be important, as it can affect the user's ability to find the information they are looking for and can also influence their overall satisfaction with the search experience.

In IR systems, ranking is typically used to order the search results based on their relevance to the user's query. Relevance is usually determined using a combination of factors, such as the presence of relevant keywords or phrases in the documents, the importance or credibility of the documents or their sources, and the context in which the documents are used.

Ranking can be implemented using a variety of different approaches, including mathematical models that use statistical or probabilistic techniques to determine the likelihood that a document is relevant to a given query, or machine learning algorithms that learn to predict relevance based on examples of relevant and non-relevant documents.

In the context of boolean retrieval, ranking is often used to present the search results to the user in a way that is more meaningful and useful. Boolean retrieval is a type of IR system that allows users to search a collection of documents using boolean operators (such as AND, OR, and NOT) to combine keywords or phrases into more complex queries. By ranking the search results based on their relevance to the query, boolean retrieval systems can help users to more easily find the information they are looking for within the collection.

I hope this helps to clarify why ranking might be mentioned in a textbook chapter on boolean retrieval. If you have any other questions or need further clarification, please don't hesitate to ask.

Ganz nette Begründung. Aber wenn wir mal bei dem Buch von Manning et al. bleiben, dort ist in der Tat ausdrücklich erwähnt, dass Ranked Retrieval etwas anderes ist als Boolean Retrieval. Da hat ChatGPT das Buch wohl gar nicht richtig gelesen…

Zeit, ChatGPT mal zu konfrontieren:

Can you prove that the phrase "closed book information retrieval" is not something I just made up a few moments ago?

ChatGPT will sich rausreden, aber bleibt dennoch standhaft dabei, dass es Closed Book Information Retrieval wirklich gibt und behauptet sogar, dass sei doch ein etablierter Begriff!

I'm sorry, but as an AI language model, I do not have the ability to verify the origin or history of specific phrases or terms. However, the phrase "closed book information retrieval" (CBIR) is a well-established term in the field of information retrieval (IR), and has been used by researchers and practitioners for many years.

CBIR refers to a type of IR system that does not have access to external resources or external knowledge beyond the documents in the collection being searched. CBIR systems rely solely on the content of the documents in the collection to retrieve relevant documents in response to a user's query.

You can find more information about CBIR in the scientific literature on information retrieval, including the references that I provided in my previous response.

Ich habe die „Diskussion“ noch ein wenig weitergeführt. Einen Teil lasse ich hier mal aus. Es gelingt mir aber leicht, ChatGPT noch eine weitere erfundene Textstelle zu entlocken:

Can you give me another example where closed book information retrieval is mentioned in literature?

In der Antwort wird auch wieder ein Buch referenziert, dass es tatsächlich gibt, aber der präsentierte Auszug ist so nicht vorhanden.

Certainly! Here is another example of where the concept of closed book information retrieval (CBIR) is mentioned in the literature:

In the book "Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data" by Bing Liu, CBIR is discussed in Chapter 3, titled "Information Retrieval Basics."

Here is an excerpt from the book that explains the concept of CBIR:

"In closed book information retrieval, the search engine only has access to a pre-specified collection of documents and returns search results from this collection. In open book information retrieval, the search engine can go beyond the pre-specified collection and use external resources, such as the World Wide Web, to try to find an answer to the user's query. Closed book information retrieval is typically used when the collection is small and well-defined, such as a corporate intranet or a digital library, or when the quality or relevance of external resources is uncertain."

You can find this discussion in the following location in the book:

Chapter 3, "Information Retrieval Basics," Section 3.3, "Boolean Retrieval," page 63

I hope this helps! If you have any other questions or need further clarification, please don't hesitate to ask.

Belassen wir es mal dabei, keine weiteren Fragen.

Es bleibt die Erkenntnis, dass ChatGPT sehr überzeugend sein kann und wir in Zukunft zum Beispiel in der Rolle als Prüfer noch genauer als bisher nachforschen müssen, wenn Quellen zitiert werden, ob das, was angeblich dort stehen soll, auch tatsächlich dort steht.