Bramblethorn
Sleep-deprived
- Joined
- Feb 16, 2012
- Posts
- 19,275
No, GPT is not a "search engine". In some cases, it will give you similar results to what you'd get from a search engine, but it's built with a very different purpose and if you treat it like a search engine it'll lead you astray.
The point of a search engine is to index pages (or other documents) that already exist, and help people find pages that relate to what they're looking for. Sometimes it might point to a page that isn't really what you needed, and sometimes it might point to a page that contains misinformation, but either way it's pointing you to a thing that exists (or did when the search engine indexed it.)
The point of GPT is to "read" documents and learn patterns in how those documents are written, then produce similar patterns in response to prompts. In some cases, when it sees the same document over and over - the Bible, or the US Constitution, for instance - it might learn patterns in enough detail to reproduce that document word for word. But in general, it doesn't have anywhere near perfect recall of the things it's read; it's not intended to.
Search engine: I go to the library and ask the librarian for books about fish. Then I read the books the librarian found for me.
GPT: I find a guy who used to spend a lot of time in the library (but no longer has access, for reasons that aren't important to this analogy). I ask him to write me a book about fish, with the instructions that it's more important to be complete and convincing than it is to be accurate. He's going to include the things he remembers from the fish books he read, along with a few things he misremembered, but where he doesn't remember, he'll just make it up.
(He might well include a references section, which looks like he's providing cites for the info in his own book. But if I check out those references, I will find that most of them don't exist, and the ones that do probably don't say what he's attributing to them.)
This is still inaccurate. Yes, it's read much the same data that a search engine has, but it doesn't have all that data, any more than I "have" the complete text of Lord of the Rings from reading it several times. Nor does it have live access to all that data when responding to prompts. What it has is a highly compressed, gappy description of the kinds of patterns it encountered in the data it read.
Aside from anything else, GPT isn't anywhere near large enough to perfectly "memorise" everything in its training data. An instance of GPT-3 is defined by about 175 billion 16-bit parameters, i.e. 3.5 x 10^11 bytes. The Common Crawl data set alone - which is a large part of GPT-3's training data, though not all - is about a thousand times larger. Lossless compression could reduce that a little, but nowhere near a thousandfold.
The point of a search engine is to index pages (or other documents) that already exist, and help people find pages that relate to what they're looking for. Sometimes it might point to a page that isn't really what you needed, and sometimes it might point to a page that contains misinformation, but either way it's pointing you to a thing that exists (or did when the search engine indexed it.)
The point of GPT is to "read" documents and learn patterns in how those documents are written, then produce similar patterns in response to prompts. In some cases, when it sees the same document over and over - the Bible, or the US Constitution, for instance - it might learn patterns in enough detail to reproduce that document word for word. But in general, it doesn't have anywhere near perfect recall of the things it's read; it's not intended to.
Search engine: I go to the library and ask the librarian for books about fish. Then I read the books the librarian found for me.
GPT: I find a guy who used to spend a lot of time in the library (but no longer has access, for reasons that aren't important to this analogy). I ask him to write me a book about fish, with the instructions that it's more important to be complete and convincing than it is to be accurate. He's going to include the things he remembers from the fish books he read, along with a few things he misremembered, but where he doesn't remember, he'll just make it up.
(He might well include a references section, which looks like he's providing cites for the info in his own book. But if I check out those references, I will find that most of them don't exist, and the ones that do probably don't say what he's attributing to them.)
I was simplifying. The key point, and relevance to this discussion, being its sources.
ChatGPT's material came from 'a massive corpus of data written by humans. That includes books, articles, and other documents across all different topics, styles, and genres—and an unbelievable amount of content scraped from the open internet.'
So while it may not technically be a 'search engine', it has all the data that a search engine has. It's just already searched it.
This is still inaccurate. Yes, it's read much the same data that a search engine has, but it doesn't have all that data, any more than I "have" the complete text of Lord of the Rings from reading it several times. Nor does it have live access to all that data when responding to prompts. What it has is a highly compressed, gappy description of the kinds of patterns it encountered in the data it read.
Aside from anything else, GPT isn't anywhere near large enough to perfectly "memorise" everything in its training data. An instance of GPT-3 is defined by about 175 billion 16-bit parameters, i.e. 3.5 x 10^11 bytes. The Common Crawl data set alone - which is a large part of GPT-3's training data, though not all - is about a thousand times larger. Lossless compression could reduce that a little, but nowhere near a thousandfold.