Using Generative AI to Stand on the Shoulders of Giants
Tim McCallum
research
oai-pmh
ai
llm
This article delves into the prospect of employing Generative AI to translate scholarly content into layman’s terms, making academic knowledge more accessible to the public; bridging the academic-public knowledge gap.
Here at Fermyon, our CEO is a philosophy PhD, and he’s not afraid to share the love. I was inspired by a recent blog that discussed Artificial Intelligence (AI) and proposed philosophical experiments to explore Generative AI’s capacities and limitations. It made me reflect on the fact most (if not all) Information Technology (IT) systems rely on the Garbage In, Garbage Out (GIGO) principle; computers will process whatever input they are given, regardless of the quality. Talking about garbage is a pretty negative perspective, though. The GIGO I’d like to see from modern IT is “Greatness In, Greatness Out” - a modern take on Isaac Newton’s “standing on the shoulders of giants.”
Another recent blog that discussed open-source software licenses also inspired this article. Specifically around the monetization challenge that the Free and Open Source Software (FOSS) development model grapples with. As the aforementioned article describes, high-quality output needs funding. Thankfully, from a scientific research point of view, most of our governments take care of the costs. Notwithstanding, the global software industry has maintained a healthy FOSS community over the past decade or more (and as a result, everyone has access to an incredible amount of software libraries that underpin most of the technology that we all use daily). This is something that engineers should be very proud of!
Back to the topic of broader (less software-related) research. The public, encompassing taxpayers, entrepreneurs, educators, agriculturists, researchers, and students, stands to gain significantly from unrestricted, complimentary access to all knowledge and information funded by the public. Government bodies earmark several billion dollars annually for research projects and initiatives. These ventures yield research documents, papers, and various types of literature. Higher education students also contribute to this body of literature through their Theses and Dissertations. Most of this publicly funded content is peer-reviewed and is arguably of the highest standard. But let’s face it: when we click on a PDF to a research paper, we flat-out don’t understand the academic jargon, and if we get past that, we immediately stumble when we are hit with the LaTeX formatted formulas and complex explanations. So … what if there was a way to access and present the latest research to us in lay terms?
Would you believe me if I told you that there is a protocol that provides all of the world’s publicly funded research output in a way that a machine can conveniently slurp it all up?
The OAI-PMH protocol facilitates the harvesting of research papers and essentially acts as a bridge between digital repositories and the receiving aggregation service. In this case, a proposed generative AI model is trained on only the latest academic peer-reviewed scholarly content from around the globe.
So, how does OAI-PMH work? For many years, various publicly funded universities have set up repositories where researchers and students can upload research papers. Each paper has associated metadata, which is like a small summary containing key information like the title, author, publication date, keywords, and an abstract. On the receiving end, a harvester is set up by a service that wants to index or gather information from various repositories. This could be a university library, a research database, or a search engine focused on academic papers. Traditionally, like Google Scholar, this information is indexed in a way that makes it easy to search. But what if we could provide not just a way to search but a way to explain what the heck the paper is actually saying in lay terms? This is where we start to think of using generative AI to stand on the shoulders of giants.
As we learned in a recent exercise about using generative AI to write Haiku poetry, when training AI, we must always remember that “more work must be done”. But think of what we could build! We could start by downloading each full-text article (most of which are in the Portable Document Format(PDF) format). We could extract the text using Optical Character Recognition (OCR). Once fragmented into a training set, we could start to train the AI.
This is when you should hear the penny drop … at this point, we start to realize that all of the data we are training our AI on has already undergone a rigorous peer review process.
This peer review process can be seen as gruelling if you are an academic. However, the great news from an AI training perspective is that once a paper has made it into a peer-reviewed publication, it contributes to the world’s highest quality knowledge and adds to the evergrowing collection of empirical facts. This peer-reviewed content is a far cry from an anecdotal soundbite or someone’s opinion on the “internet”. Can we even put a value on using generative AI to stand on the shoulders of giants in this way? I don’t know about you … but I am excited.