A “Silly Walk” through Fermyon Serverless AI
Matt Butcher
ai
serverless
example
AI should be put to good use. And what use is better than generating Pythonesque quotes from a Large Language Model (LLM)? Let’s take a quick tour of the basics of Fermyon Serverless AI by creating an homage to one of the most famous Monty Python sketches.
If you’re new to AI and want a quick and entertaining way to get started, this post is for you.
Pre-requisite: Watch this Video (and Create a Fermyon Cloud Account)
Okay, you don’t really have to watch this video, but it may improve your day nonetheless.
Monty Python’s Ministry of Silly Walks (Full Sketch)
We’ll be building our application with Spin, the developer tool for writing serverless apps for Fermyon. The installation instructions and Taking Spin for a spin docs can get you started. You will need Spin 1.5 or later.
There is one more prerequisite for building this app: You will need AI inferencing enabled on your Fermyon Cloud account. If you haven’t already created a Fermyon Cloud account, all you need is a GitHub account. From there, you can either run spin login
at the command line or login from a browser.
At the time of this writing, Fermyon Serverless AI is not automatically enabled. To request the feature on your cloud account, sign up for the beta.
Creating a New Spin App using TypeScript
While we could create an app using any Spin SDK that supports Serverless AI inferencing, we’ll use TypeScript in this demo.
To get started, we will scaffold out a new app and use NPM to install the basic dependencies:
$ spin new http-ts silly-walk --accept-defaults
$ cd silly-walk
$ npm install
Once that is done, you should have a full TypeScript Spin application.
A Silly Prompt
The main source file that we will be working with is src/index.ts
. Open that file in the editor of your choice.
A quick word on terminology:
- An LLM (Large Language Model) is a provided pre-trained AI model. We’re using the default LLM, the open source-style LLaMa2.
- Querying an LLM is called running an inference
- We pass a prompt, which is just a plain text description of what we want
- And we receive back a response that the LLM generates
- Tokens are words or chunks of words that the LLM treats as significant. (Tokens are a kind of indicator for how much work the LLM has to do.) Fermyon Cloud places resource limits based on the number of tokens the LLM uses. So it’s best to ask the LLM to do small things (3 sentences) rather than large things (a movie-length script).
For our app, we are going to use the following Monty Python-inspired prompt:
As a monty python character explain how to walk. Limit to 3 sentences.
And what we’ll expect to get back is a silly explanation of how to walk (or, perhaps, a Pythonesque non-sequiter).
Configuring Our App
By default, a given component of a Spin application will not have access to any Serverless AI models. Access must be provided explicitly via the Spin application’s manifest (the spin.toml
file). For example, an individual component in a Spin application could be given access to the llama2-chat model by adding the following configuration inside the specific [[component]]
section: ai_models = ["llama2-chat"]
. Your spin.toml
file should look something like this:
spin_manifest_version = "1"
authors = ["Matt <example@users.noreply.github.com>"]
description = ""
name = "silly-walk"
trigger = { type = "http", base = "/" }
version = "0.1.0"
[[component]]
id = "silly-walk"
source = "target/silly-walk.wasm"
exclude_files = ["**/node_modules"]
ai_models = ["llama2-chat"]
[component.trigger]
route = "/..."
[component.build]
command = "npm run build"
Coding Our App
We’re going to use the Llm
object from the Fermyon Spin SDK. (Note: That’s a capital L
and lowercase l
followed by an m
.) We can import that and then use the infer()
function to pass a prompt to the LLM and then get back a response.
Our complete code is only twelve lines:
import { Llm, InferencingModels, HandleRequest, HttpRequest, HttpResponse } from "@fermyon/spin-sdk"
const model = InferencingModels.Llama2Chat
export const handleRequest: HandleRequest = async function (request: HttpRequest): Promise<HttpResponse> {
const prompt = "As a monty python character explain how to walk. Limit to 3 sentences."
const out = Llm.infer(model, prompt)
return {
status: 200,
body: out.text
}
}
What we have created above is an HTTP serverless app that will, when accessed on the web, return an answer to the silly walk prompt.
To build it, use spin build
. If you have any problems:
- Run
npm install
and make sure there are no errors
- Double-check that your version of
spin
is 1.5 or newer.
And, of course, feel free to drop into Discord if you get stuck. We’re there and ready to help.
When the spin build
finished successfully, we’re ready to run it!
Running it on Fermyon Cloud
You can run inferencing on your local machine. But you will first need to install the LLaMa2 model (which is rather large). And likely the inferencing operations will be slow. I had to wait over 15 minutes for the above to run locally simply because my workstation is not powerful enough.
But there’s an easier way: Fermyon Cloud’s free tier has access to Serverless AI. And Fermyon Cloud uses powerful AI-grade GPUs to do inferencing, which means the above will run in a second or two, rather than 15 minutes.
Testing it out is as simple as running spin deploy
after your spin build
. That will package and send your app to Fermyon Cloud and then return you an HTTP endpoint that you can access with your browser or curl
.
Once you have it, you can test your app like this (substituting AAAA
with whatever random string Fermyon Cloud gave you):
$ curl <https://silly-walk-AAAA.fermyon.app>
"Oh ho ho! Walking, you say? Well, first you gotta put one foot in front of the other, see? And then you gotta lift that foot up and put it down again, oh ho ho! But watch out for those banana peels, they'll have you slipping and sliding all over the place, ha ha ha!"
As you can see, we got a silly response.
A Different Answer Every Time
Those used to coding with relational databases and other forms of storage may initially be surprised to find that each inference will respond with a different answer. This is because AI inferencing is non-deterministic.
$ curl <https://silly-walk-AAAA.fermyon.app>
This is an absurd and ridiculous request, but I'll do my best to explain how to walk as a Monty Python character:
"Right-o then, listen up, me ol' chum! To walk like a Monty Python character, ye must first ensure that yer feet be properly attached to yer legs. Then, ye must lift one foot and place it in front of the other, repeatin' this process until ye reach yer
Notice that the above was truncated? That’s because the response hit the default limit of 100 tokens. There is an alternate way of writing the request that allows me to express a higher limit:
import { Llm, InferencingModels, HandleRequest, HttpRequest, HttpResponse } from "@fermyon/spin-sdk"
const model = InferencingModels.Llama2Chat
export const handleRequest: HandleRequest = async function (request: HttpRequest): Promise<HttpResponse> {
const prompt = "As a monty python character explain how to walk. Limit to 3 sentences."
// We added maxTokens here:
const out = Llm.infer(model, prompt, { maxTokens: 200 })
return {
status: 200,
body: out.text
}
}
The above will run the same request, but allow an answer up to 200 tokens, which is more than enough for three sentences.
Sometimes the answers generated are good. Sometimes they are not so good. And sometimes the LLM generates something that just doesn’t make sense. That is just the current state of the technology. Our silly example here is certainly amenable to strange answers. But if you are doing more serious things, you may have to get very specific with your prompts to get something that is likely to meet your needs.
Remember also that each time you run an inference, it counts against your total inferencing allowance. On free tier, you will eventually hit your daily cap. If you are testing a lot, or if you are exposing this publicly, you might consider caching your responses for a time in the Key Value Store to reduce load and save those tokens.
Where to Go from Here
Inferencing is the main way you work with an existing LLM model, and that’s what we’ve illustrated above. But there are more advanced things you can do as well, such as tailor the domain in which the LLM responds. Using a technique called embedding, you can pass additional context to the AI inferencing system to tell it how to more accurately meet the user’s expectations.
For example, if we were writing an app to generate Pythonesque responses to questions about the Fermyon documentation, we’d need to make sure that the inferencing system was primed with Fermyon-specific information and understood that it should keep its answers related to technical docs.
To learn more about this and other things you can do with LLMs, head over to the Fermyon Serverless AI developer guide. And whether your AI idea is a silly app or a serious one, we hope you enjoy using Fermyon Serverless AI.