Announcing Spin v1.5
Radu Matei
spin
wasm
wasi
Today, we are excited to introduce Spin 1.5, which includes performance improvements, a few bugfixes, and a new exciting set of features:
- support for running AI inferencing for Large Language Models (LLMs) and for generating sentence embeddings
- improved performance when handling concurrent requests by using Wasmtime’s pooling memory allocator
- support for intra-component outbound HTTP with
allowed_http_hosts = ["self"]
- SQLite support in the TinyGo SDK
Let’s dive into some of the highlights from this release!
AI Inferencing for Large Language Models
A few weeks ago, we announced Fermyon Serverless AI, a new set of features for Fermyon Cloud that represent the building blocks for adding AI capabilities to your serverless applications. At the core of those features lies the ability to build a Spin application that can perform AI operations such as inferencing or embedding. In Spin 1.5, you are able to write applications that make use of those new building blocks directly using the Spin SDK, currently with support for Rust, JavaScript, and TypeScript. Here is an example that uses the new features to perform sentiment analysis on the input:
import { Llm, InferencingModels, InferencingOptions } from "@fermyon/spin-sdk";
export async function handler(req, res) {
// Take the request body and prepare the Llama2 prompt.
let input = decoder.decode(req.body);
let prompt = `<s>[INST] <<SYS>>
You are a utility that performs sentiment analysis on the supplied text.
Only respond with one of: positive, neutral, negative.
<</SYS>>
${input} [/INST]
`
// Use the new Llm.infer implementation to perform the inference.
// Control the model, prompt, and inference parameters such as
// number of tokens or temperature.
let inferenceResponse = Llm.infer(InferencingModels.Llama2Chat, prompt, { maxTokens: 6 });
console.log("Executed inference with input " + input + " Result: ")
console.log(inferenceResponse);
// This is a full web application, send the HTTP response.
res.status(200).body(inferenceResponse.text);
}
This is a complete HTTP handler function, and the most important part of it is the Llm.infer
function call, which will execute an inferencing operation on the Llama2 Chat model using the prompt constructed using the input from the request body.
We also need to grant this component the capability to execute the AI model in the application manifest:
[[component]]
id = "sentiment-analysis"
# this is a WebAssembly component.
source = "component.wasm"
[component.trigger]
route = "/api/..."
# this component is not allowed to make ANY external requests!
allowed_http_hosts = []
# we grant the component the capability to call the Llama2 chat model.
ai_models = ["llama2-chat"]
[component.build]
command = "npm run build"
Finally, we can build the application using spin build
, and after fetching the right model locally, run spin up
. When the application is executed locally, Spin will load the appropriate language model from disk, and perform the inferencing operation locally.
You can find a complete tutorial for the inferencing functionality here, and you can find a growing list of examples, templates, and tutorials on the Spin Up Hub.
Executing inferencing on large language models is extremely resource intensive, and requires hardware acceleration — such as a graphics card or CPU multi-threading — none of which is currently widely supported from WebAssembly. To address this, using the component model, we defined a high level interface that defines how a guest Wasm component can execute an inferencing operation. Here is the inference function from the Spin WIT interface:
/// Perform inferencing using the provided model and prompt with the given optional params
infer: func(model: inferencing-model, prompt: string, params: option<inferencing-params>) -> result<inferencing-result, error>
At Fermyon, we have been long supporters of the WASI NN proposal for working with neural networks in Wasm (and our team built one of the first implementations of WASI NN). Recent improvements to the specification should make it possible to implement the Spin inferencing operation in terms of WASI NN.
Wasm components running in Spin will call the infer
function above, which, to optimize the performance and to have access to hardware acceleration, is currently implemented on the host. Even so, executing inferencing on large language models is extremely resource intensive. To address this when running locally (potentially on machines without significant amount of resources), Spin implements inferencing using a machine learning library optimized for running on consumer hardware (and potentially on the CPU) — GGML, and in particular, the Rust bindings for GGML from the Rustformers project.
Inferencing would not be possible in Spin without these projects, and we would like to thank the maintainers and contributors of those projects for making it possible!
If you are interested in learning more about how this is implemented, have a look at the Spin Improvement Proposal that introduced this feature.
Performance Improvements for Concurrent Requests
Every time it handles a new request, Spin will create a new WebAssembly instance, execute the handler function for that request, then terminate the instance. It can do that for thousands of very short-lived instances over very short periods of time. Spin 1.5 makes use of a Wasmtime feature specifically designed for such scenarios, the memory pooling allocator, which can speed-up the performance for concurrent requests significantly.
You can read more about the scenarios used to benchmark this change here and the results from these changes.
Intra-Component HTTP Calls
Spin 1.5 adds the ability for one component to send HTTP calls to other components in the same application using the special value self
in the allowed_http_hosts
configuration field.
By adding allowed_http_hosts = ["self"]
in a component configuration in the spin.toml
manifest file, it can now make a request to a relative route to another component:
// send a request to the current application on the /hello route
resp, err := http.Get("/hello")
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
SQLite Support for TinyGo
Spin 1.5 adds support for using SQLite as a relational data store when using the TinyGo SDK, bringing it to feature parity for relational storage with Rust, JavaScript, TypeScript, and Python.
Below is an example for a TinyGo Spin handler that uses SQLite:
spinhttp.Handle(func(w http.ResponseWriter, r *http.Request) {
// open the "default" store
db := sqlite.Open("default")
defer db.Close()
// execute a query against the database
rows, err := db.Query("SELECT * FROM pets")
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
// collect each row returned from the query into an array, serialize it,
// then return in as the response body for the request.
var pets []*Pet
for rows.Next() {
var pet Pet
if err := rows.Scan(&pet.ID, &pet.Name, &pet.Prey, &pet.IsFinicky); err != nil {
fmt.Println(err)
}
pets = append(pets, &pet)
}
json.NewEncoder(w).Encode(pets)
})
Next Steps
Upgrade to the latest Spin to try out these features.
We are starting to think about Spin 2.0, and we would love to hear your thoughts about the features you would like to see in the next major version of Spin!
If you are interested in Spin, Fermyon Cloud, or other Fermyon projects, join the chat in the Fermyon Discord server and follow us on Twitter @fermyontech and @spinframework!
If you would like to get involved in the Spin project, join us at our Spin Community Developers Meeting which is held on the fourth Monday of every month at 1500 UTC.