Fireworks.ai: Fast, Affordable, Customizable Gen AI Platform
By Fireworks.ai|8/17/2023
Introducing FireOptimizer, an adaptation engine to customize latency and quality for production inference. Learn more
By Fireworks.ai|8/17/2023
tl;dr Fireworks.ai releases the fast, affordable, and customizable Fireworks GenAI Platform. It enables product developers to run, fine-tune, and share Large Language Models (LLMs) to best solve your product problems. The platform provides state-of-the-art machine performance for latency-optimized and throughput-optimized settings, cost reduction (up to 20–120x lower) for affordable serving, and a customizable cookbook for tuning models for your product use cases.
Generative AI (GenAI) has shifted the product landscape and redefined product experiences for consumer and business-facing products. Large Language Models (LLMs) enable never before-seen performance on tasks like document/code generation, auto-completion, chat, summarization, reranking, and retrieval-augmented generation.
Foundation models (FMs) and parameter-efficient fine-tuning (PEFT) now enable more efficient AI customization. Instead of training large models from scratch with vast data, companies can now tailor GenAI models using open FMs like LLaMA, Falcon, and StarCoder. These FMs, sourced from internet data, serve broad tasks. But with PEFT techniques like LoRA, FMs can be customized for areas like legal or finance. Models can be customized with company data for specific tasks or personalized AI products. This approach speeds up product development and reduces data needs.
Open LLMs and PEFT enable customization of foundational LLMs for new use cases
At Fireworks, we believe in developer-centric AI: LLMs should be fast, affordable, and customizable for integration into modern products. Fireworks provides:
Efficient inference of LLMs is an active area of research, but we are industry veterans from the PyTorch tea specializing in performance optimization. We use model optimizations including multi/group query attention optimizations, sharding, quantization, kernel optimizations, CUDA graphs, and custom cross-GPU communication primitives. At the service level, we employ continuous batching, paged attention, prefill disaggregation, and pipelining to maximize throughput and reduce latency. We carefully tune deployment parameters including the level of parallelism and hardware choice for each model.
Machine efficiency improvements from our runtime allow us to pass on cost savings to the user. Compared to other providers serving similar-sized models, we offer prices that are 1–2 orders of magnitude lower. In particular, we provide special optimization for fine-tuned models, resulting in significant savings compared with OpenAI and other OSS model providers.
The Fireworks GenAI Platform delivers significantly lower costs than comparable providers
The Fireworks Developer tier is free to use to get you going easily. For more advanced and much higher usage, Fireworks Developer Pro pricing is based on $/input million tokens and $/output million tokens, listed in the table below. To illustrate the cost for product usage, the below study examines several common LLM use cases and the typical number of input and output tokens. We use the Fireworks per-token pricing and compute a normalized price for each use case.
In addition to cost savings, we can deliver lower latency and cost than existing solutions. For example, we compare with the popular open-source vLLM framework. Running both solutions on a lightly loaded server with 8 A100 GPUs we're getting 2–3x lower latency. Importantly, our service's latency stays low as server load increases and it can maintain 3x higher maximum throughput than vLLM while staying within the latency constraint.
The Fireworks platform offers latency significantly lower than comparable open-source offerings
The Fireworks Inference Service stack is built with first-class support for serving of LoRA fine-tuned models. We compose LoRA with all optimization techniques, including sharding and continuous batching. Further, we enable multi-tenancy of multiple LoRA adapters on the same base model with cross-model request batching. This brings efficiency that is not available when hosting your tuned model elsewhere. We pass these savings to you by making the cost of a fine-tuned model serving as low as the base model regardless of how much traffic it receives.
We believe this pricing is key for unlocking the power of expert models fine-tuned for a specific use case. If you have many use cases or many variants of a single model, each of the model variants might serve very few requests. Deploying each of them on a separate GPU would be prohibitively expensive on a per-token basis. With Fireworks, you can experiment with fine-tuning without breaking the bank.
The cost of serving fine-tuned models on Fireworks does not scale with the number of models
Fireworks.ai provides several state-of-the-art foundational models for you to use off the shelf, including those from the LLaMA 2, Falcon, and StarCoder families. Sign up for an API key here and access the models for free today. The Fireworks console allows you to interact with models right in your browser.
Trying out LLaMAv2–70B in the Fireworks console
Fireworks also provides a convenient REST API that allows you to call LLMs programmatically from your product. The API is OpenAI API-compatible and thus interoperates with the broader LLM ecosystem. Try out the REST API using our interactive explorer.
The Fireworks.ai API explorer helps you to test and implement LLM API calls
We also provide a dedicated Python API:
For applications that already use the OpenAI Python SDK, migration to Fireworks involves simply switching the API endpoint:
Fireworks has partnered with LangChain to provide Fireworks integration, allowing LangChain-powered applications to use Fireworks fine-tuned and optimized models. Integrating Fireworks into your LangChain app is just a few calls away using the LLM module:
Off-the-Shelf Models and Addons: We test and curate a list of top models (both in-house and from the community) for various product use cases. There are two application areas that we'd like to highlight:
Check out all of the latest available models on our models page.
Foundational models and community-uploaded adapters are great for some cases, but in many cases, you need to customize a model based on a new task or on your own data. However, the information required to recreate these models and implement the latest modeling techniques is dispersed across various repositories, online forums, and research papers. To help with this, Fireworks provides an easy-to-use cookbook repository that allows you to fine-tune models and upload them to the Fireworks inference service.
The Fireworks cookbooks are open-source for the community to use and modify. We invite contributions to the repository to help build strong fine-tuning tools for all!
We are excited to announce the Fireworks Generative AI Platform for fast, affordable, and customizable serving of the latest open Large Language Model architectures. Try the platform and see how it can help you unlock the power of Generative AI for you. Follow us on Twitter and LinkedIn and join our discord channel for more discussions!