Since the foundation of the first public cloud companies in the early 2000s, they have radically transformed the landscape of the internet, arguably, more than any other company.
Amazon was the first one that made its computing infrastructure available as an API for rent on a per-time basis in the early 2000s, coining the term “Infrastructure as a Service — IaaS”, they were subsequently followed by Google and Microsoft. They established their own IaaS products with Google Cloud Platform (GCP) and Microsoft’s Azure.
Starting a web-based or SaaS (Software as a Service) business was virtually unheard of before the age of IaaS (Infrastructure as a Service) companies. There were simply too many hurdles, and the expenses were too high — you’d have to purchase dedicated servers and a high bandwidth connection to handle a load of incoming visitors to your website, hire engineers to build a scalable system, and if you planned to go international, you would have to purchase servers in other geographical locations.
IaaS companies provided (and still do … ) businesses a much shorter time to market and lowered development costs to a point where launching an online, scalable, internationally available online service can be done in a matter of days at a fraction of the cost required 20 years ago.
Why has Enzymit decided to off-load many of its processes to an on-premise private cloud?
First, it’s important to note that Enzymit’s use of cloud computing mainly entailed computationally intensive calculations for protein design. We do not have (yet) any public-facing applications that need to scale across multiple geographical zones and handle millions of requests per minute. Our primary use case is running CPU and GPU heavy analyses, and for that use case, we have found the IaaS/public cloud solution to be far from cost-effective in the long term.
The first apparent downside from our perspective is the transparency of the costs. Although several of the largest IaaS companies have dedicated a lot of resources to providing their customers with pricing simulators and spending predictions, we found that those are much more reliable if you use their public product-based APIs that serve as a layer between the engineer and the “raw” compute infrastructure. In other words, you get much better transparency if you vendor-lock your company to the IaaS provider (or simply have a less cost-optimized platform). Suppose you don’t want to vendor-lock your software to any cloud company and just use the bare-bones infrastructure so that you could migrate with ease to a different cloud provider in the future; well, good luck. If you’re going to get a glimpse of a complex pricing model, take a look at this guide that discusses only one provider’s data transfer pricing model!
Adding storage, CPU time, and many other costs makes understanding and verifying the cost structure a task suitable for certified experts (those certificates are another source of income for public cloud companies)
Upon founding Enzymit, we received a significant amount of credits for one of the largest public cloud providers. Like many other startups who are short on staff, we made the most of it, but not without architectural blunders that cost us thousands of dollars of credit money. Our primary use of the platform is mainly GPU-based neural network training and inference and some CPU-based calculations.
After spending $100k in a year, we calculated that each thread-hour costs us about $0.06-$0.08 — including everything from storage to data transfer and CPU usage. (Not including GPUs in this analysis)
Running at about 120k CPU hours per month, this sum amounts to $7200-$9,600 per month!
This is the cost of a very well-equipped server with RTX 3090 GPU.
This finding got us thinking about the option of off-loading at least some of our workload to on-premise compute servers. After doing the math, We decided to purchase three workstations at a total cost of $17k. One is a GPU-based workstation with two RTX 3090s and an Intel i9–12900 CPU, and another two workstations with 16 cores AMD Ryzen 5950X CPUs.
It took us a few FTE days to set those up to our satisfaction with slurm, NFS, backups, and several other services.
We noticed that our RTXs, although considered gaming cards, are comparable (if not better) in performance to Tesla V100, which some cloud providers rent at the staggering price of $3.06 an hour.
We are aware that hosting our own compute infrastructure doesn’t come cheap, and there are some hidden costs that need to be factored in, such as maintenance, security, and more.
Over the next few months, we will collect more data on those factors, hopefully eventually converging on the most cost-effective solution, which will probably be a balance between our infrastructure and public cloud-based infrastructures. In any case, after a few weeks of running on our local infrastructure, we can say that it requires more planning than just running on the cloud since resources are much more limited. On the other hand, once those computational experiments are planned well — we barely require additional resources from cloud providers.