The examples below are real customer cases where ByteRouter was used to establish cost-efficient data analysis infrastructure to realized significant cost savings. It's also important to note that the examples are from small teams where ByteRouter allowed analysis power to be scaled up quickly without the help of infrastructure experts.
As part of an innovative signal protein analysis platform, Raven Biosciences required a cost-efficient way of performing large scale molecular docking using the HADDOCK docking software suite. Molecular docking is CPU intensive and HADDOCK also requires fast disk I/O to fully utilize multiple CPU cores.
The pipeline consisted of several containerized applications handling data preparation, docking and post-processing. The input consisted of small files describing the structure of the involved proteins. The docking process generated large amounts of data (50-200GB) but only a few MB was needed for the downstream processing. As both the input and output were relatively small it wasn't necessary to establish any fast shared storage.
S3 compatible object storage was used to store both input and output data. The S3 storage was hosted by the cloud service Digital Ocean in US East which was also hosting the customer systems that would initiate the analysis and display the output.
Contabo was used as the primary hosting provider and an existing on-premise server was also connected. The setup is visualized below.
Over a span of 18 months, more than 5000 protein pairs were analyzed. The estimated cost savings compared to AWS are described below. The execution time of each dataset primarily depends on the size of the proteins being docked. The cost estimates below use an average protein size which allowed benchmarking on AWS infrastructure.
Note that AWS spot instances were not considered as they are subject to capacity constraints and interruptions which would not work well with this type of pipeline. For reference, AWS spot instances cost index is 270% plus the extra time and storage costs needed to handle interruptions.
Using the setup with 3 servers described above, it was possible to achieve a high utilization of >80% of all nodes. Assuming an 80% utilization and the per dataset costs from the table above, the total cost of processing 5000 datasets was USD 8,855. Contabo is an "all-inclusive" service with no additional costs. Instances were rented on a per month basis, and were shut down when the project was nearing the end.
With the most cost effective instance in AWS, the cost of processing the 5000 datasets using on-demand instances would be USD 50,301. Additional costs for storage, data egress and API calls should be expected, but they are estimated to be small for this pipeline.
Total estimated savings: USD 41,446.
In another project completed with Raven Biosciences protein-protein interaction was simulated using Molecular Dynamics (MD) to estimate the binding affinity of Antibodies with protein targets. Modern MD simulation makes extensive use of GPUs where the floating point performance of GPUs is central for the performance. Simulation performance is measured in nanoseconds per day (ns/day). Besides a fast GPU, simulations also require a relatively fast CPU as some calculations must be done here.
The analysis pipeline was based on GROMACS which is an open source MD software package. The software was first used by installing it directly on machines but it was quickly containerized to allow faster scaling with additional nodes. Each simulation took several days to complete which made the transfer time of the 1-20GB output negligible in comparison and made this pipeline suitable for distribution across multiple data centers.
The analysis infrastructure for this pipeline is visualized below. Two data centers were used to rent GPU nodes and an existing on-premise server with 2 GPUs installed was also connected. One data center was located in the Netherlands and the other on the US East Coast. Two different data centers was used for redundancy and to have one data center close to the development team and another data center close to the customer systems. A positive side effect of using two vendors was the special offers from both vendors could be leveraged to reduce the cost. Data was feed into the system from an S3 bucket hosted by Digital Ocean and results were uploaded to another S3 bucket where they could be post-processed by smaller VMs running at DigitalOcean.
The infrastructure consisted of 18 GPU nodes and was operational for 8 months with an average utilization >80%. The table below compares cost of the different nodes with the most cost efficient AWS GPU instances available at the time. The MD analysis pipeline depends on fast floating point performance which means that more powerful GPUs such as the A100/H100 would not be cost-efficient as they are optimized for AI applications where tensor cores are utilized.
Note: spot instances are not considered since the availability of the necessary instances is very poor. MD can handle interruptions but at the expense of additional storage costs to save state plus additional effort would be required to handle interruptions.
The cost of simulating 150,000ns of an average sized dataset with the infrastructure described above and assuming 80% utilization is USD 21,539. Using AWS on-demand instances and ignoring overhead from e.g. elastic storage, API calls and data egress, the cost would be USD 95,162.
Both HostKey and DatabaseMart are "all inclusive" services where instances are rented on a monthly basis. Further savings were realized in the project as several instances were rented with a discount. The discounts were not included in these calculations.
Total estimated savings: USD 73,623