Revamping IT: How VA Can Save $100 Million Through Cloud Migration
Written on
The U.S. Digital Service's initiative to acquire and transition to modern IT infrastructure represents a significant opportunity for government agencies to enhance their operations. This article is the second installment in a two-part series detailing how the Digital Services team at Veterans Affairs (DSVA) has positioned the agency as a pioneer in adopting contemporary, cloud-based IT frameworks. In Part One, we examined how DSVA established the groundwork for this migration through stakeholder engagement and the creation of an Authority To Operate (ATO) along with cloud-compatible contracts.
Over the past year, the Veterans Affairs Enterprise Cloud team (VAEC) has been leading a comprehensive IT modernization initiative, aimed at migrating numerous on-premises applications to the cloud. The main goal of this transition is to enhance the scalability and reliability of VA’s applications while lowering IT infrastructure costs through cloud technologies. By leveraging platforms such as Vets.gov and Caseflow, the Digital Service team at VA (DSVA) became one of the early adopters of Amazon Web Services (AWS). This experience prompted the team to evaluate the VA’s Initial Cloud Reference Architecture for AWS, during which we identified an opportunity to optimize the architecture, potentially resulting in savings of around $100 million over the next decade.
The Cloud Migration
The Department of Veterans Affairs (VA) operates a vast technological ecosystem, with 632 on-premises applications currently in production. Many of these systems are interconnected, at various stages of their lifecycle, and utilize a diverse array of technologies, some of which date back 40 years. These applications are essential for Veterans to receive their benefits and assistance.
Given the scale and interconnectivity of these critical systems, VA requires a cloud architecture that guarantees reliable, high-bandwidth connectivity to its on-premises network. Additionally, the architecture must be flexible and scalable to accommodate the different application frameworks utilized across VA’s systems.
Transit VPC and Direct Connect
To fulfill its scalability requirements, VA is implementing the Transit Virtual Private Cloud (VPC) solution alongside AWS Direct Connect. This architecture adheres to industry best practices regarding security, scalability, and availability. The Transit VPC employs a hub-and-spoke model: numerous VPCs (the spokes) connect to VA data centers. AWS Direct Connect, capable of 10Gbit connections, links VA's on-premises network with the cloud (AWS Govcloud), providing reliable, high-bandwidth connections at a controlled cost.
Single and Multi-tenant Environments
With the foundational connection infrastructure established, the next challenge is selecting a VPC Network Architecture. We opted for a blend of single-tenant and multi-tenant VPCs (where multiple applications share a VPC).
The multi-tenant VPC setup enables VA to centralize resource provisioning and manage network security at an enterprise level. Among the hundreds of applications previously mentioned, mature applications (those in the sustainment phase) will migrate into the multi-tenant environment using a lift-and-shift strategy.
Applications supported by a DevOps culture (e.g., Vets.gov, Caseflow) will transition into single-tenant VPCs using a cloud-native approach. This setup grants teams greater control over their environments, enabling them to implement their own CI/CD pipelines and fully leverage cloud scalability.
Initial Reference Architecture
With these basic requirements established, the VAEC drafted the Initial Reference Architecture, serving as the blueprint for the entire VA’s AWS environment. Thanks to the Digital Service's cloud expertise, our engineering team was invited to review this architecture. During our review, we noted that it utilized GRE tunnels over VPC Peering as a layer-3 overlay for VPC interconnectivity and access to Direct Connect.
In addition to the AWS-provided VPC Peering service and VPN service, these GRE tunnels offer several additional capabilities:
- Scaling beyond the 125 VPC peering limits
- Multi-casting of packets
- Overlapping IP address space among VPCs
As we discussed potential drawbacks, we recognized a significant issue. Managing GRE tunnels requires at least two Cisco Cloud Service Routers (CSRs) for each Spoke VPC to ensure high availability. With numerous VPCs, the costs for Cisco CSR licenses and AWS EC2 instances could be substantial, with ongoing maintenance and upgrades further inflating costs. A second issue arose; since all Spoke VPCs must peer with Transit VPCs, additional Transit VPCs would need deployment to avoid exceeding VPC peering capacity. Managing multiple Transit VPCs would significantly increase network complexity and maintenance expenses.
CSR Resource Cost
To project the steady-state costs of this architecture, we assumed that out of the 600 applications in the cloud, 100 would reside in a single-tenant environment. For traffic isolation, we also considered three environments: development, staging, and production.
Here’s a rough estimate of the CSR cost breakdown:
Ultimately, we concluded that at steady state, the annual CSR cost would be at least $9,897,120. After accounting for engineering maintenance costs (e.g., software upgrades) and licensing overhead, the total annual cost could easily surpass $10 million. Given that this architecture may be in use for over a decade, the overall projected cost reaches $100 million.
Revised Architecture
Reassessing the situation, the team concluded that the costs and complexities associated with GRE tunnels could outweigh their benefits both in the short and long term. We determined that we could largely replace the GRE tunnels with AWS’s Managed VPN Service and VPC Peering Service. Both options are low-cost and would meet the needs of most applications. This adjustment eliminates the need for hundreds of CSRs used in the GRE tunnel approach. Furthermore, it still allows for GRE tunnels; as VA's cloud architecture evolves, we can iteratively implement GRE tunnels for applications that require exceptions to AWS standards. By offloading VPN endpoints to AWS’s Virtual Private Gateways, we significantly reduce network complexity and the number of CSRs required. With this design, we estimate that only 18 CSRs will be needed at a steady state. Combining the CSR and AWS VPN costs, this results in an estimated total of $403,128 annually, reflecting a 95.9% reduction in resource costs compared to the original architecture.
Final Thoughts
Being invited to preview the cloud architecture of a large organization before its production launch is a rare opportunity. It is even rarer to have the chance to optimize the architecture for substantial impact. The Digital Service at VA was fortunate to collaborate with the VA Enterprise Cloud, resulting in a streamlined design that could potentially save $100 million over the next decade. By the end of 2017, this new architecture was implemented, and we began migrating applications to the cloud environment. Caseflow and Vets.gov will be among the first applications to transition to this new infrastructure. This initiative will undoubtedly enhance the reliability of applications utilized by Veterans daily, and the Digital Service team is excited to partner with the VAEC team to realize this mission.
Reference to specific commercial products, processes, or services, or the use of any trade, firm, or corporation name is for informational purposes and does not imply endorsement by the U.S. government.
The best of technology. The best of government. And we want you. We’re seeking dedicated designers, software engineers, product managers, and others who are passionate about reimagining and redesigning essential government services. Join a team of the most talented technologists from both the private sector and government. For inquiries about employment with the U.S. Digital Service, please contact us at [email protected] and visit usds.gov/join.
Join the U.S. Digital Service | Follow us on Twitter | Visit our Site | GitHub