AWS VPC: A Reintroduction (or, how to allocate 5 million IPs in 5 minutes)

AWS VPCs: A Reintroduction

or, how to allocate 5 million private IPs in 17 regions, 10+ countries, 2 commands, 5 minutes.

A Starting

Do you know how many services AWS has these days?

299 services as of this writing!

So much for the cloud simplifying things, right?

Since we have 299 wonderful services to discover, let’s ignore all of them and look at a foundational cloud concept everybody forgets about because one click deploy magic cloud magic scale no ops no ops no ops!!!

Networking Background

Networking in AWS evolved from its original infrastructure hacks where everybody shared the same 10.0.0.0/8 internally with no peer protection, into our current reality of fully software defined isolated networks you can create and destroy on-demand with somewhat thoughtful features like optionally peering your virtual networks across regions or customers and even bringing your own IP addresses to AWS then announcing them over BGP.

The software defined network in your AWS account is called a VPC — “virtual private cloud” — and by default you can create 5 fully isolated VPC network environments in your AWS organization.

cool cool, right? But if you’re like most cloud people these days, you’ve never looked at or configured anything to do with your AWS networking setup. You click randomly when a new service launch asks “which subnets?” and just hope things work.

Actually though, your AWS VPC config has massive impacts over your service scalability and reliability and can make or break your ability to connect to other services as your global infrastructure eventually needs to expand.

Unfortunately, AWS gives all accounts VPC configurations with really bad default settings.

By default, AWS gives each region in your account its own VPC (VPCs are single region-bound resources), but each default VPC AWS creates for you is exactly the same and uses duplicate IP address space in every region.

Not only is the default AWS VPC configuration poor because it uses the same IP ranges in every region, it also uses non-size-optimized subnet configurations which can limit your scalability unexpectedly (less so now with recent changes, but not long ago, AWS was allocating up to 3,000 IP addresses for high throughput Lambda functions — did you know that? Do you know if your subnets even have 3,000 free IP addresses in each AZ?)

Automated Global Network Configuration

AWS currently has 26 regions and 84 AZs, but only about 17-20 regions are generally usable due to the others being provisioned for private government services (looking at you, us-gov-topsecret-2) or politically restricted regions.

By default AWS gives you access to about 17 isolated1 geographic regions around the world. Also by default, AWS creates a default VPC inside each network with questionable choices you should definitely change. The AWS default VPC in each region uses the same /16 (172.31.0.0/16) configuration everywhere (which is actually bad since you’re likely not employing a full time network architect to properly design your systems from the start anymore because cloud cloud cloud no knowledge single click cloud cloud!!!).

For a more scalable and interoperable network architecture, you should define custom unique CIDR blocks in each VPC (region) in your account, but how do you create, manage, maintain, then grow 17+ network configurations2 over time? We can do it! We have the technology.

An AWS VPC can contain 5 CIDR Blocks by default (up to 50 with quota increase). Each CIDR Block can be a max of /16 (64k IPs), so with no quota increase by default on all accounts, you can allocate over 300k IPs per VPC if you configure things correctly.

What does a correct global scalable interoperable AWS VPC configuration require and why does it matter?

AWS allows you to connect independent VPCs together across accounts and regions to ensure your traffic is both remains in-AWS, using encrypted transit and off-Internet private IPs across all regions, and uses the cheapest network path available—but this feature is ONLY AVAILABLE if your VPCs DO NOT have any overlapping CIDR blocks. Remember, by default, AWS gives you the same duplicate /16 in every region, so to use the very very useful VPC peering feature, you need to custom-configure your entire AWS networking infrastructure from the ground up.

To configure the best global AWS VPC setup ever, you just need to pick your global base network/supernet (10.0.0.0/8) then start slicing off 5 new /16 for each region around the world to end up with a globally unique, non-shared, non-overlapping private IP space, then you can effortlessly use direct VPC peering between any region pairs transparently without last minute panic re-configurations due to CIDR block conflicts you didn’t even know existed.

After selecting all the proper /16 CIDR blocks, then simply inside each VPC (inside each Region), assign sequential /19 IP ranges to each Availability Zone in the region (“Subnets” in AWS speak) for each network access configuration you need (e.g. 100% public Internet-facing Subnets, 100% internal no-Internet subnets). Each Subnet is geographically bound to an Availability Zone. “Subnet” is how Amazon both lets you specify network controls and deployment location controls when provisioning new services3.

Now our global VPC network planning goal looks like:

  • For each Region:
    • reserve the next 5 /16 for Region CIDR blocks
    • For each Subnet Type (public Internet routable, Internal non-routable)
      • For each AZ in the Region:
        • reserve the next /19 from the Region CIDR blocks
      • repeat /19 allocations are exhausted or until subnet types are exhausted

Note the pre-selection of /16 CIDR blocks in each region — we reserve 5 /16 in each VPC by default up front regardless of how many Subnets we want to populate — this means even if we don’t use the full CIDR capacity for Subnet allocations right now, the unused CIDR blocks are still reserved for any future expansion or reconfiguration of the region.

great great great we have a plan to generate a network plan, but what do we do with it? Are we going to sit in the AWS Console and create 20+ VPCs and 100+ CIDR blocks and 200+ subnets by hand?

no no no no no.

We can use my new planvpc system to auto-generate both the network plan and a Terraform script to deploy your network plan around the world with fully automated commands.

First we auto-generate a configuration defining custom VPC CIDR blocks and custom per-AZ subnets across all current AWS regions in our account:

Which creates planned.myregions.json looking like:

great nice json but what now

Automatically Deploying a Globally Unique VPC Networks

Since we have nicely formatted network plan output, planvpc can auto-populate a Terraform script to deploy your globally non-overlapping VPC and subnet configurations around the world as quickly as:

to which terraform will reply:

and the 340 to add contains (for this example4):

  • creating 17 VPCs across 17 regions
  • creating 17 igw for all VPCs (how public subnets route to the Internet)
  • creating 17 default network ACLs for all VPCs
  • creating 17 public IPv4 routes for all VPCs
  • creating 17 public IPv6 routes for all VPCs
  • creating 17 internal routes for all VPCs
  • creating 55 total public subnets across all AZs
    • plus creating 55 routes to the local VPC igw
  • creating 55 total private subnets across all AZs
    • plus creating 55 routes with no public Internet access

planvpc includes a Terraform module5 to create a VPC from a single region network description in the JSON output. With the module, we can loop over each region to create VPCs and Subnets all around the world (arguably with as few as 2 commands (generate_terraform_config above and terraform apply -auto-approve)).

Running the above network creation layout takes 6-8 minutes depending on the slowest region responding to creation requests (us-east-1 can regularly take 2-3 minutes longer than every other region combined to create resources).

Sample auto-generated Terraform script from your pre-planned network output:

# ================================================================================
# module.planvpc-us-east-1:
# ================================================================================
module "planvpc-us-east-1" {
  source = "./tf/modules/vpc-auto"

  cidr_primary            = "10.2.0.0/16"
  cidr_secondaries        = ["10.3.0.0/16"]
  cidr_secondaries_unused = ["10.4.0.0/16", "10.5.0.0/16", "10.6.0.0/16"]
  subnets = {
    "public" = {
      "use1-az1" = "10.2.0.0/19",
      "use1-az2" = "10.2.32.0/19",
      "use1-az3" = "10.2.64.0/19",
      "use1-az4" = "10.2.96.0/19",
      "use1-az5" = "10.2.128.0/19",
      "use1-az6" = "10.2.160.0/19"
    },
    "internal" = {
      "use1-az1" = "10.2.192.0/19",
      "use1-az2" = "10.2.224.0/19",
      "use1-az3" = "10.3.0.0/19",
      "use1-az4" = "10.3.32.0/19",
      "use1-az5" = "10.3.64.0/19",
      "use1-az6" = "10.3.96.0/19"
    },
  }

  providers = {
    aws = aws.us-east-1
  }
}

# ================================================================================
# module.planvpc-us-east-2:
# ================================================================================
module "planvpc-us-east-2" {
  source = "./tf/modules/vpc-auto"

  cidr_primary            = "10.7.0.0/16"
  cidr_secondaries        = []
  cidr_secondaries_unused = ["10.8.0.0/16", "10.9.0.0/16", "10.10.0.0/16", "10.11.0.0/16"]
  subnets = {
    "public" = {
      "use2-az1" = "10.7.0.0/19",
      "use2-az2" = "10.7.32.0/19",
      "use2-az3" = "10.7.64.0/19"
    },
    "internal" = {
      "use2-az1" = "10.7.96.0/19",
      "use2-az2" = "10.7.128.0/19",
      "use2-az3" = "10.7.160.0/19"
    },
  }

  providers = {
    aws = aws.us-east-2
  }
}

... and 1,000 more lines covering every region ...

Also note the Terraform script records unused CIDR blocks directly in the script for self-documentation as well (plus unused subnets, but left out here for clarity).

You can save this Terraform config in revision control as the ground truth of your global VPC+Subnet configuration6, then move CIDR blocks and unused subnets into future usage as necessary directly in this configuration without needing to consult the übermegajson again.

You can also use your global VPC+Subnet Terraform config to further customize your global network with Network ACLs and more specific Subnet configurations as needed (a good practice here is to keep your “AWS Network Terraform Config” independent from any per-service or per-application scalability configurations—so your Network Terraform config controls the underlying network, and anything else needing to grab the VPCs or Subnet IDs just uses data sources filtered by appropriate tag or metadata fields instead of using the direct resources themselves).

but that’s just a theory A NETWORKING THEORY

Feel more enlightened now? This system is how I’m going to run all my AWS accounts going forward. It’s often useful to bridge VPCs together with PrivateLink across dimensions of accounts+regions, and creating globally unique non-conflicting IP space among multiple accounts up-front enables more “best quick solution no require brain hurt” optimizations than are possible using the default VPC and subnet configurations AWS gives you without any initial input when creating new accounts (much less when inheriting existing accounts from less enlightened times).

Any questions? Well, you probably shouldn’t ask me since Amazon has rejected me over multiple interviews in the past three years, so I’m kinda useless as a person and don’t have any value to provide to any organizations or people or society at large ever again apparently.

More Details Than You Can Shake a Bezos At

these are additional notes i made for this article about points to hit or explain or emphasize, but i ran out caring energy.

enjoy the raw notes!

ALSO MENTION:

  • subnets become hard-line capacity limits when they run out of addresses
  • PrivateLink kinda scammy
    • “let’s charge customers $3,000 per month if they want to use some of our services in their private IP space across all zones in all regions!” uh, no tanks?
  • “public” “private” “internal”
    • in AWS Speak, “public” means a subnet has a default route to an AWS internet gateway, so instances in a subnet can access the Internet directly
    • in AWS Speak, “private” means a subnet doesn’t have an igw, but practically most people use “private subnet” to mean “subnet with a NAT gateway default route instead of direct Internet access,” but this is a shitty security practice and those people should feel bad.
    • private not private, NAT is not a firewall, ask log4j how having a “private” globally routable NAT gateway protects your “private” subnets.
    • design your systems to split between igw public or 100% non-routable.
      • being non-routable is difficult in AWS though because every public AWS service uses public IPs for connections, so to enable something like S3 in a non-routable subnet without a NAT gateway, you have to enable PrivateScamLink where you pay AWS $7.20/month in each AZ where you need to alias a public AWS service into your private IP space without global routing (plus you get to—PRAIZE BEZOS—pay $0.01/GB for all traffic over your PrivateScamLink instead of free traffic if you just use the regular public endpoints)
      • most architectures end up going overweight on “public+security groups” as the most cost effective option instead of building out proper public/private network infrastructures these days because it’s simpler and more cost effective even though it doesn’t look great from a network architecture point of view.
  • IPAM very scammy (AWS charging you a monthly rental fee to use your own self-defined private IP space)

SUBNETS ARE FOUR THINGS:

  • deployment location by AZ / distributed redundancy by default
  • network security configuration (Routes and Network ACLs)
  • capacity limits for deployments based on your private IP address availability
  • capacity limits based on AWS service availability
    • AWS may have 500 instances available in AZ 2 and 0 in AZ 1, so if you limit your deploy only to AZ 1, your deploy fails even though capacity is available elsewhere.
  • the only unit of “not the same failure domain” AWS exposes
    • e.g. if you deploy 10 things in the same AZ, they could all be in the same rack or even same server, you don’t know.
  • the only unit of workload placement you can control.
    • if you want to guarantee your data cache servers and your database and your app servers all have the lowest latency, give them all ONE subnet to deploy into (especially if all the services are singular and can’t operate independently)
    • alternatively, you can give your DB all 4 AZs as choices for primary + replica redundancy, then give your app servers + cache servers only ONE subnet to deploy into, so your cache is close to your app, but your DB is allowed to be elsewhere, while the DB replica will (with good placement logic) always be independently located away from the primary.
  • don’t be pressured into thinking you have to use ALL AZs available to you though (for example: us-west-2 has 4 AZs, us-east-1 has 6 AZs, us-west-1 has 2 AZs). Usually distributing workloads across just two AZs is fine.
    • one of my favorite use cases: deploy everything into EC2 spot or FARGATE_SPOT instances across multiple AZs and just let ECS service desired_tasks automatically relaunch things when spot kills them off.
    • Also have to account for extra fees:
      • FREE: traffic inside one AZ is free (or from any “AWS hosted data service” in a region to any AZ in the region (like an S3 bucket with origin location in the same region regardless of zone))
      • $0.01/GiB: traffic between AZs using your private IP space in their parent region is $0.01/GB BOTH WAYS IN/OUT
      • $0.02/GiB: traffic between regions private (bridged) or public AWS<->AWS network is $0.02/GB
      • $0.09/GiB: traffic out to the Internet is $0.09/GB (inbound free)

Everything you do in AWS requires networking capacity, so you need to have good VPC architecture up front.

You want to launch 1,000 fargate containers? It’ll use 1,000 IP addresses (both private and public by default!)

You want to skip fargate and run containers on your own instances (spot instances to save money, right?) — enable VPC Trunking but then it uses more internal IPs too, so you need to make sure all your subnets are big enough for all your use cases:

Price hacks:

  • S3 gives you 100 GB outbound free per month
  • CloudFront gives you 1 TB outbound free per month
  • CloudFront doesn’t charge for S3 origin fetches
  • So configure a persistent cloudfront origin to your S3 bucket then add an alarm to notify you when you reach 1 TB, then move back to S3 directly for the next 100 GB, then pick cloudfront or S3 going forward after. Saves you $100/month.

  • CloudFront is 5% cheaper for NA and Europe access than other AWS network access too (but more expensive internationally).

ACCESS CONTROLS FLOW FROM:

  • does current Network ACL attached to subnet allow access?
  • yes? -> send to security groups for target ENI
  • no? -> reject

AWS, in its infallible wisdom, made up their own terminology for basic network practices and now I guess we all just have to adjust.

also because AWS made confusing and conflicting service requirements, they happily create even more services to manage their broken services, with everything being charged per hour forever.

Basically:

  • “Subnets” in AWS networking speak is basically VLANs, but isolated to an individual Availability Zone
    • Even though “Subnets” are your unit of allocation for services, DHCP is configured and attached to the VPC itself, not configured per-Subnet (and if you don’t configure DHCP directly, a default DHCP configuration is provided automatically and you don’t have to provide any DHCP forwarding settings anywhere).
  • “Network ACL” is your overall per-subnet (stateless!) network firewall attached to each Subnet
    • by default, AWS creates a single Network ACL for Subnets to use if you don’t create any other ACLs. The default Network ACL has two rules to start out: allow all inbound, allow all outbound.
    • this is where you go if you need to block something immediately from all your infrastructure in one place
    • many organizations never touch their Network ACL and only use security groups, but knowing how to quickly add Network ACL entries is important if you ever need to quickly respond to security issues.
    • ACL docs: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-network-acls.html
      • Notably, each subnet can only have one Network ACL and a Network ACL is limited to 19 custom rules total by default with 39 max with a quota increase.
        • One way to hack around not having enough Subnet-Wide ACL rule capacity is by using per-interface Security Groups (the SG quota is 60 rules per group by default with a max of 1,000 rules per network interface (across all attached Security Groups combined) if you request quota increases).
  • “Security Groups” are like individual firewalls attached on-demand based on deployment descriptions
    • Security Groups are DENY by default, so can only ALLOW traffic, not block traffic.
    • Because AWS wants to give you random IP addresses for everything launched, a more traditional network-wide firewall doesn’t make sense because you can’t target consistent internal IPs over time. Their workaround is now firewall rules floating with individual deployed entities (ENIs) instead of access controls attached to “the network” itself.
  • Each EC2 instance type has a maximum number of network interfaces attachable, but there are workarounds for higher tenancy ECS usage via enabling ENI Trunking to provision a special additional untagged/interswitch/trunk ENI on your EC2 instance to multiplex network traffic for optimizing ECS service deployments.
  • “Internet Gateway” is basically what defines your default gateway service in the VPC (has no config options)
  • “Availability Zone” can be thought of as a single datacenter from a latency and DR point of view
    • but for a redundancy/placement point of view, one AZ is basically “one rack” in a mental landscape because you can’t guarantee two services deployed in one AZ aren’t in the same immediate failure domain.
  • “Region” can be thought of as a private metro network between Availability Zones
    • AWS claims all regular AZs in a region are with 100 km of each other, but geographically distinct.
  • AWS also has “Local Zones” which are basically AWS colo’ing cages or tertiary buildings at regular exchanges/datacenters away from a traditional physical AWS region to avoid more immediate full hundo-million dollar buildouts.
  • AWS also has “Wavelength Zones” which are basically AWS colo’ing inside Verizon datacenters for faster access to wireless customers. Good places to deploy your highly optimized, memory-efficient end-user cache + IoT platforms.

  • Amazon provides a JSON document of all their public IP ranges for you to create custom firewall rules if necessary.
    • Also see:
    • You can do things like:
      • Get all S3 IPs in us-west-1:
        • jq '[.prefixes[] | select((.service == "S3") and (.region == "us-west-1")).ip_prefix]' ip-ranges.json
      • Get all AWS IP ranges but exclude EC2 (because user traffic only comes from the EC2 service): - jq '[.prefixes[] | select(.service == "AMAZON").ip_prefix] - [.prefixes[] | select(.service == "EC2").ip_prefix]' ip-ranges.json
      • and note: AWS can update their IP list at any time (sometimes multiple times per week), so if you want to use it for network filtering, you need to subscribe to their SNS endpoint arn:aws:sns:us-east-1:806199016981:AmazonIpSpaceChanged and probably run your firewall updater at least once an hour anyway to update any new rules you may need.

END OF DOCUMENT. THX FOR THE MEMORIES.


  1. well, isolated except for all global AWS login credentials being stored only on a single brown Zune in us-east-1 somewhere.

  2. 17 network configurations is the smallest, since each VPC can have 5 CIDR blocks, so you can potentially need to manage 17 * 5 = 85 CIDR blocks then manage subnets inside those CIDR blocks themselves, and each /16 CIDR block can hold 8 /19 subnets, so that’s a total of 85 CIDR blocks across your account * 8 /19 subnets inside each CIDR block == 680 individual final AZ Subnet network configurations at maximum deployment (before any additional quota increases!)

  3. when AWS asks you which Subnets to use for launching a service, the Subnet(s) you pick decide the collection of deployment AZs available to the launcher plus any default (or custom) Network config (gateways, routes, ACLs) already attached to the final deployed subnet.

  4. creating all of these all at once for all configured regions plus subnets:

  5. and my Terraform VPC+Subnets module uses the correct terraform for-each syntax so you can add/remove subnets without erasing your entire configuration, unlike the most popular terraform VPC creation module which uses each offsets for subnet creation.

    the problem with each offsets in Terraform is resources become indexed in state by list position offset—so if you ever add or remove or change the order of your defined subnets, Terraform decides to delete all of them then re-create them (because their position in the state list changed and Terraform can’t maintain persistent identifiers for list offset positions); the only proper way to manage multi-input auto-iterating Terraform configurations in a long-term maintainable state is using for-each with named keys, not fixed-position maybe-future-changing list offsets.

  6. to use the repo as revision control, clone aws-vpc-global-planner, create a new branch for your private management, delete the existing .gitignore so you can save your own plans and terraform config locally, run the plan generator and the terraform generator, then commit your local plans and configs directly to your branch of the repo.

    You will also likely want to modify the generated terraform config to use an external backend with proper distributed state locking. Another idea is to modify the generation script to create one terraform script per-region instead of generating them all in one file (for easier long term feature additions per-region/per-vpc), but I haven’t bothered to flesh out the feature set that far yet.

    It’s probably also useful to record the original commands for your subnet generation if not using config file defaults or writing your own config file.

    Example config I ended up using to create globally unique regions across 3 AWS accounts so any region in any account can be peer’d/bridged to any other region in any account (where, for each config generation, I increment account_offset=N and change profile=X for each account’s profile definition in ~/.aws/config):