ansible ec2_facts returns false data (if there is NAT on the system level; This is ok if You use AWS router interface gateway)

THE PROBLEM:
I’ve just realised why sometimes my playbook fills the template with false data

This happens, when the instance is in my VPC subnet (with internet gateway), while in configuration there is NAT route table on the system level, then reguest to the internet goes through NAT instance and the AWS response is covered.
Then the NAT_instance facts are returned, NOT the current_instance facts about.

THE DEBUGGING:

If You look into the code, the ec2_facts fetch a bunch of requests to

http://169.254.169.254/latest/meta-data

in Example:

curl http://169.254.169.254/latest/meta-data/local-ipv4
172.16.0.200

while real data is

eth0: ***

inet 172.16.0.110/24 brd 172.16.0.255 scope global eth0

THE INSTANCE CONFIGURATION:

$ ip r

default via 172.16.0.200 dev eth0

172.16.0.0/24 dev eth0 proto kernel scope link src 172.16.0.110

172.16.0.0/16 via 172.16.0.1 dev eth0

$ ip a

eth0: ***

inet 172.16.0.110/24 brd 172.16.0.255 scope global eth0

If You keep remote files, You can check it Yourself

export ANSIBLE_KEEP_REMOTE_FILES=1

and then

python /home/ubuntu/.ansible/tmp/ansible-tmp-1436872330.49-72199016469620/ec2_facts

will return as one of the facts:

“ansible_ec2_local_ipv4”: “172.16.0.200”,

(or run a curl)

curl http://169.254.169.254/latest/meta-data/local-ipv4

THE CURRENT WORKAROUND:

  1. do NOT use (in roles nor tasks)

    • action: ec2_facts
  2. DRAWBACKS:

  3. You will not have some variables available (ansible_ec2_* will be unavailable)

  4. You will have only ec2_* facts from you LOCAL inventory cache (ec2.py if I’m correct now)

  5. If You add in playbook (“gather_facts: True”) then You can also use ansible_* facts gathered by setup.py module

  6. so instead of ansible_ec2_local_ipv4 You can use **ansible_eth0['ipv4][‘address’]**1. BUT this can bring some problems when You have a role, that expects some vatiable (example: ansible_hostname), but in the playbook You have disabled system fact gathering (“gather_facts: False”) - You will have to be carefull

  7. OR You would like to access some AWS variable, independent form Your LOCAL cache1. configure you VPC routing tables so it will point to NAT-instance-interface, rather than IP address

  8. 0.0.0.0/0 eni-xxx / i-xxx1. instead of:

  9. 0.0.0.0/0 igw-zzzzz + system routing tables1. Then You do not have to override the routing table on the system level

  10. You rely on AWS Router

  11. DRAWBACKS

  12. You will have to change the routing table in the VPC, pointing to other phisical interface, when Your NAT instance will shut down

  13. vs1. If kept with system routing table, You will lunch new NAT-instance with “old IP address” attached
    QUESTIONS / CONCLUSION:

  14. Be aware about ec2_facts limitation

  15. If possible - rely on Amazon Routing Table

  16. How You prevent SPOF in Your VPC subnets?

  17. What is Your best-practise to configure VPC subnet (private and public), so they have internet outside access (for github, apt), and are still safe without SPOF that is NAT-instance?

I’m using Ansible with AWS VPC’s, where most of them have public and private subnets, and have never had the problem you are seeing. This is definitely a misconfiguration on your side and nothing to do with Ansible. The ec2_facts is doing the right thing, there is no other way of collecting data except querying the meta-data repository which is what the AWS CLI tools do anyway. Meaning you will get wrong data using AWS CLI as well. Don’t forget you are in the cloud and your networking is configured in the hypervisor/SDN level and NOT on instance level. Meaning you can create as many network interfaces as you want on instance level and set IP’s on those but none of them will work since you have bypassed the SDN and there is no record of those in the meta-data repository. Which finally means that collecting facts on the instance locally really means nothing if those values don’t match what is in the meta-data repository.

Now that we have that cleared, lets move to your problem, which looks to me is AWS routing tables. Or more specific the lack of those. For an instance to be in a private subnet it needs separate routing table from the VPC’s default one (which has IGW created for you when the VPC was created) that has the NAT instance as IGW (internet gateway). And that is all you need, you don’t have to set any routing tables on the system level, the SDN will route the traffic for you.

Hope this makes sense. Since you haven’t provided any info about your subnets, routing tables, ACL’s etc. this is more of a guess what’s going on so please correct my assumptions if needed.

Thanks,
Igor

Have to correct myself, you do provide the subnet information. So in answer to you questions/conclusions they way I do it is:

  • Use private routing table for the private subnets pointing to the NAT as IGW
  • Use 2 x NAT instances and NAT takeover script that modifies the the private subnets routing table and points the IGW to itself in case the other NAT instance has failed

Thanks Igor.

You are right, it is not ansible “bug”, but an configuration-feature, tough it is the “bad one” since it silently provides the false data. I had to dig into the source code to track it down.
There could be some warning in ec2_facts detecting default route, but it would be some work :confused: