@mmmarfy wrote:
An elastic IP is quite the opposite of what you're describing, it's used to create a static IP that doesn't change. An IP changing wouldn't cause this though, it's definitely a networking issue on EAs side. Notice that you aren't immediately getting blocked when you attempt to connect, like a firewall would do. It just sits and spins til a timeout is reached. Trace route shows it is making it to AWS just fine, it's dropped right after that when it's passed into EA's VPC.
In short, it looks like everyone's requests are being sent into the abyss on EA's side due to some sort of egregious mistake made in their routing. Mind blowing that a production outage has gone unresolved for 2 weeks. I would have been fired being unable to resolve this after one day.
I also don't appreciate the passive aggressive replies with emojis from EA. You have no solution because you don't even understand how the system *you designed* works. It's incredibly bad PR to talk down to someone in this situation, especially when you then point to a solution that everyone has confirmed DOES NOT WORK.
Nothing is resolved, it has only gotten worse and affected more people *that paid money to access this* as time goes on
You're right about it not being firewall or NACL behavior, I don't believe it's a routing issue either though. There is still two-way traffic to these IPs, there is just a ton of TCP errors, but also only at a certain point in the day (if it's 8:00PM CST sharp, it confirms my idea that there is "something" on a schedule causing this). The concept that something is failing over due to routine maintenance is the only thing I could think of beyond some kind of time of day QoS setting.
The fact that it's only AT&T based routes is what makes it extra perplexing. The more I think about it the less sense I can make of it. All of the internet circuits that AWS has would have the full routing table so it's not like there is a particular router that has AT&T routes and another has T-Mobile routes. Not every AT&T customer is going to be sourced from the same block of public IP addresses either, so the idea that it's a VPC routing issue seems highly unlikely - for this type of traffic it's all going to just be the default route of 0.0.0.0/0. Regardless, two-way traffic = not a routing issue. When I start going down this path it makes me believe it's actually just an AT&T issue. This could also explain the time it's taken to get it resolved. EA would not have any particular business contract with AT&T in this regard, and neither would AWS.
At the end of the day none of us have administrative access or knowledge of their infrastructure or architecture so there's a lot of speculation outside of what we've able to identify as affected clients. I agree a production outage like this would also have me out of a job. I don't think EA is being fully transparent about the issue.