Taking a look at this again, and as far as I can tell the issue is still roughly the same - tons of TCP retrans/TLS issues (likely from TCP segment loss) when connecting to: pub-bf-prod-grpc.ops.dice.se (application actually looks up kingston-prod-cgw.ops.dice.se which has a CNAME record that points to pub-bf-prod-grpc.ops.dice.se)
It seems like EA did make some changes here.
Old IPs:
108.128.42.217
34.248.44.177
52.208.207.227
Current IPs:
108.128.42.217
99.80.71.123
54.74.237.202
Previous behavior before the "fix" was only traffic destined to 52.208.207.227 was problematic. Now all 3 current IPs are problematic and show the same errors in the PCAP. Not sure how it was temporarily working, but the issue still seems scoped to that service.
Very bizarre that it seems time of day related as well.. something is on a schedule.. not familiar with service provider networks but I can't think of much that would be on a schedule to cause this. QoS/Traffic shaping based on time of day? policy based forwarding/routing somewhere? To affect this many users if it's not a core routing problem (which if it was it would have to be fixed by now, no way it would take this long) it must be taking a common path with a problematic device somewhere. Makes it seem more on the AWS side (EA config or infrastructure who knows).
Would love to hear the engineer's current thoughts on this - seems almost unbelievable that it hasn't been resolved yet so either there's some incompetence at play or it's a very interesting issue.