AT&T network - "Unable to load persistence data" - Season 5

2 years ago

Taking a look at this again, and as far as I can tell the issue is still roughly the same - tons of TCP retrans/TLS issues (likely from TCP segment loss) when connecting to: pub-bf-prod-grpc.ops.dice.se (application actually looks up kingston-prod-cgw.ops.dice.se which has a CNAME record that points to pub-bf-prod-grpc.ops.dice.se)

It seems like EA did make some changes here.

Old IPs:

108.128.42.217

34.248.44.177
52.208.207.227

Current IPs:

108.128.42.217
99.80.71.123
54.74.237.202

Previous behavior before the "fix" was only traffic destined to 52.208.207.227 was problematic. Now all 3 current IPs are problematic and show the same errors in the PCAP. Not sure how it was temporarily working, but the issue still seems scoped to that service.

Very bizarre that it seems time of day related as well.. something is on a schedule.. not familiar with service provider networks but I can't think of much that would be on a schedule to cause this. QoS/Traffic shaping based on time of day? policy based forwarding/routing somewhere? To affect this many users if it's not a core routing problem (which if it was it would have to be fixed by now, no way it would take this long) it must be taking a common path with a problematic device somewhere. Makes it seem more on the AWS side (EA config or infrastructure who knows).

Would love to hear the engineer's current thoughts on this - seems almost unbelievable that it hasn't been resolved yet so either there's some incompetence at play or it's a very interesting issue.

mmmarfy
2 years ago
An elastic IP is quite the opposite of what you're describing, it's used to create a static IP that doesn't change. An IP changing wouldn't cause this though, it's definitely a networking issue on EAs side. Notice that you aren't immediately getting blocked when you attempt to connect, like a firewall would do. It just sits and spins til a timeout is reached. Trace route shows it is making it to AWS just fine, it's dropped right after that when it's passed into EA's VPC.
In short, it looks like everyone's requests are being sent into the abyss on EA's side due to some sort of egregious mistake made in their routing. Mind blowing that a production outage has gone unresolved for 2 weeks. I would have been fired being unable to resolve this after one day.
I also don't appreciate the passive aggressive replies with emojis from EA. You have no solution because you don't even understand how the system *you designed* works. It's incredibly bad PR to talk down to someone in this situation, especially when you then point to a solution that everyone has confirmed DOES NOT WORK.
Nothing is resolved, it has only gotten worse and affected more people *that paid money to access this* as time goes on
5ubj3ctz3r0
2 years ago
@mmmarfy wrote:
An elastic IP is quite the opposite of what you're describing, it's used to create a static IP that doesn't change. An IP changing wouldn't cause this though, it's definitely a networking issue on EAs side. Notice that you aren't immediately getting blocked when you attempt to connect, like a firewall would do. It just sits and spins til a timeout is reached. Trace route shows it is making it to AWS just fine, it's dropped right after that when it's passed into EA's VPC.
In short, it looks like everyone's requests are being sent into the abyss on EA's side due to some sort of egregious mistake made in their routing. Mind blowing that a production outage has gone unresolved for 2 weeks. I would have been fired being unable to resolve this after one day.
I also don't appreciate the passive aggressive replies with emojis from EA. You have no solution because you don't even understand how the system *you designed* works. It's incredibly bad PR to talk down to someone in this situation, especially when you then point to a solution that everyone has confirmed DOES NOT WORK.
Nothing is resolved, it has only gotten worse and affected more people *that paid money to access this* as time goes on
You're right about it not being firewall or NACL behavior, I don't believe it's a routing issue either though. There is still two-way traffic to these IPs, there is just a ton of TCP errors, but also only at a certain point in the day (if it's 8:00PM CST sharp, it confirms my idea that there is "something" on a schedule causing this). The concept that something is failing over due to routine maintenance is the only thing I could think of beyond some kind of time of day QoS setting.
The fact that it's only AT&T based routes is what makes it extra perplexing. The more I think about it the less sense I can make of it. All of the internet circuits that AWS has would have the full routing table so it's not like there is a particular router that has AT&T routes and another has T-Mobile routes. Not every AT&T customer is going to be sourced from the same block of public IP addresses either, so the idea that it's a VPC routing issue seems highly unlikely - for this type of traffic it's all going to just be the default route of 0.0.0.0/0. Regardless, two-way traffic = not a routing issue. When I start going down this path it makes me believe it's actually just an AT&T issue. This could also explain the time it's taken to get it resolved. EA would not have any particular business contract with AT&T in this regard, and neither would AWS.
At the end of the day none of us have administrative access or knowledge of their infrastructure or architecture so there's a lot of speculation outside of what we've able to identify as affected clients. I agree a production outage like this would also have me out of a job. I don't think EA is being fully transparent about the issue.
5ubj3ctz3r0
2 years ago
@mmmarfy wrote:
An elastic IP is quite the opposite of what you're describing, it's used to create a static IP that doesn't change. An IP changing wouldn't cause this though, it's definitely a networking issue on EAs side. Notice that you aren't immediately getting blocked when you attempt to connect, like a firewall would do. It just sits and spins til a timeout is reached. Trace route shows it is making it to AWS just fine, it's dropped right after that when it's passed into EA's VPC.
In short, it looks like everyone's requests are being sent into the abyss on EA's side due to some sort of egregious mistake made in their routing. Mind blowing that a production outage has gone unresolved for 2 weeks. I would have been fired being unable to resolve this after one day.
I also don't appreciate the passive aggressive replies with emojis from EA. You have no solution because you don't even understand how the system *you designed* works. It's incredibly bad PR to talk down to someone in this situation, especially when you then point to a solution that everyone has confirmed DOES NOT WORK.
Nothing is resolved, it has only gotten worse and affected more people *that paid money to access this* as time goes on
You're right about it not being firewall or NACL behavior, I don't believe it's a routing issue either though. There is still two-way traffic to these IPs, there is just a ton of TCP errors, but also only at a certain point in the day (if it's 8:00PM CST sharp, it confirms my idea that there is "something" on a schedule causing this). The concept that something is failing over due to routine maintenance is the only thing I could think of beyond some kind of time of day QoS setting.

The fact that it's only AT&T based routes is what makes it extra perplexing. The more I think about it the less sense I can make of it. All of the internet circuits that AWS has would have the full routing table so it's not like there is a particular router that has AT&T routes and another has T-Mobile routes. Not every AT&T customer is going to be sourced from the same block of public IP addresses either, so the idea that it's a VPC routing issue seems highly unlikely - for this type of traffic it's all going to just be the default route of 0.0.0.0/0. Regardless, two-way traffic = not a routing issue. When I start going down this path it makes me believe it's actually just an AT&T issue. This could also explain the time it's taken to get it resolved. EA would not have any particular business contract with AT&T in this regard, and neither would AWS.

At the end of the day none of us have administrative access or knowledge of their infrastructure or architecture so there's a lot of speculation outside of what we've able to identify as affected clients. I agree a production outage like this would also have me out of a job. I don't think EA is being fully transparent about the issue.
mmmarfy
2 years ago
@5ubj3ctz3r0 I have seen a lot of people who aren't on ATT say they are having the same issue, even people outside the US who don't even have ATT as an option. Hell, my boyfriend and I can be on the exact same internet and it will work for him on PC while it doesn't work for me on PS5. ATT definitely sucks at fixing things also, but given only battlefield 2042 and not other battlefield games are having this issue and no other AWS customers (including myself) are having this issue I would be shocked of if it ended up being ATT

Maybe there is increased traffic at the times when the issue happens more consistently? You can run into dropped packets with an NLB if you don't have a sufficient number of targets in the backend. Just throwing whatever out there at this point in the hopes of being able to play again.

It's such a weird issue that I hope they share the resolution. So I can avoid whatever has caused this in the future lol
merman42069
2 years ago
Not only does this happen with BF 2042 but also with Battlefront II. I am sure there are other games it affects as well. I am, however, able to play BF4.

Forum Discussion

AT&T network - "Unable to load persistence data" - Season 5

Featured Places

Battlefield 2042 Technical Issues & Bugs

Recent Discussions

Team death match returns to main menu Xbox.

Solo/Co-Op Match Won’t Start After Countdown

Breakthrough Returns to Main Menu After Match Found

Bright light, like I’m being flashed bang!

el modo solitario/coperativo con bots no funciona