Non-Technical ExplanationThe server has reached an edge case where too many zones are being cycled and the world server goes into a loop trying to determine available zones
A root cause has been found and a fix will be applied
Technical ExplanationSymptomWorld will go into an infinite loop, unresponsive, character select doesn't work, zoning doesn't work, server appears down on Loginserver list
InvestigationI finally had some actual time to dig deeper into this issue after the work week and very long days, after doing a trace on the process I had quickly found the problem
100.00% 0.00% 0 world world [.] operator() (inlined)
|
---operator() (inlined)
?? (inlined)
EQ::Net::TCPConnection::Start()::{lambda(uv_stream_s*, long, uv_buf_t const*)#2}::_FUN
EQ::Net::ServertalkServerConnection::ProcessReadBuffer
EQ::Net::ServertalkServerConnection::ProcessMessage
EQ::Net::ServertalkServerConnection::ProcessMessage
ZoneServer::HandleMessage
ZSList::GetAvailableZonePort
|
|--76.67%--?? (inlined)
| ZSList::GetAvailableZonePort
|
--18.99%--ZSList::GetAvailableZonePort
World was stuck in a loop in this code
uint16 ZSList::GetAvailableZonePort()
{
const WorldConfig *Config = WorldConfig::get();
int i;
uint16 port = 0;
if (LastAllocatedPort == 0) {
i = Config->ZonePortLow;
}
else {
i = LastAllocatedPort + 1;
}
while (i != LastAllocatedPort && port == 0) {
if (i > Config->ZonePortHigh) {
i = Config->ZonePortLow;
}
if (!FindByPort(i)) {
port = i;
break;
}
i++;
}
LastAllocatedPort = port;
return port;
}
Because we have ports 7000 -> 7100 allocated, we safely cycle through the same ports everyday all the time.
However, recently due to a lot more zone activity, we have run into scenarios where we've had 100 simultaneous zone processes up at once from people spawning way too many instances
When world can't find a free port, it will simply infinitely loop because there is nothing available anymore, this is faulty logic but can easily be mitigated by expanding our available port range pool
FixI will be expanding our available zone port range to 200 and we should see this issue go away