Researching lag

Not logged inOpenClonk Forum

Forum Home Help Search Watchlist Register Login

Topic Development / Developer's Corner / Researching lag

Post

By PeterW

Date 2013-10-23 22:55 Edited 2013-10-23 23:00

Sparked by some discussion and ala constantly bugging me (thanks!) I started investigating some lag issues with CR. Might be interesting to OC as well - also this is where the developers are, so why not.

Here's an illustration of a particularly laggy spot:
http://www.personal.leeds.ac.uk/~scpmw/timings_1.svg

Shown: scaba (id 1), ala (id 7), jok (id 6) and Mave (id 8)
Not shown: Sven2 (id 0) who either used the wrong engine or sent in the wrong log
Network: We have direct connections between all clients, jok and Mave actually being in the same (W)LAN

Reading guide:
in_[cid]_[tick] = Client [cid] packed its control for tick [tick] and broadcasted it to other clients
ct_[cid]_[tick]_[cid2] = Client [cid2] received the control from client [cid] for tick [tick]
(wait [x]ms) = The client has enough control to continue game simulation, and had to delay the game for the given amount.

Dramatic narration:

0ms: Everything looking extremely smooth. Packets only take a few dozen milliseconds to reach their destination. Especially packets between jok and Mave arrive almost instantaneous - no surprise given their proximity. We get some low delays due to jitter, but nothing serious (even 30ms wait per 100ms means we have only slowed down the game by 30%).

600ms: Out of nowhere we start seeing longer arrows. The "in_6_1011" broadcast takes significantly longer, now taking almost 20ms to reach Mave (before <2 ms!) and about 60 ms to reach scaba and ala (used to be about 15ms). But worse, packets send *to* jok start taking even longer times to reach their destination, with the worst being scaba's control packet in_1_2011 that gets delayed almost 200ms. Surprisingly, we get a delay even on Mave's packet, which arrives really late given that they're supposed to be sitting right next to each other.

800ms: This takes the network as a surprise, PreSends are low, so we get a nice significant 160ms wait time on jok. Note that in_8_1012, in_7_1011 and in_7_1012 come in at the same time. If packets by one client are clustered, this generally points towards packets that were lost and recovered. However, this wouldn't explain the packets of different clients clustering up. Might be a coincidence.

850ms: Finally the in_6_1012 packet everybody has been waiting for gets generated, the delay spreads to all other clients (wait times of 100-130ms).

900ms: Even after getting to tick 1012, jok is still waiting, because the last packet he got from Sven was at about 610ms. The missing packet is in_0_1012, which all other clients received at around 690ms. It arrives at 890ms, probably traveling for 200ms+.

Back to 880ms: Now comes the big one. Note the almost vertical arrow coming from Mave's in_8_1013 packet? It is going to arrive for jok at 2550ms, which is about 1800ms later. Now note that in the meantime, other clients are much more successful at sending packets to jok: at about 980ms, the situation looks almost normal, apart from the missing packet by Mave. Note that some additional packets by Sven2 come in, which might be because the lag just increased his PreSend faster than for the other clients.

2550ms: Packets arrive, and all of sudden everything's back to normal. Packets arrive, the engine starts asking itself why it set the PreSends so high, until...

3700ms: Yeah, long arrows again. It's ala and - again - Mave. Out of nowhere, and we get a good nice 1800ms lag out of it. Note again the pattern: All outgoing as well as incoming packets for jok are delayed, but some extremly so (probably packet losses). This time the game takes some time before it grinds to a halt due to higher PreSends.

13230ms: If you're looking for spectacle: The message sent by ala is the longest delay we have in the whole log. 3000ms. Note that due to the extremely high PreSends, we actually manage to squeeze in two ticks in the time the packets needs to reach its destination.

So what do we learn from this? :)

By Sven2

Date 2013-10-24 09:38

> Not shown: Sven2 (id 0) who either used the wrong engine or sent in the wrong log

Are you sure it's not a later game in the log? Also, isn't there a copy of the log in the replay?

By PeterW

Date 2013-10-24 11:06

Well, the engine can't well change in the middle of a log, can it? It very much looks like the log is from different rounds, btw.

And no, I can't see anything usable in a recdump. Was that an OC change?

By Sven2

Date 2013-10-24 13:14

Hm, pretty sure we added in CR that the Clonk.log is copied into the record group. Maybe only for debugrecs then.

Well, maybe I had another engine running so it was stored as Clonk2.log.

By Zapper

Date 2013-10-24 17:38

>So what do we learn from this? :)

WLAN sucks?

By PeterW

Date 2013-10-24 17:41

Correct! Zapper wins a hundred Internet brownie points for citing my favourite cop-out :)

By Pyrit Date 2013-10-24 19:26

Why doesn't the game run slow only for the dudes with a shitty WLAN, and normal for the other players?
And why does the whole game slow down when there's a connection problem? Normally you wouldn't expect the internet connection to have an impact on the framerate of a game.

By Caesar

Date 2013-10-24 19:50 Edited 2013-10-24 20:02

You have been given a basic introduction about Clonk's network architecture?
To speak abstract, it makes use of determinism. If you give a function the same input, it returns the same results (I.e. if you press the same keys, the clonks run into the same direction.) That way, it is sufficient to transmit initial parameters (a random seed or a savegame) and the continuous input the players make. While that makes a relatively simple and low-bandwith architecture, a single mistimed input can make the game go out of sync (remember the butterfly and the hurricane? chaos theory.). That's why inputs have to be processed at the same time at all machines, and in a decentral network, you have to receive information about the input (was there any? which?) at a high frequency. If an input is delayed, the frame calculation has to be delayed. [1]
The other option is to register the input with the server and let the server decide, when the input is going to be processed (I think that is what asynchronous mode did), but that will double the control latency.

[1] I'm getting a funny Idea here. You could save the current game state (i.e. with a fork, not that I want to implement a forking 3D application), assume the client didn't make any input (which is probably the most common situation) and let the calculation continue from that. If a few hundred ms later a control packet with input arrives, the incorrect calculations could be abandoned (Just kill the parent process), and possibly, performed actions could be replayed upon the saved state (They'd have to be transfered to the forked client.). (The idea is inspired by hyper, a database that does snapshots with fork.) Let me remind you. It is a funny idea. Windows can't do it, and it would probably require doing horrible things to our architecture.

By Zapper

Date 2013-10-24 21:31

>The idea is inspired by hyper, a database that does snapshots with fork.

Yeah, and afair by Quake and the Source engine and probably many more - that's not the newest idea in multiplayer programming :)

By Caesar

Date 2013-10-24 21:33

Ah, interesting. Then, how do they facilitate a quick snapshot?

By Zapper

Date 2013-10-24 23:58

Dunno, might be explained here

By PeterW

Date 2013-10-25 17:15

The variable data of their game state consists pretty much just of the player positions - really easy to snapshot.

By Isilkor Date 2013-10-24 21:55

> Windows can't do it

Windows can do shared memory though, which would allow similar stuff. But no.

By PeterW

Date 2013-10-25 10:42

> forking gamestate

This was actually discussed quite a bit in #clonk. First and foremost, it's just too expensive for how Clonk currently handles State. It would essentially be equivalent to making a savegame, and the optimized LZB-version already takes a second for saving, to say nothing about loading. This is something we'll never be able to do fast enough to just speculate with it.

Another thing to note is that this sort of thing can be very disorienting to a player - say he's walking and jumping somewhere, and all of sudden the network realized there was a huge explosion a few ticks back, and the Clonk was actually dead for a second already. Now all of sudden the player might be controlling something entirely different without ever having seen the events leading up to this. I think this can often be worse than slowing the game down a little.

By Caesar

Date 2013-10-25 16:47

>it's just too expensive for how Clonk currently handles State.

I guess then it would have to rely on the MMU to handle that. It's still a no, though.

By PeterW

Date 2013-10-25 17:11

Memory Management Unit? Not quite sure what you have in mind, but mucking around at that level is highly unlikely to make anything easier or faster.

By Caesar

Date 2013-10-25 17:20

I basically have something like fork in mind. And no, it certainly isn't easier.

By PeterW

Date 2013-10-25 18:52

Nor faster. If you are thinking of copy-on-write, that would mean thousands of context switches and page duplications per frame. It's likely that would be even slower than "manual" saving. The only way to get around that would be to separate things that change often from things that change rarely - and at that point we could just do it directly.

By Zapper

Date 2013-10-24 21:39

A random question:
You probably already put some thought into the following - why did you not chose it?:

- the host cares for the transmission of all controls
- the host sends "ok, next frame" in fixed intervals (38 times per seconds with control rate 1)
- the host sends player controls as soon as possible between those fixed ticks
- if no player command arrived at the host before the next tick, he just sends an empty control packet for that player

That would imply that the usual latency for input is [your ping] + [host ping] and that when a player lags only his control packet gets delayed.

By Sven2

Date 2013-10-25 09:50

That is (the old?) asynchronous mode. Iirc, problem was that if people have slightly varying amount of lag, then it becomes very hard to execute certain actions in Clonk such as walking just to the tip of a cliff. Currently, if you know the ControlRate and have a stable PreSend, you can predict exactly when your keypress will be executed. If this is stable over a few rounds, motor learning kicks in and you can get really good at playing with your lag.

Additionally, if you have a huge spike of lag every minute (like e.g. Mave does) it would mean a dead Clonk every time that happens.

There's also the PreSend trick: Clients currently send their controls ahead of time such that, given their ping is stable, every other client has the control by its determined frame. The client then assumes everyone has got the control and executes it as well (without any confirmation from other clients). This means you can effectively cut latency in half!

But other than that, it's still a possible model of course.

By Zapper

Date 2013-10-25 11:38

But a huge disadvantage for lagging clients (aka jumping off cliffs) would be in every implementation of a network mode which punishes lagging clients rather than the whole game. It should even be the same issue in the current asynchronous mode, or not? (I have to admit I never played it a lot..)

By PeterW

Date 2013-10-25 10:57

One of the problems with this is that it handles slow clients badly - if a client is just slightly below 38 FPS, his game will start lagging behind further and further the longer the game lasts, quickly making it impossible for him to meaningfully participate in the game. This is pretty much why matthes gave up on this idea for CP, if I read the code correctly.

The way the asynchronous mode works right now in CR is therefore different: The clients still do all the PreSend stuff they to for synchronous modes, but the Host *caps* the amount of time the network waits for the client. This still has the above problem, but if we now assume that there is a certain maximum variability in the game simulation speeds, we can choose a cap accordingly. So let's say we judge that with the host running at 38 FPS we assume no client will ever be slower than 19 FPS, we can wait for clients a maximum of "KR / slowest FPS", which would be about 100ms for KR 2.

Also of course in this sort of mode the host has a massive control advantage. Not only 100% predictable input rate, but actually instant reactions. I therefore only really advocate it for dedicated servers.

By Zapper

Date 2013-10-25 11:42

The naive idea would be to limit the simulation speed in general to the stable speed of the slowest client - then this model would only catch spikes efficiently and the game would run with f.e. stable 20FPS (the slow client could catch up by skipping the drawing of frames, for example).

But I must say that I didn't really think that through.

By PeterW

Date 2013-10-25 12:30 Edited 2013-10-25 12:34

Yeah, that's pretty much what matthes attempted. But it's more complicated than that: Game speed changes during the game. If we have a lot of stuff happening in a few frames, all clients might sail through it without noticing, but the slow client might find himself falling behind. Once everybody's noticed, the client will be significantly behind - and now it wouldn't even be enough to match the slow client's speed, because we need him to *catch up*. From the viewpoint of other clients, we'd get strange lag waves coming in with a delay. And the slow client would have to live with a huge variance in control delay.

All this "rubber band" stuff is very prone to being unstable and causing the network to start oscillating between configurations, so I try to be extra-careful with this sort of thing.

By Zapper

Date 2013-10-25 12:46

Ah, I see. Yes, that sounds like a problem

By ala

Date 2013-10-28 17:44

>One of the problems with this is that it handles slow clients badly - if a client is just slightly below 38 FPS, his game will start lagging behind further and further the longer the game lasts, quickly making it impossible for him to meaningfully participate in the game.

If we talk over fps on a local machine, we talk about computer power, right? Two ideas for that:

1) A few weeks ago Sven implanted that Skip-Frame thing for slow machines, so why not expand this for effects (reduce effects while running a game? It's a changeable variable in the menu anyway, so it should be possible right?)

2) In a tournament we once determined that everything about 28 fps is considered playable, and no complains about a lagging opponent would be taken seriously above that amount. So, like if we have a couple of clients that don't quite make it to 35 fps, why not play with 33 or 30 instead? - Sure if in rare cases (burning forest, and a 1000 objects moved by explosion) it would still lag, but not for the majority of the game.

By Maikel Date 2013-10-28 17:53

At 2):

I don't like that, the FPS is already quite low compared to shooters, etc. and clonk gameplay is even more fast paced if you consider melee, especially involving magic. To reduce the game speed just because some people play with too high settings compared to their hardware is ridiculous. Moreover, this change is really hard to do, since it affects literally everything in the game.

Having different frame rates for drawing and control inputs would, however, be a good thing to have.

By Sven2

Date 2013-10-28 18:28

> 1) A few weeks ago Sven implanted that Skip-Frame thing for slow machines, so why not expand this for effects (reduce effects while running a game? It's a changeable variable in the menu anyway, so it should be possible right?)

I implemented this for Clonk Rage only. I hope we can find a better solution for OpenClonk, so I haven't ported it (yet).

By PeterW

Date 2013-10-28 20:26 Edited 2013-10-28 20:28

> 1) A few weeks ago Sven implanted that Skip-Frame thing for slow machines, so why not expand this for effects (reduce effects while running a game? It's a changeable variable in the menu anyway, so it should be possible right?)

I'm actually surprised that graphics is still a factor in this day and age. Note that this can only ever address sub-problems. If game simulation itself is what slows clients down, there's just nothing you can do.

> 2) In a tournament we once determined that everything about 28 fps is considered playable, and no complains about a lagging opponent would be taken seriously above that amount. So, like if we have a couple of clients that don't quite make it to 35 fps, why not play with 33 or 30 instead?

What exactly are we talking about here? If it's asynchronous mode, this is pretty much how it works - as explained above, you can adjust wait times to cap the amount a slow client can slow down a game.

By Zapper

Date 2013-10-28 21:47

>I'm actually surprised that graphics is still a factor in this day and age

That's because the Clonk graphics code is so bad compared to all the high-end games which run fluently on most computers that struggle with Clonk

By PeterW

Date 2013-10-28 21:49

Well, CR ran decently when it got released, and GPUs are supposed to still be getting exponentially more powerful. Why does the software side need to get more efficient all of sudden?

By Zapper

Date 2013-10-28 22:27

Because suddenly scenario developers started using stuff like dark gamma + alpha-blended lights (Hazard) for a moodier setting. Or amped up the particle amount used. And then the players got bigger monitors and wanted to see everything not in 800x600 resolution but 1920x1080.
I am pretty sure the standard gold mine scenario runs better than it did a few years ago on 800x600 :)

By PeterW

Date 2013-10-28 23:07

Hm, yeah. Speaking of which - is there anything we can really do on those large blits? If (!!) we get a lights system, this might help the particular Hazard problem, but it is still puzzling that three large particles can kill the whole performance. And unless I'm wrong, the new code won't help here either, right?

By Zapper

Date 2013-10-28 23:41

Well, usually you would do as few blits and state changes as possible for the optimal performance (and shaders!). Clonk sets up the whole OGL state machine for every single blit.. Didn't get better with the meshes, either.
Those unnecessary state changes are probably the biggest performance-eater atm. To change that would be a lot of work, though.

By PeterW

Date 2013-10-29 00:22

But big objects are just a couple of blits, so that explanation doesn't really fit... I know that for Sunshine I just added about three big fog objects per screen, and the framerate plummeted to barely-playable levels. Had to cut down heavily on them, even though I really liked the effect.

By Zapper

Date 2013-10-29 00:30

Ah, I get what you mean now with "large blits" - I was talking mainly about the additivity aspect and the sheer amount of blits before :)

I don't know. When I tested the particle stuff, I used some very, very large particles and don't remember anything especially slow. I am not really sure why the normal, large blits are so much slower (if they are). I am not really familiar with everything the rendering codes does, though. From all I know I could imagine a pixel-wise preprocessing or something like that.. :).

By Caesar

Date 2013-10-28 22:27

Because hardware doesn't get better the way it used to anymore.

By Zapper

Date 2013-10-25 15:13

Another random thought:
Only the connection between ala and jok had that giant hickup, right? Mave received the packet pretty quickly and jok still received packets from Mave way before the missing packet from ala arrived.

So, had Mave forwarded the missing packet to jok, the lag spike would have only lasted around 100ms instead of around 3sec?

By PeterW

Date 2013-10-25 15:16

No. It's all connections to jok, even the one from Mave. It's just random which one gets the heaviest packet loss. Best bet at this point might be to just spam jok with the packets in question until one gets through (that's what Quake 3 does, according to flgr).

By ala

Date 2013-10-28 17:30 Edited 2013-10-28 17:32

Well, my simple thoughts for half a solution (not based on ANY knowledge, so it's probably rather naive):

Let's assume we have a group of fast players, and one or two lagers.

So the first thing, I would like to know. Do the lagers lag whilst receiving, or whilst sending packages?

A) My first though is that we could just ignore their failed send attempts: This means that their control commands won't reach the other players. And they cannot move for the amount of lag, usually parts of seconds. I think they still have a decent chance to save their clonk in case of engagement if they lost like a third of a second reaction time. If the lag is bigger than a second, the game could stop to not go out of sync.
Also once the lager knows he lags, he can adapt to a more careful play style and not walk on clip edges.

B) Ok, so in the second scenario: The lager fails to receive a package. This package comes from one client perhaps? I don't know how this works exactly but if the package has a number, like 1001, he could send all clients "1001 not received" - and all clients would immediately spam him with the package he is missing. So in this case a bad connection from for example ala to jok, is not a problem - because scaba's client jumps in and sends the lost package to the lager(?).
Well, you probably already do this from what I understand. So I assume this point is the main problem?

--

In case of A), some might argue that it's better to have a little lag than "random deaths" for the lagers, which they can't control. As a regular player for years I must sadly say that far over 50% of my deaths are due to lag, and that is true for most other players as well. So currently, we are already very frustrated with random deaths. And any attempt to limit those to the lagers only sounds like a big improvement to me. It would mean 80% less random deaths, only for the lagers nothing changes.

Also there comes a psycho-logic argument into play (it's not new, just for the summary): The lager will think something is wrong with his machine, and probably will improve his network - whereas with lag for all, he doesn't feel punished and changes nothing. We had players playing with too big resolutions, or downloads on - and not feeling bad about it before, and I still think this is a small but not too small percentage of lag, that we have to deal with.

This approach would feel more like a asynchronous mode, the currently asynchronous mode - well it's not really asynchron is it? It still wait's big time for lagging players, and provides no real fluent play compared to the normal network mode.

By PeterW

Date 2013-10-28 20:42 Edited 2013-10-28 20:46

Well, yes, this is pretty much exactly what we're doing already. Note though that going over other clients wouldn't actually help - both incoming as well as outgoing packets get delayed and have a good chance to be dropped. The more packets you need to invest into recovering a packet, the higher the likelihood that the mechanism in turn will get disrupted by dropped packets (this is pretty much what causes those "mega" lags). We can probably gain the most by doing some spamming here, but I'll have to wrap my head around how exactly to do that.

> It still wait's big time for lagging players

Whatever "big time" means here. In the current default configuration, it should wait the equivalent of 2 frames, which should be about 55 ms maximum. With KR 2, that means that FPS shouldn't go below around 17 FPS ever (unless game simulation at the host slows the game down).

If this doesn't work for some reason, it's just a bug we need to find.

By ala

Date 2013-10-29 14:43

>If this doesn't work for some reason, it's just a bug we need to find.

Ah, yes I'm pretty sure there is something wrong. The games had quite a big amount of fullstops (0fps for a certain amount of time, mostly half a second, sometimes longer). I think I'll organize another testing round.

By PeterW

Date 2013-10-29 15:09

Yeah, that shouldn't happen. At least not for the host / players with good connections. I fear I didn't have a look at the second log yet - the first one is kind of useless in this regard, as I positively need the host viewpoint for this. I'll try to get that done by the end of the week. Hopefully then I'll be able to provide you with another engine, logging more fine-grained information. Packet re-sends and game continuation checks are currently high on my list.

Theory behind the latter would be that currently these continuation checks only happen when the game timer fires or a network thread notification comes in. But there's actually no timer running for when the asynchronous wait time is over, so we have to rely on the game timer to tick often enough - not the smoothest solution, but it should normally introduce a maximum of 28ms of lag.

By PeterW

Date 2013-11-02 13:38

Okay, more arrows to look at - asynchronous edition:
http://www.personal.leeds.ac.uk/~scpmw/timings2.svg

Only real change in naming is that we now have "in_x" and "ct_x" events, which are the combined controls of all clients as packed by the host.

Notes:

0ms: This is how things should look like normally. All clients send their control to the host, who sends combined control packets back.

100ms: The packet from jok takes a bit longer than normal, but well within the time limit. Host simply waits the required 22ms.

150ms: Note that due to the delay introduced, the "ct_x" event now comes after the "in_" event.

180ms: Packet "in_x_2053" gets a long arrow towards jok, as apparently it gets delayed on the way. But this delay gets buffered well by the PreSends - note that jok keeps sending, so the game keeps on running. It actually strikes me as a bit overly conservative in the expected case.

350 ms: jok does two ticks quickly in order to catch up, and the delayed in_2_2056 packet only causes around 40ms of lag at the host

950 ms: Let's stop singling out jok - now ala is the villain. The in_x_2066 doesn't get to him until 1530 ms into the log, a significant 600 ms packet travel time (a packet loss, probably). ala apparently has a more aggressive PreSend and stops sending almost immediately (last packet at 980 ms).

1100 ms: For the first time the host decides to cap waits and continue without ala after waiting for 112 ms.

1200 ms: The game has slowed down considerably. Puzzlingly, all arrows are long at this point, especially the ones coming from Sven - even if they're not going into the direction of jok. So either the problem is somehow on Sven's side, or packet recovery for ala actually manages to clog up the packet buffers to this point that we get side-effects on other connections. Hm.

1530 ms: Some time later, ala finally manages to get the packets he needs, and starts catching up. He starts ticking a lot faster than the other clients, which are at this point around 5 ticks ahead.

1800 ms: Thing start to normalize, wait times at the host are reducing again.

2500 ms: Next lag event, this time again jok not getting the memo. Recovery takes until 3300 ms in.

4450 ms: jok clearly feels like he's been robbed the spotlight, and produces the most extreme lag scenario in the log. The packet in question takes a full 2.3 seconds to reach its destination. Note that in the meantime the host keeps sending, so by the time recovery is finished, we have a nice stack of 15 control packets for jok to work through. He's fallen behind massively, it will take about 3 seconds for him to catch up again. Again note that systematically, all arrows coming from the host are very long at this point, but normalize almost immediately before and after.

What do we learn from this? Asynchronous mode works as indented, from what I can see. It's just looks a lot like C4NetIOUDP's recovery actually manages to sabotage itself.

Topic Development / Developer's Corner / Researching lag

Post