Discrete-event simulator

I have created a discrete-event simulator in Python to better understand what is going on under the hood of the mesh algorithm. For example, it shows which nodes are flooding a message over time, who receives it and when it leads to collisions.

Furthermore, it can be used to analyze the scalability of the routing protocol, which might be useful to optimize or extend it later on. See the README for an example.

The simulator is based on Meshtastic-device commit e7a825d. I only subtracted the important parts of it, so e.g. each generated message is sent to everyone on the channel as one-to-one messaging is not used now. I hope I did not make any mistakes in mimicking the current algorithm.

Feel free to try it out if you think it is useful/fun and let me know if you have any questions.

7 Likes

Do you suggest any improvements to the routing algorithm as it stands today?

In general, I think it is quite good considering it’s made with low complexity in mind. However, I have seen some flaws while running these simulations and got some ideas on improving it, but these need to be tested.

For example, right now the zero-hop reliability retransmit timer depends on shortPacketMsec, i.e. the airtime of a packet without payload, so only the header. This means that if you send a large packet, it will be on the air for quite some time and you might try a retransmit too early. I think it would be good to make the retransmit timer dependent on the airtime of the packet you just transmitted.

Next to that, the new SNR-based flooding sounds good, but does not have any randomness build in anymore. This means that two nodes that are roughly at the same distance from the original sender will try to flood roughly at the same time, which leads to collisions. At a lower layer it might still need some randomness.

2 Likes

I think you have created a very valuable tool to better understand the mesh algorithm by using simulation.

I have just noticed that you only ran the simulation up to a count of 4 hops, however it is my understanding that the hard coded hop count is 7?

And you have already highlighted some areas of possible improvement.

1 Like

I think the hard-coded maximum is indeed 7, since there are three bits reserved for it in the header. However, the default setting is 3, to limit flooding.
I could also run the simulations for a hop limit of up till 7. Though, I now ran each simulation of 200s one hundred times, which already took quite some time for a larger number of nodes.

Update: I added the results of up till 7 hops for completeness :slight_smile:

1 Like

@mc-hamster You might want to read this, before we finalise 1.3

If this is true, it won’t stop 1.3.

Please file a defect with full reproduction steps.

So this will wait for a few years too?

EDIT: “You might want to read this, before we finalize 1.3” I am referring to the main post here, talking about routing. I will make a new post about the bug related to delivery confirmation.

You are welcome to make your first pull request if you find a bug.

@kokroo Please be mindful of this kind of snark. We’re all volunteers and this project needs to stay fun.

If you have found a bug, please file it in our issue tracker in GitHub with the steps to reproduce it. If the details are here in the forum, it will be lost. This forum is just-in-time and not built for prioritization or provide a dashboard for a volunteer to step up and address defects/enhancements.

Like @garth said, we will grow together if you submit a pull request for this.

There is no snark. I don’t know in what tone you read my messages.

It was a serious question because the OP spoke about some important points, and since you worked on the new routing algorithm, I thought I should bring it to your attention.

I asked “So this will wait for a few years too?” because you said the changes for 1.3 are locked and the mesh protocol will not be changed for the next few years.

I don’t know why everybody loves bringing out pitchforks so quickly.

Please file a defect with full steps to reproduce the problem.

Thank you.

1 Like

It read as really snarky to me as well. Honesty not sure why you think it sounded nice or genuine as it does not read that way at all.

I think there’s some confusion. I was simply asking MC Hamster if the suggestions to routing by the original poster of the thread will also need to wait a few years before being included in the protocol, since he told me on discord that the protocol will be “locked” for the next few years and no changes will be allowed that break backwards compatibility.

Very cool tool, working on my Mac now and the updated explanation made it more obvious what is going on. Great work @GUVWAF this will be a huge help debugging and visualizing networks and should help debug and test all the routing changes in 1.3.

2 Likes

My first concern should be fixed with PR #1444.
For my second concern, I would like to propose a revision on the transmit delays. My idea is to use a contention window, basically the pool to pick a random number from. Then a node waits this random number of slottimes before transmitting in order to avoid collisions. The slottime is the time required for channel activity detection, such that if node 1 picks random number n and node 2 picks n + 1, node 2 can detect that node 1 is already transmitting and waits for the channel to be clear before picking another random number.
In order to keep the SNR-based flooding proposed for 1.3 in place, the size of the contention window will be larger for a higher received SNR.
Since if a retransmission is needed it is more likely that collisions are currently happening, the contention window size for a retransmitted packet is based on the number of retransmissions already tried. For a router, the contention window size will always be lower than that of a normal node.

I implemented this behavior in the simulator branch contention_window. Here are some results, where the solid lines represent the current Meshtastic logic and the dotted lines my proposal.


You can see that the collision rate for the new method is quite some lower, especially for the Fast modem settings. This in turn leads to lower transmit air utilization. Here it is compared against the average delay (time between message generation and end of reception).



You can see that the per node air utilization is quite some lower, whereas it has only little effect on the average delay.

I would like to hear what you think of the new method such that I can decide whether to implement it in the device source for a PR.

7 Likes

@GUVWAF That’s so awesome! Totally spot on. Let’s do it!

2 Likes

FYI, absolute maximum delay before transmitting on Short Fast is 3 seconds and on Very Long Slow it is 35.5 seconds.

3 Likes