Proposal of New BGP Community Standard Supporting Cumulative Latency Calculation

nicholascw

Have just started working on enforcing BGP communities on my network, however I found that the current latency values will only give the worst part of the paths' latency range, which is actually not that valuable for latency optimization, and many DN42 members tend to rely more on the region values rather than latency.

I am thinking of the following new set of values that reserves (64511,60..69) and is capable of logging a relatively accurate cumulative latency value together with count of participating nodes and I'd like to see more opinions on this.

Update: I would like to hear more thoughts on tie-breaking issue, and whether we should at least add 1 to the cumu_lat or not (so that routes within same locations all with 0-5 routes will probably go through shorter ones?).

## This is a set of BGP Community values that achieves cumulative latency
## calculations for participating routers, which may enhance the ability of
## optimization for latency.
##
## The utilized range is (64511,60..69), of which:
##   I.  (64511, 60..65) are bits denoting the cumulative latency value in an
##       interval of 5ms. Therefore it is capable of carrying latency value from
##       0ms to 320ms. This binary number is bitwise NOT'd, so that routes with
##       no participated node would have the value 315ms, so it would not be
##       considered as a very good route with 0~5ms latency.
##   II. (64511, 66..69) are bits denoting participated nodes' count, supporting
##       up to 15 nodes. This binary number isn't bitwize NOT'd, so routes with
##       no participating node will automatically have a count of zero.

function get_cumu_lat()
int cumu_lat;
{
  cumu_lat = 315;
  if (64511, 65) ~ bgp_community then { cumu_lat = cumu_lat - 160; }
  if (64511, 64) ~ bgp_community then { cumu_lat = cumu_lat - 80; }
  if (64511, 63) ~ bgp_community then { cumu_lat = cumu_lat - 40; }
  if (64511, 62) ~ bgp_community then { cumu_lat = cumu_lat - 20; }
  if (64511, 61) ~ bgp_community then { cumu_lat = cumu_lat - 10; }
  if (64511, 60) ~ bgp_community then { cumu_lat = cumu_lat - 5; }
  return cumu_lat;
}

function set_cumu_lat(int cumu_lat)
int tmp;
{
  bgp_community.delete([(64511, 60..65)]);
  tmp = cumu_lat / 5;
  if(tmp > 63) then return;
  if(tmp >= 32) then tmp = tmp - 32;
  else bgp_community.add((64511, 65));
  if(tmp >= 16) then tmp = tmp - 16;
  else bgp_community.add((64511, 64));
  if(tmp >= 8) then tmp = tmp - 8;
  else bgp_community.add((64511, 63));
  if(tmp >= 4) then tmp = tmp - 4;
  else bgp_community.add((64511, 62));
  if(tmp >= 2) then tmp = tmp - 2;
  else bgp_community.add((64511, 61));
  if(tmp >= 1) then tmp = tmp - 1;
  else bgp_community.add((64511, 60));
}

function get_cumu_cnt()
int cumu_cnt;
{
  cumu_cnt = 0;
  if (64511, 69) ~ bgp_community then { cumu_lat = cumu_lat + 8; }
  if (64511, 68) ~ bgp_community then { cumu_lat = cumu_lat + 4; }
  if (64511, 67) ~ bgp_community then { cumu_lat = cumu_lat + 2; }
  if (64511, 66) ~ bgp_community then { cumu_lat = cumu_lat + 1; }
  return cumu_cnt;
}

function set_cumu_cnt(int cnt)
int tmp;
{
  if (cnt > 15) then tmp = 15;
  else tmp = cnt;
  bgp_community.delete([(64511, 66..69)]);
  if (tmp >= 8) then { bgp_community.add((64511, 69)); tmp = tmp - 8; }
  if (tmp >= 4) then { bgp_community.add((64511, 68)); tmp = tmp - 4; }
  if (tmp >= 2) then { bgp_community.add((64511, 67)); tmp = tmp - 2; }
  if (tmp >= 1) then { bgp_community.add((64511, 66)); tmp = tmp - 1; }
}

# link_latency: in millisecs unit
function update_cumu_latency(int link_latency)
int cnt;
int cumu_lat;
{
  cumu_lat = get_cumu_lat() + link_latency;
  set_cumu_lat(cumu_lat);
  cnt = 1 + get_cumu_cnt();
  set_cumu_cnt(cnt);
  return cumu_lat;
}

jerry

burble

Interesting idea of using communities as a binary number.
A few comments:

The spec needs to be clear whether the community is set on import or export (definitely don't want two AS doing opposite things)
How would it be used during BGP route calculation ? I'd be concerned about inexperienced users simply setting local_pref and causing issues for the network through flaps, loops or other misconfigurations. In other words, how would it be used safely ?
I think there is a risk it would encourage bad design through ASNs advertising smaller, per node or per service prefixes and making reactive changes in order to try and optimise routes to their node. This moves DN42 away from how it works in clearnet, would multiply up the size of the DN42 global routing table and the number of global route updates considerably. With significantly more global routes and associated instability the costs and overhead of joining (and staying) in DN42 could increase making it harder for users who are learning, have less time or resources to spend. The value of DN42 being a learning environment 'like clearnet' would also diminish.

The official place for consensus in DN42 is the mailing list, so you may want to post there too.

Simon

nicholascw

burble The spec needs to be clear whether the community is set on import or export (definitely don't want two AS doing opposite things)

Definitely. This is only the very first draft we came up yesterday. Just want to give it an "early access" so I can hear more voices about this before doing too much useless work xD.

burble How would it be used during BGP route calculation ? I'd be concerned about inexperienced users simply setting local_pref and causing issues for the network through flaps, loops or other misconfigurations. In other words, how would it be used safely ?

You may get the participating rate by , then you may estimate actual latency to the origin by:

calculate an average latency between previous participating peers by dividing cumulated latency value with cumu_cnt then time it with bgppath.len: total_lat = bgppath.len * (cumu_lat / cumu_cnt)
assume those non-participating nodes have the worst values carried by (64511,1..9), (denoting it as worst_lat) so total_lat = worst_lat * (bgppath.len - cumu_cnt) + cumu_lat;

burble causing issues for the network through flaps, loops or other misconfigurations

As for flapping... I am assuming community values are manually specified when you are peering and not that likely to change frequently? As for other types of misconfigurations, I am seeing it's just as equal as other existing community values as soon as we settle those behaviors including apply on import or export, etc.

burble I think there is a risk it would encourage bad design through ASNs advertising smaller, per node or per service prefixes and making reactive changes in order to try and optimize routes to their node.

Surely it does make people worry... However, if that's the case, it also means this measurement actually works for optimization, even easier than doing IGP optimizations? 😃 Actually I am not that optimistic on the number of networks would actually adopt this new standard, that's why the cumu_cnt is implemented. I also do not have the estimation and statistics on how currently BGP communities are adopted.

burble would multiply up the size of the DN42 global routing table

If any system has the need of shrinking its size, it could drop some precision bits like (64511,60..62).

burble The value of DN42 being a learning environment 'like clearnet' would also diminish.

I'm agreeing with your point. However it also serves like a lab or experimental playground to testify things that cannot be done on the clearnet, right? I am definitely going to classify this new community as very experimental and optional.

burble The official place for consensus in DN42 is the mailing list, so you may want to post there too.

Definitely. Just like I said earlier, I am simply looking for some advice in the very early stage before making any further proposal in a much formal way. Threads is especially suitable for these opinion spreading and helps keep track of it comparing with IMs.

jrb0001

Your system is interesting because it also works well if only parts of the paths (even with gaps) support it. But I see a few things that need to be clarified:

How often should the metric be updated? If only once when the peering is established, then it will become less accurate or maybe even misleading as clearnet connectivity changes over time. But updating every minute like I do at the moment causes a big load on the whole network and might also lead to strange issues.
local_pref overrides are risky, especially for beginners. What can we do to reduce the risk for everybody?
It will make deaggregation look like a good solution to beginners. Similar optimisations are already possible with prepending, but this new community is a lot more precise and easier to predict and that will make it much more attractive. And we definitely don't want unneeded deaggregation.
How can it be implemented correctly on other bgp implementations? Bird has by far the most flexible filtering as far as I know.
How many bits do we need/want? More bits make it more accurate but also cost more memory and generate more traffic.
How are you supposed to increment it on iBGP sessions? For example bird doesn't expose the cost of the IGP route.
What should happen if the one of the parts overflows?

welterde

I don't think switching from a logarithmic to a linear scale will really fix anything, since in my opinion the main issue is the lack of propagation of network internal latency.

It doesn't really matter if the external peering has 20ms or 30ms of latency if the internal latency between the ingress and egress node is above 100ms and is completely ignored.
Therefore I would start with that: apply the current latency attribute correctly on export (by say tagging each node on route import into the network and having a precomputed latency for each ingress/egress pairing).