Clustering Downstream SNR Time Series For Better Visualization

 

Look at the picture on the left. Do you see anything? Probably, not.

The goal of this exercise is to find a smart way to group the SNR data over time of many channels belonging to a single cable modem dynamically to make visualization better: for example, have 24 lines appear as roughly 5-8 grouped lines that can be drilled down on.

For this, we use DBSCAN with custom metrics. DBSCAN can be parameterized in a few ways, but the main ones are its two hyperparameters: epsilon and min_pts. Another quite important option is the metric used. It turns out, epsilon and metric are the two main parameters to look out for, especially in higher dimensional data. In our case we are clustering lines, so we are working in two dimensions.

The main use of DBSCAN is for cases where a fixed set of clusters doesn't make sense. For instance, when using k-means or similar, the number of clusters is pre-defined. Since the densities of the SNR data vary a lot across modems, we end up with both very well clustered modems, but also with very bad looking ones. DBSCAN is able to find clusters based on the density by looking at neighbors and appending them to the current cluster given a set of constraints (minimum amount of points and distance). It goes point by point, denoting some of them as 'core' points, if they have enough (min_pts) neighbors in a certain range (epsilon). In our case, a single 'point' is actually a whole line, and the metric is defined between lines.

This is where epsilon comes in. It obviously needs to be fitted to the metric and the data, and it ultimately controls the amount of clusters you end up with. The issue we have here is that different modems have different densities (note: OPTICS, which is very closely related to DBSCAN and based on it, is able to find clusters of different densities, but we did not look into it since our approach already worked), i.e. some modems have a SNR range of 40 to 45, while others have a range of 42 to 43. An epsilon that works well in the first case, likely will not work well in the second one.

Now, we tried multiple metrics and ended up with an extremely steep one. Below you can see a few candidates and how the different roots behave over different values.

A widely used metric is the mean squared error (MSE), which is kind of similar looking to the ones presented here. However, it punishes outliers heavily, and this is something we do not want here: two lines that are equal in all points except for one that is extremely far off would have a large MSE. Such outliers inflate the distance between two candidate lines far too much. We discourage these outliers by using a much steeper metric. Observe in the figure above how steep x**0.05 is. In this figure, x is the distance between two points of two candidate lines. At some point (>0.1) the distance does not change that much any more. This steep metric works since we know our data to be in a very tight range (for example, a SNR between 43 and 44).

As for epsilon, we came up with a trick. We used the average of the pairwise distances of all lines as our epsilon and thus made it dynamic. And it worked remarkably well!

A small counterpoint against this  trick: it is inefficient and gives us O(n^2) complexity. For comparison, with a good epsilon, DBSCAN has an excellent time complexity of O(n*log(n)).

Below is the result for 6 random modems. The shaded area is the min/max of the cluster. The blue cluster is always the 'outlier' cluster, thus it sometimes spans the whole range. It contains channels with its SNR having a rather high variance over time. Observe the varying ranges (e.g. 30 to 42 vs 38.5 to 39.5). Looks pretty cool!

That's it for this post. I hope you could gain some insight into how DBSCAN helps in visualization. For clustering, DBSCAN is an indispensable method to have in your toolbox. Often, it is extremely tough or even impossible to define a specific number of clusters, like required by k-means. With DBSCAN, we define a cluster using eps and the metric. This has its own complexities, but gives us much more room for parameterization. See you next time!


Previous
Previous

NQI - One Of Many DOCSIS Health Indicators For Cable Modems In HFC Networks

Next
Next

Detecting Channels From FBC Spectrum Data Using Machine Learning