Blog

Status Update + Future Projects

Status Update + Future Projects

If you are a recurring visitor or happen to just browse the site, you may have noticed the absence of current posts. Lets just say I have been a bit busy…

I’ve got some fun projects I have been working on as well as some IRL stuff to handle. I will hopefully be posting a new project series here soon.

Patiently waiting for the arrival of my second kid in December :)

Site Migration

Site Migration

You may have noticed that both my site [alexsguardian.net] and the Grafana – Experts Exchange wiki going down a lot recently. Here is why:

TL/DR: Migrated both sites from my home hosted web server to a dedicated host.

I logged into Cloudflare’s dashboard the other day to update a DNS record as I was testing a new self-hosted service at home. I normally don’t pay much attention to the stats page but I noticed something, a bit crazy. Apparently my “site” had over 373k requests in 30 days, which at that point, it was apparent that I was running a web server. Pretty sure thats a breach of contract with my ISP as I am on a residential gigabit line and not a business connection.

Note: I have not received any type of warning from my ISP about this, yet.

I decided to take preemptive action and migrate both sites to a dedicated Linode VM. This freed up my bandwidth at home as well as giving me access to a proper dedicated cloud host with automated backups. Both sites are now happily running on this host with no issues so far.

Linode VM Portainer view

I only had to restore a backup once during the migration after I messed up the iptables firewall. The backup file wasn’t saved correctly so couldn’t restore the original rules.

Expanding Pi-Hole Stats with Prometheus

Expanding Pi-Hole Stats with Prometheus

The other day I came across a Prometheus Exporter for Pi-hole (found in a comment on /r/pihole) that gives WAY more stats/data compared to the InfluxDB script I posted about awhile back. With this exporter, I was able to setup a more detailed dashboard.

Now currently I only have this setup for a single instance of Pi-Hole. I am currently in the process of setting up a second instance as a backup when my primary one goes down for updates. This dashboard can be easily updated to have either a clone of this info for your second instance or a drop down selector for instances. You’ll have to update your queries to support dashboard variables, which isn’t very hard to do.

The dashboard json can be found here.

Monitoring Nvidia GPUs via Telegraf

Monitoring Nvidia GPUs via Telegraf

The nvida-smi plugin for Telegraf basically gives you an overview of your GPU usage in the most current iteration in v1.10.4. This “guide” assumes you are using Windows as your host OS. Linux should be fairly easy to get going as long as you know where your nvidia-smi executable is located.

If you do not have Telegraf installed, check out my guides here.

Create a new conf file in telegraf.d folder.

notepad.exe C:\telegraf\telegraf.d\nvidiasmi.conf

Paste the following into the new file and save/close it.

# Pulls statistics from nvidia GPUs attached to the host
[[inputs.nvidia_smi]]
  ## Optional: path to nvidia-smi binary, defaults to $PATH via exec.LookPath
  bin_path = "C:\\Program Files\\NVIDIA Corporation\\NVSMI\\nvidia-smi.exe"

  ## Optional: timeout for GPU polling
  timeout = "5s"

Restart Telegraf.

net stop telegraf
net start telegraf

With Windows you have to use an escape \ when setting the bin_path otherwise you’ll get errors when Telegraf queries nvidia-smi.exe.

Once you have verified Telegraf is reporting Nvidia stats you can start creating your panels in Grafana. Use nvidia-smi from your telegraf data source to build the panels.

Update February 2020

Telegraf recently updated its SMI plugin to include more data retrieval. This new data can be used to create more monitoring panels. Here is a list of the most recent fields that are now returned:

  • clocks_current_graphics
  • clocks_current_memory
  • clocks_current_sm
  • clocks_current_video
  • encoder_stats_average_fps
  • encoder_stats_average_latency
  • encoder_stats_session_count
  • fan_speed
  • memory_free
  • memory_total
  • memory_used
  • pcie_link_gen_current
  • pcie_link_width_current
  • power_draw
  • temperature_gpu
  • utilization_gpu
  • utilization_memory
Here you can see the new encoder fps and session counts. These are utilizing the new stat panel in Grafana 6.6.
Monitoring Hyper-V via Telegraf

Monitoring Hyper-V via Telegraf

Now the cool thing about Telegraf on Windows is that you can basically monitor any system service that reports to the Windows performance counters. So creating a Hyper-V dashboard is actually fairly easy.

Create a new input configuration file in the telegraf.d directory:

# PowerShell
notepad C:\telegraf\telegraf.d\hyperv.conf

Paste the following into your new file, then save.

[[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V Virtual Machine Health Summary"
    Instances = ["------"]
    Measurement = "hyperv_health"
    Counters = [
      "Health Ok",
      "Health Critical",
    ]
    
    [[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V Hypervisor"
    Instances = ["------"]
    Measurement = "hyperv_hypervisor"
    Counters = [
      "Logical Processors",
      "Partitions",
    ]

    [[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V Hypervisor Virtual Processor"
    Instances = ["*"]
    Measurement = "hyperv_processor"
    Counters = [
      "% Guest Run Time",
      "% Hypervisor Run Time",
      "% Idle Time",
      "% Total Run Time",
    ]
    
    [[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V Dynamic Memory VM"
    Instances = ["*"]
    Measurement = "hyperv_dynamic_memory"
    Counters = [
      "Current Pressure",
      "Guest Visible Physical Memory",
    ]

    [[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V VM Vid Partition"
    Instances = ["*"]
    Measurement = "hyperv_vid"
    Counters = [
      "Physical Pages Allocated",
    ]
    
    [[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V Virtual Switch"
    Instances = ["*"]
    Measurement = "hyperv_vswitch"
    Counters = [
      "Bytes Received/Sec",
      "Bytes Sent/Sec",
      "Packets Received/Sec",
      "Packets Sent/Sec",
    ]
    
    [[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V Virtual Network Adapter"
    Instances = ["*"]
    Measurement = "hyperv_vmnet"
    Counters = [
      "Bytes Received/Sec",
      "Bytes Sent/Sec",
      "Packets Received/Sec",
      "Packets Sent/Sec",
    ]
    
    [[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V Virtual IDE Controller"
    Instances = ["*"]
    Measurement = "hyperv_vmdisk"
    Counters = [
      "Read Bytes/Sec",
      "Write Bytes/Sec",
      "Read Sectors/Sec",
      "Write Sectors/Sec",
    ]
    
    [[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V Virtual Storage Device"
    Instances = ["*"]
    Measurement = "hyperv_storage"
    Counters = [
      "Write Operations/Sec",
      "Read Operations/Sec",
      "Read Bytes/Sec",
      "Write Bytes/Sec",
      "Latency",
      "Throughput",
    ]

Restart Telegraf with the new config file.

# PowerShell Administrator
net stop telegraf
net start telegraf

Import dashboard ID: 2618 into Grafana and set your data source to telegraf.

To see all Hyper-V counters you can check out this PowerShell counters export, here.

50% Packet loss? What?!

50% Packet loss? What?!

So if you are a frequent user of my website you may have recalled some issues with my site that started last week. Well here’s the scoop on what happened.

Last Wednesday [ 4/26/19 ] around 3AM I started getting reports of low bandwidth in my overview dashboard in Grafana. Note: I basically monitor everything that I can. These reports set off a massive find and fix hunt as I basically rely on my “gigabit” connection for everything.

Prepare for long post.

First Issue ( Mostly unrelated )

I was getting reports from a few Facebook group members that my site was throwing occasional TLS errors. So I started investigating my reverse proxy. I found out that Caddy just launched version 1.0 and my automatic docker image building failed to update my image due to a plugin issue. So I fixed that and pulled the new docker image to update my local docker container version of Caddy.

Caddy 1.0 introduced a new ALPN challenge for obtaining Let’s Encrypt certificates. This challenge does not work when you are being Cloudflare with an Orange cloud as LE is unable to talk directly to your server making the request. This caused my certificates to not renew which was causing the TLS errors. I fixed this by disabling the ALPN challenge and using the older HTTP one. Also had to wait a bit because, I may have accidentally hit LE’s rate limit.

Second Issue

I run a Ubiquiti EdgeRouter Lite-3 as my primary routing device behind my modem. I assumed my low bandwidth issue could have been caused by my router being updated to the latest firmware. So I downgraded the firmware from version 2.0.1, which was recently released, to version 1.10.9. This ended up leading into my third issue.

Third Issue

I noticed that my router’s CPU was spiking rather frequently so I jumped into a CLI SSH session and ran the top -i command on my router. I noticed the SNMP service was spiking rather frequently to 100% CPU usage. After doing some googling, I found out that SNMP can bug out when changing firmware versions. The only fix was to disable it completely and doing so returned my router’s CPU usage back to normal.

So far its been 24 hours since the first low bandwidth report…

Fourth Issue

At this point I’m pulling my hair out trying to figure out whats wrong with my internet. I ran a pcap on my router, a few speed tests, and a few ping tests from it and a few network devices. I also tried using a laptop + my old ER-X directly connected to my modem… and… HOLYCRAPMONKIES I’m losing 20-50% of my incoming packets. This was verified when I checked my modem’s status page and saw channel ID 33 with a corrected packet rate of 2 billion on my downstream channel list. So I called my modem manufacturer to verify this was the issue. They told me that channel 33 is an internal Comcast data channel. Also helps it was labeled as Other and not QAM256.

The image shows Channel 33 labeled as "Other" with a corrected packet rate of 2,052,783,457.
Channel 33… Notice the corrected rate

Now this was the fun part: I had to convince Comcast Script Monkies that this was an issue on their end and not because I “owned my own equipment.” So I went and called Comcast TS and got the usual Tier 1 person. Before they could even talk I asked them if they knew what channel bonding, MER/SNR, or signal power was and if they didn’t I need someone that does. This is how it went: (My thoughts)

Friday Night Call

Me: “Hi, before we go anywhere, do you know what channel bonding, MER/SNR, or signal power level is? If not I need someone that does.”
TS: “No I do not, and there is no one here at the moment that does. (Um OK). But I can put in a T2 ticket for you after I do a few modem checks.. (FML).”
Me: “Ok well I am losing 20-50% of my incoming packets which is basically making my internet useless. Its been going on since Wednesday at 3AM. [Its now Friday night btw]
TS: “You said 20-50% sir?”
Me: “Yes.. I checked my modem and I have a channel labeled ID 33 – Other with a corrected packet rate of 2 billion. I looks like it was provisioned wrong.”
TS: “Ok well I can see if I can change the channels.”

TS: “I see you own your own modem and unfortunately we can’t edit its settings. I can schedule a tech to come out… The earliest I have is Sunday.”
Me: “No, everytime a tech comes out, they plugin their little speed testing raspberry pi which has a cap of 300Mbps so they won’t even be able to fully test my gigabit connection. Second this is a signal issue and recently started happening, its not my modem as it was confirmed by the manufacturer. I have also double checked cable connections and even the coax connection box in my apartment.”
TS: “I’ll schedule the tech for Sunday at 2-4PM”
Me: “ffs, fine.”

Saturday Call

Comcast T2 calls me:
TS: “Hi, Can I speak to Mr. Henderson?”
Me: “Speaking”
TS: “Hi, this is Comcast T2 calling about your cable issue. We have identified a signal issue in your neighborhood which was resulting in poor signal performance. We have since fixed this issue and would like to know if your issue is resolved.
Me: “I am currently not home but I can verify when I am.”
TS: “Ok Mr. Henderson. If your issue is fixed let us know and we can cancel your tech visit tomorrow.”

Okay so maybe laying tech information onto the T1 person helped, lol. (Also I condensed some of the Friday call since it had a lot of me being annoyed with T1 for being a script monkey. Even called them out on it lol). Anyway after the call on Saturday I RDP’d to my server via VPN+Guacamole and ran a slew of network tests. Looks like my connection was fixed finally and its way more stable now too!

What a 72 hour ride!

Quick Grafana update

Quick Grafana update

So now that I have my permanent Grafana setup running, I want to go through and update the guide with everything I have learned. This will hopefully be happening this weekend if I get time. The update will also include an updated docker-compose file as well. Most likely will host that file on my github repo.

In the mean time be sure to check out the current guide (which utilizes docker, etc.), as well as the discord server and the Facebook group!

Overview

PS: Sorry for the site downtime. I ran an upgrade and it borked the database for wordpress….. lol

Deluge -> InfluxDB

Deluge -> InfluxDB

So ever since I got my remote seed box setup from seedboxes.cc I have been trying to figure out the best way to get Deluge stats to show up in my Grafana stack.

I first tried a Deluge Exporter for Prometheus but it didn’t seem to work as it required the config directory of Deluge in order for it to export the stats. Really dumb but its whatever. I then came across a influxdb script that sent Deluge stats to Influx. However, that did not work either as apparently the /json endpoint used a self signed certificate and the script errored because of that.

BUT, I got that script to actually work though! Had to use a deluge “thin client” to connect to the remote seed box and basically mirror the data locally. This was done by running a deluge container in docker and using the connection preferences to connect to Cerberus (my remote seedbox).

W.I.P. Deluge Influx dashboard.

Now a quick note here, this dashboard is currently a full WIP as I learn what data is what and how to properly visualize it in Grafana.

What you will need to set this up

First make sure you have docker installed and setup (preferably on a Linux host). Then make sure you have your deluge client setup and configured properly for hosting Linux ISO downloads, etc.

First create the deluge database, user and assign the appropriate permissions. If you do not have a Grafana/Influx stack going see my guide here.

curl -XPOST "http://ip.of.influx.db:8086/query" -u admin:password --data-urlencode "q=CREATE DATABASE 'deluge'"

curl -XPOST "http://ip.of.influx.db:8086/query" -u admin:password --data-urlencode "q=CREATE USER deluge WITH PASSWORD 'deluge'"

curl -XPOST "http://ip.of.influx.db:8086/query" -u admin:password --data-urlencode "q=GRANT WRITE ON deluge TO deluge"

curl -XPOST "http://ip.of.influx.db:8086/query" -u admin:password --data-urlencode "q=GRANT READ ON deluge TO grafana"

Create a exporters folder inside your Influxdb directory.

# BASH [ LINUX VM ]
mkdir /opt/containers/influxdb/exporters && mkdir /opt/containers/influxdb/exporters/deluge

Copy the following into a file to your docker host and then edit it to match your setup.

# BASH [ LINUX VM ]
curl https://bin.alexsguardian.net/raw/deluge2influx -o /opt/containers/influxdb/exporters/deluge/deluge2influx-compose.yml
# BASH [ LINUX VM ]
nano /opt/containers/influxdb/exporters/deluge/deluge2influx-compose.yml

When you finish editing hit CTRL+X then Y to save and close the file.

Now startup the container.

# BASH [ LINUX VM ]
docker-compose -f /opt/containers/influxdb/exporters/deluge/deluge2influx-compose.yml up -d

Create a new dashboard in Grafana and import this .json file. Note that this dashboard expects the data source in Grafana to be called “deluge”.

HDD Failure Update

HDD Failure Update

Ok so a few weeks ago I suffered a partial HDD failure. Basically the HDD hosting my 2 Docker VMs started producing a ton of bad sectors which caused partial corruption on my VMs. This in turn caused issues with containers that read persistent data volumes. On a good note, this HDD was a WD Blue 500GB which was almost 7 years old before it started having issues.

Now due to this failure my entire Grafana metrics system/stack got corrupted causing me to have to start from scratch. Which is what I have now done:

Main Overview

So instead of having a single super massive dashboard I decided to create an overview dashboard that just houses alert panels that pull alerts from other dashboards. This gives me the ability to, at a glance, see whats going on. I can then click each alert to take me to their respective panels.

This is far from done. I plan on expanding my Hyper-V stats as well as my System Alerts. I also need to add my remote seedbox stats and monitoring for my Virtualized AD network and game servers.