Category: Telegraf Adv. Config

Advanced Telegraf configuration guides.

Alex's Guardian > Blog > Homelab Things > Telegraf Adv. Config
Monitoring Nvidia GPUs via Telegraf

Monitoring Nvidia GPUs via Telegraf

The nvida-smi plugin for Telegraf basically gives you an overview of your GPU usage in the most current iteration in v1.10.4. This “guide” assumes you are using Windows as your host OS. Linux should be fairly easy to get going as long as you know where your nvidia-smi executable is located.

If you do not have Telegraf installed, check out my guides here.

Create a new conf file in telegraf.d folder.

notepad.exe C:\telegraf\telegraf.d\nvidiasmi.conf

Paste the following into the new file and save/close it.

# Pulls statistics from nvidia GPUs attached to the host
[[inputs.nvidia_smi]]
  ## Optional: path to nvidia-smi binary, defaults to $PATH via exec.LookPath
  bin_path = "C:\\Program Files\\NVIDIA Corporation\\NVSMI\\nvidia-smi.exe"

  ## Optional: timeout for GPU polling
  timeout = "5s"

Restart Telegraf.

net stop telegraf
net start telegraf

With Windows you have to use an escape \ when setting the bin_path otherwise you’ll get errors when Telegraf queries nvidia-smi.exe.

Once you have verified Telegraf is reporting Nvidia stats you can start creating your panels in Grafana. Use nvidia-smi from your telegraf data source to build the panels.

Update February 2020

Telegraf recently updated its SMI plugin to include more data retrieval. This new data can be used to create more monitoring panels. Here is a list of the most recent fields that are now returned:

  • clocks_current_graphics
  • clocks_current_memory
  • clocks_current_sm
  • clocks_current_video
  • encoder_stats_average_fps
  • encoder_stats_average_latency
  • encoder_stats_session_count
  • fan_speed
  • memory_free
  • memory_total
  • memory_used
  • pcie_link_gen_current
  • pcie_link_width_current
  • power_draw
  • temperature_gpu
  • utilization_gpu
  • utilization_memory
Here you can see the new encoder fps and session counts. These are utilizing the new stat panel in Grafana 6.6.
Monitoring Hyper-V via Telegraf

Monitoring Hyper-V via Telegraf

Now the cool thing about Telegraf on Windows is that you can basically monitor any system service that reports to the Windows performance counters. So creating a Hyper-V dashboard is actually fairly easy.

Create a new input configuration file in the telegraf.d directory:

# PowerShell
notepad C:\telegraf\telegraf.d\hyperv.conf

Paste the following into your new file, then save.

[[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V Virtual Machine Health Summary"
    Instances = ["------"]
    Measurement = "hyperv_health"
    Counters = [
      "Health Ok",
      "Health Critical",
    ]
    
    [[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V Hypervisor"
    Instances = ["------"]
    Measurement = "hyperv_hypervisor"
    Counters = [
      "Logical Processors",
      "Partitions",
    ]

    [[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V Hypervisor Virtual Processor"
    Instances = ["*"]
    Measurement = "hyperv_processor"
    Counters = [
      "% Guest Run Time",
      "% Hypervisor Run Time",
      "% Idle Time",
      "% Total Run Time",
    ]
    
    [[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V Dynamic Memory VM"
    Instances = ["*"]
    Measurement = "hyperv_dynamic_memory"
    Counters = [
      "Current Pressure",
      "Guest Visible Physical Memory",
    ]

    [[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V VM Vid Partition"
    Instances = ["*"]
    Measurement = "hyperv_vid"
    Counters = [
      "Physical Pages Allocated",
    ]
    
    [[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V Virtual Switch"
    Instances = ["*"]
    Measurement = "hyperv_vswitch"
    Counters = [
      "Bytes Received/Sec",
      "Bytes Sent/Sec",
      "Packets Received/Sec",
      "Packets Sent/Sec",
    ]
    
    [[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V Virtual Network Adapter"
    Instances = ["*"]
    Measurement = "hyperv_vmnet"
    Counters = [
      "Bytes Received/Sec",
      "Bytes Sent/Sec",
      "Packets Received/Sec",
      "Packets Sent/Sec",
    ]
    
    [[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V Virtual IDE Controller"
    Instances = ["*"]
    Measurement = "hyperv_vmdisk"
    Counters = [
      "Read Bytes/Sec",
      "Write Bytes/Sec",
      "Read Sectors/Sec",
      "Write Sectors/Sec",
    ]
    
    [[inputs.win_perf_counters.object]]
    ObjectName = "Hyper-V Virtual Storage Device"
    Instances = ["*"]
    Measurement = "hyperv_storage"
    Counters = [
      "Write Operations/Sec",
      "Read Operations/Sec",
      "Read Bytes/Sec",
      "Write Bytes/Sec",
      "Latency",
      "Throughput",
    ]

Restart Telegraf with the new config file.

# PowerShell Administrator
net stop telegraf
net start telegraf

Import dashboard ID: 2618 into Grafana and set your data source to telegraf.

To see all Hyper-V counters you can check out this PowerShell counters export, here.

IPMI Monitoring via Telegraf

IPMI Monitoring via Telegraf

Telegraf supports IPMI inputs for monitoring via ipmitool. Now this will only work if your server supports the Intelligent Platform Management Interface aka IPMI. To check if your server supports it you can either look up your server’s documentation or take a look at the UEFI/BIOS for IPMI settings. Usually you have to enable it as its not enabled by default.

Getting started

Note: If you followed my guide, Grafana – Start from Scratch, you should already have IPMItool installed with Telegraf in Docker! If so you can skip right to the configuration section!

First you will need to download and install the ipmitool. I am running this on a 2c 2GB Ubuntu server 17.10 VM along with a Telegraf install with [[inputs.ipmi_sensor]] enabled.

Installing IPMItool

sudo apt-get install ipmitool -y

To check and see if it installed correctly you can run ipmitool -H IP.OF.SERVER.HERE -U username -P password sensor

Configuring the IPMI Input

Install Telegraf and edit the telegraf.conf file.

nano /etc/telegraf/telegraf.d/ipmi-input.conf

Paste the following into the new file and edit theservers and metric_version sections to match your setup.

Note, you can have multiple IPMI inputs, just copy everything and paste it a second time and for how many servers you want to monitor.

[[inputs.ipmi_sensor]]
  path = "/usr/bin/ipmitool" # This is the default install location of ipmitool
  servers = ["USERNAME:[email protected](IP.OF.IPMI.SERVER)"]
  interval = "30s"
  timeout = "20s"
  metric_version = "SUPPORTED METRIC VERSION OF SERVER" # Usually 1 or 2

Save and close ipmi-input.conf and start telegraf.

sudo systemctl start telegraf.service

Adding IPMI to Grafana

Now I am going to assume you already have Telegraf reporting to Influxdb with a Influxdb Telegraf data source already added to Grafana. If not go check out the Telegraf install guide(s).

Add a single stat panel to your dashboard with the following info under Metrics:

FROM default ipmi_sensor WHERE server = IP.OF.IPMI.SERVER AND name = cpu1_temp SELECT field(value) mean() GROUP BY time(30s) full(null)

Now the problem with IPMI is that all machines report their values different so one server may have it as cpu_1_temp_C and another may have it as proc1_temp_C. You’ll have to play with your queries to get the right values.

Under options set Unit to Temperature > Celsius (°C)

You should now have a singlestat panel that displays current cpu temp every 30s. You can speed up the pooling rate by editing the interval = "30s" value in telegraf.conf and changing time(30s) to the same value.