Opsis + recording laptop monitoring

xfxf commented 7 years ago

This spec is split out into a super basic spec (minimum spec), and then a second spec of the tool we'd like to be using. The latter is not meant to be a full 'wishlist' spec, just the minimum of what's realistic for v1.

The problem this is meant to solve: during conferences, occasionally we have issues with HDMI2USB + laptop compatibility. We rarely get useful debug information to solve this as the priority during the event is just get the laptop working (which might end up being swapping with a different laptop, use a different dongle, etc). We need to log this information.

Secondly, we'd love for this information to be central, displayed on a single dashboard we can show in AV NOC, to be able to solve problems faster. Statuses should dynamically update using JS - i.e. we can see when people plug in / unplug laptops, the current version of the firmware flashed on it, the current mode the device is configured in, the status of the recording machines (is voctomix running, hard drive space left, etc) with appropriate alerts occuring.

MINIMUM SPEC

Write a daemon that connects to the Opsis (via serial using HDMI2USBd, or via a TCP connection if @mithro prefers this), enables debug mode, and dumps this with timestamps to a local file. Ensure new V4L connections to the webcam device (i.e. voctomix starting a capture) isn't interfered by this.

SECOND SPEC

Daemon should be able to parse events (list events we care about below? including getting current firmware info + status of input/outputs/encoder), and understand what's happening.
Daemon should push these events via REST back to a app running on a central AV server (with an appropriate standarised scheme that describes the source machine, etc).
A simple dashboard should be developed that reads these events and displays them.
Dashboard should understand various 'states' of components, so if an error /warning condition occurs, it can appropriately display this on the dashboard.
Dashboard should be extended to report things such as hard drive space, status of various voctomix components, machine reboot events, etc (list other things on the machine we care about as a minimum/)

mithro commented 7 years ago

I don't really care how this is done.

I want to know about things like;

When the inputs of the HDMI change state in any way
When the boards are restarted / power cycled
When the version changed (IE the board was flashed)
If they moved from one location or role to another

Basically similar to all the things that the paper sheet on the Opsis is currently used for but more detailed info.

If someone says "we had issues with the opsis in room XYZ all day today" I want to be able to go and look at what was happening.

xfxf commented 7 years ago

@mithro to do this well I'd want the HDMI2USB to emit events when things change, which can be parsed, otherwise we have to poll 'status' repeatedly. Do you agree / disagree? (Assume this means somebody needs to modify the firmware...)

mithro commented 7 years ago

I have no problem with polling frequently. Less likely to fail because an event is missed.

xfxf commented 7 years ago

@mithro OK. polling every 5-10 seconds isn't expected to cause any issues?

also, what is your preferred mechanism to reliably talk to the device? serial directly, serial via hdmi2usbd (unsure of current status), or via telnet? if the latter, won't we need to compile a seperate firmware per device if IP addrs etc are hardcoded currently?

mithro commented 7 years ago

Doing telnet requires getting a bunch of things working.

The connection system doesn't really matter all that much and should be pretty easy to adapt to whatever.

xfxf commented 7 years ago

@mithro I ask because of the flow control issues + the fact that opening a serial connection to the device before V4L on the firmware we used at LCA / Pycon AU 2016 causes V4L not to work, so this is more complicated... or are these fixed now?

mithro commented 7 years ago

Nope, not fixed.

Just connect, probe the status, disconnect. Wait 30 seconds, repeat. Possible blocking time is the length of the serial port being open for the polling.

xfxf commented 7 years ago

@mithro OK, is it possible to have you prioritise fixing this issue (specifically the V4L not working if serial connection is open first)? we'll try to make our code defensive but I absolutely can forsee an issue in production where polling hangs + causes a voctomix element not to start up properly for a volunteer.

mithro commented 7 years ago

I can't see the problem being fixed this hackfest. It requires too many other things to be finished first and we don't have enough people to work on it.

xfxf commented 7 years ago

OK, noted, I can just forsee this polling now potentially creating headaches in prod we didn't have before (as we don't normally have a serial connection open to the device).

xfxf commented 7 years ago

Plan is to use hdmiusbd, a python client connecting to this which parses status / debug messages, which then talks back to a generic monitoring suite - we shouldn't need to build the logging/monitoring side of it, purely write a client that handles/normalises the Opsis monitoring.

@joeladdison had some ideas on suites/libraries we could use for this.

joeladdison commented 7 years ago

I'm thinking of using a time series database and dumping lots of stuff into it. We can then setup graphs pretty quickly from that without much work.

The two main choices for time series databases that I have found are:

Prometheus (https://prometheus.io)
InfluxDB (https://www.influxdata.com/)

Both of these can be used with Grafana (http://grafana.org/) for graphical display of the data.

A big difference between the two is how data gets there. Prometheus prefers to poll sources for data, while Influx normally has data pushed to it. You can do both push and pull for each of them using additional tools.

A tool we might find useful is mtail (https://github.com/google/mtail). It takes application logs and makes them useable for time series databases.

For later - https://finestructure.co/blog/2016/5/16/monitoring-with-prometheus-grafana-docker-part-1

mithro commented 7 years ago

I'm actually happy just having the logs with timestamps and host information at the moment. Graphs and stuff can come later.

xfxf / av-foss-stack

Opsis + recording laptop monitoring #6