[tech] Tech/Wheel Meeting 2021-11-14 14:00 - One week reminder

root root at ucc.gu.uwa.edu.au
Sun Nov 7 14:00:00 AWST 2021


Tech/Wheel Meeting Agenda - Sunday 2021-11-14 14:00
===================================================
- VENUE: UCC Clubroom
- [BOB] was 'ere
  - and online at https://meetings.ucc.asn.au/b/tech

*Meeting opened hh:mm*

## Attendance
- Present
- Apologies
- Absent

## Next meeting
- Schedule next meeting
  - *day 202Y-MM-ddTHH:mm
- ACTION: [???] shall be this meeting's secretary! This entails:
  - Copying the following checklist into a new issue under [[https://gitlab.ucc.asn.au/UCC/tech-todo-list/-/issues]], and assigning it to yourself.
    This is to keep track of any async secretarial duties detailed ahead. See our new Action Items section below.
  - [ ] Set and verify reminders of next meeting: `motsugo# crontab -e`
  - [ ] Promptly update agenda.next with the TIME/DATE/VENUE
  - [ ] Perform initial curation of agenda.next, and move any longstanding action items out of it and into GitLab (see Action Items section below).
  - [ ] Check at T-7days that the notice really went out, fix for T-4days if needed
- [ ] Everyone, before next meeting: Curate agenda.next, and move any items you think should be tracked as GitLab issues into GitLab issues, as above.

## Optional items - choose at the start of the meeting
- Ethical guidelines
- Monitoring
- Backups
- Password rotations
- New members
  - [BRD] nominated for wheel. ACTION: [TEC] to raise this with committee
- Quick check of ChangeLog
- Lessons learnt
  - 2021-10-04 magikarp's SD card wasn't so secure after all...
    - Filesystem went readonly on 2021-10-04 and [BOB] tried to run `fsck` to no avail
      - when exactly did it fail? Noticed as grey'ed out in the web UI at 19:35, but
        VM were still running until reboot test at about 20:00
    - [333] More details to be added to agenda as they come...
      - [333]'s editor is leaving `foobar~` backup files about the place
    - No spare SD card on hand (they're cheap!)
      - Nearest replacement was a USB thumb key, slightly smaller so `dd` isn't a direct option
    - with one VM host down, ceph was over-capacity and could not meet goals
      - TODO: add NVMe to legacy hosts
      - TODO: bring machop online
  - 2021-10-05T0318 Power outage
    - ssh.ucc.asn.au
      - auth failures triggered fail2ban?
    - samson
      - manual, post-reboot `mount -av`
      - manual, post-reboot `systemctl restart samba-ad-dc.service`
      - samson RADIUS dead? -> broken wifi auth, IPSec VPN
    - portal
      - https://portal.ucc.asn.au/ was `403 Forbidden`-ing
      - `uccportal# mount -av`
        - standardise/document/expose www -> hostname mappings? DocumentRoot?
        - Cloudflare -> F5 -> mussel/mailauesi proxy config?
          - https://wiki.ucc.asn.au/TheCloudflarening
        - portal, bbb, gitlab, uccmonitor, element+matrix, wiki, www ...
    - mailfish
      - manual, post-reboot `mount -av` (try autofs?)
    - motsugo
      - md0 scrubbed? or rebuilt? more than once recently, but new spare SSD /dev/sdh not yet in use
    - mollitz
      - some long-running and failed backups: away, motsugo

## Current Action Items
- We'll start maintaining them in GitLab at [[https://gitlab.ucc.asn.au/UCC/tech-todo-list/-/issues/]]
- Briefly discuss anything in here that's worth discussing, but don't spend too long rehashing unresolved issues that have already been discussed ;)

## Known Broken Stuff
- [BRD] `universitycomputer.club.passwd.org` vs `*Everything.html`
- IPv6 inbound
  - ACTION: [TEC] to email UWA IT
- lard
  - Still needs a spare PSU OR replacement with something less... fatty.
  - ACTION: [???] to send email out requesting a 1U Cisco switch to replace Lard
- ACTION: [MTL] to update Ansible scripts for mail*
    - ACTION: [DBA] wants to give it a shot, good reason to try out Proxmox
- samson the https://wiki.ucc.asn.au/ActiveDirectory server has no freshly built DC friends
  - this is risky, a single-point-of-failure, which in turn depends on the running VM cluster
  - something to do with the current configuration is probably why mussel
    and mooneye still have auth problems
    - can we upgrade or rebuild or document our way out of this?
  - ...so making a quick clone and calling it "done" really isn't enough, continuous integration is called for?
  - vucc testbed in https://wiki.ucc.asn.au/NewActiveDirectory
- mollitz is missing prometheus-node-exporter since the rebuild, months ago?
  - [NTU] anyone want a hand with a https://gitlab.ucc.asn.au/ucc-systems/ansiblemonitoring run ?
  - can we use the DebianPkg:prometheus-node-exporter/stable where possible?
- motsugo i801_smbus spam
  - [BOB] I think we should change the bios battery before we go down any other rabbit holes
    - @2021-10-08 I've rebooted the BMC, let's see if that fixes things
      - the fact that the BMC thinks it's 2007 is rather telling

## Matters arising previously

## Extra items (rename/refile as appropriate)
- machop: new EPYC box from Michael/Wings

- Monitoring
  - Drive health
    - Uncorrectable errors, reallocated sectors, TBW on SSDs, temperature
    - ACTION: [NTU] and [MTL] to work together on how best to start drive monitoring, and make it standard/SOE config via ansible
      - proof-of-concept tested
        - https://matrix.to/#/!zAfheZzGazlYUQqAeJ:ucc.asn.au/$HuKyvV8eVoTXKah1Ua3hwR9jWyodlIt2P1iO4upAPmE
      - `/etc/cron.d/node_prometheus-SMART-export`: `*/5 * * * * root /usr/local/bin/smartmon.sh > /var/lib/node_exporter/smart_metrics.prom`
      - `-rwxr-xr-x 1 nick wheel 11287 Sep 27 19:49 /usr/local/bin/smartmon.sh`
      - http://uccmonitor.ucc.asn.au:3000/d/PkPI4xGWz/s-m-a-r-t-dashboard
      - both the script and dashboard probably need a bit of a rewrite, so take it as inspiration?
      - The SSDs and HDDs report with temperatures in different SMART metrics; and the models are exposed in slightl
y different text strings... so it might be better if the `smartmon.sh` did some normalisation
        - Alternatively, can do some normalisation in the Prometheus query - so that's working
          - was graphing: `label_replace(avg(smartmon_temperature_celsius_raw_value{ instance=~"$instance", disk=~"$disk" }) by (instance, disk, name), "instance", "$1", "instance", "([^.]+).*")`
          - now graphing: `... smartmon_temperature_celsius_raw_value{ ... } or smartmon_airflow_temperature_cel_raw_value{ ... }`
          - so if the first fails, it uses the second... though it would be nice if (both existed) then {use the maximum}
       - ACTION: [???] ansible-ise and roll out more? and/or do some rewriting/tweaking

- [MTL] continues looking at DNS and CI setup
  - Not much progress
  - Have played with coredns for a resolving server
    - Need to do some more testing of resolving internal UWA things (to check behaviours)
  - Working on ansible to set up a primary DNS server
  - Done a little bit of playing with Gitlab CI
  - Need to finalise working out the best way to do this securely
    - split out ucc.machines from zonemake.py code

- Group Policy and Ansible on Windows machines
  - ACTION: [333] to figure out most supported way to install official SSHD build on Windows
  - ACTION: [MTL] promises to look at this in more detail once back in the clubroom, including WinRM
  - Best host to run playbooks from for the Windows machines?

- Post-O-Day account locking
  - cleanup accounts e.g. `getent passwd|grep zv`, primary group memberships
  - Fall out and thoughts from account locking
    - not bad!
    - on typical schedule: warnings due before O-Day, lockings due after the AGM
    - online payment options limited (bank transfer still works)
    - time to zip/rm some old home and away directories to save space
      - for backups: every byte removed saves 3 (more like 4,5,6!)
      - to move more directories onto SSD

- Staging storage server:
  - [TEC] Old DELL R710 server[s] from dadams
  - Store images or less selective backups onsite, for rapid recovery or offsite replication
    - zfs send? btrfs send? borgbackup? expose to https://pbs.proxmox.com/ appliance?
  - Want some extra caddies: 3.5" slots, 3.5" + 2.5" SATA drives
    - https://discord.com/channels/264401248676085760/264401248676085760/878831917133353031
    - https://discord.com/channels/264401248676085760/264401248676085760/879354657976229958
    - 3D print? does [DBA] or anyone else at UWA Makers have the model?
      - 2021-09-07 update: .STL parts here: https://discord.com/channels/264401248676085760/264411219627212801/884778548156563517
        - ACTION: [???] print a couple?
      - Currently just for 3.5" drives in 3.5" slots?
        - ACTION: [???] tweak it for a 2.5" drive in a 3.5" slot?
      - a few similar ones:
        - https://www.thingiverse.com/search?q=dell+r710
        - https://www.yeggi.com/q/dell+hard+drive+caddy/
    - or ebay?

- [MPT] Began (unofficial) discussions with [DBA] and CS faculty about making GPU compute accessible to students
  - Potential for funding? No assurances yet
  - What else would UCC need to buy/build to make it happen in our MR?
  - Plan on 2021-07-18.txt to get moving on `loveday` upgrades - wait for this instead?
    - or test with existing hardware?

- ...so if `loveday` doesn't have upgrade quotes yet, how about `medico` -> `machops`?
  - https://discord.com/channels/264401248676085760/264411219627212801/883522265466146869
  - https://docs.google.com/spreadsheets/d/1mbszgk9T7FU0jGXrdTKXXLzW62vuOvqG3xZ-x9CpALE/edit?usp=sharing

- Build and break a PC 2021-04-20, followup
  - Brand new motherboard missing audio capacitor, but [DBA] will resolder it
    - ACTION: [DBA] to resolder audio capacitor on new motherboard

- ACTION: [MTL] to update Ansible scripts for mail*
  - In response to spam campaign

- Rebuild rather than upgrade `discord-irc` ?
  - ansible driven install
    - config files:
      - `~discord/discord-irc-config.json`
      - `/etc/systemd/system/discord-irc.service`
    - this machine is a non-complicated test case?
  - https://github.com/reactiflux/discord-irc
    - requires Debian 11 "bullseye", DebianPkg:nodejs 12.x
  - occasionally dies, config tweaks could help?
    - https://github.com/reactiflux/discord-irc/issues/594
    - `journalctl -xe -u discord-irc.service`
```
Aug 31 17:45:40 discord-irc discord-irc[45040]: TypeError: Converting circular structure to JSON
Aug 31 17:45:40 discord-irc systemd[1]: discord-irc.service: Main process exited, code=exited, status=1/FAILURE
Aug 31 17:45:40 discord-irc systemd[1]: discord-irc.service: Start request repeated too quickly.
```

*Meeting closed hh:mm*

----

```
# https://demo.hedgedoc.org/Hlsapf47RsqpgIjqLVfMUw
cd /home/wheel/docs/meetings
HEDGEDOC_SERVER=https://demo.hedgedoc.org /home/wheel/bin/hedgedoc export --md Hlsapf47RsqpgIjqLVfMUw ./$(date +%Y-%m-%d).txt
git commit -am "Tech meeting minutes $(date +%Y-%m-%d)"
```

<!-- vim: tabstop=2 shiftwidth=2 expandtab
-->
<!-- Local Variables: -->
<!-- tab-width: 2 -->
<!-- End: -->


More information about the tech mailing list