[tech] Tech/Wheel Meeting 2021-11-14 14:00 - One week reminder
root
root at ucc.gu.uwa.edu.au
Sun Nov 7 14:00:00 AWST 2021
Tech/Wheel Meeting Agenda - Sunday 2021-11-14 14:00
===================================================
- VENUE: UCC Clubroom
- [BOB] was 'ere
- and online at https://meetings.ucc.asn.au/b/tech
*Meeting opened hh:mm*
## Attendance
- Present
- Apologies
- Absent
## Next meeting
- Schedule next meeting
- *day 202Y-MM-ddTHH:mm
- ACTION: [???] shall be this meeting's secretary! This entails:
- Copying the following checklist into a new issue under [[https://gitlab.ucc.asn.au/UCC/tech-todo-list/-/issues]], and assigning it to yourself.
This is to keep track of any async secretarial duties detailed ahead. See our new Action Items section below.
- [ ] Set and verify reminders of next meeting: `motsugo# crontab -e`
- [ ] Promptly update agenda.next with the TIME/DATE/VENUE
- [ ] Perform initial curation of agenda.next, and move any longstanding action items out of it and into GitLab (see Action Items section below).
- [ ] Check at T-7days that the notice really went out, fix for T-4days if needed
- [ ] Everyone, before next meeting: Curate agenda.next, and move any items you think should be tracked as GitLab issues into GitLab issues, as above.
## Optional items - choose at the start of the meeting
- Ethical guidelines
- Monitoring
- Backups
- Password rotations
- New members
- [BRD] nominated for wheel. ACTION: [TEC] to raise this with committee
- Quick check of ChangeLog
- Lessons learnt
- 2021-10-04 magikarp's SD card wasn't so secure after all...
- Filesystem went readonly on 2021-10-04 and [BOB] tried to run `fsck` to no avail
- when exactly did it fail? Noticed as grey'ed out in the web UI at 19:35, but
VM were still running until reboot test at about 20:00
- [333] More details to be added to agenda as they come...
- [333]'s editor is leaving `foobar~` backup files about the place
- No spare SD card on hand (they're cheap!)
- Nearest replacement was a USB thumb key, slightly smaller so `dd` isn't a direct option
- with one VM host down, ceph was over-capacity and could not meet goals
- TODO: add NVMe to legacy hosts
- TODO: bring machop online
- 2021-10-05T0318 Power outage
- ssh.ucc.asn.au
- auth failures triggered fail2ban?
- samson
- manual, post-reboot `mount -av`
- manual, post-reboot `systemctl restart samba-ad-dc.service`
- samson RADIUS dead? -> broken wifi auth, IPSec VPN
- portal
- https://portal.ucc.asn.au/ was `403 Forbidden`-ing
- `uccportal# mount -av`
- standardise/document/expose www -> hostname mappings? DocumentRoot?
- Cloudflare -> F5 -> mussel/mailauesi proxy config?
- https://wiki.ucc.asn.au/TheCloudflarening
- portal, bbb, gitlab, uccmonitor, element+matrix, wiki, www ...
- mailfish
- manual, post-reboot `mount -av` (try autofs?)
- motsugo
- md0 scrubbed? or rebuilt? more than once recently, but new spare SSD /dev/sdh not yet in use
- mollitz
- some long-running and failed backups: away, motsugo
## Current Action Items
- We'll start maintaining them in GitLab at [[https://gitlab.ucc.asn.au/UCC/tech-todo-list/-/issues/]]
- Briefly discuss anything in here that's worth discussing, but don't spend too long rehashing unresolved issues that have already been discussed ;)
## Known Broken Stuff
- [BRD] `universitycomputer.club.passwd.org` vs `*Everything.html`
- IPv6 inbound
- ACTION: [TEC] to email UWA IT
- lard
- Still needs a spare PSU OR replacement with something less... fatty.
- ACTION: [???] to send email out requesting a 1U Cisco switch to replace Lard
- ACTION: [MTL] to update Ansible scripts for mail*
- ACTION: [DBA] wants to give it a shot, good reason to try out Proxmox
- samson the https://wiki.ucc.asn.au/ActiveDirectory server has no freshly built DC friends
- this is risky, a single-point-of-failure, which in turn depends on the running VM cluster
- something to do with the current configuration is probably why mussel
and mooneye still have auth problems
- can we upgrade or rebuild or document our way out of this?
- ...so making a quick clone and calling it "done" really isn't enough, continuous integration is called for?
- vucc testbed in https://wiki.ucc.asn.au/NewActiveDirectory
- mollitz is missing prometheus-node-exporter since the rebuild, months ago?
- [NTU] anyone want a hand with a https://gitlab.ucc.asn.au/ucc-systems/ansiblemonitoring run ?
- can we use the DebianPkg:prometheus-node-exporter/stable where possible?
- motsugo i801_smbus spam
- [BOB] I think we should change the bios battery before we go down any other rabbit holes
- @2021-10-08 I've rebooted the BMC, let's see if that fixes things
- the fact that the BMC thinks it's 2007 is rather telling
## Matters arising previously
## Extra items (rename/refile as appropriate)
- machop: new EPYC box from Michael/Wings
- Monitoring
- Drive health
- Uncorrectable errors, reallocated sectors, TBW on SSDs, temperature
- ACTION: [NTU] and [MTL] to work together on how best to start drive monitoring, and make it standard/SOE config via ansible
- proof-of-concept tested
- https://matrix.to/#/!zAfheZzGazlYUQqAeJ:ucc.asn.au/$HuKyvV8eVoTXKah1Ua3hwR9jWyodlIt2P1iO4upAPmE
- `/etc/cron.d/node_prometheus-SMART-export`: `*/5 * * * * root /usr/local/bin/smartmon.sh > /var/lib/node_exporter/smart_metrics.prom`
- `-rwxr-xr-x 1 nick wheel 11287 Sep 27 19:49 /usr/local/bin/smartmon.sh`
- http://uccmonitor.ucc.asn.au:3000/d/PkPI4xGWz/s-m-a-r-t-dashboard
- both the script and dashboard probably need a bit of a rewrite, so take it as inspiration?
- The SSDs and HDDs report with temperatures in different SMART metrics; and the models are exposed in slightl
y different text strings... so it might be better if the `smartmon.sh` did some normalisation
- Alternatively, can do some normalisation in the Prometheus query - so that's working
- was graphing: `label_replace(avg(smartmon_temperature_celsius_raw_value{ instance=~"$instance", disk=~"$disk" }) by (instance, disk, name), "instance", "$1", "instance", "([^.]+).*")`
- now graphing: `... smartmon_temperature_celsius_raw_value{ ... } or smartmon_airflow_temperature_cel_raw_value{ ... }`
- so if the first fails, it uses the second... though it would be nice if (both existed) then {use the maximum}
- ACTION: [???] ansible-ise and roll out more? and/or do some rewriting/tweaking
- [MTL] continues looking at DNS and CI setup
- Not much progress
- Have played with coredns for a resolving server
- Need to do some more testing of resolving internal UWA things (to check behaviours)
- Working on ansible to set up a primary DNS server
- Done a little bit of playing with Gitlab CI
- Need to finalise working out the best way to do this securely
- split out ucc.machines from zonemake.py code
- Group Policy and Ansible on Windows machines
- ACTION: [333] to figure out most supported way to install official SSHD build on Windows
- ACTION: [MTL] promises to look at this in more detail once back in the clubroom, including WinRM
- Best host to run playbooks from for the Windows machines?
- Post-O-Day account locking
- cleanup accounts e.g. `getent passwd|grep zv`, primary group memberships
- Fall out and thoughts from account locking
- not bad!
- on typical schedule: warnings due before O-Day, lockings due after the AGM
- online payment options limited (bank transfer still works)
- time to zip/rm some old home and away directories to save space
- for backups: every byte removed saves 3 (more like 4,5,6!)
- to move more directories onto SSD
- Staging storage server:
- [TEC] Old DELL R710 server[s] from dadams
- Store images or less selective backups onsite, for rapid recovery or offsite replication
- zfs send? btrfs send? borgbackup? expose to https://pbs.proxmox.com/ appliance?
- Want some extra caddies: 3.5" slots, 3.5" + 2.5" SATA drives
- https://discord.com/channels/264401248676085760/264401248676085760/878831917133353031
- https://discord.com/channels/264401248676085760/264401248676085760/879354657976229958
- 3D print? does [DBA] or anyone else at UWA Makers have the model?
- 2021-09-07 update: .STL parts here: https://discord.com/channels/264401248676085760/264411219627212801/884778548156563517
- ACTION: [???] print a couple?
- Currently just for 3.5" drives in 3.5" slots?
- ACTION: [???] tweak it for a 2.5" drive in a 3.5" slot?
- a few similar ones:
- https://www.thingiverse.com/search?q=dell+r710
- https://www.yeggi.com/q/dell+hard+drive+caddy/
- or ebay?
- [MPT] Began (unofficial) discussions with [DBA] and CS faculty about making GPU compute accessible to students
- Potential for funding? No assurances yet
- What else would UCC need to buy/build to make it happen in our MR?
- Plan on 2021-07-18.txt to get moving on `loveday` upgrades - wait for this instead?
- or test with existing hardware?
- ...so if `loveday` doesn't have upgrade quotes yet, how about `medico` -> `machops`?
- https://discord.com/channels/264401248676085760/264411219627212801/883522265466146869
- https://docs.google.com/spreadsheets/d/1mbszgk9T7FU0jGXrdTKXXLzW62vuOvqG3xZ-x9CpALE/edit?usp=sharing
- Build and break a PC 2021-04-20, followup
- Brand new motherboard missing audio capacitor, but [DBA] will resolder it
- ACTION: [DBA] to resolder audio capacitor on new motherboard
- ACTION: [MTL] to update Ansible scripts for mail*
- In response to spam campaign
- Rebuild rather than upgrade `discord-irc` ?
- ansible driven install
- config files:
- `~discord/discord-irc-config.json`
- `/etc/systemd/system/discord-irc.service`
- this machine is a non-complicated test case?
- https://github.com/reactiflux/discord-irc
- requires Debian 11 "bullseye", DebianPkg:nodejs 12.x
- occasionally dies, config tweaks could help?
- https://github.com/reactiflux/discord-irc/issues/594
- `journalctl -xe -u discord-irc.service`
```
Aug 31 17:45:40 discord-irc discord-irc[45040]: TypeError: Converting circular structure to JSON
Aug 31 17:45:40 discord-irc systemd[1]: discord-irc.service: Main process exited, code=exited, status=1/FAILURE
Aug 31 17:45:40 discord-irc systemd[1]: discord-irc.service: Start request repeated too quickly.
```
*Meeting closed hh:mm*
----
```
# https://demo.hedgedoc.org/Hlsapf47RsqpgIjqLVfMUw
cd /home/wheel/docs/meetings
HEDGEDOC_SERVER=https://demo.hedgedoc.org /home/wheel/bin/hedgedoc export --md Hlsapf47RsqpgIjqLVfMUw ./$(date +%Y-%m-%d).txt
git commit -am "Tech meeting minutes $(date +%Y-%m-%d)"
```
<!-- vim: tabstop=2 shiftwidth=2 expandtab
-->
<!-- Local Variables: -->
<!-- tab-width: 2 -->
<!-- End: -->
More information about the tech
mailing list