[tech] UCC backup status, was Re: Cron <backups at mollitz> /backups/bin/rdiff-manager/rdiff-manager.py (fwd)

Tue Jun 21 13:35:01 AWST 2022

On Mon, May 11, 2020 at 10:25:34PM +0800, Nick Bannon wrote:
> On Sun, Sep 17, 2017 at 11:21:17AM +0800, David Adam wrote:
> > On Wed, 9 Aug 2017, David Adam wrote:
> > > Backups for molmol have been failing for the last month or so as the
> > > backup server (Mollitz) does not have enough space.
> Plus ça change, plus c'est la même chose.

Congrats, [GPO] on stepping up to put some fresh eyes on the legacy backups!

Like some other things, it failed in stages:

1. When things are going well, we can be building the next overlapping
   improvement, but backups tend to be out-of-sight, out-of-mind.

2. We used to get emails when things were wrong, looks like that broke
   as we started to seriously decomission mooneye last year? and now
   all outgoing messages sit in its queue, going nowhere:
```
root at mollitz:/var/log# mailq
-Queue ID-  --Size-- ----Arrival Time---- -Sender/Recipient-------
E66311F8       2903 Mon Jun 20 07:30:48  MAILER-DAEMON
(delivery temporarily suspended: connect to smarthost.mail.ucc.asn.au[130.95.13.19]:25: Connection timed out)
                                         root at ucc.gu.uwa.edu.au
08814236      16202 Sat Jun 18 02:00:16  backups at ucc.gu.uwa.edu.au
(delivery temporarily suspended: connect to smarthost.mail.ucc.asn.au[130.95.13.19]:25: Connection timed out)
                                         hostmaster at ucc.gu.uwa.edu.au
[...]
root at mollitz:/var/log# postcat -vqb 08814236|less
[...]
regular_text: OSError: [Errno 28] No space left on device: '/backups/murasoi/rdiff-backup-data/rdiff-backup.tmp.0'
regular_text: Fatal Error: Lost connection to the remote system
regular_text: Backup failed with error 1
```

3. The #alerts monitoring via uccmonitor/Grafana had been getting
   people's attention since about 2022-04-23 but uh... no-one fixed
   things for a couple of months.

   Here's a pretty graph from the last 24 hours, check it out by 21:00 tonight:
   http://uccmonitor.ucc.asn.au:3000/d/uYiRn3BZk/node-exporter-full?orgId=1&var-job=other&var-name=mollitz&var-node=mollitz.ucc.asn.au&var-port=9100

4. We're looking for a bit more than the quick minimal fix, we're trying
   to get people to fix whatever else is broken and go back to step 1.
   Lucrative real world IT skills! and complementary to what one learns
   in University studies.

Oops. Turns out that I had basically left a snapshot of the very large
`molmol` backup around in `mollitz:~backups/tmp/` - that was fine for
months until:
- some failed `rdiff-backup` run
- then the followup finally rewrote all the hard-links
- so it diverged from the "live" backup copy, and
- started taking up its full disc space again.

5. Most backups seem to have worked last night, we might need to try one
manually and nurse it through until it succeeds:
```
backups at mollitz:~$ rdiff-backup --list-increments motsugo
Fatal Error: Previous backup to /backups/motsugo seems to have failed.
Rerun rdiff-backup with --check-destination-dir option to revert
directory to state before unsuccessful session.
```

6. Actually, [DAA]'s https://gitlab.ucc.asn.au/UCC/rdiff-manager does a
great job of force-retrying failed backups all by itself, but it's time
for us to do the python2 to python3 upgrade to get `rdiff-manager` and
`rdiff-backup` up to date, on Debian 11 "bullseye". That's not something
that magically happens by itself without reading release and upgrade notes.

The general status is:
* https://wiki.ucc.asn.au/Backups
* mollitz, an 24GiB DELL PowerEdge 2950, boots off a budget 120GB GIGABYTE
  GP-GSTF QLC SSD
* it has 6TB (5.4TiB) of /backups space, RAID-5 over 4*2TB drives
  * that sure is slow(!) but at least we're down to 23M inodes, from
    about 124M in 2020-05
* it can't easily fit more drives
* the PERC 5/i RAID controller can't use larger capacity drives
  (same for the PERC 6/i in our other DELLs? or whatever's in the old
   IBM M-series)
* it would normally use https://gitlab.ucc.asn.au/UCC/rdiff-manager to
  run rdiff-backup over ssh, to fetch daily backups from some UCC hosts
* Not as many hosts as we might like, not the scratch areas or member
  VMs, prioritising the most important data, excluding some. Please
  refer to the SLA.
* Has data backups that one can restore selectively from, but not the
  sort of full images you could boot straight up on the Proxmox cluster
  or replacement hardware

Nick.

-- 
   Nick Bannon   | "I made this letter longer than usual because
nick-sig at rcpt.to | I lack the time to make it shorter." - Pascal