Thursday 20 June 2024

Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

After a helm upgrade failed subsequent releases / upgrades to the helm chart returned the following error message: 

Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

In order to determine the problematic release revision issue:

helm history <release-name> -n <namespace>


The above error happened due to revision 5 being stuck on the 'pending-upgrade' status.

We can roll back the release one of the following ways:

helm rollback <release-name>  0 -n <namespace>

or (explicitly define the revision number):

helm rollback <release-name>  4 -n <namespace>


Monday 28 August 2023

ArgoCD: Namespace deletion stuck on deleting

This scenario reared its head when it looked like all of the recourses within the offending namespace were deleted but the actual deletion of the namespace had hung.

Firstly try to delete forcefully with: 

kubectl delete ns ns-example --force --grace-period=0

If this fails...

Check for active finalizers if a namespace is failing to delete:

kubectl get namespace ns-example -o json

If so, we can empty it with:

kubectl get ns ns-example -o json | jq '.spec.finalizers = []' | kubectl replace --raw "/api/v1/namespaces/ns-example/finalize" -f -

The app deletion in ArgoCD should now have (hopefully) been successful. 

Wednesday 7 July 2021

Wednesday 18 November 2020

Vertiv Avocent ACS800 / ACS8000 Series LTE Dongle / Modem Setup

Forewarning: The below is not officially supported and should not be used in a production environment. It's here to simply demonstrate that it's possible to configure 'unsupported' LTE modems on Avocent units.

Since there is next to no documentation for this I thought I'd provide some (hopefully) useful information for anyone else trying to get a LTE modem / dongle setup with the unit.

Firstly ensure you are running the latest firmware as more recent Linux kernels have greatly improved support for LTE modems.

For this exercise I am using a 'D-Link DWM-222 4G LTE USB Adapter'

We'll need to set the 'Moderate' security profile (System >> Security >> Security Profile) firstly (as we'll need root access over SSH)

Firstly define the APN for our mobile operator (Smarty Mobile in this case):

echo "APN=mob.asm.net" >/etc/qmi-network.conf

Ensure 802.3 is set:

qmicli -d /dev/cdc-wdm0 --wda-set-data-format=802-3

Start the modem driver:

qmi-network /dev/cdc-wdm0 start

Ensure the interface is automatically started and the cdc-wdm interface is started before bringing the interface up:

vi /etc/network/interfaces

auto lo eth0 eth2 wwan0

iface wwan0 inet dhcp

        pre-up /usr/bin/qmi-network /dev/cdc-wdm0 start

If the interface does not come up automatically despite 'auto wwan0' being set you can create startup script instead:

vi /etc/init.d/S99lteconfig

sleep 15

echo "Bringing up LTE interface..."

/sbin/ifup wwan0

/sbin/ip route add 0.0.0.0/0 dev wwan0

chmod 755 /etc/init.d/S99lteconfig

ln -s /etc/init.d/S99lteconfig /etc/rcS.d/S99lteconfig

Restart the device and check whether the interface has come up:

ip addr

Lastly ensure we put the 'Security Profile' back to 'Secure' mode.

Saturday 22 August 2020

Recovering from a replication failure in a MariaDB Master/Master replication setup

For the purposes of this post I'll assume we are working with two MariaDB servers that have been configured to perform master/master replication and one of them has failed. In this case Server01 is healthy while Server02 has stop replicating.

We need to firstly ensure that no queries are hitting Server02 / the failed server - this will typically be a case of stopping services / blocking network access to services that hit it. e.g. stopping httpd.

We'll also want to ensure replication is stopped on the failed server (Server02):

SERVER02> stop slave;

Now on Server01 / the working server issue:

SERVER01> stop slave;

SERVER01> flush tables with read lock; (This will temporarily stop it updating)

SERVER01> show master status;

We'll make a note of the above command - it should read something like:

File: mysql-bin.123456

Position 123

Binlog_Do_DB: <replicated_database>

Then on Server01 / the working server take a backup of the database:

SERVER01> mysqldump -u<username> -p --lock-tables --databases <database-name[s]> > export.sql

and on Server02 / the failed server - restore the backup:

SERVER02> mysql -u root -p < export.sql

Now on Server01 / the working server issue the following command to start processing changes again:

SERVER01> UNLOCK TABLES;

Then on Server02 / the failed server issue the following to repoint the logs (use the information above we recorded from Server01):

SERVER02> CHANGE MASTER TO master_log_file='mysql-bin.xxxxxx', master_log_pos=yy;

SERVER02> START SLAVE;

To verify we can issue:

SERVER02> show slave status \G

Now we need to do the reverse by ensuring Server01 / working server replicates from Server02 / failed server. On Server02 issue:

SERVER02> show master status \G

Record the output again.

Now on Server01 / the working server set the logs:

SERVER01> CHANGE MASTER TO master_log_file='mysql-bin.xxxxxx', master_log_pos=yy;

SERVER01> START SLAVE;

and then to verify replication issue:

SERVER01> SHOW SLAVE STATUS \G

Finally reverse anything you performed at the start to block comms with Server02 / the bad server e.g. start services, update firewall etc.


Friday 10 July 2020

Monday 16 March 2020

Invoking sysprep (Generalising a Windows install) on AWS EC2

  1. From the Windows Start menu:
    For Windows Server 2008 through Windows Server 2012 R2, open EC2ConfigService Settings, and then choose the Image tab.
    For Windows Server 2016 or later, open EC2 Launch Settings.
  2. For Administrator Password, choose Random.
  3. Choose Shutdown with Sysprep.
  4. Choose Yes.
    Note: You must retrieve the new password from the EC2 console at the next service start.