Wednesday 13 December 2017

vSphere Replication 6.5 Bug: 'Not Active' Status

This happened to myself when setting up a brand new vSphere lab with vSphere 6.5 and the vSphere Replication Appliance 6.5.1.

After setting up a new replicated VM I was presented with the 'Not Active' status - although there was no information presented in the tool tip.

So to dig a little deeper we can use the CLI to query the replicated VM status - but firstly we'll need to obtain the VM id number:

vim-cmd vmsvc/getallvms

and then query the state with:

vim-cmd hbrsvc/vmreplica.getState <id>

Retrieve VM running replication state:
        The VM is configured for replication. Current replication state: Group: CGID-1234567-9f6e-4f09-8487-1234567890 (generation=1234567890)
        Group State: full sync (0% done: checksummed 0 bytes of 1.0 TB, transferred 0 bytes of 0 bytes)

So it looks like it's at least attempting to perform the replication - however is stuck at 0% - so now devling into the logs:

cat /var/log/vmkernel.log | grep Hbr

2017-12-13T10:12:18.983Z cpu21:17841592)WARNING: Hbr: 4573: Failed to establish connection to [10.11.12.13]:10000(groupID=CGID-123456-9f6e-4f09-
8487-123456): Timeout
2017-12-13T10:12:45.102Z cpu18:17806591)WARNING: Hbr: 549: Connection failed to 10.11.12.13 (groupID=CGID-123456-9f6e-4f09-8487-123456): Timeout

It looks like the ESXI host is failing to connect to 10.11.12.13 (the Virtual Replication Appliance in my case) - so we can double check this
with:

cat /dev/zero | nc -v 10.11.12.13 10000

(Fails)

However if we attempt to ping it:

ping 10.11.12.13

we get a responce - so it looks like it's a firewall issue.

I attempt to connect to the replication appliance from another server:

cat /dev/zero | nc -v 10.11.12.13 10000

Ncat: Version 7.60 ( https://nmap.org/ncat )
Ncat: Connected to 10.11.12.13:10000.

So it looks like the firewall on this specific host is blocking outbound connections on port 10000.

My suspisions were confirmed when I reviewed the firewall rules from within vCenter on the Security Profile tab of the ESXI host:



Usually the relevent firewall rules are created automatically - however this time for whatever reason they have not been - so we'll need to
proceed by creating a custom firewall rule (which unfortuantely is quite cumbersome...):

SSH into the problematic ESXI host and create a new firewall config with:

touch /etc/vmware/firewall/replication.xml

and set the relevent write permissions:

chmod 644 /etc/vmware/firewall/replication.xml
chmod +t /etc/vmware/firewall/replication.xml

vi /etc/vmware/firewall/replication.xml

<!-- Firewall configuration information for vSphere Replication -->
<ConfigRoot>
<service>
<id>vrepl</id>
<rule id='0000'>
<direction>outbound</direction>
<protocol>tcp</protocol>
<porttype>dst</porttype>
<port>
<begin>10000</begin>
<end>10010</end>
</port>
</rule>
<enabled>true</enabled>
<required>false</required>
</service>
</ConfigRoot>

Revert the permsissions with:

chmod 444 /etc/vmware/firewall/replication.xml

and restart the firewall service:

esxcli network firewall refresh

and check its there with:

esxcli network firewall ruleset list

(Make sure it's set to 'enabled' - if not you can enable it via the vSphere GUI: ESXI Host >> Configuration >> Security Profile >> Edit Firewall
Settings.)

and the rules with:

esxcli network firewall ruleset rule list | grep vrepl

Then re-check connectivity with:

cat /dev/zero | nc -v 10.11.12.13 10000
Connection to 10.0.15.151 10000 port [tcp/*] succeeded!

Looks good!

After reviewing the vSphere Replication monitor everything had started syncing again.


Sources:

https://kb.vmware.com/s/article/2008226
https://kb.vmware.com/s/article/2059893




0 comments:

Post a Comment