Thursday, October 27, 2022

Recovering Dell OS10 switch from boot to ONIE

 Issue - received two Dell switches that booted into ONIE install O/S mode.

Interrupting the ONIE boot menu by pressing x, then exit, enabled it to boot into OS10 that time, confirming OS10 and most importantly the license was intact.  I then SCPed the license file off to somewhere safe - I didn't / don't have access to the customers Dell Digital Locker, so I had no way to recover it.

Reboot and let it boot into ONIE O/S install mode again, install OS10 - in my case this rebuild both OS10 partitions but did leave the license intact, but if I didn't back that up first I would have been distressed to see it repartition and format the flash...

Confirmed it now boots directly into OS10.

Wednesday, June 22, 2022

Sonic Networking

Sonic is an open source networking operating system, which runs on a variety of platforms.  Azure and other hyperscalers use it as an abstraction layer - they can manage multiple vendors network hardware the same - this may have been especially useful during the last couple of years.  Thanks to the supply chain issues of all the major vendors the plentiful availability of 32X100G used switches made them a candidate for a short notice project, and I figured I'd replace the ancient H3C 10G switch in my lab too.

They work well, as you would expect as the hardware is very similar to anything else based on the Broadcom Trident, be it white box or major vendor.  Sonic is not user friendly though, there are multiple release streams, with a variety of bugs in each.  User friendliness is not a high priority with all the big Sonic users out there heavily into automation.

You're going to want to read this huge thread through:
https://www.reddit.com/r/homelab/comments/n5opo2/initial_configuration_of_a_celestica_dx010_100ge/
Between that thread, the EdgeCore Sonic docs and the official docs, these are my notes...

Initial configuration:

Change the device to layer2 - default is layer3, which makes sense for a switch with no Spanning-Tree, be absolutely sure there are no loops before doing this (I created a nice 400G broadcast storm which was enough to kill the management plane access)

sudo sonic-cfggen --preset l2 -p -H -k Seastone-DX010 >/etc/sonic/config_db.json

Then reboot - in theory sudo config reload should work, I've had mixed results.  There are multiple warm reboot options for upgrading code without interrupting forwarding or rebuilding the config without reloading the O/S.  I will go back and experiment when I have time.

Hostname & management port (which is DHCP by default)

sudo config hostname switch01

sudo config interface ip add eth0 192.168.10.10/24 192.168.10.1

The default gateway doesn't really show up anywhere, don't be surprised, it does still work.

The management port also needs the following added into config_db.json, it works without it but a couple of the commands fail, specifically when you try to breakout ports dynamically - breaking out ports by editing config_db.json and restarting always works though.

"MGMT_PORT": {
        "eth0": {
            "admin_status": "up",
            "alias": "eth0"
        }
    },

Dynamic port breakout:


sudo config interface breakout '4x25G[10G]' Ethernet0

Turns 100G port 0 into 4 X 25G ports which you can then configure to 10G (all four of them) to make a 40-4x10 breakout work.  Extra arguments that should work, -y -v etc. seem to break it as does not having the extra MGMT_PORT blob above.

MCLAG

Because I was doing layer2 MCLAG was a requirement - the daemon for which, ICCPD, is not included in the builds by default, so then I needed a build server.  This is well documented on the Sonic site, though one step is installing Docker, which they have you do with Snap - which then doesn't work.  Whereas a proper Docker install works fine.  (Look for instructions to add the docker.com repository and gpg key to sources.list).  Code upgrades fail to persist the configuration in my experience, so archive config_db.json prior and plan to push it back on afterwards.

The Edge-Core MCLAG page is mostly right, 
https://support.edge-core.com/hc/en-us/articles/900002380706--Edgecore-SONiC-MC-LAG 
Note you can't designate which VLAN interfaces will be the unique-ip while they already have IPs on them, so designate the interface, then add the IP.

The JSON block ends up looking like this if you've done it right:
   "MCLAG_DOMAIN": {
        "1": {
            "peer_ip": "10.210.1.2",
            "peer_link": "PortChannel01",
            "source_ip": "10.210.1.1"
        }
    },
    "MCLAG_INTERFACE": {
        "1|PortChannel02": {
            "if_type": "PortChannel"
        }
    },
    "MCLAG_UNIQUE_IP": {
        "Vlan1000": {
            "unique_ip": "enable"
        }
    },

Here our IP on Vlan1000 is 10.210.1.1, with our partner .2, PortChannel01 is our peer link and our only MCLAG enabled channel PortChannel02.
Ensure IPCCD and TEAMD are running / set to start at boot.  Easiest way to to ensure those services are set top enabled in config_db.json, but may need to unmask, 'systemctl unmask iccpd' and 'systemctl start iccld' first.

There are references to various 'show mclag' commands which don't seem to exist, but 
'mclagdctl dump state'  is the key thing to show whether the partner switches can see one another and the daemons are talking.
Some of the other mclagdctl commands are implemented- though not all of them...but useful stuff does go to syslog.

VLAN assignments
So it took me hours to build out a config with 35 VLANs tagged on all ports, as far as I can see there's no efficient way to do it interactively- so if you need to do it in bulk script it to a file then dump the file into config_db.