Sunday 9 October 2016

Fun facts about ethernet debugging, number 3+4i in an occasional series

So, I've been writing a driver for an OS-less machine which uses (for reasons you will be relieved that I am not going to go into) a LAN9221 as its ethernet interface.

Now, this chip is moderately spiffy - in particular, it seems Microchip have now encountered every possible bus-related design cock-up and have a handy register ready for reversing out most of them. However, it requires quite careful FIFO handling and turns out to be exquisitely read-sensitive. So far, so good.

So, one thing I tried was doing a Tx test. Send a packet every couple of seconds, read status back out of the status FIFO, report. Check that all is well.

Now, I didn't have a LAN port that didn't have too much chatter on it, so I used an ASIX AX88178 I had lying around to do the capture (and incidentally, what is it with Linux desktops these days that they just can't seem to shut up? You no sooner plug something in than you have kilobits of traffic asking if, against all knowledge and probability, an ethernet interface with no IP is a link back to some mothership or other. Sheesh).

Anyway, you look at the trace in wireshark, and all is well, except that if you send:

0000  68 05 ca 1e 35 18 00 50 9a 00 00 00 08 00 01 02   h...5..P........
0010  03 04 05 76 40 30 00 00 40 00 0b 00 00 10 01 02   ...v@0..@.......
0020  03 04 05 06 07 08 01 02 03 04 05 06 07 08 01 02   ................
0030  03 04 05 06 07 08 01 02 03 04 05 06 07 08 01 02   ................

You get your original packet, but immediately after it, you also get:

0000  40 00 bf ff 68 05 ca 1e 35 18 00 50 9a 00 00 00   @...h...5..P....
0010  08 00 01 02 03 04 05 76 40 30 00 00 40 00 0b 00   .......v@0..@...
0020  00 10 01 02 03 04 05 06 07 08 01 02 03 04 05 06   ................
0030  07 08 01 02 03 04 05 06 07 08 01 02 03 04 05 06   ................
0040  07 08 01 02                                       ....

which is your packet, with 0x4000bfff prepended to it. No status word in the Tx status FIFO, no IRQ_SIS TXE bit, nothing. Weird, huh?

It's a replay, so can't be bad FIFO management (well, not obvious bad FIFO management), and it's not the CPU double-writing or you'd get word dups, not packet dups. If you vary the size of your packet, you find that the first byte is the length of the original, and the third and fourth byte are always something like 0xNfff where N seems to be something to do with the top nybble of your packet length.

So, you try sending two packets back to back. Send:

0000  68 05 ca 1e 35 18 00 50 9a 00 00 00 08 00 01 02   h...5..P........
0010  03 04 05 77 40 30 00 00 40 00 0c 00 00 10 01 02   ...w@0..@.......
0020  03 04 05 06 07 08 01 02 03 04 05 06 07 08 01 02   ................
0030  03 04 05 06 07 08 01 02 03 04 05 06 07 08 01 02   ................


0000  68 05 ca 1e 35 18 00 50 9a 00 00 00 08 00 01 02   h...5..P........
0010  03 04 05 78 40 30 00 00 40 00 0c 00 00 10 01 02   ...x@0..@.......
0020  03 04 05 06 07 08 01 02 03 04 05 06 07 08 01 02   ................
0030  03 04 05 06 07 08 01 02 03 04 05 06 07 08 01 02   ................

Get:

0000  40 00 bf ff 68 05 ca 1e 35 18 00 50 9a 00 00 00   @...h...5..P....
0010  08 00 01 02 03 04 05 77 40 30 00 00 40 00 0c 00   .......w@0..@...
0020  00 10 01 02 03 04 05 06 07 08 01 02 03 04 05 06   ................
0030  07 08 01 02 03 04 05 06 07 08 01 02 03 04 05 06   ................
0040  07 08 01 02 40 00 bf ff 68 05 ca 1e 35 18 00 50   ....@...h...5..P
0050  9a 00 00 00 08 00 01 02 03 04 05 78 40 30 00 00   ...........x@0..
0060  40 00 0c 00 00 10 01 02 03 04 05 06 07 08 01 02   @...............
0070  03 04 05 06 07 08 01 02 03 04 05 06 07 08 01 02   ................
0080  03 04 05 06 07 08 01 02                           ........

Sometimes you get two of these curious runt packets, and sometimes one. Awooga! You then spend all sodding night debugging the cursed thing, convinced that your MMU has somehow rewritten the FIFO, or you've accidentally written it a negative length and it's wrapped, or that you're trying to send a packet in the middle of a reset.

Anyway, finally, in desperation, you plug your ASIX adaptor into a different machine, running kernel 3.13, rather than the 3.11 (sheesh, that old?) on your original box and you get your original packets. Two at a time, all fine and dandy - and my onboard adaptors on two different laptops seem to agree with that.

It seems that older ASIX drivers and/or 3.11 will insert comedy packets into your wireshark captures, for fun and profit. Now, the reason I switched to the ASIX in the first place was because I was seeing these runt packets on another interface, so I'm not sure if I blame the ASIX driver or not. My desktop has both 802.11q and IPv6 enabled, so it may be that some component of the network stack is simply failing to cope with an attempt to configure 802.11q and IFF_PROMISC at the same time. Be warned!

Hopefully, if you found this post via google at 3am, you can now change adaptor, go to bed and sleep the sleep of the justly offended.

Gah. Onward to lwip! (which is at least reasonably well-behaved)