Friday, 31 December 2010

Interesting facts about USB, number 2312 in an occasional series.

Ah, USB 2.0. How I don't love thee in any way at all.

I've recently been hacking the USB drivers on an SoC we have here. The initial problem was that a USB mouse connected via a high-speed hub was acting erratically.

[Update: for what it's worth, it turns out that none of the below actually works. But I do now have some nice clean traces showing a high-speed USB hub apparently blatantly failing to report transaction results back to the host. Hey ho. Work continues.. ]

It turns out that this behaviour is common to all mice and all the high-speed hubs we had about, and to our USB-IR widgets, which would occasionally send key-down without key-up, leading to fun with Android deciding that key-repeat was clearly what you meant and spinning your keypresses repeatedly.

We asked the manufacturer for support and they reckoned it worked fine at their end with one of these.

So: next things to do were to bug a couple of the hubs that seemed to work, get ourselves a hub we knew wasn't broken, pull out the USB analysers and see what we could see.

Fact #0: A Totalphase Beagle 480 analyser will mysteriously claim 'Host Disconnected' at the slightest provocation. This includes when your hub power supply is 4.97V - nothing else will care, but the Beagle 480 will look as though you've broken it. So that took me two days to work out .. (amusingly, a Beagle 12 seems fine)

Fact #1: The Slim 4 port USB2.0 hubs that worked are full-speed.

Fact #2: The problem still exists on a self-built board using SMSC9514 (we wanted ethernet too) - my go-to 'it really does work' high-speed hub chip, as used on the Beagle xM.

Fact #3: The problem goes away if you connect the hub to a PC rather than our embedded board.

Fact #4: Our SoC uses the Designware USB 2.0 OTG host controller IP. This is an out-of-tree driver from our vendor, though it's been submitted for inclusion in the mainline kernel - here, but the DENX people have made a few fixes which we'd merged into the vendor-supplied driver. So the host side USB driver is even more of a mess than the stock Designware drivers, which are in themselves grotty in the extreme.

Fact #5: If you attach a Beagle 480 to the host side of the hub, you get:

IN txn (SPLIT) [ 1117 POLL ] 00 02 00 28 00 00 00 00 ..

whereas if you attach the same analyser to the device side, you find that our USB-IR widget (for it is he) sends:

IN txn [ 105 POLL ] 00 02 00 28 00 00 00 00 [4 SOF] IN txn 00 01 00 00 00 00 00 00

So: the USB-IR widget is sending key-up, but it's never being received on the high-speed bus.

At this point, a little background. DANGER WILL ROBINSON! This is highly simplified and almost certainly contains mistakes - the USB 2.0 standard has a good (if long-winded) explanation of transaction translation which it is highly recommended that you read if you ever have to seriously care about these things.

USB 2.0 runs much faster than USB 1.x. USB 1.x timing is divided into 1ms frames whose timing is broadcast by SOF tokens. In USB 2.x we further subdivide frames into 125uS subframes.

USB is a host-controlled token-based bus; the host issues tokens (IN, OUT .. ) and devices respond.

Now, obviously, if you have a high-speed and a full-speed device attached to a high-speed hub you don't want to be waiting around for the full-speed device to respond to your token before servicing the high-speed device. So there is a token called SPLIT, further subdivided into SSPLIT (Start SPLIT) and CSPLIT (Complete SPLIT).

When a host wants to issue a full-speed transaction to a downstream device, it issues SSPLIT to the (high-speed) hub port in some given micro-frame. This instructs a bit of mechanism in the hub called the transaction translator to initiate a full-speed transaction on a given port, send the packet (IN/OUT) that came with the ssplit, collect the answer and wait for the host to issue a CSPLIT to retrieve it.

Now, there are some caveats about split transactions:

  • There is a limited amount of buffer space per TT. Old transactions will be thrown away when the limit is reached.
  • Replies will be discarded when 4 or more microframes have passed since the SSPLIT that issued them.
  • CSPLITs for which the hub can find no replies will get NYET (forever if the hub will never get a reply or has discarded it).
  • You must not issue an SSPLIT in subframe 6.
  • You can have one TT per hub or one TT per port (the hub descriptor tells you which).
  • You may have multiple concurrent SPLITs , but you must issue CSPLITs in the order in which you issued SSPLITS or the hub will assume it has lost your CSPLIT and discard transaction results.
  • The hub takes some number of FS bit times (8 for the 9514) to issue an FS transaction after a CSPLIT - this is signalled in the hub descriptor.
  • To make life even more fun, HID devices like keyboards and mice are periodically polled, so there is something in the host called the periodic scheduler which attempts to make sure that each device is polled at least as often as its descriptor would like it to be. Luckily this is a best-effort polling mechanism.

    .. oh, and for added amusement, you can also have multiple transactions per subframe though you can ignore this for the purposes of this article. There are also things called isochronous transfers which do need to be real-time, and I have probably broken what little support there was for them in the synopsys drivers.

    The upshot of all this is that you need to be extremely careful when issuing splits - particularly periodic splits. Ideally you want to do it in hardware. Less ideally, you need to make sure your periodic scheduler knows about splits and respects the in-order and 'not in microframe 6' constraints.

    So. Our next bet was to tie a Beagle 480 to the hub side of the system under test and a Beagle 12 to the device side; this doesn't give you as much information as you might like because a Beagle 12 is only capable of aggregate (rather than sequential) captures and so presents its results out of order, but it's good enough for now.

    Monitoring both sides of the bus simultaneously we discover that the synopsys drivers are guilty of several sins:

      1. Issuing SSPLITs in frame 6 (believe it or not, the 9514 actually cares about this and will NYET not the frame 6 SSPLIT but any SSPLIT in frame 7. It's allowed to, but this seems odd)
      2. Issuing CSPLITs out of order from the SSPLITs that created them, causing the results of previous SSPLITs to be discarded.
      3. There is some logic in the synopsys drivers intended to avoid runaway CSPLITs such that if you are about to issue a CSPLIT in a different full-speed frame from the SSPLIT that created it, you abort the transaction. This causes SSPLITs issued in subframe 7 to never get a corresponding CSPLIT (and a similar slight problem for those in subframe 5, though the chances are that if you're going to get an answer you will have got it by subframe 7)
      4. Skipping two subframes between SSPLIT and CSPLIT - so you get SSPLIT:0, CSPLIT:3 ..; it's unclear why that CSPLIT gets NYETed - the 9514 ought to keep your results for you until the start of subframe 4 - but nevertheless, not issuing a CSPLIT in subframes 1 and 2 is a standards violation.

    These all (indirectly) arise from the fact that the designware IP attempts to handle SPLIT/CSPLIT transfers using the same mechanism it uses for other transfers - the drivers schedule a transaction (if periodic), then they queue it on a hardware channel for transmission, then if it is a split, once you get the reply token (NYET, ACK, etc.) you decide whether to immediately reschedule for transmission.

    So: now we (think we) know what's wrong:

    The designware IP tries to handle case 1 by pushing an SSPLIT into subframe 0. I'm not really sure why this doesn't work.

    It can't avoid 2, because the periodic scheduler doesn't know about splits - if a periodic transaction is ready, it'll be scheduled, in whatever order the list (for it is held in a kernel list) happens to be. If that list is somehow jumbled during scheduling - e.g. because you have two periodic transfers and they keep getting scheduled in the opposite order to each other, so a periodic transfer gets accepted half-way through your split - that's what happens.

    3 is a simple bug, and probably shouldn't arise given our ostensible policy of only ever scheduling a split in subframe 0. However, easily fixed by allowing a frame's grace - this sacrifices bandwidth, but it's good enough for now.

    4 is a pig; the host is simply too slow to run all that code and always get back in time for its CSPLIT.

    So: now we think we know what's going wrong -

    Problem 2 is solved by adding logic to allow only one CSPLIT at a time. This also, obscurely, solves problem 1 (perhaps some kind of hardware channel tx conflict? hard to tell without knowing what the hardware channels are actually supposed to do).

    Problem 3 is easy

    Problem 4 is a pig. I've tried to get around it by adding a fast path to the driver which spots a split in progress and immediately reschedules the transfer on the same host channel without going through the scheduling mechanism. However, the interrupt handling in that driver is a mess so I'm not sure if this is stable, and I'm certainly not sure it solves the problem.

    All seems promising so far, though more testing is needed with various peripherals and I have done nothing yet with non-periodic splits, which probably suffer from many of the same problems.

    As a side-note, I did have a quick look to see if ftrace would help me work out whether someone else was running for long periods with interrupts disabled, but all it did was crash my kernel. Maybe next time ..

    So: there you go. A whirlwind tour of how to get seriously confused about high-speed hubs; hope someone finds it useful. Questions, comments, etc. welcome - do let me know if you'd like the patches - they're to a vendor-only version so I shall have to get permission to release them until there's a formal source release for the product.

    1 comment:

    Shar said...

    Hi Richard,
    Regarding
    "Problem 2 is solved by adding logic to allow only one CSPLIT at a time"

    I am facing same problem of CSPLIT ordering in the driver and I wanted to know if it is already implemented in the driver and if yes can you share the revision or code snippet..

    rgds,
    Sharanu

    Post a Comment