.dp-highlighter { font-family: "Consolas", "Monaco", "Courier New", Courier, monospace; font-size: 12px; background-color: #E7E5DC; width: 99%; overflow: auto; margin: 18px 0 18px 0 !important; padding-top: 1px; /* adds a little border on top when controls are hidden */ } /* clear styles */ .dp-highlighter ol, .dp-highlighter ol li, .dp-highlighter ol li span { margin: 0; padding: 0; border: none; } .dp-highlighter a, .dp-highlighter a:hover { background: none; border: none; padding: 0; margin: 0; } .dp-highlighter .bar { padding-left: 45px; } .dp-highlighter.collapsed .bar, .dp-highlighter.nogutter .bar { padding-left: 0px; } .dp-highlighter ol { list-style: decimal; /* for ie */ background-color: #fff; margin: 0px 0px 1px 45px !important; /* 1px bottom margin seems to fix occasional Firefox scrolling */ padding: 0px; color: #5C5C5C; } .dp-highlighter.nogutter ol, .dp-highlighter.nogutter ol li { list-style: none !important; margin-left: 0px !important; } .dp-highlighter ol li, .dp-highlighter .columns div { list-style: decimal-leading-zero; /* better look for others, override cascade from OL */ list-style-position: outside !important; border-left: 3px solid #6CE26C; background-color: #F8F8F8; color: #5C5C5C; padding: 0 3px 0 10px !important; margin: 0 !important; line-height: 14px; } .dp-highlighter.nogutter ol li, .dp-highlighter.nogutter .columns div { border: 0; } .dp-highlighter .columns { background-color: #F8F8F8; color: gray; overflow: hidden; width: 100%; } .dp-highlighter .columns div { padding-bottom: 5px; } .dp-highlighter ol li.alt { background-color: #FFF; color: inherit; } .dp-highlighter ol li span { color: black; background-color: inherit; } /* Adjust some properties when collapsed */ .dp-highlighter.collapsed ol { margin: 0px; } .dp-highlighter.collapsed ol li { display: none; } /* Additional modifications when in print-view */ .dp-highlighter.printing { border: none; } .dp-highlighter.printing .tools { display: none !important; } .dp-highlighter.printing li { display: list-item !important; } /* Styles for the tools */ .dp-highlighter .tools { padding: 3px 8px 3px 10px; font: 9px Verdana, Geneva, Arial, Helvetica, sans-serif; color: silver; background-color: #f8f8f8; padding-bottom: 10px; border-left: 3px solid #6CE26C; } .dp-highlighter.nogutter .tools { border-left: 0; } .dp-highlighter.collapsed .tools { border-bottom: 0; } .dp-highlighter .tools a { font-size: 9px; color: #a0a0a0; background-color: inherit; text-decoration: none; margin-right: 10px; } .dp-highlighter .tools a:hover { color: red; background-color: inherit; text-decoration: underline; } /* About dialog styles */ .dp-about { background-color: #fff; color: #333; margin: 0px; padding: 0px; } .dp-about table { width: 100%; height: 100%; font-size: 11px; font-family: Tahoma, Verdana, Arial, sans-serif !important; } .dp-about td { padding: 10px; vertical-align: top; } .dp-about .copy { border-bottom: 1px solid #ACA899; height: 95%; } .dp-about .title { color: red; background-color: inherit; font-weight: bold; } .dp-about .para { margin: 0 0 4px 0; } .dp-about .footer { background-color: #ECEADB; color: #333; border-top: 1px solid #fff; text-align: right; } .dp-about .close { font-size: 11px; font-family: Tahoma, Verdana, Arial, sans-serif !important; background-color: #ECEADB; color: #333; width: 60px; height: 22px; } /* Language specific styles */ .dp-highlighter .comment, .dp-highlighter .comments { color: #008200; background-color: inherit; } .dp-highlighter .string { color: blue; background-color: inherit; } .dp-highlighter .keyword { color: #069; font-weight: bold; background-color: inherit; } .dp-highlighter .preprocessor { color: gray; background-color: inherit; }

Friday, October 5, 2007

Explanation of Duff's device

This is a neat and informative explanation of Huff's Device. A famous optimization using loop unrolling. Notice that Huff's device was originally developed to avoid "many" loop branching when writing to a serial I/O port, so it is not a YAI of memcpy.

------------

[Someone asked for some more explanation on Duff's Device, which at first glance looks like it couldn't even be parsed correctly. This was my reply.]

It's a poser, isn't it?

There are a several different issues to understand.

One is that, as far as the compiler is concerned, the formal syntax of a switch statement is not

 switch ( expression )
{
case constant-expression :
statement-list
break ;

case constant-expression :
statement-list
break ;

...

default :
statement-list
break ;
}

Rather, the syntax is simply

 switch ( expression )
statement

where the statement can either be a single statement or a brace-enclosed list of statements. As you can imagine, this is the same sort of syntax specification as is used for if, for, and while statements.

Furthermore, the syntax rules for statements say, in effect, that any statement (anywhere) can be “labeled”, either with a goto label (that is, a label that can be the target of a goto) or a case label.

What this means so far is that the case labels in a switch statement do not, strictly speaking, have to be immediately within the outer set of braces that enclose the body of the switch statement; they can instead be hiding inside inner, nested blocks.

The next question is, if the case labels do not have to be immediately within the outer braces of the switch statement, how does the compiler find them and match them up? There are various ways to answer this, but perhaps the simplest and most convincing is to say that the compiler implements case labels in a similar way as it implements goto labels, and that a switch statement is really a sort of “computed goto”.

This is a bit of an oversimplification, because a goto can jump around to anywhere within a function, whereas case labels must always be somewhere within (if not immediately within) an enclosing switch statement. But the essential point is that the compiler simply does not notice that a case label might be “hiding” in a nested block within the body of the switch statement.

If you really want to understand this, it's worth discussing a few more points. You may not have realized it, but in many languages, including C, it is legal (though ill-advisable) to use a goto to jump into the body of a loop. For example, this code:

 int i = 7;
goto inside;
for(i = 0; i < 10; i++)
{
inside: printf("%d ", i);
}

prints “7 8 9 ”. The related code

 i = 42;
goto in2;
for(i = 0; i < 10; i++)
{
in2: printf("%d ", i);
}

prints “42 ”, and the related code

 i = 42;
goto in3;
for(i = 0; i != 10; i++)
{
in3: printf("%d ", i);
}

runs practically forever, printing thousands or billions of numbers.

I must emphasize that there are vanishingly few, and perhaps no, reasons why you would ever want to jump into a loop like this. It's sort of an accident that it's possible at all, and anyone who cares about programming style will assert that this questionable freedom is one you would never want to make use of. (They're right, too: this questionable freedom is one you would never want to make use of.) However, thinking about how a goto can jump into a loop may make it easier to imagine what's going on when a switch statement causes a jump to a case label buried in a nested inner block.

Another point worth mentioning concerns the question of why compilers are written like this in the first place. Why is the official syntax for the switch statement simply

 switch ( expression )
statement

rather than the more obvious, explicit syntax which I started this discussion by saying was not the official one? If the explicit syntax were the official syntax, it simply wouldn't be possible to write questionably-valid, impossible-to-understand code like Duff's Device.

The answer -- which may sound somewhat lame, but anyway -- has to do with making the compiler easier and cleaner to write. There's a whole bunch of machinery having to do with brace-enclosed lists of statements, and it's nice to be able to reuse that machinery in any context where a control flow statement might use a brace-enclosed list as its body. In fact, it's so nice to re-use that machinery that it's worth doing so even though that means the machinery has to be augmented with some additional complexity to handle interspersed case labels, complexity that ought to be needed only when the block being parsed is in fact the body of a switch statement.

In fact, whenever people say that a surprising feature was accidental, not intended, that it simply “fell out of” the chosen implementation of some system, this is exactly the sort of case they are talking about. It was never intended, I don't think, for case labels to be accepted (let alone properly handled) within nested inner blocks. But when the decision was made to re-use the block statement handling machinery for the body of switch statements, meaning that that machinery had to learn about case labels, it meant that case labels just happened to be accepted in front of statements in the bodies of loops, as long as there was a switch statement somewhere out there for them to hook up to. And, furthermore, though I won't try to argue whether this was accidental or deliberate or an accidental result of deliberately careful coding practices, it did turn out that the case labels were not only accepted, but worked “properly”, even when they were deeply nested.

* * *

I'm not sure whether you were more puzzled by the syntactic and semantic implications of an interleaved loop and switch statement, or by the particular way Duff's code made use of that combination to implement a practical algorithm. But if you had questions about the latter, let's now take a look at Duff's Device itself. Actually, for clarity, I'm going to present a variation on Duff's Device, a reimplementation of the standard memcpy function. (I presume you know what memcpy does.) A straightforward, bare-bones implementation of memcpy might look like this:

 void *memcpy(void *dest, const void *src, size_t n)
{
char *destp = dest;
const char *srcp = src;

for(; n > 0; n--)
*destp++ = *srcp++;

return dest;
}

This code works pretty well. But it's not perfect, and there are some things we could worry about, and one of those is efficiency.

A problem with the code as written is that it checks n each time through the loop to see if it's gotten down to 0. That is, if we copy, say, 1000 bytes, the running of the code will involve 1000 byte-copying operations and 1000 increment operations on srcp and 1000 more increment operations on destp and 1000 decrement operations on n and 1000 comparison operations on n. It might seem as if there's no way around this, but there is. The technique is called “loop unrolling”, and we can illustrate it with a set of alterations to the sample memcpy code above. Suppose we rewrite the loop like this:

  for(; n > 0; n -= 2)
{
*destp++ = *srcp++;
*destp++ = *srcp++;
}

The intent is to achieve exactly the same result, with a little less work (albeit with a little more code, which is a frequent tradeoff). Using the revised code, if n starts out as 1000, there will still be 1000 assignments and 1000+1000 increments on srcp and destp, but there will only be 500 comparisons of n against 0, not 1000. Also, there will be 500 subtractions where before we had 1000 decrements.

As you may have noticed, however, the revised code has a bug: it works properly only if n is even. If n is odd, it will copy too much. (In fact, since n has type size_t which is unsigned, the loop is likely to run out of control, running essentially forever instead of simply copying one byte too many. The situation is similar to our third jump-into-the-loop example, which set i to 42 but tested i != 10. But this is a side issue, so I won't digress further.)

To fix the code, we need to test for n being odd and handle the case of copying a single byte, since always copying by pairs obviously won't handle odd numbers correctly:

  if(n % 2 == 1)
{
*destp++ = *srcp++;
n--;
}

for(; n > 0; n -= 2)
{
*destp++ = *srcp++;
*destp++ = *srcp++;
}

Suppose we wanted to unroll the loop further. If we put four assignment statements inside the loop, we'll perform only n/4 subtractions and n/4 comparisons on n during the main loop. But now we have to worry not only about n being odd, but about it having a remainder of 1, 2, or 3 when divided by 4. So the code might look like

  switch(n % 4)
{
case 3: *destp++ = *srcp++; n--;
case 2: *destp++ = *srcp++; n--;
case 1: *destp++ = *srcp++; n--;
}

for(; n > 0; n -= 4)
{
*destp++ = *srcp++;
*destp++ = *srcp++;
*destp++ = *srcp++;
*destp++ = *srcp++;
}

Now you're probably starting to see where this is going. It's bad enough that when we unroll the loop by a factor of 4, to reduce the overhead of decrementing and testing n by a factor of 4, we also increase the code size by a factor of 4. But it's worse than that, because the initial handle-the-remainder cases mean that the code bloat is more like a factor of 8. Wouldn't it be nice if we could eliminate a few of those *destp++ = *srcp++ lines?

As the FAQ list notes, the essence of Duff's device is that it “solves the problem of handling the leftover bytes by interleaving a switch statement with the loop which copies bytes [N] at a time.” The idea is to jump into the middle of the unrolled loop, so that the first trip through the loop isn't a full trip, but one which handles the not-a-multiple-of-four leftover bytes, if necessary. This is obviously a screwy thing to do, at best -- after all, I was claiming up above that you would never want to make use of the questionable freedom C gives you to jump into the middle of a loop. We're about to do just that, and the fact that we're going to use a switch instead of a goto doesn't really make the exercise much (if any) less questionable.

Anyway, here is the interleaving trick -- that is, the essential idea of Duff's device -- applied to the memcpy example we've been working with:

  int nn = n;  /* need signed n to handle underflow */
/* also assumes initial n > 0 */

switch(nn % 4)
{
for(; nn > 0; nn -= 4)
{
case 0: *destp++ = *srcp++;
case 3: *destp++ = *srcp++;
case 2: *destp++ = *srcp++;
case 1: *destp++ = *srcp++;
}
}

This code works, although there are two additional complications. One is that since it always uses “-= 4” in the for loop, the count will go negative in the case where the initial n was not a multiple of 4. So we must do the subtraction on a signed variable -- if we kept subtracting 4 from n, which was passed in as a size_t, it would never go negative, but would wrap around to a very large positive number, that is, a number > 0, meaning that the loop would run far too long. (This is the explanation that I was trying to avoid digressing into up above.)

The second complication is that the code as written can no longer handle an initial n value of 0. (The “official” version of Duff's device has the same limitation.)

The code I've presented here captures, I think, the essential ideas of Duff's device. For the record, I'll explain the differences between this code and the “official” version. Duff's unrolls the loop by a factor of 8, it precomputes an n which is the byte count divided by 8 (and then adjusted) so that it can use “--n” instead of “n -= 8”, and it uses a do/while loop rather than a for loop. Those three differences are all mainly cosmetic; the one big difference is of course that the “official” version of Duff's device does not increment the to pointer at all. This is because to was pointing, not at a regular memory location into which the data was being copied, but rather, to a memory-mapped device register; the bytes being copied were actually being transmitted to a display device of some kind. (The application was some kind of animation, presumably real-time, which is why efficiency mattered so much.)

Source: http://c-faq.com/misc/duffexpln.html

No comments :