434,880 Members | 2,292 Online
Need help? Post your question and get tips & solutions from a community of 434,880 IT Pros & Developers. It's quick & easy.

duff's device / loop unriolling

 P: n/a Hi there, the Code below shows DJBs own implementation of strlen (str_len): unsigned int str_len(char *s) { register char *t; t = s; for (;;) { if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; } } I understand the code so far, but I have a question about the loops. To me, it seems he used loop unrolling (aka duff's device) to optimize the code while it is executed. But why did he use four loops? When the function is invoked, he didn't know how big "s" is. Or am I wrong here? I always thought, to unroll a loop I need to know how often the loop is used. Any help would be appreciated, JR Nov 15 '05 #1
22 Replies

 P: n/a Jan Richter wrote: Hi there, the Code below shows DJBs own implementation of strlen (str_len): unsigned int str_len(char *s) { register char *t; t = s; for (;;) { if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; } } I understand the code so far, but I have a question about the loops. To me, it seems he used loop unrolling (aka duff's device) to optimize the code while it is executed. But why did he use four loops? When the function is invoked, he didn't know how big "s" is. Or am I wrong here? I always thought, to unroll a loop I need to know how often the loop is used. Any help would be appreciated, JR In my linbox, strlen() gives by far better results... -- one's freedom stops where others' begin Giannis Papadopoulos http://dop.users.uth.gr/ University of Thessaly Computer & Communications Engineering dept. Nov 15 '05 #2

 P: n/a "Giannis Papadopoulos" wrote: [...] In my linbox, strlen() gives by far better results... Yes, same for me, so where is the benefit? Cheers, JR Nov 15 '05 #3

 P: n/a Jan Richter wrote: "Giannis Papadopoulos" wrote: [...]In my linbox, strlen() gives by far better results... Yes, same for me, so where is the benefit? Cheers, JR No benefit... Maybe it is written for a compiler that does not know how to unroll loops... And yes, loop unrolling works perfectly when you have a hint how many times this loop will be executed. Doesn't the author of this peculiar str_len() say anything? -- one's freedom stops where others' begin Giannis Papadopoulos http://dop.users.uth.gr/ University of Thessaly Computer & Communications Engineering dept. Nov 15 '05 #4

 P: n/a Jan Richter wrote:But why did he use four loops? Whenthe function is invoked, he didn't know how big "s" is. may be based on the consideration of pipeline. I heard Intel PII CPU has four levels pipelines of integer operation. is that right? Nov 15 '05 #5

 P: n/a "Jan Richter" wrote: "Giannis Papadopoulos" wrote: In my linbox, strlen() gives by far better results... Yes, same for me, so where is the benefit? The original Duff's Device did a bit more than merely count. It's in the FAQ, btw, which you should've read. . Richard Nov 15 '05 #6

 P: n/a On Fri, 19 Aug 2005 11:53:21 +0200, Jan Richter wrote: Hi there, the Code below shows DJBs own implementation of strlen (str_len): unsigned int str_len(char *s) { register char *t; t = s; for (;;) { if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; } } I understand the code so far, but I have a question about the loops. To me, it seems he used loop unrolling (aka duff's device) to optimize the code while it is executed. As someone else has said most optimising compilers will do this for you. It's a far more appropriate technique for assembly. But perhaps the author had a specific implementation in mind for which this happened to work best. Also as pointed out elsewhere Duff's Device is a more specific term than loop unrolling. But why did he use four loops? When the function is invoked, he didn't know how big "s" is. Or am I wrong here? I always thought, to unroll a loop I need to know how often the loop is used. There are two main considerations: speed vs code size. More unrolling == less jump opcodes == more loop body code. Which is equivalent to a speed increase at the cost of a code size increase. If you know how often the loop body will be iterated on average then you can avoid unrolling the loop by more than that number. Hence you avoid the extra code without affecting the speed increase. If you don't know, then you make a decision as to how much extra code you're willing to tolerate. There are other issues like keeping all the loop body code in the instruction cache and possibly others that I'm not too familiar with not being an assembly coder. For the strlen function as you point out you can't know what the average number of iterations will be so the only consideration is code size. Providing you can keep it all in the instruction cache the loop could be unrolled indefinitely and continue to provide speed increases for calls with a sufficiently long string. Why did the author choose four? Who knows - perhaps he wanted the function to be reasonably small whilst still providing a modest speed increase. Which as you and others point out, it doesn't. Going to show once again that usually optimisations are best left to the compiler or assembly. -- http://members.dodo.com.au/~netocrat Nov 15 '05 #7

 P: n/a Jan Richter wrote: Hi there, the Code below shows DJBs own implementation of strlen (str_len): unsigned int str_len(char *s) { register char *t; t = s; for (;;) { if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; } } Why only four if statements in the loop? I guess that's not supposed to be an example of portable code. The (t-s) expressions are of type ptrdiff_t, which is unsuitable for use that way, in portable code. That's not Duff's Device. http://www.lysator.liu.se/c/duffs-device.html -- pete Nov 15 '05 #8

 P: n/a Giannis Papadopoulos wrote: No benefit... Maybe it is written for a compiler that does not know how to unroll loops... Probably; an implementation simple-minded enough to trust the programmer when he uses the "register" storage class specifier probably could use some help unrolling a loop. The VAX compiler for which Duff originally wrote his device was (presumably) such an implementation. -- Christopher Benson-Manica | I *should* know what I'm talking about - if I ataru(at)cyberspace.org | don't, I need to know. Flames welcome. Nov 15 '05 #9

 P: n/a "Jan Richter" writes: the Code below shows DJBs own implementation of strlen (str_len): unsigned int str_len(char *s) { register char *t; t = s; for (;;) { if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; } } I understand the code so far, but I have a question about the loops. To me, it seems he used loop unrolling (aka duff's device) to optimize the code while it is executed. But why did he use four loops? When the function is invoked, he didn't know how big "s" is. Or am I wrong here? I always thought, to unroll a loop I need to know how often the loop is used. Duff's device is a specific technique, not a synonym for loop unrolling. See question 20.35 of the C FAQ. The use of unsigned int is questionable. The return type should be size_t. It's possible that this implementation predates the existence of size_t (though it's been modernized with a prototype). I would guess that it might be more efficient than strlen() only on a very old implementation. A matter of terminology: it doesn't have four loops. It has a single loop containing four if statements. The statement that's executed N times is if (!*t) return t - s; ++t; N needn't be a multiple of 4 because it can break out of the loop partway through the body of the for loop. In the following, I'll use the term "iteration" for an execution of the entire body of the loop, and "sub-iteration" for an execution of a single line within the loop. The idea of loop unrolling is that it avoids the overhead of a test on each iteration, falling through from one statement to the next and performing the test and branch only once every 4 elements. The number of times a loop is to be unrolled is a tradeoff between speed and code size -- and if it's unrolled too much (say, 1024 times), the code size itself can make it run slower due to cache issues. This is all *extremely* system-specific, which is why the whole thing is best left to the compiler. Furthermore, the "beauty" of Duff's Device is that it jumps into the middle of the loop, so the loop never has to exit from the middle, and it really does avoid the overhead on each sub-iteration. The code above doesn't do this; instead, it performs a test and branch on each sub-iteration. I'd be surprised to see the above code performing better than strlen() on any modern implementation, particularly since an implementation is free to implement strlen() using whatever non-portable tricks it likes to squeeze out the last clock cycle. -- Keith Thompson (The_Other_Keith) ks***@mib.org San Diego Supercomputer Center <*> We must do something. This is something. Therefore, we must do this. Nov 15 '05 #10

 P: n/a In article <3m*************@individual.net>, "Jan Richter" wrote: Hi there, the Code below shows DJBs own implementation of strlen (str_len): unsigned int str_len(char *s) { register char *t; t = s; for (;;) { if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; } } I understand the code so far, but I have a question about the loops. To me, it seems he used loop unrolling (aka duff's device) to optimize the code while it is executed. Loop unrolling != Duff's device. Duff's device is a combination of loop unrolling + completely perverted usage of a switch statement, which is likely to prevent the compiler from optimising the code But why did he use four loops? When the function is invoked, he didn't know how big "s" is. Or am I wrong here? I always thought, to unroll a loop I need to know how often the loop is used. As you can see, knowing how often a loop will be executed is not necessary. However, before you unroll a loop by hand, you should check whether the compiler can do loop unrolling itself, which is likely to produce more efficient code. Nov 15 '05 #11

 P: n/a Keith Thompson wrote: "Jan Richter" writes: the Code below shows DJBs own implementation of strlen (str_len): unsigned int str_len(char *s) { register char *t; t = s; for (;;) { if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; } } The idea of loop unrolling is that it avoids the overhead of a test on each iteration, That's supposed to be the idea, but the above code has one test per increment just like a naive portable implementation. size_t str_len(const char *s) { size_t n = 0; while (s[n] != '\0') { ++n; } return n; } -- pete Nov 15 '05 #12

 P: n/a pete wrote: Keith Thompson wrote: "Jan Richter" writes: > the Code below shows DJBs own implementation of strlen (str_len): > > unsigned int str_len(char *s) > { > register char *t; > t = s; > for (;;) { > if (!*t) return t - s; ++t; > if (!*t) return t - s; ++t; > if (!*t) return t - s; ++t; > if (!*t) return t - s; ++t; > } > } The idea of loop unrolling is that it avoids the overhead of a test on each iteration, That's supposed to be the idea, but the above code has one test per increment just like a naive portable implementation. size_t str_len(const char *s) { size_t n = 0; while (s[n] != '\0') { ++n; } return n; } IANAAP (I am not an assembly programmer) so I don't know how the code typically translates, but it seems to me that it does provide some benefit - it reduces by 75%[1] the number of jump statements executed (to return from the end to the start of the loop). [1] When the number of iterations is not a multiple of 4 this is only approximate. -- http://members.dodo.com.au/~netocrat Nov 15 '05 #13

 P: n/a Christian Bau wrote (in article ): In article <3m*************@individual.net>, "Jan Richter" wrote: Hi there, the Code below shows DJBs own implementation of strlen (str_len): unsigned int str_len(char *s) { register char *t; t = s; for (;;) { if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; if (!*t) return t - s; ++t; } } I understand the code so far, but I have a question about the loops. To me, it seems he used loop unrolling (aka duff's device) to optimize the code while it is executed. Loop unrolling != Duff's device. Duff's device is a combination of loop unrolling + completely perverted usage of a switch statement, which is likely to prevent the compiler from optimising the code It's suddenly being discussed again, probably as a result of an article in DDJ (Doctor Dobb's Journal) about it in the August issue. Ralf Holly proposed using it in macro form for generic loop unrolling, which probably makes people misunderstand its original purpose. He proposed something like #define DUFF_DEVICE_8(macroCount, macroAction) \ do { \ size_t duffcount = (macroCount); \ size_t dufftimes = (duffcount + 7) >>3u; \ switch(duffcount & 7) { \ case 0: do { macroAction; \ case 7: macroAction; \ case 6: macroAction; \ case 5: macroAction; \ case 4: macroAction; \ case 3: macroAction; \ case 2: macroAction; \ case 1: macroAction; \ } while (--dufftimes > 0); \ } \ } while (0) Of course, the caller has to know not to call with a 0 countvalue or it will execute 8 times instead. Probably not all that practical in general use, but you might find somewhere, (after profiling) where it makes some sense. -- Randy Howard (2reply remove FOOBAR) Nov 15 '05 #14

 P: n/a pete writes: Keith Thompson wrote: [...] The idea of loop unrolling is that it avoids the overhead of a test on each iteration, That's supposed to be the idea, but the above code has one test per increment just like a naive portable implementation. Yes, as I said in the portion of the article that you snipped. -- Keith Thompson (The_Other_Keith) ks***@mib.org San Diego Supercomputer Center <*> We must do something. This is something. Therefore, we must do this. Nov 15 '05 #15

 P: n/a In article Christopher Benson-Manica writes: Giannis Papadopoulos wrote: No benefit... Maybe it is written for a compiler that does not know how to unroll loops... Probably; an implementation simple-minded enough to trust the programmer when he uses the "register" storage class specifier probably could use some help unrolling a loop. The VAX compiler for which Duff originally wrote his device was (presumably) such an implementation. Make that probably and presumably to "certainly". Loop unrolling was pretty unheard of before the 80s. It came only in general use in the 80s when more pipe-lined processors appeared (and it made most sense on those). General compilers doing loop unrolling only appeared in the early 90s (if I remember right). Even CDC Fortran compilers of the 70s and early 80s did not do loop unrolling, while it made perfect sense on the CDC Cybers. -- dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131 home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/ Nov 15 '05 #16

 P: n/a In article Keith Thompson writes: .... The idea of loop unrolling is that it avoids the overhead of a test on each iteration, falling through from one statement to the next and performing the test and branch only once every 4 elements. No. It all has to do with pipelining processors, and the *instruction* cache. It also has effect on systems where a backward branch is much more expensive than a forward branch. The number of times a loop is to be unrolled is a tradeoff between speed and code size -- and if it's unrolled too much (say, 1024 times), the code size itself can make it run slower due to cache issues. This is all *extremely* system-specific, which is why the whole thing is best left to the compiler. Indeed. But I think the code dates from a time when compilers did not do loop unrolling. I'd be surprised to see the above code performing better than strlen() on any modern implementation, particularly since an implementation is free to implement strlen() using whatever non-portable tricks it likes to squeeze out the last clock cycle. Again, indeed. There are extremely easy, and efficient, techniques to determine (on a 64 bit system) whether any of 8 consecutive 8-bit entities is zero. -- dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131 home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/ Nov 15 '05 #17

 P: n/a In article <00*****************************@news.verizon.net> , Randy Howard writes: It's suddenly being discussed again, probably as a result of an article in DDJ (Doctor Dobb's Journal) about it in the August issue. Ralf Holly proposed using it in macro form for generic loop unrolling, which probably makes people misunderstand its original purpose. He proposed something like [Tabs converted to spaces to avoid wrapping.] #define DUFF_DEVICE_8(macroCount, macroAction) \ do { \ size_t duffcount = (macroCount); \ size_t dufftimes = (duffcount + 7) >>3u; \ switch(duffcount & 7) { \ case 0: do { macroAction; \ case 7: macroAction; \ case 6: macroAction; \ case 5: macroAction; \ case 4: macroAction; \ case 3: macroAction; \ case 2: macroAction; \ case 1: macroAction; \ } while (--dufftimes > 0); \ } \ } while (0) Of course, the caller has to know not to call with a 0 countvalue or it will execute 8 times instead. Unless I'm missing something, that's easily fixed with a if (!duffcount) break; before the switch. (Testing duffcount avoids using macroCount, which might have side effects, twice, of course.) A worse problem would be using it with a negative count; hopefully the compiler would provide a useful diagnostic for the conversion between a signed value and a size_t when duffcount is initialized, but that's a QoI issue. (Also, I know far too many C programmers who routinely ignore such diagnostics, partly because their code is full of them. I suppose that's a QoP issue.) Probably not all that practical in general use, but you might find somewhere, (after profiling) where it makes some sense. Agreed. I have found cases where relatively recent commercial implementations don't unroll even simple loops where unrolling has a significant benefit. Of course, if the loop isn't in a performance- critical path, it doesn't matter anyway. -- Michael Wojcik mi************@microfocus.com We are subdued to what we work in. (E M Forster) Nov 15 '05 #18

 P: n/a Michael Wojcik wrote (in article ): In article <00*****************************@news.verizon.net> , Randy Howard writes: It's suddenly being discussed again, probably as a result of an article in DDJ (Doctor Dobb's Journal) about it in the August issue. Ralf Holly proposed using it in macro form for generic loop unrolling, which probably makes people misunderstand its original purpose. He proposed something like [Tabs converted to spaces to avoid wrapping.] Ack, sorry about that. I recently changed newsreaders, and didn't pick up on it not doing that magically in the new version. I'll be more careful in the future. -- Randy Howard (2reply remove FOOBAR) Nov 15 '05 #19

 P: n/a mw*****@newsguy.com (Michael Wojcik) writes: In article <00*****************************@news.verizon.net> , Randy Howard writes: It's suddenly being discussed again, probably as a result of an article in DDJ (Doctor Dobb's Journal) about it in the August issue. Ralf Holly proposed using it in macro form for generic loop unrolling, which probably makes people misunderstand its original purpose. He proposed something like [Tabs converted to spaces to avoid wrapping.] #define DUFF_DEVICE_8(macroCount, macroAction) \ do { \ size_t duffcount = (macroCount); \ size_t dufftimes = (duffcount + 7) >>3u; \ switch(duffcount & 7) { \ case 0: do { macroAction; \ case 7: macroAction; \ case 6: macroAction; \ case 5: macroAction; \ case 4: macroAction; \ case 3: macroAction; \ case 2: macroAction; \ case 1: macroAction; \ } while (--dufftimes > 0); \ } \ } while (0) Of course, the caller has to know not to call with a 0 countvalue or it will execute 8 times instead. Unless I'm missing something, that's easily fixed with a if (!duffcount) break; before the switch. (Testing duffcount avoids using macroCount, which might have side effects, twice, of course.) Alternatively: #define DUFF_DEVICE_8(macroCount, macroAction) \ do { \ size_t duffcount = (macroCount); \ size_t dufftimes = duffcount >>3u; \ switch(duffcount & 7) { \ do { macroAction; \ case 7: macroAction; \ case 6: macroAction; \ case 5: macroAction; \ case 4: macroAction; \ case 3: macroAction; \ case 2: macroAction; \ case 1: macroAction; \ case 0: ; \ } while (dufftimes-- > 0); \ } \ } while (0) handles the case with zero iterations without needing an extra 'if' before the switch, and simplifies the calculation of 'dufftimes'. Nov 15 '05 #20

 P: n/a # #define DUFF_DEVICE_8(macroCount, macroAction) \ With a modern, optimising compiler, it's bad idea. Compilers can do unrolling for you. Duff's Device is going to make the code so complicated that it will prevent the compiler from doing a number of additional optimisations. If you insist on unrolling your loops, do something like for (i=0; i+8

 P: n/a SM Ryan wrote (in article <11*************@corp.supernews.com>): # #define DUFF_DEVICE_8(macroCount, macroAction) \ With a modern, optimising compiler, it's bad idea. Compilers can do unrolling for you. If you read the article in DDJ, you will see that your assumption, although likely broadly true is not /always/ true. As usual, it depends. Duff's Device is going to make the code so complicated that it will prevent the compiler from doing a number of additional optimisations. That is why profilers are useful. To find out, for each and every case, which happens to apply. -- Randy Howard (2reply remove FOOBAR) Nov 15 '05 #22

 P: n/a SM Ryan writes: # #define DUFF_DEVICE_8(macroCount, macroAction) \ With a modern, optimising compiler, it's bad idea. As with most widely known programming techniques, whether it makes sense to employ Duff's Device in some particular circumstances depends on the circumstances. Compilers can do unrolling for you. The question, however, usually is not whether some hypothetical compiler *can* but whether some actual compiler *does*. These questions often have different answers. Duff's Device is going to make the code so complicated that it will prevent the compiler from doing a number of additional optimisations. Use of Duff's Device may complicate the code to the point where it *might* prevent the compiler from doing additional useful optimizations, but there's no guarantee of that. At this point Duff's Device is well known enough so advanced compilers using structural analysis to do their flow analysis may very well recognize the particular form of multiple-entry loop that Duff's Device uses and deal with it appropriately. More significantly, the loop body that is being unrolled is usually very simple (otherwise why are we bothering to unroll it?); often times it won't be improved beyond what is already done in the multiple-entry loop form. If you insist on unrolling your loops, do something like for (i=0; i+8