445,804 Members | 1,666 Online
Need help? Post your question and get tips & solutions from a community of 445,804 IT Pros & Developers. It's quick & easy.

# statistic of integer values

 P: n/a Hello together! I'm searching for a statistic about the value ranges of integers in a set of "standard" c-programs. I'm specifically interested in the avarage ratio of assignments etc. of integers where the value is either of (un)signed 16 bit or 32 bit size. Has anybody ever been in contact with something similar? Bye, Stefan Dec 10 '05 #1
16 Replies

 P: n/a "Stefan Wallentowitz" wrote I'm searching for a statistic about the value ranges of integers in a set of "standard" c-programs. I'm specifically interested in the avarage ratio of assignments etc. of integers where the value is either of (un)signed 16 bit or 32 bit size. Has anybody ever been in contact with something similar? No. It will be rather tricky to define exactly what you mean. For instance, a loop in which a variable, i, takes range between 0 and N-1, is an extremely common construct. However the counting variable usually isn't held for more than a few machine cycles. N, on the other hand, may well keep its value for the life of the program. However if the loop is in a subroutine, variable N may be passed to that routine, copied from a "Nemployees" member of a data structure. So does N now count as one or two integers? Also, because of C semantics, i will actually be set to N at the end of the for loop. Probably, however, i will never be accessed outside of the loop. So has i taken the value N, or not? Dec 10 '05 #2

 P: n/a Am Sat, 10 Dec 2005 21:25:38 +0000 schrieb Malcolm: Hi! No. It will be rather tricky to define exactly what you mean. Ok, I thought it might be a bit unexact. Also, because of C semantics, i will actually be set to N at the end of the for loop. Probably, however, i will never be accessed outside of the loop. So has i taken the value N, or not? Yes, N is one of it. But the initialisation of i also. In fact I'm interested in the ratio of common integer ranges being loaded in registers. Several examples: a=1, a=b+4, a=0x0feff0c4 etc. It is related to a compiler optimization and it is interesting how much instructions are in average saved, e.g. when theres a high occurence-rate of 16 bit signed integers, the optimization brings more success. But some measurement data would be great. I also know that there is a machine dependency, but I envision some overall data from a more abstract view on c-code. Greets, Stefan Dec 10 '05 #3

 P: n/a Stefan Wallentowitz wrote: I'm searching for a statistic about the value ranges of integers in a set of "standard" c-programs. You will probably have to generate your own. I'm specifically interested in the avarage ratio of assignments etc. of integers where the value is either of (un)signed 16 bit or 32 bit size. You need to be more precise about what you need. "ratio of assignments etc. of integers where the value is either of (un)signed 16 bit or 32 bit size." Does that mean you want to count the number of integer assignments executed in which the value has exactly 16 sigificant bits and exactly 32 significant bits and compute their ratio? Do you want to know how many assignment statements are made to 16-bit vs. 32-bit integer types without regard to value and without regard to the number of times executed? Or something else? -- Thad Dec 10 '05 #4

 P: n/a Stefan Wallentowitz wrote: It is related to a compiler optimization and it is interesting how much instructions are in average saved, e.g. when theres a high occurence-rate of 16 bit signed integers, the optimization brings more success. Not a good idea. The compiler should be consistent or the programmer will be surprised later on if the run-time code suddenly overflows on 65536. If the compiler implements 16bit ints, then that's fine. If it is 32bit ints, then that's fine also. But if it is sometimes 32 and sometimes 16 that's not a good thing. An example of a compiler where this is configurable is the HITECH C compiler for PIC microcontrollers. The programmer gets to decide via compiler options or #pragma statements weather ints are 16 or 32 bits. Dec 11 '05 #5

 P: n/a Walter Roberson wrote: In article <11**********************@g14g2000cwa.googlegroups .com>, sl*******@yahoo.com wrote:Stefan Wallentowitz wrote: It is related to a compiler optimization and it is interesting how much instructions are in average saved, e.g. when theres a high occurence-rate of 16 bit signed integers, the optimization brings more success.Not a good idea. The compiler should be consistent or the programmerwill be surprised later on if the run-time code suddenly overflows on65536. If the compiler implements 16bit ints, then that's fine. If itis 32bit ints, then that's fine also. But if it is sometimes 32 andsometimes 16 that's not a good thing. Ummm, it looked to me like that OP wasn't trying to say anything like that. If I understand correctly, the question is about integral literals appearing in source code -- though it should probably instead be about integral literals appearing after pre-processing and constant folding. In this case there would be no need to ask about statistics. It is always safe to do this because the compiler know there is ONLY one value for each literal - not a range. This is implemented by almost all modern compilers. Not only that, if for example on a RISC machine you use the literal 1 or 0 the compiler would simply generate code to read the one and zero registers. Dec 11 '05 #7

 P: n/a In article <11**********************@g43g2000cwa.googlegroups .com>, sl*******@yahoo.com wrote:Walter Roberson wrote: If I understand correctly, the question is about integral literals appearing in source code -- though it should probably instead be about integral literals appearing after pre-processing and constant folding. In this case there would be no need to ask about statistics. It isalways safe to do this because the compiler know there is ONLY onevalue for each literal - not a range. There is more than one meaning of "safe" that might apply here. There is the "safe" of "Is there a certainty that I can find a valid instruction to use for this particular literal, that the literal will not overflow the range allowed for the instruction?". I suspect that's the meaning of "safe" that you are using. There is another meaning, though, when it comes to generating instructions: "Is there a certainty that if I chose to generate this particular one of the instruction sequences that I could generate, that this choice will not result in worse performance than could be obtained with a different sequence?" And the answer to that is "It depends on the instruction set, upon the internal details of multiple dispatch or pipelining, and upon the cache characteristics of the surrounding code." Also, as I indicated earlier, the point might not be "can it safely be done", but rather, "Is it worth putting in the programming effort to do this potential optimization? Do small literals occur often enough that the savings would be appreciable? Is it worth the time to find good algorithms for the boundary cases such as the short load taking longer to execute but the extra cost being worth it if it allows a tight loop to fit in cache when it would not otherwise fit?" And that's where the statistics come in. You know the usual refrain about "premature optimization" -- well, it looks to me as if the OP is trying to get a feel for whether this kind of optimization is premature, or is worth while. -- "law -- it's a commodity" -- Andrew Ryan (The Globe and Mail, 2005/11/26) Dec 11 '05 #8

 P: n/a Am Sat, 10 Dec 2005 23:35:03 -0800 schrieb sl*******@yahoo.com: In this case there would be no need to ask about statistics. It is always safe to do this because the compiler know there is ONLY one value for each literal - not a range. This is implemented by almost all modern compilers. Not only that, if for example on a RISC machine you use the literal 1 or 0 the compiler would simply generate code to read the one and zero registers. Yes, but creating an own compiler leads me to this question regarding the impact of a quite simple optimization. I am searching for a statistic of how often a loaded integer (either in register or as part of an immediate instruction etc.) in whatever general is: - in 16-bit domain - above (might be extended to is zero, 8-bit and so on) Bye, Stefan Dec 11 '05 #10

 P: n/a Am Sun, 11 Dec 2005 08:21:25 +0000 schrieb Walter Roberson: Also, as I indicated earlier, the point might not be "can it safely be done", but rather, "Is it worth putting in the programming effort to do this potential optimization? Do small literals occur often enough that the savings would be appreciable? Is it worth the time to find good algorithms for the boundary cases such as the short load taking longer to execute but the extra cost being worth it if it allows a tight loop to fit in cache when it would not otherwise fit?" And that's where the statistics come in. You know the usual refrain about "premature optimization" -- well, it looks to me as if the OP is trying to get a feel for whether this kind of optimization is premature, or is worth while. Exactly. I want to illustrate the possible impact of a simple optimization I made. In fact this optimization most oftens gains execution, but a quite simple value of optimization ratio reachid in such an easy way would be nice. It isn't something very scientific , just for getting a feeling of gain. Thanks for your elaborated statements, Stefan Dec 11 '05 #11

 P: n/a Stefan Wallentowitz wrote: Am Sat, 10 Dec 2005 23:35:03 -0800 schrieb sl*******@yahoo.com: In this case there would be no need to ask about statistics. It is always safe to do this because the compiler know there is ONLY one value for each literal - not a range. This is implemented by almost all modern compilers. Not only that, if for example on a RISC machine you use the literal 1 or 0 the compiler would simply generate code to read the one and zero registers. Yes, but creating an own compiler leads me to this question regarding the impact of a quite simple optimization. I am searching for a statistic of how often a loaded integer (either in register or as part of an immediate instruction etc.) in whatever general is: - in 16-bit domain - above (might be extended to is zero, 8-bit and so on) For what it's worth, I find the following values very common in code: 0, 1, 0xFF Also, from experience, I find the majority of string literals to be below 100, usually in the range of 0 to 10. The value 1 is common enough that most archictectures have either increment/decrement instructions or a hardwired 1 register. The value 0 is common in a lot of data initialisations. The value 255 (0xFF) is very common in protocol handling code usually used for byte masking. Dec 11 '05 #12

 P: n/a Am Sun, 11 Dec 2005 08:20:37 -0800 schrieb sl*******@yahoo.com: For what it's worth, I find the following values very common in code: 0, 1, 0xFF Also, from experience, I find the majority of string literals to be below 100, usually in the range of 0 to 10. The value 1 is common enough that most archictectures have either increment/decrement instructions or a hardwired 1 register. The value 0 is common in a lot of data initialisations. The value 255 (0xFF) is very common in protocol handling code usually used for byte masking. FOr sure, but a number sais more than thousand words.. Dec 11 '05 #13

 P: n/a Stefan Wallentowitz wrote: Am Sun, 11 Dec 2005 08:20:37 -0800 schrieb sl*******@yahoo.com: For what it's worth, I find the following values very common in code: 0, 1, 0xFF Also, from experience, I find the majority of string literals to be below 100, usually in the range of 0 to 10. The value 1 is common enough that most archictectures have either increment/decrement instructions or a hardwired 1 register. The value 0 is common in a lot of data initialisations. The value 255 (0xFF) is very common in protocol handling code usually used for byte masking. FOr sure, but a number sais more than thousand words.. Download lots of source code, preferably large ones like the Linux kernel, Apache, Xfree86 and Mozilla. Then find all literals. I did the following on my company's source codes: grep -r -e "= *[0-9]" SONaccessGateway/ > stat.txt This gets all lines containing an equal sign followed by a number. Then, if you have Tcl installed, run the following script on the file above: ================== begin script #! /usr/bin/env tclsh if {\$argc < 1} { set input stdin } else { set input [open \$argv "r"] } array set numbers {} while {[eof \$input] == 0} { set line [gets \$input] # extract right hand side: set line [lindex [split \$line "="] 1] if {\$line != ""} { # extract first word: set line [split [string trim \$line] " "] set n [lindex \$line 0] # get and count the number: if {[string is integer -strict \$n]} { set n [expr \$n] ;# normalise the number if {[info exists numbers(\$n)]} { incr numbers(\$n) } else { set numbers(\$n) 1 } } } } if {\$input != "stdin"} { close \$input } # now we are done processing, print out stats: puts "Literals found:" foreach n [lsort -integer [array names numbers]] { puts " \$n - \$numbers(\$n) times" } ===================== end script Save the script as sumstat.tcl. To run it, you can type the following: tclsh sumstat.tcl stat.txt where stat.txt is the file we generated earlier. The following is the statistics of my employer's source codes: Literals found: -774712364 - 7 times -505224220 - 7 times -235736076 - 7 times -60 - 1 times -1 - 3 times 0 - 1039 times 1 - 723 times 2 - 119 times 3 - 159 times 4 - 159 times 5 - 87 times 6 - 10 times 7 - 16 times 8 - 15 times 9 - 1 times 10 - 17 times 11 - 2 times 12 - 41 times 13 - 1 times 14 - 4 times 15 - 6 times 16 - 20 times 17 - 7 times 20 - 172 times 23 - 6 times 24 - 2 times 25 - 2 times 27 - 1 times 28 - 2 times 30 - 10 times 31 - 3 times 32 - 29 times 36 - 2 times 38 - 2 times 39 - 3 times 40 - 5 times 42 - 10 times 44 - 2 times 48 - 19 times 50 - 6 times 54 - 1 times 56 - 5 times 57 - 2 times 60 - 21 times 64 - 46 times 67 - 2 times 71 - 2 times 78 - 2 times 80 - 3 times 85 - 1 times 90 - 35 times 96 - 3 times 100 - 7 times 112 - 3 times 120 - 6 times 127 - 4 times 128 - 12 times 144 - 3 times 160 - 3 times 164 - 1 times 167 - 3 times 168 - 4 times 169 - 2 times 176 - 3 times 192 - 4 times 200 - 9 times 208 - 3 times 210 - 6 times 211 - 8 times 212 - 8 times 224 - 3 times 226 - 1 times 253 - 1 times 255 - 7 times 256 - 14 times 271 - 1 times 299 - 1 times 300 - 1 times 384 - 23 times 400 - 3 times 420 - 4 times 493 - 10 times 500 - 35 times 513 - 1 times 600 - 10 times 636 - 1 times 647 - 1 times 648 - 4 times 691 - 16 times 730 - 1 times 800 - 4 times 802 - 1 times 995 - 1 times 1000 - 8 times 1024 - 15 times 1200 - 1 times 1234 - 3 times 1257 - 1 times 1500 - 3 times 1800 - 1 times 2000 - 1 times 2020 - 1 times 2048 - 3 times 2054 - 2 times 2345 - 2 times 3600 - 1 times 4096 - 3 times 5000 - 3 times 10000 - 3 times 16384 - 9 times 28800 - 3 times 32224 - 1 times 32768 - 6 times 36000 - 4 times 65535 - 3 times 86400 - 2 times 90000 - 1 times 123456 - 3 times 1000000 - 2 times 50331652 - 2 times 67452301 - 1 times 305441741 - 17 times 592100561 - 17 times 806429235 - 10 times 823206451 - 10 times 839983667 - 10 times 883674386 - 17 times 1073741824 - 2 times 2147483647 - 2 times Hope this helps. Of course, I'm assuming you're on a platform that has grep and tclsh installed. If not, just use my results. But note that the statistics above covers C, C++ and Perl code. I was too lazy to separate them. Dec 12 '05 #14

 P: n/a sl*******@yahoo.com wrote: Stefan Wallentowitz wrote: FOr sure, but a number sais more than thousand words.. Save the script as sumstat.tcl. To run it, you can type the following: tclsh sumstat.tcl stat.txt Hope this helps. Of course, I'm assuming you're on a platform that has grep and tclsh installed. If not, just use my results. But note that the statistics above covers C, C++ and Perl code. I was too lazy to separate them. I've modified the sumstat.tcl script to further analyse results: ============================ begin sumstat.tcl script #! /usr/bin/env tclsh if {\$argc < 1} { set input stdin } else { set input [open \$argv "r"] } array set numbers {} while {[eof \$input] == 0} { set line [gets \$input] # extract words: set line [split \$line " "] foreach n \$line { # get and count the number: if {[string is integer -strict \$n]} { set n [expr \$n] ;# normalise the number if {[info exists numbers(\$n)]} { incr numbers(\$n) } else { set numbers(\$n) 1 } } } } if {\$input != "stdin"} { close \$input } # now we are done processing, print out stats: set total 0 set bits8 0 set bits16 0 set bits32 0 puts "Literals found:" foreach n [lsort -integer [array names numbers]] { puts " \$n - \$numbers(\$n) times" if {\$n < 256 && \$n > -128} { incr bits8 \$numbers(\$n) } elseif {\$n < 65536 && \$n > -32768} { incr bits16 \$numbers(\$n) } else { incr bits32 \$numbers(\$n) } incr total \$numbers(\$n) } set percent8 [format "%0.2f" [expr \$bits8*100.0/\$total]] set percent16 [format "%0.2f" [expr \$bits16*100.0/\$total]] set percent32 [format "%0.2f" [expr \$bits32*100.0/\$total]] puts "\nStatistics:" puts " 8bit numbers - \$bits8 (\$percent8%)" puts " 16bit numbers - \$bits16 (\$percent16%)" puts " 32bit numbers - \$bits32 (\$percent32%)" ============================ end script also, befure I used: grep -r -e "= [0-9]". I now realise that this only captures assignments and not integers used for other purposes such as a = b + 12. I also modified the script to get all numbers in the source code instead of just numbers after the = sign. I ran this new script with: grep -r -E "[0-9]+" code_dir/ | tclsh sumstat.tcl piping the result of grep directly into the script. This simply gets all lines containing numbers. The new result is obviously much larger than before and is probably too long to post. The following is the a snipped version of the result: Literals found: -2147483648 - 5 times -976170042 - 2 times -823302554 - 1 times -1 - 285 times 0 - 6704 times 1 - 5408 times 2 - 3600 times 3 - 2194 times 4 - 1735 times 1936484395 - 1 times 1936484398 - 1 times 2147483647 - 2 times Statistics: 8bit numbers - 34984 (83.66%) 16bit numbers - 6346 (15.18%) 32bit numbers - 486 (1.16%) Dec 12 '05 #15

 P: n/a Am Sun, 11 Dec 2005 20:22:47 -0800 schrieb sl*******@yahoo.com: sl*******@yahoo.com wrote: Stefan Wallentowitz wrote: > FOr sure, but a number sais more than thousand words.. Save the script as sumstat.tcl. To run it, you can type the following: tclsh sumstat.tcl stat.txt Hope this helps. Of course, I'm assuming you're on a platform that has grep and tclsh installed. If not, just use my results. But note that the statistics above covers C, C++ and Perl code. I was too lazy to separate them. Thank you very, very much. I will use this approach and apply it on some code specific to embedded systems. Although it doesn't in effect have profiling character it gives a smart feeling of the subject-matter. Bye and thanks, Stefan Dec 12 '05 #16

 P: n/a In article <11**********************@z14g2000cwz.googlegroups .com>, "sl*******@yahoo.com" writes: Download lots of source code, preferably large ones like the Linux kernel, Apache, Xfree86 and Mozilla. Then find all literals. I did the following on my company's source codes: grep -r -e "= *[0-9]" SONaccessGateway/ > stat.txt As Walter Roberson suggested, it would probably be better to do this on the result of preprocessing the source (ie, the output of trans- lation phase 4, 5, or 6). That at least will catch literals that are expressed using object-like macros, though not necessarily folded constants (which could be computed during phase 7, or even not until runtime). Most implementations have an option to write the result of preprocess- ing to a file. It might be possible to get even more accurate results by analyzing an intermediate result from a later translation phase, such as the assembly-language output available from many implementations. I recently used such analysis to check for large auto-class objects in a large code base. (Details are implementation-dependent and OT, obviously.) -- Michael Wojcik mi************@microfocus.com Cooperation is just like two pagodas, one hardware and one software. -- Wen Jiabao Dec 13 '05 #17

### This discussion thread is closed

Replies have been disabled for this discussion.