473,746 Members | 2,247 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Windows/Linux optimization problem

Hi all,
I have a strange optimization problem. I have written a small program,
basically a matrix-vector multiplication at its core, that needs to
run as fast as possible.

The relevant code snippet is:

for (e = start_e; e < end_e; e++)
for (s = start_s; s < end_s;)
*e += (*r++) * (*s++);

where all variables are float pointers, 'r' is the matrix, 's' the
vector and 'e' the result vector. I call the operation on the last
line a MAC (multiply-accumulate), a common measure of performance on
DSPs. The 'r' matrix is very large (about 15 MB) and does not fit into
the cache.

The program does a thousand iterations, each consisting of some setup
and of the matrix mult. above, and prints out the speed.

On Linux, using an Intel Xeon 2.6 Ghz (512Kb cache) I get the
following result:

Done in 0.42 seconds (2398.55 iterations/sec) (487.00 Mmac/sec)

The above result was with the optimizing Intel compiler v9.0, which
auto-vectorize loops using SSE. The non-SSE version was only about 20%
slower.

On Windows, using my Athlon 2700+, I get this:

Done in 1.81 seconds (551.61 iterations/sec) (112.00 Mmac/sec)

I then learned that my non-professional copy of VisualC++ does not
optimize binaries (!), so I downloaded the Microsoft Visual C++
Toolkit 2003, which claims to have the same optimizing compiler
featured by the professional version of Microsoft Visual C++. The
result is even worse:

Done in 1.85 seconds (540.53 iterations/sec) (109.75 Mmac/sec)

The windows version was compiled with this command line:

cl /O2 test2.c

Adding flags for SSE instructions did not help.

Anyone has a clue of what I'm doing wrong? The numbers are very
repeatable. Using a smaller 'r' matrix pushed the speed on the Linux
xeon up to 1.5 GMac (!), while the windows version on the Athlon never
went over 250 Mmac.

Thanks for the answers :-)
Alfio


Feb 10 '06 #1
4 1658
On Fri, 10 Feb 2006 21:17:32 GMT, re***@dodgeit.c om (Renato) wrote:
Hi all,
I have a strange optimization problem. I have written a small program,


An interesting problem, but way off topic here, where we discuss the
standard C language, not specific implementations , and not
optimization. Look for a Microsoft newsgroup.

--
Al Balmer
Sun City, AZ
Feb 10 '06 #2
On Fri, 10 Feb 2006 15:09:42 -0700, Al Balmer <al******@att.n et>
wrote:
On Fri, 10 Feb 2006 21:17:32 GMT, re***@dodgeit.c om (Renato) wrote:
Hi all,
I have a strange optimization problem. I have written a small program,


An interesting problem, but way off topic here, where we discuss the
standard C language, not specific implementations , and not
optimization . Look for a Microsoft newsgroup.


Sorry, I didn't realize that it was offtopic. I'll post it somewhere
else.

Alfio
Feb 10 '06 #3

"Renato" <re***@dodgeit. com> wrote
The relevant code snippet is:

for (e = start_e; e < end_e; e++)
for (s = start_s; s < end_s;)
*e += (*r++) * (*s++);

where all variables are float pointers, 'r' is the matrix, 's' the
vector and 'e' the result vector. I call the operation on the last
line a MAC (multiply-accumulate), a common measure of performance on
DSPs. The 'r' matrix is very large (about 15 MB) and does not fit into
the cache.

[ Windows worse than Linux ]

You want to try to look at the assembly code produced (if you have no tools,
make a minimal program and then hand-dissassemble the binary).

This will tell you whether it is the greedy operating system or the bad
compiler causing your problems on the Windows machine.
Feb 10 '06 #4
Renato wrote:
Hi all, The relevant code snippet is:

for (e = start_e; e < end_e; e++)
for (s = start_s; s < end_s;)
*e += (*r++) * (*s++);
More than this is relevant, including how the pointers are declared.
The windows version was compiled with this command line:

cl /O2 test2.c

Adding flags for SSE instructions did not help.

Anyone has a clue of what I'm doing wrong? The numbers are very
repeatable. Using a smaller 'r' matrix pushed the speed on the Linux
xeon up to 1.5 GMac (!), while the windows version on the Athlon never
went over 250 Mmac.

Most Windows compilers do not default to requiring programs to comply
with the C standard on typed aliasing. In fact, they don't even act as
C compilers by default. Thus, they may assume possible side effects
which prevent scalar reduction (registerizatio n of the sum
accumulation), or unpredictable changes in the pointer values preventing
those from being registerized. You could make it easier on the compiler
by declaring a local scalar for the accumulator, and unambiguously
moving the assignment to +e to the outer loop.
Using pointers to make counted for loops has pitfalls. Purists might
replace the < condition with !=, but that introduces ambiguities in how
to treat the situation where the loop might wrap around in the address
space. So, it's possible that one compiler might choose not to optimize
for such reasons.
Feb 12 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
4928
by: Philip | last post by:
Hi, i'am looking for a db2 driver for windows the DB2 servers runs on as400 if that makes any difference. Thanks, Philip
1
5862
by: Rami Saarinen | last post by:
I have been making a client-server type of software mainly on Linux and now I am trying to get it all work also on windows. However, I am having slight problems running the program in Windows 2000. I have tried Python 2.2.3 and the latest release 2.3.1 (?) If I have 2 clients and 1 server. The server listening some predefined port and all the sending (in client and server) is done by creating a new connection. On the beginning the client...
1
3858
by: corrado | last post by:
Hello I have an application running several thread to display some financial data; basically I have a thread displaying HTML tables by means of Tkhtml, another implementing a scrolling ticker based on a Text widget with embedded windows and a thread running the Tkinter mainloop plus several other thread dealing with the scheduling of the contents and the acquisition of data but not using graphic widgets. I run the same code on Linux...
2
4290
by: Read Roberts | last post by:
I have the current Windows binary install of Python 2.3.4 on my Windows XP system. I am pained to discover that tkFileDialog.askdirectory() returns a mangled path when a directory is selected which has non-ascii Unicode path, as in Kanji characters, i.e. the usull "????" in place of the original UTF-8 code points. . After some time spent Google'ing. I don't find a discussion of this. Is there some option or configuration change I...
2
10512
by: Chris | last post by:
A weird issue...though hopefully not for everyone... I am trying to connect to a 10g database on a Red Hat Linux server from my 9i client on a XP pc. Both are on my local home network, behind a router. I can ping the linux server from my XP box successfully: C:\>ping 192.168.1.101
32
2815
by: cat_dog_ass | last post by:
I am used to programming in DOS mode via Borland C++. I would now like to create programs for the Windows envirinment. Is it absoultely necessary to use Visual C++ to do this? Are there other tools that can help me in doing Windows programming via C++? How am I going to create drop-down menus and allow my programs to interact with the mouse? Regards, Icon
17
2782
by: Bruce Jin | last post by:
I wonder how many people are using db2 on Windows? I know db2 is native to AS400 which has about 800,000 installations. Thanks!
2
1192
by: Lev Elbert | last post by:
Hi, all! I have to make a custom email module, based on the standard one. The custom module has to be able to work with extremely large mails (1GB +), having memory "footprint" much smaller. The modified program has to work in SUSE environment, while the development is done under Windows. I'm not too good with linux and do not know if speedup in Windows translates one-to-one into speedup in SUSE. For example, if the bottleneck is IO,...
5
17028
by: yakir22 | last post by:
Hello experts, I am dealing now in porting our server from windows to linux. our client is running only on windows machine. to avoid the wchar_t size problem ( in windows its 2 bytes and linux is 4 bytes ) we defined #ifdef WIN32 #define t_wchar_t wchar_t #else // LINUX #define t_wchar_t short
0
8970
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9346
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9279
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8219
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6763
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6057
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4827
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3287
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2759
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.