Hi all,
I will keep working on this following Michel and George suggestions. h
I agree with George that at first look the public code of dgemm looks quite ordinary. In my successive attempt I have tried to specialized a matrix times matrix (or vector) having in mind the special case of KF. Just counting the number of operations there is a reduction.
I will do the various test (in the line of George and Michel) and we will see.
I will off for a week or so, but I will do some work in the spare time. No much chance to have Internet connection though
Best
Andrea
On Sat, Jul 4, 2009 at 1:07 PM, G. Perendiageorge@perendia.orangehome.co.uk wrote:
Dear Michel (et al)
- I will continue refactoring KF.
When you say "The multiplications should be done using optimized Lapack DGEM" do you mean stand-alone ATLAS or Matlab DGEMM (I suspect the Matlab one)?
- Another possible reason we can not beat dgemm is that Matlab dgemm
function is not just a compilation of the publicly available BLAS dgemm.f , highly optimised by a compiler, but a function with the same name and interface which has been rewritten by Mathworks specially for Matlab, a system which justifies its commercial existence on fast matrix manipulation.
It is possible they rewrote few of the most commonly used routines to boost the performance even further. E.g. It may use direct memory manipulation, multi-threading, blocking/subdivision and poss. some other size reduction features for large or even special handlers for QT and symmetric matrices that are not present in the public domain version of dgemm.f (or in the BOOST library).
E.g., public version of dgemm.f uses classic nested loop such as IF (BETA.EQ.ZERO) THEN DO 20 J = 1,N DO 10 I = 1,M C(I,J) = ZERO ....
whilst memcpy() is a much quicker way to (re-)initialise a memory range to a values such as 0 respectively.
Hence, a test for our compilers, QT and the possible additional boosting features of Matlab dgemm would be the comparison of timing of our test routines being linked to: a) our own optimised compilation of the publicly available dgemm/dgemv in the way we do it for QT routines, and see how that scores against both, the same tests when using and linked to b) our QT routines on one and to c) the Matlab dgemm/dgemv on another hand.
Best regards
George artilogica@btconnect.com
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "G. Perendia" george@perendia.orangehome.co.uk; "'Andrea Pagano'" andrea.pagano@jrc.it; "List for Dynare developers" dev@dynare.org Sent: Friday, July 03, 2009 8:07 PM Subject: C++ Kalman filter
Dear Andrea and George,
than you for your efforts on the Kalman filter. I would like to regroup efforts along the following lines:
- Kalman filter C++:
I would still like to be able to compare with an implementation that removes all object instantiation inside the loop: the lowest level object should be Kalman filter and all necessary work space should be allocated once for ever outside the main loop of the filter. The multiplications should be done using optimized Lapack DGEM. George should continue along this line. I'm suspicious of timings taken in Windows and we should double check them under Linux. For this reason as well, Makefile should also provide for building of tests under Linux. George and Andrea should have accounts on karaba. Sebastien, can you take care of that?
- Unless I'm missing something, there are only two explanations for not
being able to beat DGEMM a) our compilers don't optimize as well as the one used by Matlab b) there is a problem in the algorithm.
In order to eliminate and be able to quantify the importance of (a), I would like Andrea to write a standalone test for both operations:
- A*x
- A*P*A'
calling DGEMM (from Matlab's Lapack library) and compare it with his own routines.
A minimum modification would be to take DGEMM as it is and modify the loops so as to accomodate a quasi-triangular matrix instead of a general matrix. In this exercise, all lower level BLAS functions should be linked with Matlab's BLAS library.
We will make a new attempt to integrate Andrea's routines in the Kalman filter only when we obtain satisfactory standalone test results.
I think it is important to spend more time on these tests, because we are evaluating how much speed improvement we can expect from using compiled code instead of Matlab.
All the best,
Michel