Dear Michel and Andrea
Using the updated version of QTV, the QTV_1.f90, we got some 10% improvement on large matrices:
Running 1,000,000 loops with 100x100 matrix and vector:
Dgemv : 26 sec
QTV high level subroutine: 59 sec
QTV_1 2nd version of high level subroutine 53sec
QT modular: 240 sec
Best regards
George Mob. +44(0)7951415480 artilogica@btconnect.com
----- Original Message ----- From: "G. Perendia" george@perendia.orangehome.co.uk To: "Michel Juillard" michel.juillard@ens.fr Cc: "andrea pagano" pagano.andrea@gmail.com; "List for Dynare developers" dev@dynare.org Sent: Wednesday, June 24, 2009 2:19 PM Subject: Re: Quasi triangular matrices in Kalman filters
Dear Michel and Andrea
I run some initial standalone .exe tests with 2 matrices sizes and 3 optimised programs:
dgemv_exe - calling Matlab Lapack/Blas dgemv, qtvmv_exe - calling new high level QT library subroutine QTV.f90, and, qtamv_exe - calling 4 different low level QT library subroutines in a highly modular fashion and performing vector addition at the end,
Whilst the new high-level subroutine is marginally faster than dgemv for smaller matrices (11x11), both QT subroutines appear however slower than dgemv for the larger matrices (100x100).
QT , Matlab and GM T*a Stand-alone exe's calculations
Running 10,000,000 loops with 11x11 matrix and vector:
Dgemv : 10 sec
QTV high level subroutine: 9 sec
QT modular: 32 sec
Running 1,000,000 loops with 100x100 matrix and vector:
Dgemv : 26 sec
QTV high level subroutine: 59 sec
QT modular: 240 sec
It is quite obvious that the modular approach is no-go as it looks like it is spending too much time calling 4 f90 subroutines and passing data
between
C++ and f90, and an overall optimised, (new) high level subroutine
(QTV.f90)
is a much (~4 times?) better solution.
NOTES:
The QT matrices are from qz decomposition or two rand(size) matrices
The lapse time has been very consistent across several repeated runs!
The lapse execution time was measured as difference between the unix date after and before the .exe runs, e.g. as:
$ date;./qtamvm_exe QT100t_tab.txt aa100_dat 100 1000000;date
PS: For correct results, one needs to supply a tab-delimited file
containing
the transpose of the QT matrices at the entry due to the FORTRAN matrix orientation being different form C++.
Shall I upload the data files too?
Best regards
George Mob. +44(0)7951415480 artilogica@btconnect.com
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "G. Perendia" george@perendia.orangehome.co.uk Cc: "andrea pagano" pagano.andrea@gmail.com; "List for Dynare
developers"
dev@dynare.org Sent: Monday, June 22, 2009 7:39 PM Subject: Re: Quasi triangular matrices in Kalman filters
Andrea and George,
- please write a standalone test/timing of the QT code so that we can
profile it using standard tools 2) Compare with call to dgmev() using Lapack + optimized Blas (possibly from a Matlab distribution). 3) Upload the code to SVN, so we can test it on other machines. The huge variability between 2 runs reported by George, may be due to Windows and usually is less important under Linux.
All the best,
Michel
G. Perendia wrote:
Dear Andrea
Thanks for the new libraries.
I run some initial performance tests today for the simple T*a
matrix*vector
multiplication with 2 different QT matrices sizes but in summary this
is
what I am, getting:
10000 iteration loop with 100x100 random QT matrix (from qz
decomposition)
and a vector:
1st & 2nd run (after
restart)
Native matlab matrix multiplication in a loop
Ta_time = 0.3010 & 0.6610
Calling dgemv() using Sylv Vector and General Matrix is faster than
Matlab
loop:
GMcppTaInnrLoop_time = 0.1600 & 0.3310
Calling QT f90 library using of Sylv Vector and General Matrix:
QTcppTaInnerLoop_time = 8.5730 & 20.5300
Calling QT f90 library without use of Sylv Vector and General Matrix
but
using only pure C/C++ double arrays is only marginally faster:
QTcpp_noSylv_TaInnerLoop_time = 8.4420
1000 loop with 10x10 random QT matrix and vector:
For a 10x10 matrix, calling QT f90 library takes about twice the time
Matlab
loop does but dgemv is still faster.
Matlab: 0.0400
GMcppTaInnrLoop_time = 0.0300
QTcppTaInnerLoop_time = 0.0800
It is, however, possible that the MinGW f95 I am using is not the best optimising compiler that can be used and/or that tests for PTP', which
I
am
planning to do next, may be better..
What are your thoughts? Do you think that we may be able to improve performance of this multiplication somehow.
I wander if making many cross-language calls may be rather detrimental
and
that we may improve performance if we reduce this high level of modularisation and calling, e.g. by using a higher level subroutine
that
will perform all operations within f90, passing back only the final
Ta?
NOTES:
After a restart, Matlab appears to be much slower than later!
Also, matlab multiplication reports both, the real and the imaginary
part of
the result which appear complex but the real part matches QT and dgemv outputs..
Best regards
George artilogica@btconnect.com
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "andrea pagano" pagano.andrea@gmail.com Cc: "G. Perendia" george@perendia.orangehome.co.uk Sent: Friday, June 19, 2009 1:33 PM Subject: Re: Quasi triangular matrices in Kalman filters
Thanks Andrea
amities
Michel
andrea pagano wrote:
Hi all I would go for subroutines. I will do it over the weekend while looking at other possibilities fortran pointers.
Best
Andrea
On Fri, Jun 19, 2009 at 10:04 AM, G. Perendiageorge@perendia.orangehome.co.uk wrote:
Dear Andrea
Problem:
I have encountered a problem integrating KalmanFilter with the f90 QT library - passing the QT result arrays back to C++.
QT Fortran routines have been written in standard Fortran FUNCTION
format,
(i.e., not SUBROUTINE), so that they are returning double or single dimensional array (they are named by), by value ( not reference).
However,
as it appears, only simple, single variables seems can be passed
from
Fortran FUNCTIONs back to C++ (e.g. INT or REAL).
On the other hand, NAG, BLAS and LAPACK routines have all been
written
as
Fortran SUBROUTINEs and they can be integrated with C more easily -
they
receive parameters and return their results through the variables passed as calling
parameters,
by references.
For example, dgemv.f from BLAS library gets Y by reference and
returns
modified Y passed as calling parameter reference.
SUBROUTINE DGEMV(TRANS,M,N,ALPHA,A,LDA,X,INCX,BETA,Y,INCY)
....
- Y - DOUBLE PRECISION array of DIMENSION at least
...
Before entry .... the incremented array Y
must contain the vector y. On exit, Y is overwritten by
the
updated vector y.
....
Poss. Solutions:
I could not find any references on how to get arrays from Fortran
FUCTION as
return value back to C - do you or anyone around you know how to do
it,
if
at all possible? In any case, passing array by value is also not
recommended
as rather un-economical, especially for larger matrices.
One alternative way I can think of is less explored option of
returning
Fortran pointer to the resulting array from the QT functions
instead
of the
array by value and I think I can work one of that out but suggestions are more
than
welcome.
I can see few options:
a) to rewrite QT library as SUBROUTINE instead FUCTION routines,
or,
b) try to use Fortran pointers and, if we can then also rewrite QT library to return pointers, or,
c) write 3 or more high level f90 shell SUBROUTINES calling the
existing and
unmodified QT functions and performing the all operations needed to construct the resulting Ta and TPT' (for both cases 1 and 2) instead of doing low-level QT
manipulation in
C++.
This way QT library need not be changed and those new SUBROUTINEs
will
also
act as interface with C++. I think this is a more productive and
optimal
alternative of the three since those combination utilities would
have
to be
written anyway, except it seem to be easier to do that now in
Fortran
than
in C++.
If you like and/or are busy, I can by Monday develop the Ta and the
first
case of TPT' SUBROUTINES whilst the second case may need more
thinking
and
more granular approach to take advantage of multiple processors.
Please let me know your thoughts on this issue and, whether if you
have
time
to make the needed changes or additions in the f90 files.
Best regards
George artilogica@btconnect.com
----- Original Message ----- From: "andrea pagano" pagano.andrea@gmail.com To: george@perendia.orangehome.co.uk Cc: "Michel Juillard" Michel.Juillard@ens.fr Sent: Monday, June 01, 2009 7:47 PM Subject: Quasi triangular matrices in Kalman filters
> Hi all > > I am sending you a set of Fortran routines to calculate the
matricial
> expression in Kalman filter together with some explenations. > > Hope they can be a starting point in optimizing the overall >
computation
> Best > > Andrea > > > -- > Andrea Pagano > via Veratti VARESE > tel. +3903321691261 > cell.+393403804397 > > >