Thanks George,
very interesting. Please proceed with simplifying the KF code.
All the best,
Michel
G. Perendia wrote:
Dear Michel
I already have a stand-alone exe test I used last week (and uploaded it today too) and I run it through gprof earlier today (though, after wasting some time last week trying to use a reportedly sophisticated profiler - CodeAnalyst from AMD - which, so far, I could not make to work at all).
The other profiling tool I initially used (and reported) last week ("Very Sleepy") is an "external", not getting into the code details itself but could be run attached externally to either the stand-alone exe test or the Matlab running DLL thread too, and that reported for both spending a lot (~40%) time in ztrsv solver but, at one snapshot, also pointed to a lot of time spent in dtrsv and GeneralMatrix copy - 50% each. On the contrast, gprof is more internal, higher resolution profiler and it puts the load weight (>10%) on housekeeping functions but does not even mention calls to external library BLAS functions such as dtrsv and ztrsv.
The both profiling tools, however, seem to confirm what my early code inspection concluded too: a very high, use of (not very productive) General Matrix copy constructor -(e.g. the C++ kalman filter stores copy of few of the main time variant system matrices: F, P and an intermediate L for each step of the time series evaluation and also creates a copy of the input T, H and Z at each step as if they may be time invariant too although this would not be the case for a simple, no-diffuse KF without missing observations).
This then resulted in high % of time in GeneralMatrix copy() function, (which is called by the copy-constructor explicitly) - i.e. as reported by the both profiling programs: Sleepy: up to 50% at one snapshot point whilst gprof gives it the 1st rank with 11% on its own or 27.3% of total with its children.
The copy() function is followed by the utility functions such as two varieties (const and non-const) of Vector indexing [] operator and by the const and variable varieties of GeneralMatrix::get() elements that utilise the previous Vector indexing[] operator and are themselves directly called from the heavily used GenaralMatrix copy function among the rest.
According to gprof, the above high burden of copy constructor and the related functions are only then followed by the productive functions such as PLUFact::multInvRight matrix multiplication with inversion (used for inversion of the F matrix), the GeneralMatrix constructors and the GeneralMatrix::gemm() - a general matrix multiplication(itself calling BLAS dgemm) with 4.7, 3.1 and 2.6 % of total time respectively
NOTE however that gprof profiler paints to an extent different picture and does not even mention external BLASS functions such as dtrsv and ztrsv solvers reported as heavy users by the VerySleepy "external" profiler.
All in all, it appears form the both profiler reports and my initial inspection that , for start (and as I initially already intended and suggested to), we should refactor the current heavy use of the un-productive General Matrix copy constructor and its current reliance on element-by-element get() function before we get into any further performance improvements of the productive functions and external libraries.
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Saturday, June 06, 2009 3:08 PM Subject: Re: [DynareDev] Kalman Filter
There are tools to do profiling in C++. All we need is an standalone executable calling the filter. Don't loose time adding timing function inside the code. It may be difficult to do profiling in Windows. In that case, just prepare the code and we will do the profiling in Linux.
Best
Michel
G. Perendia wrote:
Dear Michel
Yes, as agreed initially
these are the Matlab Dynare KF measures, mainly to show the
proportion
of inversion vs. pure update in Matlab KF. I have not yet done fine profiling for C++, so, not much to upload
either.
- I agree..
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Saturday, June 06, 2009 2:10 PM Subject: Re: [DynareDev] Kalman Filter
Thanks George
One of the first thing that we need to establish is whether identical basic matrix operations take much longer in the C++ implementation than in Matlab and, if that it is the case, why.
- Indeed, and as a significant part of the overall parcel of updating
P,
one needs to invert the updated F too :
100000 loops of small model KF 4x4 F matrix inversion: iF =
inv(F);
Fmx_inv_time = 2.2530
100000 loops of the corresponding core KF 8x8 P matrix update: P1 = T*(P-K*P(mf,:))*transpose(T)+QQ; Pupdt_time = 3.4450
(and also, 100000 loops of the preceding K = P(:,mf)*iF; Kupdt_time = 0.5910)
How do these operations compare with Matlab on your machine?
The convergence of P exploited in Matlab Dynare KFs (which does not
require
further update of P and K and inversion of F ), can improve greatly performance of KF.
e.g.: running Matlab Dynare KF with 57x57 system matrix in 1000 loop 1000 of usual matlabKF_time = 337.1650
and then using P recursively in the loop with a modified
kalman_filter.m
which returns P too (therefore, utilising P convergence and avoiding
its
update for the most of the remaining 999 loops): 1000 of recursive: Matlab_rec_KF_time = 11.7060
- And, although the convergence of P in Matlab KF did not take place
for
the large sw_euro model which had total run much closer to the C++KF
(see 3
below), as today's check show, the convergence did however take place
very
early in the Matlab KF running the small model I initially tested (at
step
t=3!!!) so it certainly did affect and, judging from the above results, rather
greatly
contribute to the very much faster KF loops we experienced running Matlab KF versus C++ KF in the initial tests with the same small
model
(C++ KF does yet not take advantage of convergence and the comparative results were even)!!!
OK, we forget the first comparison on the small model, because C++ and Matlab didn't use the same algorithm (no convergence monitorinig in
C++).
Matlab is still faster by 45% on the medium size model. We should focus on explaining this difference and we don't need to bring monitoring the convegence of the filter for this particular example.
Could you please upload on SVN the code that you use for profiling?
Best
Michel
Best regards
George
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dear Michel
After the first cut of refactoring, we have mixed results: there is about 19% performance improvement and approx 30% reduction in use of copy constructor running small model as expected, though no significant performance improvement could be measured on the larger models (euro_sw3.mod) yet - I am looking into the other possible causes for that lack of improvement.
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "G. Perendia" george@perendia.orangehome.co.uk; "List for Dynare developers" dev@dynare.org Sent: Sunday, June 07, 2009 9:01 PM Subject: Re: [DynareDev] Kalman Filter
Thanks George,
very interesting. Please proceed with simplifying the KF code.
All the best,
Michel
G. Perendia wrote:
Dear Michel
I already have a stand-alone exe test I used last week (and uploaded it today too) and I run it through gprof earlier today (though, after
wasting
some time last week trying to use a reportedly sophisticated profiler - CodeAnalyst from AMD - which, so far, I could not make to work at all).
The other profiling tool I initially used (and reported) last week
("Very
Sleepy") is an "external", not getting into the code details itself but could be run attached externally to either the stand-alone exe test or
the
Matlab running DLL thread too, and that reported for both spending a lot (~40%) time in ztrsv solver but, at one snapshot, also pointed to a lot
of
time spent in dtrsv and GeneralMatrix copy - 50% each. On the contrast, gprof is more internal, higher resolution profiler and it puts the load weight (>10%) on housekeeping functions but does not even mention calls
to
external library BLAS functions such as dtrsv and ztrsv.
The both profiling tools, however, seem to confirm what my early code inspection concluded too: a very high, use of (not very productive) General Matrix copy constructor -(e.g. the C++ kalman filter stores copy
of
few of the main time variant system matrices: F, P and an intermediate
L
for each step of the time series evaluation and also creates a copy of
the
input T, H and Z at each step as if they may be time invariant too
although
this would not be the case for a simple, no-diffuse KF without missing observations).
This then resulted in high % of time in GeneralMatrix copy() function, (which is called by the copy-constructor explicitly) - i.e. as reported
by
the both profiling programs: Sleepy: up to 50% at one snapshot point
whilst
gprof gives it the 1st rank with 11% on its own or 27.3% of total with
its
children.
The copy() function is followed by the utility functions such as two varieties (const and non-const) of Vector indexing [] operator and by
the
const and variable varieties of GeneralMatrix::get() elements that
utilise
the previous Vector indexing[] operator and are themselves directly
called
from the heavily used GenaralMatrix copy function among the rest.
According to gprof, the above high burden of copy constructor and the related functions are only then followed by the productive functions
such as
PLUFact::multInvRight matrix multiplication with inversion (used for inversion of the F matrix), the GeneralMatrix constructors and the GeneralMatrix::gemm() - a general matrix multiplication(itself calling BLAS dgemm) with 4.7, 3.1 and 2.6 % of total time respectively
NOTE however that gprof profiler paints to an extent different picture
and
does not even mention external BLASS functions such as dtrsv and ztrsv solvers reported as heavy users by the VerySleepy "external" profiler.
All in all, it appears form the both profiler reports and my initial inspection that , for start (and as I initially already intended and suggested to), we should refactor the current heavy use of the
un-productive
General Matrix copy constructor and its current reliance on element-by-element get() function before we get into any further
performance
improvements of the productive functions and external libraries.
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Saturday, June 06, 2009 3:08 PM Subject: Re: [DynareDev] Kalman Filter
There are tools to do profiling in C++. All we need is an standalone executable calling the filter. Don't loose time adding timing function inside the code. It may be difficult to do profiling in Windows. In
that
case, just prepare the code and we will do the profiling in Linux.
Best
Michel
G. Perendia wrote:
Dear Michel
Yes, as agreed initially
these are the Matlab Dynare KF measures, mainly to show the
proportion
of inversion vs. pure update in Matlab KF. I have not yet done fine profiling for C++, so, not much to upload
either.
- I agree..
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Saturday, June 06, 2009 2:10 PM Subject: Re: [DynareDev] Kalman Filter
Thanks George
One of the first thing that we need to establish is whether identical basic matrix operations take much longer in the C++ implementation
than
in Matlab and, if that it is the case, why.
- Indeed, and as a significant part of the overall parcel of
updating
P,
one needs to invert the updated F too :
100000 loops of small model KF 4x4 F matrix inversion: iF =
inv(F);
Fmx_inv_time = 2.2530
100000 loops of the corresponding core KF 8x8 P matrix update: P1 = T*(P-K*P(mf,:))*transpose(T)+QQ; Pupdt_time = 3.4450
(and also, 100000 loops of the preceding K = P(:,mf)*iF; Kupdt_time = 0.5910)
How do these operations compare with Matlab on your machine?
The convergence of P exploited in Matlab Dynare KFs (which does not
require
further update of P and K and inversion of F ), can improve greatly performance of KF.
e.g.: running Matlab Dynare KF with 57x57 system matrix in 1000 loop 1000 of usual matlabKF_time = 337.1650
and then using P recursively in the loop with a modified
kalman_filter.m
which returns P too (therefore, utilising P convergence and avoiding
its
update for the most of the remaining 999 loops): 1000 of recursive: Matlab_rec_KF_time = 11.7060
- And, although the convergence of P in Matlab KF did not take
place
for
the large sw_euro model which had total run much closer to the C++KF
(see 3
below), as today's check show, the convergence did however take
place
very
early in the Matlab KF running the small model I initially tested
(at
step
t=3!!!) so it certainly did affect and, judging from the above results, rather
greatly
contribute to the very much faster KF loops we experienced running Matlab KF versus C++ KF in the initial tests with the same small
model
(C++ KF does yet not take advantage of convergence and the comparative results were even)!!!
OK, we forget the first comparison on the small model, because C++
and
Matlab didn't use the same algorithm (no convergence monitorinig in
C++).
Matlab is still faster by 45% on the medium size model. We should
focus
on explaining this difference and we don't need to bring monitoring
the
convegence of the filter for this particular example.
Could you please upload on SVN the code that you use for profiling?
Best
Michel
Best regards
George
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Thanks George,
could you compare the execution time between C++ and Matlab subtask by subtask (almost line by line as far as Matlab is concerned)?
Best
Michel
G. Perendia wrote:
Dear Michel
After the first cut of refactoring, we have mixed results: there is about 19% performance improvement and approx 30% reduction in use of copy constructor running small model as expected, though no significant performance improvement could be measured on the larger models (euro_sw3.mod) yet - I am looking into the other possible causes for that lack of improvement.
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "G. Perendia" george@perendia.orangehome.co.uk; "List for Dynare developers" dev@dynare.org Sent: Sunday, June 07, 2009 9:01 PM Subject: Re: [DynareDev] Kalman Filter
Thanks George,
very interesting. Please proceed with simplifying the KF code.
All the best,
Michel
G. Perendia wrote:
Dear Michel
I already have a stand-alone exe test I used last week (and uploaded it today too) and I run it through gprof earlier today (though, after
wasting
some time last week trying to use a reportedly sophisticated profiler - CodeAnalyst from AMD - which, so far, I could not make to work at all).
The other profiling tool I initially used (and reported) last week
("Very
Sleepy") is an "external", not getting into the code details itself but could be run attached externally to either the stand-alone exe test or
the
Matlab running DLL thread too, and that reported for both spending a lot (~40%) time in ztrsv solver but, at one snapshot, also pointed to a lot
of
time spent in dtrsv and GeneralMatrix copy - 50% each. On the contrast, gprof is more internal, higher resolution profiler and it puts the load weight (>10%) on housekeeping functions but does not even mention calls
to
external library BLAS functions such as dtrsv and ztrsv.
The both profiling tools, however, seem to confirm what my early code inspection concluded too: a very high, use of (not very productive) General Matrix copy constructor -(e.g. the C++ kalman filter stores copy
of
few of the main time variant system matrices: F, P and an intermediate
L
for each step of the time series evaluation and also creates a copy of
the
input T, H and Z at each step as if they may be time invariant too
although
this would not be the case for a simple, no-diffuse KF without missing observations).
This then resulted in high % of time in GeneralMatrix copy() function, (which is called by the copy-constructor explicitly) - i.e. as reported
by
the both profiling programs: Sleepy: up to 50% at one snapshot point
whilst
gprof gives it the 1st rank with 11% on its own or 27.3% of total with
its
children.
The copy() function is followed by the utility functions such as two varieties (const and non-const) of Vector indexing [] operator and by
the
const and variable varieties of GeneralMatrix::get() elements that
utilise
the previous Vector indexing[] operator and are themselves directly
called
from the heavily used GenaralMatrix copy function among the rest.
According to gprof, the above high burden of copy constructor and the related functions are only then followed by the productive functions
such as
PLUFact::multInvRight matrix multiplication with inversion (used for inversion of the F matrix), the GeneralMatrix constructors and the GeneralMatrix::gemm() - a general matrix multiplication(itself calling BLAS dgemm) with 4.7, 3.1 and 2.6 % of total time respectively
NOTE however that gprof profiler paints to an extent different picture
and
does not even mention external BLASS functions such as dtrsv and ztrsv solvers reported as heavy users by the VerySleepy "external" profiler.
All in all, it appears form the both profiler reports and my initial inspection that , for start (and as I initially already intended and suggested to), we should refactor the current heavy use of the
un-productive
General Matrix copy constructor and its current reliance on element-by-element get() function before we get into any further
performance
improvements of the productive functions and external libraries.
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Saturday, June 06, 2009 3:08 PM Subject: Re: [DynareDev] Kalman Filter
There are tools to do profiling in C++. All we need is an standalone executable calling the filter. Don't loose time adding timing function inside the code. It may be difficult to do profiling in Windows. In
that
case, just prepare the code and we will do the profiling in Linux.
Best
Michel
G. Perendia wrote:
Dear Michel
Yes, as agreed initially
these are the Matlab Dynare KF measures, mainly to show the
proportion
of inversion vs. pure update in Matlab KF. I have not yet done fine profiling for C++, so, not much to upload
either.
- I agree..
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Saturday, June 06, 2009 2:10 PM Subject: Re: [DynareDev] Kalman Filter
Thanks George
One of the first thing that we need to establish is whether identical basic matrix operations take much longer in the C++ implementation
than
in Matlab and, if that it is the case, why.
> 2) Indeed, and as a significant part of the overall parcel of >
updating
>
P,
> one needs to invert the updated F too : > > 100000 loops of small model KF 4x4 F matrix inversion: iF = > > >
inv(F);
> Fmx_inv_time = 2.2530 > > 100000 loops of the corresponding core KF 8x8 P matrix update: > P1 = T*(P-K*P(mf,:))*transpose(T)+QQ; > Pupdt_time = 3.4450 > > (and also, 100000 loops of the preceding K = P(:,mf)*iF; > Kupdt_time = 0.5910) > > > > > How do these operations compare with Matlab on your machine?
> The convergence of P exploited in Matlab Dynare KFs (which does not > > >
require
> further update of P and K and inversion of F ), can improve greatly > performance > of KF. > > e.g.: running Matlab Dynare KF with 57x57 system matrix in 1000 loop > 1000 of usual matlabKF_time = 337.1650 > > and then using P recursively in the loop with a modified > >
kalman_filter.m
> which returns P too (therefore, utilising P convergence and avoiding > >
its
> update for the most of the remaining 999 loops): > 1000 of recursive: Matlab_rec_KF_time = 11.7060 > > > > 3) And, although the convergence of P in Matlab KF did not take >
place
>
for
> the large sw_euro model which had total run much closer to the C++KF > > >
(see 3
> below), as today's check show, the convergence did however take >
place
>
very
> early in the Matlab KF running the small model I initially tested >
(at
>
step
> t=3!!!) so it > certainly did affect and, judging from the above results, rather > > >
greatly
> contribute to the very much faster KF loops we experienced running > Matlab KF versus C++ KF in the initial tests with the same small > >
model
> (C++ KF does yet not take > advantage of convergence and the comparative results were even)!!! > > > > > OK, we forget the first comparison on the small model, because C++
and
Matlab didn't use the same algorithm (no convergence monitorinig in
C++).
Matlab is still faster by 45% on the medium size model. We should
focus
on explaining this difference and we don't need to bring monitoring
the
convegence of the filter for this particular example.
Could you please upload on SVN the code that you use for profiling?
Best
Michel
> Best regards > > George > > > > > _______________________________________________ Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dear Michel
1) With the 2nd cut of refactoring we achieved another substantial performance improvement, about 25-30% for the basic KFover all models, i.e. the small- dll and exe- and the larger, sw_euro_3 using either inner loop dll or calling dll in the loop. The times for the larger model are now similar if not marginally better than those for the Matlab Dynare KF loops (i.e. 95.6 sec for new C++ comparing to 97.5 for Matlab KF loop), whilst those for the small are now significantly better.
The main change made in the 2nd cut was overloading member-by-member GeneralMatrix copy () (used by the constructor too) with a memcpy() version in the Dynare++ sylv/cc/GeneralMatrix.h and .cpp files. Together with Vector.h/.cpp and only 3 other (utility) headers from that directory they are also used by C++KF. I added that small subset of sylv files that are needed for KF to the new sylv/cc subdirectory of the mex/sources/kalman (see NOTE (*) below)
Note also, however, that the same performance improvement change may poss. be applied in the main Dynare++ sylv as well as to the (similar) mex/sources/gensylv versions of those files too!
2) I will start devising a method to compare subtask execution times as you suggested but that may be a bit tricky.
However, I would like to try few more things that can be done to improve the performance of the existing C++ code - that is - without major changes being implemented at this stage yet such as e.g. adding Andrea's quasi triangular matrix multiplication library.
As can be seen from the enclosed profile file taken from running the optimised executable with inner loop, the top 5 CPU time "spending" sub-tasks now are the productive gmemm, matrix constructor(still) and matrix inverter (i.e beside the main KalmanTask -filterNonDiffuse and _Unwind_SjLj_Register exception controller, two of which little can be done about).
_______________ NOTE: (*) I.e. There are few small differences between gensylv directory in Dynare mex and the sylv in Dynare ++ and those differences (e.g. missing GeneralMatrix.isZero() in gensylv, etc) are still affecting successful compilation of kalman filter using the gensylv. I would therefore need either to modify gensylv (and I am afraid to break it) or to keep a copy of the small required subset of Dynare++ sylv directory specially associated to the kalman filter. I would suggest the latter as few more changes may be needed for KF and a merge may then poss. be performed at a later stage. ______________
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Wednesday, June 10, 2009 8:13 AM Subject: Re: [DynareDev] Kalman Filter
Thanks George,
could you compare the execution time between C++ and Matlab subtask by subtask (almost line by line as far as Matlab is concerned)?
Best
Michel
G. Perendia wrote:
Dear Michel
After the first cut of refactoring, we have mixed results: there is
about
19% performance improvement and approx 30% reduction in use of copy constructor running small model as expected, though no significant performance improvement could be measured on the larger models (euro_sw3.mod) yet - I am looking into the other possible causes for
that
lack of improvement.
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "G. Perendia" george@perendia.orangehome.co.uk; "List for Dynare developers" dev@dynare.org Sent: Sunday, June 07, 2009 9:01 PM Subject: Re: [DynareDev] Kalman Filter
Thanks George,
very interesting. Please proceed with simplifying the KF code.
All the best,
Michel
G. Perendia wrote:
Dear Michel
I already have a stand-alone exe test I used last week (and uploaded
it
today too) and I run it through gprof earlier today (though, after
wasting
some time last week trying to use a reportedly sophisticated
profiler -
CodeAnalyst from AMD - which, so far, I could not make to work at
all).
The other profiling tool I initially used (and reported) last week
("Very
Sleepy") is an "external", not getting into the code details itself
but
could be run attached externally to either the stand-alone exe test or
the
Matlab running DLL thread too, and that reported for both spending a
lot
(~40%) time in ztrsv solver but, at one snapshot, also pointed to a
lot
of
time spent in dtrsv and GeneralMatrix copy - 50% each. On the
contrast,
gprof is more internal, higher resolution profiler and it puts the
load
weight (>10%) on housekeeping functions but does not even mention
calls
to
external library BLAS functions such as dtrsv and ztrsv.
The both profiling tools, however, seem to confirm what my early code inspection concluded too: a very high, use of (not very productive) General Matrix copy constructor -(e.g. the C++ kalman filter stores
copy
of
few of the main time variant system matrices: F, P and an
intermediate
L
for each step of the time series evaluation and also creates a copy of
the
input T, H and Z at each step as if they may be time invariant too
although
this would not be the case for a simple, no-diffuse KF without missing observations).
This then resulted in high % of time in GeneralMatrix copy() function, (which is called by the copy-constructor explicitly) - i.e. as
reported
by
the both profiling programs: Sleepy: up to 50% at one snapshot point
whilst
gprof gives it the 1st rank with 11% on its own or 27.3% of total with
its
children.
The copy() function is followed by the utility functions such as two varieties (const and non-const) of Vector indexing [] operator and by
the
const and variable varieties of GeneralMatrix::get() elements that
utilise
the previous Vector indexing[] operator and are themselves directly
called
from the heavily used GenaralMatrix copy function among the rest.
According to gprof, the above high burden of copy constructor and the related functions are only then followed by the productive functions
such as
PLUFact::multInvRight matrix multiplication with inversion (used for inversion of the F matrix), the GeneralMatrix constructors and the GeneralMatrix::gemm() - a general matrix multiplication(itself
calling
BLAS dgemm) with 4.7, 3.1 and 2.6 % of total time respectively
NOTE however that gprof profiler paints to an extent different picture
and
does not even mention external BLASS functions such as dtrsv and ztrsv solvers reported as heavy users by the VerySleepy "external" profiler.
All in all, it appears form the both profiler reports and my initial inspection that , for start (and as I initially already intended and suggested to), we should refactor the current heavy use of the
un-productive
General Matrix copy constructor and its current reliance on element-by-element get() function before we get into any further
performance
improvements of the productive functions and external libraries.
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Saturday, June 06, 2009 3:08 PM Subject: Re: [DynareDev] Kalman Filter
There are tools to do profiling in C++. All we need is an standalone executable calling the filter. Don't loose time adding timing
function
inside the code. It may be difficult to do profiling in Windows. In
that
case, just prepare the code and we will do the profiling in Linux.
Best
Michel
G. Perendia wrote:
Dear Michel
Yes, as agreed initially
these are the Matlab Dynare KF measures, mainly to show the
proportion
of inversion vs. pure update in Matlab KF. I have not yet done fine profiling for C++, so, not much to upload
either.
- I agree..
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Saturday, June 06, 2009 2:10 PM Subject: Re: [DynareDev] Kalman Filter
> Thanks George > > One of the first thing that we need to establish is whether
identical
> basic matrix operations take much longer in the C++ implementation >
than
> in Matlab and, if that it is the case, why. > > > > >> 2) Indeed, and as a significant part of the overall parcel of >>
updating
>> P,
>> one needs to invert the updated F too : >> >> 100000 loops of small model KF 4x4 F matrix inversion: iF = >> >> >> inv(F);
>> Fmx_inv_time = 2.2530 >> >> 100000 loops of the corresponding core KF 8x8 P matrix update: >> P1 = T*(P-K*P(mf,:))*transpose(T)+QQ; >> Pupdt_time = 3.4450 >> >> (and also, 100000 loops of the preceding K = P(:,mf)*iF; >> Kupdt_time = 0.5910) >> >> >> >> >> > How do these operations compare with Matlab on your machine? > > > > >> The convergence of P exploited in Matlab Dynare KFs (which does
not
>> >> >> require
>> further update of P and K and inversion of F ), can improve
greatly
>> performance >> of KF. >> >> e.g.: running Matlab Dynare KF with 57x57 system matrix in 1000
loop
>> 1000 of usual matlabKF_time = 337.1650 >> >> and then using P recursively in the loop with a modified >> >>
kalman_filter.m
>> which returns P too (therefore, utilising P convergence and
avoiding
>> >>
its
>> update for the most of the remaining 999 loops): >> 1000 of recursive: Matlab_rec_KF_time = 11.7060 >> >> >> >> 3) And, although the convergence of P in Matlab KF did not take >>
place
>> for
>> the large sw_euro model which had total run much closer to the
C++KF
>> >> >> (see 3
>> below), as today's check show, the convergence did however take >>
place
>> very
>> early in the Matlab KF running the small model I initially tested >>
(at
>> step
>> t=3!!!) so it >> certainly did affect and, judging from the above results, rather >> >> >> greatly
>> contribute to the very much faster KF loops we experienced
running
>> Matlab KF versus C++ KF in the initial tests with the same small >> >>
model
>> (C++ KF does yet not take >> advantage of convergence and the comparative results were even)!!! >> >> >> >> >> > OK, we forget the first comparison on the small model, because C++ >
and
> Matlab didn't use the same algorithm (no convergence monitorinig in > >
C++).
> Matlab is still faster by 45% on the medium size model. We should >
focus
> on explaining this difference and we don't need to bring monitoring >
the
> convegence of the filter for this particular example. > > Could you please upload on SVN the code that you use for profiling? > > Best > > Michel > > > > >> Best regards >> >> George >> >> >> >> >> > _______________________________________________ > Dev mailing list > Dev@dynare.org > http://www.dynare.org/cgi-bin/mailman/listinfo/dev > > > _______________________________________________ Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Thanks George,
I understand that for sw_euro_3 the time of C++ and Matlab are about the same. For the small model, is C++ ¨significantly better" than Matlab or than before?
It seems a good time to try Andrea´s code, but we need to be able to carefully measure its contribution. So it is necessary to time the operations that this code performs in standard and in the improved implementation.
In the GPROF output, I'm surprised at the number -- and therefore time consumption -- of calls to matrix constructors and destructors. It looks as if matrices were constructed inside the filter loop. It would seems more efficient to allocate the necessary space once and use it over and over again. I suspect that it has to do with the very high modularity of the current implementation and that we will need to rewrite it basically from scratch in a more integrated manner.
Best,
Michel
G. Perendia wrote:
Dear Michel
- With the 2nd cut of refactoring we achieved another substantial
performance improvement, about 25-30% for the basic KFover all models, i.e. the small- dll and exe- and the larger, sw_euro_3 using either inner loop dll or calling dll in the loop. The times for the larger model are now similar if not marginally better than those for the Matlab Dynare KF loops (i.e. 95.6 sec for new C++ comparing to 97.5 for Matlab KF loop), whilst those for the small are now significantly better.
The main change made in the 2nd cut was overloading member-by-member GeneralMatrix copy () (used by the constructor too) with a memcpy() version in the Dynare++ sylv/cc/GeneralMatrix.h and .cpp files. Together with Vector.h/.cpp and only 3 other (utility) headers from that directory they are also used by C++KF. I added that small subset of sylv files that are needed for KF to the new sylv/cc subdirectory of the mex/sources/kalman (see NOTE (*) below)
Note also, however, that the same performance improvement change may poss. be applied in the main Dynare++ sylv as well as to the (similar) mex/sources/gensylv versions of those files too!
- I will start devising a method to compare subtask execution times as you
suggested but that may be a bit tricky.
However, I would like to try few more things that can be done to improve the performance of the existing C++ code - that is - without major changes being implemented at this stage yet such as e.g. adding Andrea's quasi triangular matrix multiplication library.
As can be seen from the enclosed profile file taken from running the optimised executable with inner loop, the top 5 CPU time "spending" sub-tasks now are the productive gmemm, matrix constructor(still) and matrix inverter (i.e beside the main KalmanTask -filterNonDiffuse and _Unwind_SjLj_Register exception controller, two of which little can be done about).
NOTE: (*) I.e. There are few small differences between gensylv directory in Dynare mex and the sylv in Dynare ++ and those differences (e.g. missing GeneralMatrix.isZero() in gensylv, etc) are still affecting successful compilation of kalman filter using the gensylv. I would therefore need either to modify gensylv (and I am afraid to break it) or to keep a copy of the small required subset of Dynare++ sylv directory specially associated to the kalman filter. I would suggest the latter as few more changes may be needed for KF and a merge may then poss. be performed at a later stage. ______________
Best regards
George
Dear Michel
1) Re: Small model performance:
a) running recently again 10,000 loops with Matlab KF on the same small and fast converging model but this time with conversion turned off(*) I got the following results for matlabKF_time
1st run: 277.6900, 2nd run: 283.9890 3rd (today): 316.3650**
which is much higher compared to matlabKF_time (normal, with conversion working) run initially at around 48.9600
b) Initially, calling Kalman_filters.DLL in the 10,000 loop, (with preparation of H and Z matrices in each loop), total_dll_time = 202.7320, and a rerun 161.7530 which is actually faster than Matlab without convergion!
Running the same tests with Kalman_filters.DLL called in the 10,000 loop, after the two stages of refactoring: on 11th June total_dll_times = 128.0240, and 117.9300
(*) the P conversion and other shortcuts were switched off by setting kalman and riccati tol. (or just riccati) to -1 (**) Matlab execution ties vary greatly from run to run
2) OK, I will start integrating Andrea's library now.
3) Re GPROF output: You are right, a few functions used inside C++KF loop work as copy-constructors, e.g. A= B*C+D is copy-constructor for A too, whilst on occasion, a matrix is constructed inside KF loop first (e.g. F=H as F(H)) before it is used.as a host (and target) of a complex embedded operation and some of these could be a subject to the next stage of refactoring which is what I initially thought of doing next (i.e. before the integration).
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Thursday, June 11, 2009 8:33 PM Subject: Re: [DynareDev] Kalman Filter
Thanks George,
I understand that for sw_euro_3 the time of C++ and Matlab are about the same. For the small model, is C++ ¨significantly better" than Matlab or than before?
It seems a good time to try Andrea´s code, but we need to be able to carefully measure its contribution. So it is necessary to time the operations that this code performs in standard and in the improved implementation.
In the GPROF output, I'm surprised at the number -- and therefore time consumption -- of calls to matrix constructors and destructors. It looks as if matrices were constructed inside the filter loop. It would seems more efficient to allocate the necessary space once and use it over and over again. I suspect that it has to do with the very high modularity of the current implementation and that we will need to rewrite it basically from scratch in a more integrated manner.
Best,
Michel
G. Perendia wrote:
Dear Michel
- With the 2nd cut of refactoring we achieved another substantial
performance improvement, about 25-30% for the basic KFover all models,
i.e.
the small- dll and exe- and the larger, sw_euro_3 using either inner loop dll or calling dll in the loop. The times for the larger model are now similar if not marginally better than those for the Matlab Dynare KF loops (i.e. 95.6 sec for new C++ comparing to 97.5 for Matlab KF loop), whilst those for the small are now significantly better.
The main change made in the 2nd cut was overloading member-by-member GeneralMatrix copy () (used by the constructor too) with a memcpy()
version
in the Dynare++ sylv/cc/GeneralMatrix.h and .cpp files. Together with Vector.h/.cpp and only 3 other (utility) headers from that directory they are also used by C++KF. I added that small subset of sylv files that are needed for KF to the new sylv/cc subdirectory of the mex/sources/kalman
(see
NOTE (*) below)
Note also, however, that the same performance improvement change may poss. be applied in the main Dynare++ sylv as well as to the (similar) mex/sources/gensylv versions of those files too!
- I will start devising a method to compare subtask execution times as
you
suggested but that may be a bit tricky.
However, I would like to try few more things that can be done to improve
the
performance of the existing C++ code - that is - without major changes
being
implemented at this stage yet such as e.g. adding Andrea's quasi
triangular
matrix multiplication library.
As can be seen from the enclosed profile file taken from running the optimised executable with inner loop, the top 5 CPU time "spending" sub-tasks now are the productive gmemm, matrix constructor(still) and
matrix
inverter (i.e beside the main KalmanTask -filterNonDiffuse and _Unwind_SjLj_Register exception controller, two of which little can be
done
about).
NOTE: (*) I.e. There are few small differences between gensylv directory
in
Dynare mex and the sylv in Dynare ++ and those differences (e.g. missing GeneralMatrix.isZero() in gensylv, etc) are still affecting successful compilation of kalman filter using the gensylv. I would therefore need either to modify gensylv (and I am afraid to break it) or to keep a copy
of
the small required subset of Dynare++ sylv directory specially associated to the kalman filter. I would suggest the latter as few more changes may be needed for KF and a merge may then poss. be performed at a later stage. ______________
Best regards
George
_______________________________________________ Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dear George,
concerning 1) from now on, you should only look at the tests where the computations are the exactly the same in Matlab and C++. It becomes hard to make sense of all the tests. At some point, we should run the tests on a Linux machine that gives more consistent time measures than Windows 2) perfect, thanks 3) the objective here is to remove every call to a constructor within the main loop of the Kalman filter. You are right that this may change how you integrate Andrea's code, so you may need to analyze the removal of constructor first. I'm afraid that actual implementation of the changes will push back testing of Andrea's code too far away.
All the best,
Michel
G. Perendia wrote:
Dear Michel
- Re: Small model performance:
a) running recently again 10,000 loops with Matlab KF on the same small and fast converging model but this time with conversion turned off(*) I got the following results for matlabKF_time
1st run: 277.6900, 2nd run: 283.9890 3rd (today): 316.3650**
which is much higher compared to matlabKF_time (normal, with conversion working) run initially at around 48.9600
b) Initially, calling Kalman_filters.DLL in the 10,000 loop, (with preparation of H and Z matrices in each loop), total_dll_time = 202.7320, and a rerun 161.7530 which is actually faster than Matlab without convergion!
Running the same tests with Kalman_filters.DLL called in the 10,000 loop, after the two stages of refactoring: on 11th June total_dll_times = 128.0240, and 117.9300
(*) the P conversion and other shortcuts were switched off by setting kalman and riccati tol. (or just riccati) to -1 (**) Matlab execution ties vary greatly from run to run
OK, I will start integrating Andrea's library now.
Re GPROF output:
You are right, a few functions used inside C++KF loop work as copy-constructors, e.g. A= B*C+D is copy-constructor for A too, whilst on occasion, a matrix is constructed inside KF loop first (e.g. F=H as F(H)) before it is used.as a host (and target) of a complex embedded operation and some of these could be a subject to the next stage of refactoring which is what I initially thought of doing next (i.e. before the integration).
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Thursday, June 11, 2009 8:33 PM Subject: Re: [DynareDev] Kalman Filter
Thanks George,
I understand that for sw_euro_3 the time of C++ and Matlab are about the same. For the small model, is C++ ¨significantly better" than Matlab or than before?
It seems a good time to try Andrea´s code, but we need to be able to carefully measure its contribution. So it is necessary to time the operations that this code performs in standard and in the improved implementation.
In the GPROF output, I'm surprised at the number -- and therefore time consumption -- of calls to matrix constructors and destructors. It looks as if matrices were constructed inside the filter loop. It would seems more efficient to allocate the necessary space once and use it over and over again. I suspect that it has to do with the very high modularity of the current implementation and that we will need to rewrite it basically from scratch in a more integrated manner.
Best,
Michel
G. Perendia wrote:
Dear Michel
- With the 2nd cut of refactoring we achieved another substantial
performance improvement, about 25-30% for the basic KFover all models,
i.e.
the small- dll and exe- and the larger, sw_euro_3 using either inner loop dll or calling dll in the loop. The times for the larger model are now similar if not marginally better than those for the Matlab Dynare KF loops (i.e. 95.6 sec for new C++ comparing to 97.5 for Matlab KF loop), whilst those for the small are now significantly better.
The main change made in the 2nd cut was overloading member-by-member GeneralMatrix copy () (used by the constructor too) with a memcpy()
version
in the Dynare++ sylv/cc/GeneralMatrix.h and .cpp files. Together with Vector.h/.cpp and only 3 other (utility) headers from that directory they are also used by C++KF. I added that small subset of sylv files that are needed for KF to the new sylv/cc subdirectory of the mex/sources/kalman
(see
NOTE (*) below)
Note also, however, that the same performance improvement change may poss. be applied in the main Dynare++ sylv as well as to the (similar) mex/sources/gensylv versions of those files too!
- I will start devising a method to compare subtask execution times as
you
suggested but that may be a bit tricky.
However, I would like to try few more things that can be done to improve
the
performance of the existing C++ code - that is - without major changes
being
implemented at this stage yet such as e.g. adding Andrea's quasi
triangular
matrix multiplication library.
As can be seen from the enclosed profile file taken from running the optimised executable with inner loop, the top 5 CPU time "spending" sub-tasks now are the productive gmemm, matrix constructor(still) and
matrix
inverter (i.e beside the main KalmanTask -filterNonDiffuse and _Unwind_SjLj_Register exception controller, two of which little can be
done
about).
NOTE: (*) I.e. There are few small differences between gensylv directory
in
Dynare mex and the sylv in Dynare ++ and those differences (e.g. missing GeneralMatrix.isZero() in gensylv, etc) are still affecting successful compilation of kalman filter using the gensylv. I would therefore need either to modify gensylv (and I am afraid to break it) or to keep a copy
of
the small required subset of Dynare++ sylv directory specially associated to the kalman filter. I would suggest the latter as few more changes may be needed for KF and a merge may then poss. be performed at a later stage. ______________
Best regards
George
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dear Michel
re 1) The Matlab Dynare KF tests comparable to C++ KF are those with conversion turned off since C++ has not conversion built in yet.
re2&3)
I started looking into 2) and, on the way, made a small change replacing one of the copy constructors (F) with the improved assignment operator. Though the performance impact was negligent, the profile (enclosed) points to much less of time in GeneralMatrix constructors (i.e. replaced by copy() used by the assignment).
Profiles also points to the gemm as the main CPU time consumer in total time accounted to it (16.4%), though, not per call since it had 8 times more calls than the inversion multInvRight which accounts for 5.8% in total, thus, on average taking nearly 3 times longer per call then gemm.
There is however, still a very large number (though, proportionally not very much time consuming, 1.78 % in total ) of copy-constructing used for recasting from GeneralMatrix to ConstGeneralMatrix, done so to utilise member functions defined for the latter but not the former. Those can be optimised by overloading member functions for GeneralMatrix so to reduce number of those recasting required.
I will thus continue now to analyse the best way to improve gemm by integrating Andrea's library though, it seems that I should re-optimise at least the most relevant multiplications used for a and P first as I would need to use those for the f90 code too.
Best regards
George ----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Friday, June 12, 2009 5:46 PM Subject: Re: [DynareDev] Kalman Filter
Dear George,
concerning 1) from now on, you should only look at the tests where the computations are the exactly the same in Matlab and C++. It becomes hard to make sense of all the tests. At some point, we should run the tests on a Linux machine that gives more consistent time measures than Windows 2) perfect, thanks 3) the objective here is to remove every call to a constructor within the main loop of the Kalman filter. You are right that this may change how you integrate Andrea's code, so you may need to analyze the removal of constructor first. I'm afraid that actual implementation of the changes will push back testing of Andrea's code too far away.
All the best,
Michel
G. Perendia wrote:
Dear Michel
- Re: Small model performance:
a) running recently again 10,000 loops with Matlab KF on the same small
and
fast converging model but this time with conversion turned off(*) I got the following results for matlabKF_time
1st run: 277.6900, 2nd run: 283.9890 3rd (today): 316.3650**
which is much higher compared to matlabKF_time (normal, with conversion working) run initially at around 48.9600
b) Initially, calling Kalman_filters.DLL in the 10,000 loop, (with preparation of H and Z matrices in each loop), total_dll_time = 202.7320, and a rerun 161.7530 which is actually faster than Matlab without convergion!
Running the same tests with Kalman_filters.DLL called in the 10,000 loop, after the two stages of refactoring: on 11th June total_dll_times = 128.0240, and 117.9300
(*) the P conversion and other shortcuts were switched off by setting
kalman
and riccati tol. (or just riccati) to -1 (**) Matlab execution ties vary greatly from run to run
OK, I will start integrating Andrea's library now.
Re GPROF output:
You are right, a few functions used inside C++KF loop work as copy-constructors, e.g. A= B*C+D is copy-constructor for A too, whilst on occasion, a matrix is
constructed
inside KF loop first (e.g. F=H as F(H)) before it is used.as a host (and target) of a complex embedded operation and some of these could be a
subject
to the next stage of refactoring which is what I initially thought of
doing
next (i.e. before the integration).
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Thursday, June 11, 2009 8:33 PM Subject: Re: [DynareDev] Kalman Filter
Thanks George,
I understand that for sw_euro_3 the time of C++ and Matlab are about the same. For the small model, is C++ ¨significantly better" than Matlab or than before?
It seems a good time to try Andrea´s code, but we need to be able to carefully measure its contribution. So it is necessary to time the operations that this code performs in standard and in the improved implementation.
In the GPROF output, I'm surprised at the number -- and therefore time consumption -- of calls to matrix constructors and destructors. It looks as if matrices were constructed inside the filter loop. It would seems more efficient to allocate the necessary space once and use it over and over again. I suspect that it has to do with the very high modularity of the current implementation and that we will need to rewrite it basically from scratch in a more integrated manner.
Best,
Michel
G. Perendia wrote:
Dear Michel
- With the 2nd cut of refactoring we achieved another substantial
performance improvement, about 25-30% for the basic KFover all models,
i.e.
the small- dll and exe- and the larger, sw_euro_3 using either inner
loop
dll or calling dll in the loop. The times for the larger model are now similar if not marginally better than those for the Matlab Dynare KF
loops
(i.e. 95.6 sec for new C++ comparing to 97.5 for Matlab KF loop), whilst those for the small are now significantly better.
The main change made in the 2nd cut was overloading member-by-member GeneralMatrix copy () (used by the constructor too) with a memcpy()
version
in the Dynare++ sylv/cc/GeneralMatrix.h and .cpp files. Together with Vector.h/.cpp and only 3 other (utility) headers from that directory they are also used by C++KF. I added that small subset of sylv files that are needed for KF to the new sylv/cc subdirectory of the mex/sources/kalman
(see
NOTE (*) below)
Note also, however, that the same performance improvement change may
poss.
be applied in the main Dynare++ sylv as well as to the (similar) mex/sources/gensylv versions of those files too!
- I will start devising a method to compare subtask execution times as
you
suggested but that may be a bit tricky.
However, I would like to try few more things that can be done to improve
the
performance of the existing C++ code - that is - without major changes
being
implemented at this stage yet such as e.g. adding Andrea's quasi
triangular
matrix multiplication library.
As can be seen from the enclosed profile file taken from running the optimised executable with inner loop, the top 5 CPU time "spending" sub-tasks now are the productive gmemm, matrix constructor(still) and
matrix
inverter (i.e beside the main KalmanTask -filterNonDiffuse and _Unwind_SjLj_Register exception controller, two of which little can be
done
about).
NOTE: (*) I.e. There are few small differences between gensylv directory
in
Dynare mex and the sylv in Dynare ++ and those differences (e.g. missing GeneralMatrix.isZero() in gensylv, etc) are still affecting successful compilation of kalman filter using the gensylv. I would therefore need either to modify gensylv (and I am afraid to break it) or to keep a copy
of
the small required subset of Dynare++ sylv directory specially associated to the kalman filter. I would suggest the latter as few more changes may be needed for KF and a merge may then poss. be performed at a later stage. ______________
Best regards
George
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
_______________________________________________ Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Thanks for the update. Let me know how it works out.
Best
Michel
G. Perendia wrote:
Dear Michel
re 1) The Matlab Dynare KF tests comparable to C++ KF are those with conversion turned off since C++ has not conversion built in yet.
re2&3)
I started looking into 2) and, on the way, made a small change replacing one of the copy constructors (F) with the improved assignment operator. Though the performance impact was negligent, the profile (enclosed) points to much less of time in GeneralMatrix constructors (i.e. replaced by copy() used by the assignment).
Profiles also points to the gemm as the main CPU time consumer in total time accounted to it (16.4%), though, not per call since it had 8 times more calls than the inversion multInvRight which accounts for 5.8% in total, thus, on average taking nearly 3 times longer per call then gemm.
There is however, still a very large number (though, proportionally not very much time consuming, 1.78 % in total ) of copy-constructing used for recasting from GeneralMatrix to ConstGeneralMatrix, done so to utilise member functions defined for the latter but not the former. Those can be optimised by overloading member functions for GeneralMatrix so to reduce number of those recasting required.
I will thus continue now to analyse the best way to improve gemm by integrating Andrea's library though, it seems that I should re-optimise at least the most relevant multiplications used for a and P first as I would need to use those for the f90 code too.
Best regards
George ----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Friday, June 12, 2009 5:46 PM Subject: Re: [DynareDev] Kalman Filter
Dear George,
concerning
- from now on, you should only look at the tests where the computations
are the exactly the same in Matlab and C++. It becomes hard to make sense of all the tests. At some point, we should run the tests on a Linux machine that gives more consistent time measures than Windows 2) perfect, thanks 3) the objective here is to remove every call to a constructor within the main loop of the Kalman filter. You are right that this may change how you integrate Andrea's code, so you may need to analyze the removal of constructor first. I'm afraid that actual implementation of the changes will push back testing of Andrea's code too far away.
All the best,
Michel
G. Perendia wrote:
Dear Michel
- Re: Small model performance:
a) running recently again 10,000 loops with Matlab KF on the same small
and
fast converging model but this time with conversion turned off(*) I got the following results for matlabKF_time
1st run: 277.6900, 2nd run: 283.9890 3rd (today): 316.3650**
which is much higher compared to matlabKF_time (normal, with conversion working) run initially at around 48.9600
b) Initially, calling Kalman_filters.DLL in the 10,000 loop, (with preparation of H and Z matrices in each loop), total_dll_time = 202.7320, and a rerun 161.7530 which is actually faster than Matlab without convergion!
Running the same tests with Kalman_filters.DLL called in the 10,000 loop, after the two stages of refactoring: on 11th June total_dll_times = 128.0240, and 117.9300
(*) the P conversion and other shortcuts were switched off by setting
kalman
and riccati tol. (or just riccati) to -1 (**) Matlab execution ties vary greatly from run to run
OK, I will start integrating Andrea's library now.
Re GPROF output:
You are right, a few functions used inside C++KF loop work as copy-constructors, e.g. A= B*C+D is copy-constructor for A too, whilst on occasion, a matrix is
constructed
inside KF loop first (e.g. F=H as F(H)) before it is used.as a host (and target) of a complex embedded operation and some of these could be a
subject
to the next stage of refactoring which is what I initially thought of
doing
next (i.e. before the integration).
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Thursday, June 11, 2009 8:33 PM Subject: Re: [DynareDev] Kalman Filter
Thanks George,
I understand that for sw_euro_3 the time of C++ and Matlab are about the same. For the small model, is C++ ¨significantly better" than Matlab or than before?
It seems a good time to try Andrea´s code, but we need to be able to carefully measure its contribution. So it is necessary to time the operations that this code performs in standard and in the improved implementation.
In the GPROF output, I'm surprised at the number -- and therefore time consumption -- of calls to matrix constructors and destructors. It looks as if matrices were constructed inside the filter loop. It would seems more efficient to allocate the necessary space once and use it over and over again. I suspect that it has to do with the very high modularity of the current implementation and that we will need to rewrite it basically from scratch in a more integrated manner.
Best,
Michel
G. Perendia wrote:
Dear Michel
- With the 2nd cut of refactoring we achieved another substantial
performance improvement, about 25-30% for the basic KFover all models,
i.e.
the small- dll and exe- and the larger, sw_euro_3 using either inner
loop
dll or calling dll in the loop. The times for the larger model are now similar if not marginally better than those for the Matlab Dynare KF
loops
(i.e. 95.6 sec for new C++ comparing to 97.5 for Matlab KF loop), whilst those for the small are now significantly better.
The main change made in the 2nd cut was overloading member-by-member GeneralMatrix copy () (used by the constructor too) with a memcpy()
version
in the Dynare++ sylv/cc/GeneralMatrix.h and .cpp files. Together with Vector.h/.cpp and only 3 other (utility) headers from that directory they are also used by C++KF. I added that small subset of sylv files that are needed for KF to the new sylv/cc subdirectory of the mex/sources/kalman
(see
NOTE (*) below)
Note also, however, that the same performance improvement change may
poss.
be applied in the main Dynare++ sylv as well as to the (similar) mex/sources/gensylv versions of those files too!
- I will start devising a method to compare subtask execution times as
you
suggested but that may be a bit tricky.
However, I would like to try few more things that can be done to improve
the
performance of the existing C++ code - that is - without major changes
being
implemented at this stage yet such as e.g. adding Andrea's quasi
triangular
matrix multiplication library.
As can be seen from the enclosed profile file taken from running the optimised executable with inner loop, the top 5 CPU time "spending" sub-tasks now are the productive gmemm, matrix constructor(still) and
matrix
inverter (i.e beside the main KalmanTask -filterNonDiffuse and _Unwind_SjLj_Register exception controller, two of which little can be
done
about).
NOTE: (*) I.e. There are few small differences between gensylv directory
in
Dynare mex and the sylv in Dynare ++ and those differences (e.g. missing GeneralMatrix.isZero() in gensylv, etc) are still affecting successful compilation of kalman filter using the gensylv. I would therefore need either to modify gensylv (and I am afraid to break it) or to keep a copy
of
the small required subset of Dynare++ sylv directory specially associated to the kalman filter. I would suggest the latter as few more changes may be needed for KF and a merge may then poss. be performed at a later stage. ______________
Best regards
George
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dear Michel (et al)
Is there a plan to include (control) instruments into Dynare models in addition to exogenous shocks - but handled separately, and if, what is the state of that development.
Best regards
George
I think that you mean deterministic exogenous variable. It exists but need to be tested again. In what context do you want to apply it?
Best
Michel
G. Perendia wrote:
Dear Michel (et al)
Is there a plan to include (control) instruments into Dynare models in addition to exogenous shocks - but handled separately, and if, what is the state of that development.
Best regards
George
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Thanks Michel
Instruments are needed for the QL project by Joe Pearlman and Paul Levine and Joe asked me yesterday to investigate if there is anything like that already in Dynare..
E.g. It is needed to define one or more variables appearing in the model as control instrument (e.g. interest rate) without having their own equation, very much alike shocks but they will be treated differently by the QL extension.
It sound like deterministic exogenous variable may be what they need and I will have a look at that over the weekend. - can one give time series for it
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Thursday, June 18, 2009 8:55 PM Subject: Re: [DynareDev] Instruments
I think that you mean deterministic exogenous variable. It exists but need to be tested again. In what context do you want to apply it?
Best
Michel
G. Perendia wrote:
Dear Michel (et al)
Is there a plan to include (control) instruments into Dynare models in addition to exogenous shocks - but handled separately, and if, what is
the
state of that development.
Best regards
George
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Hi George,
no exogenous deterministic variable is something else. We need to define a new keyword for it. I think that we should discuss the whole interface needed by Joe in one block.
All the best,
Michel
G. Perendia wrote:
Thanks Michel
Instruments are needed for the QL project by Joe Pearlman and Paul Levine and Joe asked me yesterday to investigate if there is anything like that already in Dynare..
E.g. It is needed to define one or more variables appearing in the model as control instrument (e.g. interest rate) without having their own equation, very much alike shocks but they will be treated differently by the QL extension.
It sound like deterministic exogenous variable may be what they need and I will have a look at that over the weekend. - can one give time series for it
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Thursday, June 18, 2009 8:55 PM Subject: Re: [DynareDev] Instruments
I think that you mean deterministic exogenous variable. It exists but need to be tested again. In what context do you want to apply it?
Best
Michel
G. Perendia wrote:
Dear Michel (et al)
Is there a plan to include (control) instruments into Dynare models in addition to exogenous shocks - but handled separately, and if, what is
the
state of that development.
Best regards
George
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dear Andrea and Michel (et al)
I have encountered an unexpected problem integrating KalmanFilter with the f90 QT library - passing the QT result arrays back to C++.
QT Fortran routines have been written in standard Fortran FUNCTION format, (i.e., not SUBROUTINE), so that they are returning double or single dimensional array (they are named by), by value ( not reference) and, as it appears, only simple, single variables seems can be passed from Fortran FUNCTIONs back to C++ (e.g. INT or REAL).
On the other hand, NAG, BLAS and LAPACK routines have all been written as Fortran SUBROUTINEs and the subroutines are the equivalent of "C" functions returning "(void)".
SUBROUTINE can be integrated with C more easily as they receive parameters and return their results through the variables passed as calling parameters by references.
For example, dgemv.f from BLAS library gets Y by reference and returns modified Y passed as calling parameter reference.
SUBROUTINE DGEMV(TRANS,M,N,ALPHA,A,LDA,X,INCX,BETA,Y,INCY) .... * Y - DOUBLE PRECISION array of DIMENSION at least ... * Before entry .... the incremented array Y * must contain the vector y. On exit, Y is overwritten by the * updated vector y. ....
I could not find any references on how to get arrays from Fortran FUCTION as return value back to C - does anyone know how to do it or, if at all possible?
One way I can think of is less explored option of returning Fortran pointer to the resulting array from the QT functions instead of the array by value and I think I can work one of that out.
However, even if there is another way to pass function result arrays back to C++, I expect it is bound to be less efficient than passing it by reference (or Fortran pointer), especially for larger matrices. And, if there is no efficient alternative, I can either easily rewrite QT library as SUBROUTINE routines instead FUCTION or try to use Fortran pointers and, if ok, I can then also rewrite QT library to return pointers..
Best regards
George
Hi George,
Andrea isn't on the dev@dynare.org. Please take contact with him directly (with cc to me) and discuss with him who should make the changes that you recommend.
All the best,
Michelapp
Perendia wrote:
Dear Andrea and Michel (et al)
I have encountered an unexpected problem integrating KalmanFilter with the f90 QT library - passing the QT result arrays back to C++.
QT Fortran routines have been written in standard Fortran FUNCTION format, (i.e., not SUBROUTINE), so that they are returning double or single dimensional array (they are named by), by value ( not reference) and, as it appears, only simple, single variables seems can be passed from Fortran FUNCTIONs back to C++ (e.g. INT or REAL).
On the other hand, NAG, BLAS and LAPACK routines have all been written as Fortran SUBROUTINEs and the subroutines are the equivalent of "C" functions returning "(void)".
SUBROUTINE can be integrated with C more easily as they receive parameters and return their results through the variables passed as calling parameters by references.
For example, dgemv.f from BLAS library gets Y by reference and returns modified Y passed as calling parameter reference.
SUBROUTINE DGEMV(TRANS,M,N,ALPHA,A,LDA,X,INCX,BETA,Y,INCY)
....
- Y - DOUBLE PRECISION array of DIMENSION at least
...
Before entry .... the incremented array Y
must contain the vector y. On exit, Y is overwritten by the
updated vector y.
....
I could not find any references on how to get arrays from Fortran FUCTION as return value back to C - does anyone know how to do it or, if at all possible?
One way I can think of is less explored option of returning Fortran pointer to the resulting array from the QT functions instead of the array by value and I think I can work one of that out.
However, even if there is another way to pass function result arrays back to C++, I expect it is bound to be less efficient than passing it by reference (or Fortran pointer), especially for larger matrices. And, if there is no efficient alternative, I can either easily rewrite QT library as SUBROUTINE routines instead FUCTION or try to use Fortran pointers and, if ok, I can then also rewrite QT library to return pointers..
Best regards
George
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Hi George,
the Matlab code needs to be changed tp make T quasi diagonal before when can call Andrea's ruotines in the C++ code. We need to prepare the matrices as we do currently for the diffuse case (except Pinf). I'm in a rush and can't explain better. I'm off email for the day.
I will look into it tonight or tomorrow.
Best
Michel
G. Perendia wrote:
Dear Michel
- With the 2nd cut of refactoring we achieved another substantial
performance improvement, about 25-30% for the basic KFover all models, i.e. the small- dll and exe- and the larger, sw_euro_3 using either inner loop dll or calling dll in the loop. The times for the larger model are now similar if not marginally better than those for the Matlab Dynare KF loops (i.e. 95.6 sec for new C++ comparing to 97.5 for Matlab KF loop), whilst those for the small are now significantly better.
The main change made in the 2nd cut was overloading member-by-member GeneralMatrix copy () (used by the constructor too) with a memcpy() version in the Dynare++ sylv/cc/GeneralMatrix.h and .cpp files. Together with Vector.h/.cpp and only 3 other (utility) headers from that directory they are also used by C++KF. I added that small subset of sylv files that are needed for KF to the new sylv/cc subdirectory of the mex/sources/kalman (see NOTE (*) below)
Note also, however, that the same performance improvement change may poss. be applied in the main Dynare++ sylv as well as to the (similar) mex/sources/gensylv versions of those files too!
- I will start devising a method to compare subtask execution times as you
suggested but that may be a bit tricky.
However, I would like to try few more things that can be done to improve the performance of the existing C++ code - that is - without major changes being implemented at this stage yet such as e.g. adding Andrea's quasi triangular matrix multiplication library.
As can be seen from the enclosed profile file taken from running the optimised executable with inner loop, the top 5 CPU time "spending" sub-tasks now are the productive gmemm, matrix constructor(still) and matrix inverter (i.e beside the main KalmanTask -filterNonDiffuse and _Unwind_SjLj_Register exception controller, two of which little can be done about).
NOTE: (*) I.e. There are few small differences between gensylv directory in Dynare mex and the sylv in Dynare ++ and those differences (e.g. missing GeneralMatrix.isZero() in gensylv, etc) are still affecting successful compilation of kalman filter using the gensylv. I would therefore need either to modify gensylv (and I am afraid to break it) or to keep a copy of the small required subset of Dynare++ sylv directory specially associated to the kalman filter. I would suggest the latter as few more changes may be needed for KF and a merge may then poss. be performed at a later stage. ______________
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Wednesday, June 10, 2009 8:13 AM Subject: Re: [DynareDev] Kalman Filter
Thanks George,
could you compare the execution time between C++ and Matlab subtask by subtask (almost line by line as far as Matlab is concerned)?
Best
Michel
G. Perendia wrote:
Dear Michel
After the first cut of refactoring, we have mixed results: there is
about
19% performance improvement and approx 30% reduction in use of copy constructor running small model as expected, though no significant performance improvement could be measured on the larger models (euro_sw3.mod) yet - I am looking into the other possible causes for
that
lack of improvement.
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "G. Perendia" george@perendia.orangehome.co.uk; "List for Dynare developers" dev@dynare.org Sent: Sunday, June 07, 2009 9:01 PM Subject: Re: [DynareDev] Kalman Filter
Thanks George,
very interesting. Please proceed with simplifying the KF code.
All the best,
Michel
G. Perendia wrote:
Dear Michel
I already have a stand-alone exe test I used last week (and uploaded
it
today too) and I run it through gprof earlier today (though, after
wasting
some time last week trying to use a reportedly sophisticated
profiler -
CodeAnalyst from AMD - which, so far, I could not make to work at
all).
The other profiling tool I initially used (and reported) last week
("Very
Sleepy") is an "external", not getting into the code details itself
but
could be run attached externally to either the stand-alone exe test or
the
Matlab running DLL thread too, and that reported for both spending a
lot
(~40%) time in ztrsv solver but, at one snapshot, also pointed to a
lot
of
time spent in dtrsv and GeneralMatrix copy - 50% each. On the
contrast,
gprof is more internal, higher resolution profiler and it puts the
load
weight (>10%) on housekeeping functions but does not even mention
calls
to
external library BLAS functions such as dtrsv and ztrsv.
The both profiling tools, however, seem to confirm what my early code inspection concluded too: a very high, use of (not very productive) General Matrix copy constructor -(e.g. the C++ kalman filter stores
copy
of
few of the main time variant system matrices: F, P and an
intermediate
L
for each step of the time series evaluation and also creates a copy of
the
input T, H and Z at each step as if they may be time invariant too
although
this would not be the case for a simple, no-diffuse KF without missing observations).
This then resulted in high % of time in GeneralMatrix copy() function, (which is called by the copy-constructor explicitly) - i.e. as
reported
by
the both profiling programs: Sleepy: up to 50% at one snapshot point
whilst
gprof gives it the 1st rank with 11% on its own or 27.3% of total with
its
children.
The copy() function is followed by the utility functions such as two varieties (const and non-const) of Vector indexing [] operator and by
the
const and variable varieties of GeneralMatrix::get() elements that
utilise
the previous Vector indexing[] operator and are themselves directly
called
from the heavily used GenaralMatrix copy function among the rest.
According to gprof, the above high burden of copy constructor and the related functions are only then followed by the productive functions
such as
PLUFact::multInvRight matrix multiplication with inversion (used for inversion of the F matrix), the GeneralMatrix constructors and the GeneralMatrix::gemm() - a general matrix multiplication(itself
calling
BLAS dgemm) with 4.7, 3.1 and 2.6 % of total time respectively
NOTE however that gprof profiler paints to an extent different picture
and
does not even mention external BLASS functions such as dtrsv and ztrsv solvers reported as heavy users by the VerySleepy "external" profiler.
All in all, it appears form the both profiler reports and my initial inspection that , for start (and as I initially already intended and suggested to), we should refactor the current heavy use of the
un-productive
General Matrix copy constructor and its current reliance on element-by-element get() function before we get into any further
performance
improvements of the productive functions and external libraries.
Best regards
George
----- Original Message ----- From: "Michel Juillard" michel.juillard@ens.fr To: "List for Dynare developers" dev@dynare.org Sent: Saturday, June 06, 2009 3:08 PM Subject: Re: [DynareDev] Kalman Filter
There are tools to do profiling in C++. All we need is an standalone executable calling the filter. Don't loose time adding timing
function
inside the code. It may be difficult to do profiling in Windows. In
that
case, just prepare the code and we will do the profiling in Linux.
Best
Michel
G. Perendia wrote:
> Dear Michel > > 1) Yes, as agreed initially > > 2) these are the Matlab Dynare KF measures, mainly to show the > > >
proportion
> of inversion vs. pure update in Matlab KF. > I have not yet done fine profiling for C++, so, not much to upload > > >
either.
> 3) I agree.. > > Best regards > > George > > > ----- Original Message ----- > From: "Michel Juillard" michel.juillard@ens.fr > To: "List for Dynare developers" dev@dynare.org > Sent: Saturday, June 06, 2009 2:10 PM > Subject: Re: [DynareDev] Kalman Filter > > > > > > >> Thanks George >> >> One of the first thing that we need to establish is whether >>
identical
>> basic matrix operations take much longer in the C++ implementation >> >>
than
>> in Matlab and, if that it is the case, why. >> >> >> >> >> >>> 2) Indeed, and as a significant part of the overall parcel of >>> >>>
updating
> P, > > > > >>> one needs to invert the updated F too : >>> >>> 100000 loops of small model KF 4x4 F matrix inversion: iF = >>> >>> >>> >>> > inv(F); > > > > >>> Fmx_inv_time = 2.2530 >>> >>> 100000 loops of the corresponding core KF 8x8 P matrix update: >>> P1 = T*(P-K*P(mf,:))*transpose(T)+QQ; >>> Pupdt_time = 3.4450 >>> >>> (and also, 100000 loops of the preceding K = P(:,mf)*iF; >>> Kupdt_time = 0.5910) >>> >>> >>> >>> >>> >>> >> How do these operations compare with Matlab on your machine? >> >> >> >> >> >>> The convergence of P exploited in Matlab Dynare KFs (which does >>>
not
>>> >>> > require > > > > >>> further update of P and K and inversion of F ), can improve >>>
greatly
>>> performance >>> of KF. >>> >>> e.g.: running Matlab Dynare KF with 57x57 system matrix in 1000 >>>
loop
>>> 1000 of usual matlabKF_time = 337.1650 >>> >>> and then using P recursively in the loop with a modified >>> >>> >>>
kalman_filter.m
>>> which returns P too (therefore, utilising P convergence and >>>
avoiding
>>>
its
>>> update for the most of the remaining 999 loops): >>> 1000 of recursive: Matlab_rec_KF_time = 11.7060 >>> >>> >>> >>> 3) And, although the convergence of P in Matlab KF did not take >>> >>>
place
> for > > > > >>> the large sw_euro model which had total run much closer to the >>>
C++KF
>>> >>> > (see 3 > > > > >>> below), as today's check show, the convergence did however take >>> >>>
place
> very > > > > >>> early in the Matlab KF running the small model I initially tested >>> >>>
(at
> step > > > > >>> t=3!!!) so it >>> certainly did affect and, judging from the above results, rather >>> >>> >>> >>> > greatly > > > > >>> contribute to the very much faster KF loops we experienced >>>
running
>>> Matlab KF versus C++ KF in the initial tests with the same small >>> >>> >>>
model
>>> (C++ KF does yet not take >>> advantage of convergence and the comparative results were even)!!! >>> >>> >>> >>> >>> >>> >> OK, we forget the first comparison on the small model, because C++ >> >>
and
>> Matlab didn't use the same algorithm (no convergence monitorinig in >> >> >>
C++).
>> Matlab is still faster by 45% on the medium size model. We should >> >>
focus
>> on explaining this difference and we don't need to bring monitoring >> >>
the
>> convegence of the filter for this particular example. >> >> Could you please upload on SVN the code that you use for profiling? >> >> Best >> >> Michel >> >> >> >> >> >>> Best regards >>> >>> George >>> >>> >>> >>> >>> >>> >> _______________________________________________ >> Dev mailing list >> Dev@dynare.org >> http://www.dynare.org/cgi-bin/mailman/listinfo/dev >> >> >> >> > _______________________________________________ > Dev mailing list > Dev@dynare.org > http://www.dynare.org/cgi-bin/mailman/listinfo/dev > > > > > _______________________________________________ Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev
Dev mailing list Dev@dynare.org http://www.dynare.org/cgi-bin/mailman/listinfo/dev