erp5_core: do one catalog call only
Catalog seems to correctly handle the OR parameters. Do not lose the possible external `query` parameter.
-
Developer
@romain not that you initially added this from scratch, but the created query still seems slow in large instances. While in the same context Metadata tab would be fast. So I wonder if we could simplify here like:
kw['portal_type'] = context.getPortalObject().getPortalDocumentTypeList() context.Base_getRelatedObjectList(**kw)
-
Owner
But this not not the same query anymore in such case, and it will not return the same results.
Anyway, do you know if this script was also slow before my change? If no, it means we have to again split it into 2 queries, while keeping the
query
parameter support. -
Developer
But this not not the same query anymore in such case, and it will not return the same results.
But for me semantically it makes sense that
Base_getRelatedDocumentList
returns whatBase_getRelatedObjectList
would with an extra filter on portal type. I do not understand why would we need different approach here.Anyway, do you know if this script was also slow before my change?
No, but it is a good idea to check, I will do some benchmarking soon
-
Owner
Because
Base_getRelatedDocumentList
also returns the embedded documents, whichBase_getRelatedObjectList
does not.Maybe what disturb you is the script naming. But changing it seems a bit risky for compatibility reasons.
-
Developer
I will do some benchmarking soon
In my large instance I created also old version, i.e. the version of this script before you your commit (taken from history from b6f4d46d) as
Base_getRelatedDocumentListOld
. Then I did this code:portal = context.getPortalObject() person_value = portal.person_module[...] # Here a custom URL of a Person for which this query should return nothing start_time = DateTime() document_list = person_value.Base_getRelatedDocumentList() print 'Query took in sec: %s' % ((DateTime() - start_time) * 86400) print 'Number of document found %s' % len(document_list) start_time = DateTime() document_list = person_value.Base_getRelatedDocumentListOld() print 'Query took in sec: %s' % ((DateTime() - start_time) * 86400) print 'Number of document found %s' % len(document_list) start_time = DateTime() document_list = person_value.Base_getRelatedDocumentListOld(portal_type=portal.getPortalDocumentTypeList()) print 'Query took in sec: %s' % ((DateTime() - start_time) * 86400) print 'Number of document found %s' % len(document_list) return printed
This outputs (tried several times, so times in seconds here are indicative, I get similar results each time):
Query took in sec: 6.832277 Number of document found 0 Query took in sec: 0.009946 Number of document found 10 Query took in sec: 0.007806 Number of document found 0
So:
- New code does look many times slower
- Old code, returns 10 documents that are the sub-objects no matter the type. This I find wrong for a script name
Base_getRelatedDocumentList
. I checked andBase_viewDocumentList/listbox
has a TALES on portal_type:
python: [[t, t] for t in list(context.getPortalDocumentTypeList())+list(context.getPortalEmbeddedDocumentTypeList())]
so it works in this case, but for me it does seem wrong
- Old code with portal_type filtering is also faster.
Because
Base_getRelatedDocumentList
also returns the embedded documents, whichBase_getRelatedObjectList
does not.OK, so it is the sub-objects we care mostly, I see this.
Maybe what disturb you is the script naming. But changing it seems a bit risky for compatibility reasons.
Yes, but I see the worry and it was not added in your commit here so OK for this.
-
Developer
Hi.
Did you check the above? W is your opinion? I would change it on my fork (most probably put previous version) if we want to keep this version in generic
-
Owner
Can you try to do 2 queries instead of 1 in the new
Base_getRelatedDocumentList
logic? -
Developer
Something like the below maybe (plus removing
query
from the arguments of the scipt)?It is fast in my testing (though still I have testing only on Person with no related documents):
kw.pop('relative_url', None) kw.pop('follow_up_uid', None) portal = context.getPortalObject() portal_catalog = portal.portal_catalog related_document_list = portal_catalog( portal_type=portal.getPortalDocumentTypeList() if portal_type is None else portal_type, follow_up_uid=context.getUid(), **kw ) embedded_document_list = portal.portal_catalog( portal_type=portal.getPortalEmbeddedDocumentTypeList() if portal_type is None else portal_type, relative_url='%s/%%' % context.getRelativeUrl().replace('_', r'\_'), **kw ) return [x for x in related_document_list] + [x for x in embedded_document_list]
EDIT: typo
-
Owner
No. I need the
query
argument. Thequery
argument allows to filter the list of the returned documents.And we can't concatenate
related_document_list
andembedded_document_list
, because it will not respect thelimit
parameter. -
Developer
And we can't concatenate
related_document_list
andembedded_document_list
, because it will not respect thelimit
parameter.True, and sorting will not work either.
So we would need second query to use the results of first with an
or
, similar to old version but also keeping theDo not lose the possible external query parameter.
.Something like the below?
from Products.ZSQLCatalog.SQLCatalog import Query, ComplexQuery kw.pop('relative_url', None) kw.pop('follow_up_uid', None) portal = context.getPortalObject() portal_catalog = portal.portal_catalog if portal_type is None: portal_type = portal.getPortalDocumentTypeList() + portal.getPortalEmbeddedDocumentTypeList() follow_up_related_document_list = portal_catalog( portal_type=portal_type, follow_up_uid=context.getUid(), **kw) if follow_up_related_document_list: document_query = ComplexQuery( Query(relative_url='%s/%%' % context.getRelativeUrl().replace('_', r'\_')), Query(uid=[x.getUid() for x in follow_up_related_document_list]), logical_operator='or' ) else: document_query = Query(relative_url='%s/%%' % context.getRelativeUrl().replace('_', r'\_')) if query is None: query = document_query else: query = ComplexQuery( query, document_query, logical_operator='and' ) return portal_catalog(portal_type=portal_type, limit=limit, **kw)
-
Owner
Yes, it looks like what I have in mind.
Did you check the performance with this version?
-
Owner
I think that the case of documents being contained as sub-documents is very rare these days, in most cases we only have documents as follow up, so maybe we should first search for
relative_url='%s/%%' % context.getRelativeUrl().replace('_', r'\_')
and only combine the queries if this yields something.Also please check this comment wendelin!81 (comment 128200)
brain.getUid()
is a "coding crime". -
Developer
I think that the case of documents being contained as sub-documents is very rare these days, in most cases we only have documents as follow up, so maybe we should first search for
relative_url='%s/%%' % context.getRelativeUrl().replace('_', r'\_')
and only combine the queries if this yields something.If this is the case indeed
Also please check this comment wendelin!81 (comment 128200)
brain.getUid()
is a "coding crime".Good catch thanks. Then:
Query(uid=[x.uid for x in ...])
-
Developer
I think that the case of documents being contained as sub-documents is very rare these days, in most cases we only have documents as follow up, so maybe we should first search for
relative_url='%s/%%' % context.getRelativeUrl().replace('_', r'\_')
and only combine the queries if this yields something.If this is the case indeed
Mmm, strangely enough this does not work. I have a document with Embedded File so the first query:
embedded_document_list = portal_catalog( relative_url='%s/%%' % context.getRelativeUrl().replace('_', r'\_'), portal_type=portal_type, **kw )
returns a document with uid 70056
Then the second query, having no
query
in kw is contructed like:portal_catalog( portal_type=portal_type, query=ComplexQuery( Query(follow_up_uid=context.getUid()), Query(uid=[x.uid for x in embedded_document_list]), logical_operator='or' ), **kw )
stripping security_uids etc looks like:
SELECT DISTINCT catalog.path, catalog.uid FROM ( catalog AS `catalog` INNER JOIN category AS `related_follow_up_uid_1_category` ON related_follow_up_uid_1_category.base_category_uid = 1896 AND related_follow_up_uid_1_category.uid = catalog.uid ) WHERE 1 = 1 AND (`catalog`.`portal_type` IN ('Drawing', 'Discussion Thread', 'Spreadsheet', 'File', 'Image', 'PDF', 'Web Page', 'Presentation', 'Sound', 'Test Page', 'Text', 'Video', 'External Video', 'Web Illustration', 'Web Manifest', 'Web Script', 'Web Style', 'Web Table', 'Embedded File') AND (`related_follow_up_uid_1_category`.`category_uid` = 89017 OR `catalog`.`uid` = 70056));
which returns nothing. To simplify, the following:
SELECT count(*) FROM ( catalog AS `catalog` INNER JOIN category AS `related_follow_up_uid_1_category` ON related_follow_up_uid_1_category.base_category_uid = 1896 AND related_follow_up_uid_1_category.uid = catalog.uid ) WHERE `catalog`.`uid` = 70056;
gives 0
While the uid exists, i.e.:
SELECT count(*) FROM catalog where uid = 70056;
gives 1.
So it is the join on follow_up that breaks it?
If I have the first query on follow_up and second on sub-documents, I do not get this issue
-
Owner
In this case, we should maybe use
left_join_list=['follow_up_uid']
-
Developer
BTW, I find that current version we have in master also has the issue I noted above, Embedded Documents are not found.
In this case, we should maybe use
left_join_list=['follow_up_uid']
You mean like:
if embedded_document_list: document_query = ComplexQuery( Query(follow_up_uid=context.getUid()), Query(uid=[x.uid for x in embedded_document_list]), logical_operator='or' ) kw['left_join_list']=['follow_up_uid']
because this seems to return too many documents
-
Developer
because this seems to return too many documents
Bah, sorry I was getting the length of the prited query...
So it works like above, but it is slower under circumstances. I have:
- Base_getRelatedDocumentListV2 : Search based on follow_up first
- Base_getRelatedDocumentListV1 : Search based on relative_url first
- Base_getRelatedDocumentList : current master
And my code is:
person_value = portal.person_module[...] start_time = DateTime() document_list = person_value.Base_getRelatedDocumentListV2() print 'Query took in sec: %s' % ((DateTime() - start_time) * 86400) print 'Number of document found %s' % len(document_list) start_time = DateTime() document_list = person_value.Base_getRelatedDocumentListV1() print 'Query took in sec: %s' % ((DateTime() - start_time) * 86400) print 'Number of document found %s' % len(document_list) start_time = DateTime() document_list = person_value.Base_getRelatedDocumentList() print 'Query took in sec: %s' % ((DateTime() - start_time) * 86400) print 'Number of document found %s' % len(document_list)
Ran on a Person with 1 Embedded File:
Query took in sec: 0.014952 Number of document found 1 Query took in sec: 8.285919 Number of document found 1 Query took in sec: 7.622406 Number of document found 0
Ran on a Person with 1 follow_up File:
Query took in sec: 0.023403 Number of document found 1 Query took in sec: 0.009755 Number of document found 1 Query took in sec: 7.64729 Number of document found 1
So
Base_getRelatedDocumentListV1
seems slow when we have Embedded File and we do the complet query.Base_getRelatedDocumentListV2
seems more robust in both cases. -
Owner
Ah yes, with left join it is much slower. I also tried something like
document_query = ComplexQuery( Query(follow_up_uid=context.getUid()), ComplexQuery( Query(uid=[x.uid for x in embedded_document_list]), Query(follow_up_uid=None), logical_operator='and', ), logical_operator='or', ) kw['left_join_list'] = ['follow_up_uid']
but it's not better, so checking first for
embedded_document_list
was not a good idea and checking first forfollow_up_related_document_list
is better. Thanks for trying ! -
mentioned in merge request !1373 (merged)