[Hippo-cms7-user] Indexing hippo:resource item in Hippo CMS 2.16.02 takes a long time

Jeroen Reijn j.reijn at onehippo.com
Wed Dec 1 14:30:59 CET 2010


We've already implemented this in our project. I'll try to see if I can find
some time to donate this patch.

Jeroen

On Wed, Dec 1, 2010 at 2:18 PM, Frank van Lankvelt <
f.vanlankvelt at onehippo.com> wrote:

> >>>>> this functionality is not broken in the CMS; it was never
> implemented.
> >>>>> Would be nice though...
> >>>> Hmmm, that is really a bummer because the repository has all the bits
> >>>> and bytes for it. We have customers who are using this technique
> >>>> already. How come it is not yet part of the CMS? It is just adding one
> >>>> utility class which would pretty much just do what I wrote in a
> >>>> repository unit test for text extraction. I was really under the
> >>>> impression that this would be added as the default to the CMS. I hope
> >>>> we can pick it up to make it work this way
> >>> By the way I also vaguely recall that the last thing we discussed
> >>> about it that it might have to be part of the DerivedData Engine, to
> >>> make sure it would work out of the box for any plugin, even when
> >>> uploading pdf's from the HST for example. Still think it is a pity we
> >>> did not make this final step (yet)
> >>>
> >> yes, extracting the text in a derived data function would be nice, but
> >> I'm afraid it wouldn't work for updates. Say, you upload a newer
> >> version of the same PDF.  The upload logic should still be aware of
> >> the hippo:text property, if only to remove it.
> >>
> >> To do this completely in the repository, a checksum for the pdf should
> >> be stored next to the extracted text.  Then, when the derived data
> >> function is triggered, it can re-calculate the checksum and do the
> >> extraction again when it differs.  Checksumming is a lot cheaper than
> >> extracting.
> >>
> >> Also, I'm not sure if it is possible at this moment to access a binary
> >> property in a derived data function, but supporting that should be
> >> possible.
> > Do you think we could then, as an alternative, at least provide this
> > functionality for the default cms plugins that do upload binaries such
> > as a pdf? Then, later on, see if we can fix it in Derived Data Engine?
> > I really think it would be a pity to not leverage the very good
> > improvement we can have with this hippo:text property, just because
> > the 'work everywhere' solution needs possibly quite some time to
> > implement. For the short run, I would opt for simply adding it to the
> > cms plugins, but, of course, I am on thin ice here as I am not
> > familiar with the code base
> >
> yeah, this shouldn't be that hard.  I think that's also how some
> projects already implemented this, so perhaps they can step up and
> donate a patch?
>
> cheers, Frank
>
> > Regards Ard
> >
> >>
> >> cheers, Frank
> >>
> >>
> >>
> >>> Regards Ard
> >>>
> >>>>
> >>>> Regards Ard
> >>>>
> >>>>>
> >>>>> cheers, Frank
> >>>>>
> >>>>>
> >>>>> On Wed, Dec 1, 2010 at 1:06 PM, Ard Schrijvers
> >>>>> <a.schrijvers at onehippo.com> wrote:
> >>>>>> As you can see here [1], the cnd contains this property already in
> >>>>>> your version. The idea is, that when uploading the pdf, this
> property
> >>>>>> gets the extracted text. This ensures, extraction won't be needed
> >>>>>> again (particularly important in clustered setup). Now, if this
> >>>>>> property does not get set (it is not down by the repo, but every
> >>>>>> plugin that uploads binaries should do this), this can mean that:
> >>>>>>
> >>>>>> 1) It is broken in the cms
> >>>>>> 2) You are using a custom plugin that does not do the extraction and
> >>>>>> setting of the hippo:text binary property (the logic needed to do
> this
> >>>>>> is really trivial)
> >>>>>>
> >>>>>> Hope this gives you enough pointers to help figure out why you do
> not
> >>>>>> have that property
> >>>>>>
> >>>>>> Regards Ard
> >>>>>>
> >>>>>> [1]
> https://svn.hippocms.org/repos/hippo/hippo-cms7/archive/tags/Tag-HREPTWO-v2_16_02/repository/engine/src/main/resources/hippo.cnd
> >>>>>>
> >>>>>> [hippo:resource]
> >>>>>> - jcr:encoding (string)
> >>>>>> - jcr:mimeType (string) mandatory
> >>>>>> - jcr:data (binary) primary mandatory
> >>>>>> - jcr:lastModified (date) mandatory ignore
> >>>>>> - hippo:text (binary)
> >>>>>>
> >>>>>> On Wed, Dec 1, 2010 at 12:31 PM, Mathijs Brand <
> m.brand at onehippo.com> wrote:
> >>>>>>> Hi Ard,
> >>>>>>>
> >>>>>>> On Wed, Dec 1, 2010 at 11:51 AM, Ard Schrijvers
> >>>>>>> <a.schrijvers at onehippo.com> wrote:
> >>>>>>>> What actually takes a long time is the text-extraction of a pdf.
> >>>>>>>> However, it should only happen once: We fixed this for hippo
> >>>>>>>> repository by having an extra property on the hippo:resource.
> Namely,
> >>>>>>>> hippo:text which is a binary.
> >>>>>>>
> >>>>>>> Great, you've already fixed this and thanks for the quick reply :)
> >>>>>>>
> >>>>>>>> Can you confirm this property is on your
> >>>>>>>> resources?
> >>>>>>>>
> >>>>>>>
> >>>>>>> I don't see the hippo:text property on the hippo:resource document.
> >>>>>>>
> >>>>>>> I see:
> >>>>>>> - jcr:mimeType
> >>>>>>> - jcr:lastModified
> >>>>>>> - jcr:encoding
> >>>>>>>
> >>>>>>> Kind regards,
> >>>>>>> Mathijs
> >>>>>>> _______________________________________________
> >>>>>>> Hippo-cms7-user mailing list and forums
> >>>>>>> http://www.onehippo.org/cms7/support/forums.html
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Hippo
> >>>>>> Europe  •  Amsterdam  Oosteinde 11  •  1017 WT Amsterdam  •  +31
> (0)20 522 4466
> >>>>>> Hippo USA Inc  • 755 Baywood Drive  • Second Floor  • Petaluma, CA
>> >>>>>> 94954 USA  • Phone +1 (707) 658-4535
> >>>>>> Canada    •   Montréal  5369 Boulevard St-Laurent  •  Montréal QC
> H2T
> >>>>>> 1S5  •  +1 (514) 316 8966
> >>>>>> www.onehippo.com  •  www.onehippo.org  •  info at onehippo.com
> >>>>>> _______________________________________________
> >>>>>> Hippo-cms7-user mailing list and forums
> >>>>>> http://www.onehippo.org/cms7/support/forums.html
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Hippo
> >>>>> Europe  •  Amsterdam  Oosteinde 11  •  1017 WT Amsterdam  •  +31
> (0)20 522 4466
> >>>>> USA  • San Francisco  185 H Street Suite B  •  Petaluma CA 94952-5100
> >>>>> •  +1 (707) 773 4646
> >>>>> Canada    •   Montréal  5369 Boulevard St-Laurent #430 •  Montréal QC
> >>>>> H2T 1S5  •  +1 (514) 316 8966
> >>>>> www.onehippo.com  •  www.onehippo.org  •  info at onehippo.com
> >>>>> _______________________________________________
> >>>>> Hippo-cms7-user mailing list and forums
> >>>>> http://www.onehippo.org/cms7/support/forums.html
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Hippo
> >>>> Europe  •  Amsterdam  Oosteinde 11  •  1017 WT Amsterdam  •  +31 (0)20
> 522 4466
> >>>> Hippo USA Inc  • 755 Baywood Drive  • Second Floor  • Petaluma, CA  •
> >>>> 94954 USA  • Phone +1 (707) 658-4535
> >>>> Canada    •   Montréal  5369 Boulevard St-Laurent  •  Montréal QC H2T
> >>>> 1S5  •  +1 (514) 316 8966
> >>>> www.onehippo.com  •  www.onehippo.org  •  info at onehippo.com
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Hippo
> >>> Europe  •  Amsterdam  Oosteinde 11  •  1017 WT Amsterdam  •  +31 (0)20
> 522 4466
> >>> Hippo USA Inc  • 755 Baywood Drive  • Second Floor  • Petaluma, CA  •
> >>> 94954 USA  • Phone +1 (707) 658-4535
> >>> Canada    •   Montréal  5369 Boulevard St-Laurent  •  Montréal QC H2T
> >>> 1S5  •  +1 (514) 316 8966
> >>> www.onehippo.com  •  www.onehippo.org  •  info at onehippo.com
> >>> _______________________________________________
> >>> Hippo-cms7-user mailing list and forums
> >>> http://www.onehippo.org/cms7/support/forums.html
> >>>
> >>
> >>
> >>
> >> --
> >> Hippo
> >> Europe  •  Amsterdam  Oosteinde 11  •  1017 WT Amsterdam  •  +31 (0)20
> 522 4466
> >> USA  • San Francisco  185 H Street Suite B  •  Petaluma CA 94952-5100
> >> •  +1 (707) 773 4646
> >> Canada    •   Montréal  5369 Boulevard St-Laurent #430 •  Montréal QC
> >> H2T 1S5  •  +1 (514) 316 8966
> >> www.onehippo.com  •  www.onehippo.org  •  info at onehippo.com
> >> _______________________________________________
> >> Hippo-cms7-user mailing list and forums
> >> http://www.onehippo.org/cms7/support/forums.html
> >>
> >
> >
> >
> > --
> > Hippo
> > Europe  •  Amsterdam  Oosteinde 11  •  1017 WT Amsterdam  •  +31 (0)20
> 522 4466
> > Hippo USA Inc  • 755 Baywood Drive  • Second Floor  • Petaluma, CA  •
> > 94954 USA  • Phone +1 (707) 658-4535
> > Canada    •   Montréal  5369 Boulevard St-Laurent  •  Montréal QC H2T
> > 1S5  •  +1 (514) 316 8966
> > www.onehippo.com  •  www.onehippo.org  •  info at onehippo.com
> > _______________________________________________
> > Hippo-cms7-user mailing list and forums
> > http://www.onehippo.org/cms7/support/forums.html
> >
>
>
>
> --
> Hippo
> Europe  •  Amsterdam  Oosteinde 11  •  1017 WT Amsterdam  •  +31 (0)20 522
> 4466
> USA  • San Francisco  185 H Street Suite B  •  Petaluma CA 94952-5100
> •  +1 (707) 773 4646
> Canada    •   Montréal  5369 Boulevard St-Laurent #430 •  Montréal QC
> H2T 1S5  •  +1 (514) 316 8966
> www.onehippo.com  •  www.onehippo.org  •  info at onehippo.com
>



-- 
Hippo
----------------------------------------------------------------------------------------------
Europe  •  Amsterdam  Oosteinde 11  •  1017 WT Amsterdam  •  +31 (0)20 522
4466
USA  • 755 Baywood Drive Second Floor  •  Petaluma CA. 94954
•  +1 (707) 658-4535
Canada    •   Montréal  5369 Boulevard St-Laurent #430 •  Montréal QC
H2T 1S5  •  +1 (514) 316 8966
----------------------------------------------------------------------------------------------
www.onehippo.com  •  www.onehippo.org  •  info at onehippo.com
----------------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.onehippo.org/pipermail/hippo-cms7-user/attachments/20101201/b8be2db4/attachment.htm>


More information about the Hippo-cms7-user mailing list