[Hippo-cms7-user] PDF indexing error while using JCR API

Jeroen Hoffman j.hoffman at onehippo.com
Thu Apr 12 14:47:34 CEST 2012


Hi,

We have a content migration application in test phase and experience a Lucene 
indexing problem with about 20% of the pdf's that are imported.

The application is standalone and talks plain JCR over RMI to the repository. We 
have the Jackrabbit Hippo patched jars in there as well. We have no changes to 
Lucene configuration.

The strange thing is, the indexing error (a pdfbox null pointer while getting 
the font's encoding) does not occur when the pdf is uploaded into the CMS 
manually. We use very similar JCR code to what the CMS uses.

Any clues an how can that happen? It is the jcr:data property that is indexed, 
right?

(See code snippet and stack trace below)

TIA,
Jeroen


Setting jcr:data with JCR API:

   File file = new File("path/to/pdf");
   InputStream input = new FileInputStream(file);
    try {
       node.setProperty("jcr:data", 
node.getSession().getValueFactory().createBinary(input));
    } finally {
       IOUtils.closeQuietly(input);
    }


Indexing stack trace:

11.04.2012 12:32:09 WARN 
[org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run():180] 
Failed to extract text from a binary property
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1 at 7d427d23
   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
   at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
   at 
org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:192)
   at 
org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
   at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.NullPointerException
   at org.apache.pdfbox.pdmodel.font.PDFont.getEncodingFromFont(PDFont.java:832)
   at org.apache.pdfbox.pdmodel.font.PDFont.determineEncoding(PDFont.java:293)
   at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:178)
   at org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:79)
   at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:139)
   at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:109)
   at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:76)
   at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)
   at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243)
   at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
   at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:441)
   at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:365)
   at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:321)
   at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:241)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:90)
   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   ... 11 more



More information about the Hippo-cms7-user mailing list