GetText hangs for one page pdf

Posted by: rbroadwell on 9 March 2026, 3:56 pm EST

  • Posted 9 March 2026, 3:56 pm EST

    I exported an excel file into a one page pdf.

    I need to extract the text from that pdf.

    I am testing text extraction with a trial version of DS.Documents.Pdf, version 9.0.3.

    The GetText method hangs for the pdf file: GcPdfDocument.Pages[0].GetText()

    I have attached the sample pdf.

    Can this be looked into?

  • Posted 10 March 2026, 12:04 am EST

    Hi Robin,

    Thank you for reporting this behavior.

    We were unable to find any PDF file attached to your message. To investigate the behavior on our side, we created a test scenario by generating an Excel file with 300 rows and 26 columns, exporting it to a single-page PDF, and then extracting the text using DsPdf v9.0.3. In our test, the GetText() operation took approximately 20-30 seconds to complete. Please refer to the sample application attached below.

    Since this seems longer than expected for such a scenario, we have escalated the matter to our development team for further investigation under the internal tracking ID DOC-7441.

    In the meantime, to accurately reproduce the exact scenario you are encountering, could you please share the PDF file that causes DsPdf to hang during text extraction? This will help us analyze the issue more precisely.

    We look forward to your response.

    Best regards,

    Chirag

    Attachment: GetTextIssue.zip

  • Posted 10 March 2026, 6:54 am EST

    Hi Robin,

    Thank you for your patience.

    After further investigation with our development team, we identified the reason why the GetText() operation appears to hang or take a long time for your scenario.

    By default, DsPdf uses the RecognitionAlgorithm.Advanced mode for text extraction. This algorithm attempts to reconstruct the logical structure of the document (for example, grouping text into paragraphs and reconstructing layout). While this approach works well for documents with clear paragraph-based structures, it can be inefficient for PDFs generated from Excel where the content is arranged in dense tabular layouts across many columns.

    In such cases, the algorithm may attempt to combine column text into paragraphs, which significantly increases the processing time.

    For Excel-like PDFs, we recommend switching the recognition algorithm to AcrobatLike, which performs a simpler extraction similar to Acrobat’s behavior and is much faster for this type of document.

    You can apply this change before calling GetText():

    var doc = new GcPdfDocument();
    doc.Load("sample.pdf");
    doc.RecognitionAlgorithm = GrapeCity.Documents.Pdf.Recognition.RecognitionAlgorithm.AcrobatLike;
    string text = doc.Pages[0].GetText();

    In our internal tests with a similar Excel-generated PDF, this change reduced the extraction time from around 20–30 seconds to approximately 2-3 seconds.

    You can refer to the attached code sample that uses the above code snippet and extracts the text efficiently.

    Please let us know if you still encounter any issues with this approach in your PDF file.

    Best regards,

    Chirag

    Attachment: GetTextIssue.zip

  • Posted 10 March 2026, 9:45 am EST

    Hello, thank you for the quick turnaround.

    I tried changing the algorithm as you advised but the call to GetText is still hanging.

    I’m not sure why the sample pdf did not upload yesterday but I am attaching it again.

    Our application has a kill switch set to 20 minutes and rendering the pdf to text is taking longer than that so I don’t have any idea if it would ever finish or not.

    Can you try running your test with the sample.pdf that is attached.

    Thank you!ExcelWorksheet_2.zip

  • Posted 10 March 2026, 11:51 am EST

    Hi,

    I re-ran my test with that sample pdf using the AcrobatLike algorithm.

    It was able to extract the text but it took about 10 minutes.

    Are there any other options that could be tweaked (or turned off) to speed up the text extraction?

    (I don’t need the text to be pretty.)

    Thanks!

  • Posted 10 March 2026, 12:30 pm EST

    Hi Robin,

    Thanks for sharing the sample PDF document causing the issue.

    We can replicate the significant lag in the text extraction using DsPdf’s GetText method. We have escalated this behavior with our development team for further investigation and will update you as soon as we get any information from their end.

    At the moment, there is no viable workaround available. If we discover any possible workaround during further investigation, we will be sure to share it with you.

    Kind Regards,

    Chirag

  • Posted 11 March 2026, 11:28 pm EST

    Hi Robin,

    The issue has been identified as a bug, and a fix is scheduled to be included in DsPdf v9.1.0. The current estimated release date for this version is April 10, 2026.

    We appreciate your patience in the meantime.

    Kind Regards,

    Chirag

  • Posted 12 March 2026, 9:23 am EST

    Ok, thank you for the quick turn around !

Need extra support?

Upgrade your support plan and get personal unlimited phone support with our customer engagement team

Learn More

Forum Channels