Find text in pdf "sections" of page

Posted by: aweber on 4 August 2024, 10:22 am EST

  • Posted 4 August 2024, 10:22 am EST

    I have some marketing PDF pages to analyze. Many of them separate sections/parts of the page using different background colors (rectangles or other drawn shapes to change the background of the area of the page to make the text there distinct).

    For basic example, imagine this forum page was a pdf…there is a “section”/block to the right side in a shade of green for “Need extra support?” and another beneath that in an off-white for “Forum Channels”. How to find those blocks/rectangle borders in the pdf page, and then find the text within those borders?

    Is there a way to find the distinct background shapes (maybe borders) and therefore use something like the textmap to determine what text is in each “section” (shape)?

    Sorry I don’t have an example I can share at this time due to proprietary nature, but happy to explain further my issue if I was not clear.

  • Posted 5 August 2024, 4:02 pm EST

    Maybe we can take this in smaller steps…

    How would I use GC.Pdf to find all the Graphics (in this case Rectangles) on an existing page? I see a lot of methods to draw/create graphics. I do not see a lot of examples on how to enumerate existing graphics to inspect/modify them.

  • Posted 6 August 2024, 6:12 am EST

    Hi,

    We have created a small sample to achieve your requirement to fetch Annotations from a searched text.

    Please refer the attached sample: Sample.zip

    If you have another requirement, then please let us know. We will try to achieve that.

    Regards,

    Nitin

  • Posted 6 August 2024, 7:53 am EST

    These parts of the PDF page are not annotations. They are probably background/layer graphics behind the text. I can not determine how they are drawn behind the paragraphs, because I can not find the methods in the GC API to determine what they are and where they are.

  • Posted 7 August 2024, 5:34 am EST

    Hi,

    Is that part a rectangle or any other shapes drawn by graphics? could you please confirm if the shape is just like in this demo: https://developer.mescius.com/document-solutions/dot-net-pdf-api/demos/features/graphics/round-rectangle/pdf-cs

    We tried to get these shapes from matching containing text but we can’t access this. We are discussing this with the development team. Will get back to you once we have any updates from them.[internal Tracking Id - DOC-6429]

    Regards,

    Nitin

  • Posted 7 August 2024, 8:26 am EST

    The example I am attaching is not exact, it is a public PDF I randomly found on the internet. How would I determine all of the text included in the “box” that starts with “Once diagnosed,” for example? Or the box that starts with the heading “Treatment Challenges”?

    These are the “sections” - identified by different background colors (graphics) - that I would like to understand are specific paragraphs.

  • Posted 7 August 2024, 8:28 am EST

  • Posted 8 August 2024, 5:02 am EST

    Hi,

    We are discussing this with the development team. Will get back to you once we have any updates from them.

    Regards,

    Nitin

Need extra support?

Upgrade your support plan and get personal unlimited phone support with our customer engagement team

Learn More

Forum Channels