Extract Text from PDF

Posted by: john.burke on 14 August 2024, 9:01 am EST

Please login to follow topic

john.burke
- Post Options:
- Link
  Copy
Posted 14 August 2024, 9:01 am EST - Updated 14 August 2024, 9:06 am EST

For several .PDF documents, I am able to use an export filter to export .PDFs to .html and then parse the .html for text.

I have this one .PDF where the .html file is split such that each individual character is in it’s own html element, so I am unable to parse the .html efficiently.

The .PDF file has text, but when I use the Xls and Rtf export filters, the content is exported as an image.

Is there any sample that shows how to extract or iterate over the text fields in a .PDF?

I have the WinForms ComponentOne v4.0.20173.282 and a later version on my development machines.

Thanks for any advice or help in advance…

John

PDF file has a bunch of tables in it like below…

930×281 52 KB
uttkarsh.matiyal
- Post Options:
- Link
  Copy
Posted 16 August 2024, 2:23 am EST
Hello John,

You can use C1PdfDocumentSource.GetWholeDocumentRange().GetText() method to extract text from a PDF document as follows:

var mc = new C1.Win.C1Document.Util.C1DXTextMeasurementContext(); var dr = _document.GetWholeDocumentRange(mc); var text = dr.GetText(); textBox1.Text = text;

Please refer to the attached sample for implementation and let us know if you face any issues (see PDF_TextHandling.zip).

Please share a dummy PDF with which you face issues for investigation, along with the C1-specific code if you are using any.

FYI, the version of the controls you’re currently using is quite outdated and no longer supported. Therefore, we highly recommend updating your C1 version to the latest release to take advantage of the newest features and fixes.

Regards,

Uttkarsh.
john.burke
- Post Options:
- Link
  Copy
Posted 20 September 2024, 10:57 am EST - Updated 20 September 2024, 11:14 am EST

I forgot to come back and thank you for the solution.

Too busy using it to process data…

Thanks!

John

Please login to reply to thread

Need extra support?

Upgrade your support plan and get personal unlimited phone support with our customer engagement team

Learn More

Forum Channels

ComponentOne

Forums for all current editions of the ComponentOne .NET UI control product line, including ComponentOne Studio and ComponentOne Studio for Xamarin.
ActiveReports

Forums for all versions of ActiveReports and ActiveReports Server
Spread

Forums for all current versions of Spread .NET spreadsheets, SpreadJS JavaScript spreadsheets, and SpreadCOM spreadsheets.
Wijmo

Forums for all Wijmo products, including Wijmo Core, FinancialChart, FlexSheet, MultiRow, OLAP, and ReportViewer
- General Discussion
Document Solutions

Forums for all Document Solutions products, including Document Solutions for PDF, Word, Excel (.NET and Java), and Imaging.

Extract Text from PDF

Need extra support?

Forum Channels

ComponentOne

ActiveReports

Spread

Wijmo

Document Solutions