How to Parse and Extract Content from PDF Documents in C# VB.NET
Quick Start Guide | |
---|---|
What You Will Need |
.NET application Document Solutions for PDF |
Controls Referenced | |
Tutorial Concept | Learn how to parse and modify text, extract images, and utilize regular expressions to find specific data within PDF documents using C# and a powerful .NET PDF API. |
PDF files can sometimes be the most challenging files to work with while simultaneously being among the most common files in the digital world. Parsing text or images from a PDF can seem daunting, but Document Solutions for PDF v7 makes this process easy! Our latest releases continue the tradition of improving and upgrading the handling of text within PDF documents, as well as adding many other upgrades and feature enhancements, specifically, being able to parse/read text from a PDF using C# and modifying text throughout a PDF document. New samples covering every feature with a full code-behind view also help developers editing PDF documents get up and running quickly.
Starting with version 3.2, we’ve continually improved the logic regarding parsing, extracting, and reading text from a PDF - efficiently handling individual cases, such as text that is rendered multiple times to create bold or shadowed text effects so that text is not repeated in the output but only appears once in the document.
For text within a PDF, the Document Solutions for PDF API contains the FindText method, which can find text that spans more than one line. The FindText method returns a FoundPosition object, which contains an array of Quadrilateral structures from the FoundPosition object’s Bounds property. A new property ITextMap.Paragraphs returns a collection of ITextParagraph objects associated with the ITextMap.
For images within a PDF, the Document Solutions for PDF API contains the GetImages method, which utilizes the ImageBrush class to return an array of images from the PDF file. You can see an example of this in action directly within our demos.
In this blog, we will be exploring the following topics:
- Parsing and extracting data from a PDF
- Formatting a newly generated PDF
- Reading, parsing, and saving text from an existing PDF into a new PDF using C#
- Parsing, reading, and extracting text from a PDF across multiple lines, paragraphs, or pages
- Utilizing REGEX to parse data from a PDF
- Parsing and extracting images from a PDF
Ready to Get Started? Download Document Solutions for PDF Today!
Create C# PDF Parsing Code with the ITextMap.Paragraphs Property
This example reads an existing multi-page PDF document and demonstrates how to use ITextMap.Paragraphs to extract paragraphs from each page of a PDF document. The complete example and code are included in the updated demo sample explorer for Document Solutions for PDF.

The code extracts the text paragraphs on each page, rendering each section in alternating colors (for clarity) in a new PDF document:

Set the Formatting for the Generated PDF
The code used to generate the format settings for the above PDF is shown below. First, the code sets an integer value to indicate the PDF page margins we will be working with, along with some colors we’ll be using throughout. Then, the code creates a new PDF document where the text paragraphs will be rendered and adds a note explaining the sample at the top of the first page.
Next, new separate TextFormat objects are created to format the captions and paragraphs, and a new TextLayout object is created to specify the page margins.
Finally, a new TextSplitOptions object is made to handle pagination. Using the new ITextMap.Paragraphs property, the code required to perform this task is straightforward:
Code Analysis of Document Solutions Parsing/Reading PDF with C#
Now, we’ll showcase how to utilize the GetTextMap method to extract the text from the original PDF. First, the Wetlands.pdf document (the original PDF) is opened and loaded into a new GcPdfDocument object. Then, the new ITextMap.Paragraphs API is used to get the text paragraphs and append them into a different document. After each paragraph is appended, the TextFormat class is used for the paragraphs and updates tfpar to alternate the background color, highlighting the separate paragraphs in the new document.
Then, the final document is completed using TextLayout.PerformLayout and TextLayout.Split to paginate the results, merging those into one single output document using the GdPdfDocument.MergeWithDocument method.
The final result is saved as a new PDF using the GcPdfDocument.Save method.
Parse/Read the Text Across Multiple Lines or Paragraphs with C# and DsPdf

Finalize C# PDF Parsing/Reading Code and Extract Data (Save)
The FindText method now supports finding text that appears in multiple lines in a paragraph or across pages. To illustrate this, code similar to the code in the FindText demo sample is added, which searches for longer text strings that span across multiple lines and paragraphs. FindText will return a list with the found positions of all the instances where the indicated text string was found within the document. The FoundPosition.Bounds property returns an array of Quadrilateral structures, forming the bounds in each successive line or section.
In the code below, we use the FindText method to find two longer text strings, where the first string spans across multiple lines, and the second string spans across various paragraphs.
The code uses GcGraphics.FillPolygon to highlight the found text and fill the area of the found text with a semi-transparent orange-red color, as shown in the output image above.
Utilize REGEX to Extract Data from a PDF
Another useful application of DsPDF’s FindText method is the use of regular expressions so that specific known pieces of information can be quickly and easily extracted from PDF documents.
Document Solutions for PDF supports finding text based on regular expressions using the FindText method of the GcPdfDocument class and passing the regular expression to this method using the FindTextParams class. The code snippet below makes use of the FindText method to extract an invoice total and a customer email address from an invoice:
Here is an image of the inputted PDF document we are running the REGEX expressions against:

Below is the console window output after we run the REGEX search on a PDF from the code above:

Parse and Extract Images from a PDF
While many people may only be interested in extracting text from PDFs, Document Solutions for PDF’s powerful API library also offers the ability to extract images from PDFs. Let’s revisit the Wetlands.pdf that we extracted text from in the first portion of this blog, but this time, we’ll only be extracting the images instead of the text. Each image extracted from the original PDF will exist on a separate page of our newly generated PDF. The full sample can be found in the demo section of our website.
The image below shows the extracted image from the original Wetlands.pdf inside a new PDF file:

To extract the image, the GetImages method was used. In the code below, we start by using the Wetlands.pdf file and loading it in as the data source for a new GcPdfDocument object. We then create a variable called imageInfos that will store the array of images returned from the GetImages method. Next, we create a new PDF document to hold the extracted images and then begin iterating through imageInfos to extract each image found in the original PDF document by adding each image to a new page within our new PDF document. While doing this, we also mark the new PDF with the page number where the image was found in the original PDF document, which is possible because the Image objects returned from the GetImage method contain information concerning their page indices and location from where they existed in the original PDF. Lastly, we save our newly generated PDF document to view later.
Ready to Try It Out? Download Document Solutions for PDF Today!
We hope you have found this helpful! Contact us with any questions you may have related to this blog or the Document Solutions product family, and keep on coding!