Programmatically Search and Highlight Text in PDFs using C# in .NET
Quick Start Guide | |
---|---|
What You Will Need |
Visual Studio Code .NET 8 NuGet Package: DS.Documents.Pdf 7.0.3 |
Controls Referenced | |
Tutorial Concept | This tutorial discusses programmatically conducting text searches and highlighting found text in PDFs using a C#/.NET PDF API. |
This tutorial delves into different ways to programmatically search, find, and highlight text within PDF documents using .NET/C# API. We will go over loading a PDF, conducting text searches, and creating highlight markups with nuanced colors and shapes. In this example, we will use Document Solutions for PDF (DsPdf, formerly GcPdf), which enables seamless integration for C#/.NET software developers seeking advanced PDF generation functionalities. This piece will showcase the generated PDFs using the included JavaScript Document Solutions PDF Viewer.
Learn More About Document Solutions for PDF by Downloading a Trial Today!
This blog will cover how to conduct the following PDF text searches programmatically using a C# .NET PDF API:
- Find and Highlight Text in a PDF Documents
- Search for Text on a Specific PDF Page
- Find and Highlight Text From a Specific Range of PDF Pages
- Search for Text in a PDF Based on Structure Tags
- Find and Markup Transformed Text in PDFs
To Follow Along, Download a Sample App for this Tutorial Here.
Find and Highlight Text in a PDF Document Using C#
DsPdf simplifies conducting programmatic text searches in PDF documents through its FindText method, enabling users to locate all instances of specific text. The highlighting of each found item can be achieved using the System.Drawing graphics class along with the bounds of the identified text. Users can customize text search parameters through the FindTextParams constructor, with options such as wholeWord and matchCase. These parameters provide flexibility, allowing users to determine whether the search should match whole words, be case-sensitive, or both.
Note: To follow along with this section, you must include the GrapeCity.Documents.Common namespace.
The following code will search for the whole word "wetlands" in a PDF and then highlight the found text:
// Initialize the DsPdf document instance
var doc = new GcPdfDocument();
using (var fs = new FileStream(Path.Combine("wetlands.pdf"),FileMode.Open, FileAccess.Read))
{
// Load a sample PDF
doc.Load(fs);
// Use the FindText method to search text for drive, using case-insensitive, whole word match
var findsDrive = doc.FindText(new FindTextParams("wetlands", true, false), OutputRange.All);
// Highlight all found text using semi-transparent orange red
foreach (var find in findsDrive)doc.Pages[find.PageIndex].Graphics.FillPolygon(find.Bounds[0], Color.FromArgb(100, Color.OrangeRed));
doc.Save("1 - Search and Highlight Text.pdf");
}
Developers can do a multitude of searches and apply different types of markups. See our online documentation and demo explorer to learn more.
Search for Text on a Specific PDF Page using C#
In specific scenarios, users might opt to narrow down text searches to a particular page rather than scanning the entire PDF document. This can be achieved by accessing the text map interface of a specific page using its index and conducting a text search exclusively within that page's text map. For instance, the provided code demonstrates the following steps: instantiating a new FindTextParams class and performing a text search within the Text Map using the FindText method.
The following code demonstrates this by searching and highlighting the word “the” on the 2nd page of the PDF document.
// Create new instance of PDF document
GcPdfDocument doc = new GcPdfDocument();
using (var fs = new FileStream(Path.Combine("wetlands.pdf"), FileMode.Open, FileAccess.Read))
{
// Load existing PDF
doc.Load(fs);
// 1. Create a new instance of FindTextParams
var ftp = new FindTextParams("the", true, false);
// 2. Get the text map of a page by its index, not index starts at 0 so this will search page 2
var tm = doc.Pages[1].GetTextMap();
if (tm != null)
// 3. Perform text search within the text map using FindText Method and highlight text orange
tm.FindText(ftp, (p_) => {
doc.Pages[1].Graphics.FillPolygon(p_.Bounds[0], Color.FromArgb(100, Color.OrangeRed));
});
doc.Save("2 - Search Text Only Page 2.pdf");
}
Find and Highlight Text From a Specific Range of PDF Pages Using C#
Searching for text within a specific page range in a PDF is crucial for focused analysis. This targeted approach improves performance and isolates content for detailed examination. Developers can conduct this text search programmatically easily by defining the OutputRange class of the FindText methods. The OutputRange class provides the searchRange property.
Note: To follow along with this section, you must include the GrapeCity.Documents.Common namespace.
The code below will search and highlight text only on pages 2 and 3 of the provided PDF document.
// Initialize the DsPdf document instance
var doc = new GcPdfDocument();
using (var fs = new FileStream(Path.Combine("wetlands.pdf"),
FileMode.Open, FileAccess.Read))
{
// Load an existing document from file stream
doc.Load(fs);
// Create an new FindTextParams instance
var ftp = new FindTextParams("the", true, false);
// Define to and from page range properties
OutputRange searchRange = new OutputRange(2, 3);
// Find all text using case-insensitive word search within the page range
var findsTextThe = doc.FindText(ftp, searchRange);
foreach (var find in findsTextThe)
doc.Pages[find.PageIndex].Graphics.FillPolygon
(find.Bounds[0], Color.FromArgb(100, Color.OrangeRed));
doc.Save("3 - Find and Highlight Text From a Specific Range of PDF Pages.pdf");
}
Search for Text in a PDF Based on Structure Tags
Searching for text based on structural tags offers an alternative method for specifying parameters in a text search. For instance, to locate headers like H1, H2, or H3, users can employ the GetLogicalStructuremethod to retrieve the PDF document's structure. By specifying the desired tag item, such as "H1," users can initiate a process to obtain the PDF structure, searching the page root for the specified structural tag and iteratively navigating through the located tags to highlight the tag containing the desired text.
Note: To follow along with this section, you must include the GrapeCity.Documents.Pdf.Recognition.Structure namespace.
The following code will get the PDF’s H1 tags and search through them for the text “C1Olap”.
GcPdfDocument doc = new GcPdfDocument();
using (var fs = new FileStream(Path.Combine("read-tags-to-outlines.pdf"), FileMode.Open, FileAccess.Read))
{
doc.Load(fs);
// Get the LogicalStructure of the doc
LogicalStructure ls = doc.GetLogicalStructure();
if (ls == null || ls.Elements == null || ls.Elements.Count == 0)
{
// No structure tags found:
Console.Write("No structure tags were found in the source document.", doc.Pages.Add());
return;
}
// Element holds a reference of the logical structure
Element root = ls.Elements[0];
// Find all the H1 tags
var find = root.Children.ToList().FindAll(e_ => e_.StructElement.Type == "H1");
// Loop through all found H1 tags for specific text
foreach (Element e in find)
{
var color = Color.FromArgb(64, Color.Red);
if (e.HasContentItems)
{
// Get headers text
var text = e.GetText();
foreach (var i in e.ContentItems)
{
// Search for title with text "C1Olap"
if (text.Contains("C1Olap", StringComparison.OrdinalIgnoreCase))
{
if (i is ContentItem ci)
{
var p = ci.GetParagraph();
if (p != null)
{
// Get the coordinates of the found H1 tag
var rc = p.GetCoords().ToRect();
rc.Offset(rc.Width, 0);
// Draws highlighting around found H1
ci.Page.Graphics.DrawPolygon(p.GetCoords(), color, 1, null);
}
}
}
}
}
else
Console.WriteLine("No Text Found");
}
doc.Save("4 - Search for Text in a PDF Based on Structure Tags.pdf");
Console.WriteLine("PDF saved");
}
To learn more about reading PDF structure tags using C#, check out the online Read Structure Tags Demo.
Find and Markup Graphically Transformed Text in PDFs
PDFs are known to contain graphically transformed text; drawing text on top of an existing PDF using page graphics. This is typical when adding a logo or watermark to a PDF. DsPdf supports the ability to search for text specifically within graphically transformed text and highlight the found items.
To accomplish this, use DsPdf's FindText method to search for the wanted text.
Then, loop through each page containing the searched text and create a content stream using DsPdf's ContentStreams property. With this stream, get the graphics on the page using the GetGraphics method and apply the highlighting to the bounds of the found text from the returned graphics.
The provided code snippet conducts a search within a PDF document to identify graphically transformed text acting as a logo watermark for specified text, then highlighting the found instances with blue rectangles.
// Initialize the DsPdf document instance
var doc = new GcPdfDocument();
using (var fs = new FileStream(Path.Combine("Transformed Text.pdf"), FileMode.Open, FileAccess.Read))
{
// Load an existing document from file stream
doc.Load(fs);
// Find all text items 'LOGO', using case-sensitive search
var finds = doc.FindText(new FindTextParams("LOGO", false, true), OutputRange.All);
// Highlight all finds: first, find all pages where the text was found
var pgIndices = finds.Select(f_ => f_.PageIndex).Distinct();
// Loop through pages with found text
foreach (int pgIdx in pgIndices)
{
var page = doc.Pages[pgIdx];
// Create a content stream of the page
PageContentStream pcs = page.ContentStreams.Insert(0);
// Get the graphics included on the a pages content stream
var g = pcs.GetGraphics(page);
foreach (var find in finds.Where(f_ => f_.PageIndex == pgIdx))
{
foreach (var ql in find.Bounds)
{
// Set the color used to fill the polygon/highlight the found text
g.FillPolygon(ql, Color.CadetBlue);
g.DrawPolygon(ql, Color.Blue);
}
}
}
doc.Save("5 - Find and Markup Graphically Transformed Text in PDFs.pdf");
}
Console.WriteLine("PDF saved");
Try our online demo for Finding Transformed Text using a .NET PDF API to see another example.
Learn More About Document Solutions for PDF by Downloading a Trial Today!
Learn More About this .NET C# PDF API
This article scratches the surface of the full capabilities of Document Solutions for PDF. Learn how to create, extract, modify, redact, apply signatures, and more with this .NET C# PDF API. Document Solutions offers a full-fledged PDF solution, including a client-side JavaScript PDF viewer control. The JS PDF viewer control is showcased throughout this piece. To learn more about the .NET C# API and its JavaScript PDF viewer, check out our demos and documentation:
Document Solutions for PDF, .NET C# PDF API
Online Demo Explorer | Documentation
Document Solutions PDF Viewer, JavaScript PDF viewer control
Online Demo Explorer | Documentation