DsPdf allows you to parse PDF documents by recognizing their logical text and document structure. The content elements like plain text, tables, paragraphs and elements in tagged PDF documents can be extracted by using DsPdf API as explained below:
To extract text from a PDF:
C# |
Copy Code
|
---|---|
GcPdfDocument doc = new GcPdfDocument(); FileStream fs = new FileStream("DsPdf.pdf",FileMode.Open,FileAccess.Read); doc.Load(fs); //Extract text present on the last page String text=doc.Pages.Last.GetText(); //Add extracted text to a new pdf GcPdfDocument doc1 = new GcPdfDocument(); PointF textPt = new PointF(72, 72); doc1.NewPage().Graphics.DrawString(text, new TextFormat() { FontName = "ARIAL", FontItalic = true }, textPt); doc1.Save("NewDocument.pdf"); Console.WriteLine("Press any key to exit"); Console.ReadKey(); |
Similarly, you can also extract all the text from a document by using GetText method of the GcPdfDocument class.
DsPdf provides ITextMap interface that represents the text map of a page in a DsPdf document. It helps you to find the geometric positions of the text lines on a page and extract the text from a specific position.
The text map for a specific page in the document can be retrieved using the GetTextMap method of the Page class, which returns an object of type ITextMap. ITextMap provides four overloads of the GetFragment method, which helps to retrieve the text range and the text within the range. The text range is represented by the TextMapFragment class and each line of text in this range is represented by the TextLineFragment class.
The example code below uses the GetFragment(out TextMapFragment range, out string text) overload to retrieve the geometric positions of all the text lines on a page and the GetFragment(MapPos startPos, MapPos endPos, out TextMapFragment range, out string text) overload to retrieve the text from a specific position in the page.
C# |
Copy Code
|
---|---|
// Open an arbitrary PDF, load it into a temp document and use the map to find some texts: using (var fs = new FileStream("Test.pdf", FileMode.Open, FileAccess.Read)) { var doc1 = new GcPdfDocument(); doc1.Load(fs); var tmap = doc1.Pages[0].GetTextMap(); // We retrieve the text at a specific (known to us) geometric location on the page: float tx0 = 2.1f, ty0 = 3.37f, tx1 = 3.1f, ty1 = 3.5f; HitTestInfo htiFrom = tmap.HitTest(tx0 * 72, ty0 * 72); HitTestInfo htiTo = tmap.HitTest(ty0 * 72, ty1 * 72); tmap.GetFragment(htiFrom.Pos, htiTo.Pos, out TextMapFragment range1, out string text1); tl.AppendLine($"Looked for text inside rectangle x={tx0:F2}\", y={ty0:F2}\", " + $"width={tx1 - tx0:F2}\", height={ty1 - ty0:F2}\", found:"); tl.AppendLine(text1); tl.AppendLine(); // Get all text fragments and their locations on the page: tl.AppendLine("List of all texts found on the page"); tmap.GetFragment(out TextMapFragment range, out string text); foreach (TextLineFragment tlf in range) { var coords = tmap.GetCoords(tlf); tl.Append($"Text at ({coords.B.X / 72:F2}\",{coords.B.Y / 72:F2}\"):\t"); tl.AppendLine(tmap.GetText(tlf)); } // Print the results: tl.PerformLayout(true); } |
DsPdf allows extracting text paragraphs from a PDF document by using Paragraphs property of ITextMap interface. It returns a collection of ITextParagraph objects associated with the text map.
Sometimes, PDF documents might contain some repeating text (for example, overlap of same text to show it as bold) but DsPdf extracts such text without returning the redundant lines. Also the tables with multi-line text in cells are correctly recognized as text paragraphs.
The example code below shows how to extract all text paragraphs of a PDF document:
C# |
Copy Code
|
---|---|
GcPdfDocument doc = new GcPdfDocument(); var page = doc.NewPage(); var tl = page.Graphics.CreateTextLayout(); tl.MaxWidth = doc.PageSize.Width; tl.MaxHeight = doc.PageSize.Height; //Text split options for widow/orphan control TextSplitOptions to = new TextSplitOptions(tl) { MinLinesInFirstParagraph = 2, MinLinesInLastParagraph = 2, }; //Open a PDF, load it into a temp document and get all page texts using (var fs=new FileStream("Wetlands.pdf", FileMode.Open, FileAccess.Read)) { var doc1 = new GcPdfDocument(); doc1.Load(fs); for (int i = 0; i < doc1.Pages.Count; ++i) { tl.AppendLine(string.Format("Paragraphs from page {0} of the original PDF:", i + 1)); var pg = doc1.Pages[i]; var pars = pg.GetTextMap().Paragraphs; foreach (var par in pars) { tl.AppendLine(par.GetText()); } } tl.PerformLayout(true); while (true) { //'rest' will accept the text that did not fit var splitResult = tl.Split(to, out TextLayout rest); doc.Pages.Last.Graphics.DrawTextLayout(tl, PointF.Empty); if (splitResult != SplitResult.Split) break; tl = rest; doc.NewPage(); } //Append the original document for reference doc.MergeWithDocument(doc1, new MergeDocumentOptions()); } //Save document doc.Save(stream); return doc.Pages.Count; |
Limitations
DsPdf allows you to extract data from tables in PDF documents. The GetTable method in Page class extracts data from the area specified as a table. The method takes table area as a parameter, parses that area and returns the data of rows, columns, cells and their textual content. You can also pass TableExtractOptions as a parameter to specify table formatting options like column width, row height, distance between rows or columns.
The example code below shows how to extract data from a table in a PDF document:
C# |
Copy Code
|
---|---|
const float DPI = 72; const float margin = 36; var doc = new GcPdfDocument(); var tf = new TextFormat() { Font = Font.FromFile(Path.Combine("segoeui.ttf")), FontSize = 9, ForeColor = Color.Black }; var tfRed = new TextFormat(tf) { ForeColor = Color.Red }; var fs = File.OpenRead(Path.Combine("zugferd-invoice.pdf")); { // The approx table bounds: var tableBounds = new RectangleF(0, 3 * DPI, 8.5f * DPI, 3.75f * DPI); var page = doc.NewPage(); page.Landscape = true; var g = page.Graphics; var tl = g.CreateTextLayout(); tl.MaxWidth = page.Bounds.Width; tl.MaxHeight = page.Bounds.Height; tl.MarginAll = margin; tl.DefaultTabStops = 150; tl.LineSpacingScaleFactor = 1.2f; var docSrc = new GcPdfDocument(); docSrc.Load(fs); var itable = docSrc.Pages[0].GetTable(tableBounds); if (itable == null) { tl.AppendLine($"No table was found at the specified coordinates.", tfRed); } else { tl.Append($"\nThe table has {itable.Cols.Count} column(s) and {itable.Rows.Count} row(s), table data is:", tf); tl.AppendParagraphBreak(); for (int row = 0; row < itable.Rows.Count; ++row) { var tfmt = row == 0 ? tf : tf; for (int col = 0; col < itable.Cols.Count; ++col) { var cell = itable.GetCell(row, col); if (col > 0) tl.Append("\t", tfmt); if (cell == null) tl.Append("<no cell>", tfRed); else tl.Append(cell.Text, tfmt); } tl.AppendLine(); } } TextSplitOptions to = new TextSplitOptions(tl) { RestMarginTop = margin, MinLinesInFirstParagraph = 2, MinLinesInLastParagraph = 2 }; tl.PerformLayout(true); while (true) { var splitResult = tl.Split(to, out TextLayout rest); doc.Pages.Last.Graphics.DrawTextLayout(tl, PointF.Empty); if (splitResult != SplitResult.Split) break; tl = rest; doc.NewPage().Landscape = true; } // Append the original document for reference doc.MergeWithDocument(docSrc); doc.Save(stream); |
Limitation
DsPdf can recognize the logical structure of a source document from which the PDF document is generated. This structure recognition is further used to extract content elements from tagged PDF documents.
Based on the PDF specification, DsPdf recognizes the logical structure by using LogicalStructure class. It represents a parsed logical structure of a PDF document which is created on the basis of tags in the PDF structure tree. The StructElement property of Element class can be used to get the element type, such as TR for table row, H for headings, P for paragraphs etc.
The example code below shows how to extract headings, tables and TOC elements from a tagged PDF document:
C# |
Copy Code
|
---|---|
static void ShowTable(Element e) { List<List<IList<ITextParagraph>>> table = new List<List<IList<ITextParagraph>>>(); // select all nested rows, elements with type TR void SelectRows(IList<Element> elements) { foreach (Element ec in elements) { if (ec.HasChildren) { if (ec.StructElement.Type == "TR") { var cells = ec.Children.FindAll((e_) => e_.StructElement.Type == "TD").ToArray(); List<IList<ITextParagraph>> tableCells = new List<IList<ITextParagraph>>(); foreach (var cell in cells) tableCells.Add(cell.GetParagraphs()); table.Add(tableCells); } else SelectRows(ec.Children); } } } SelectRows(e.Children); // show table int colCount = table.Max((r_) => r_.Count); Console.WriteLine(); Console.WriteLine(); Console.WriteLine($"Table: {table.Count}x{colCount}"); Console.WriteLine($"------"); foreach (var r in table) { foreach (var c in r) { var s = c == null || c.Count <= 0 ? string.Empty : c[0].GetText(); Console.Write(s); Console.Write("\t"); } Console.WriteLine(); } } static void Main(string[] args) { GcPdfDocument doc = new GcPdfDocument(); using (var s = new FileStream("C1Olap QuickStart.pdf", FileMode.Open, FileAccess.Read, FileShare.Read)) { doc.Load(s); // get the LogicalStructure and top parent element LogicalStructure ls = doc.GetLogicalStructure(); Element root = ls.Elements[0]; // select all headings Console.WriteLine("TOC:"); Console.WriteLine("----"); // iterate over elements and select all heading elements foreach (Element e in root.Children) { string type = e.StructElement.Type; if (string.IsNullOrEmpty(type) || !type.StartsWith("H")) continue; int headingLevel; if (!int.TryParse(type.Substring(1), out headingLevel)) continue; // get the element text string text = e.GetText(); if (string.IsNullOrEmpty(text)) text = "H" + headingLevel.ToString(); text = new string(' ', (headingLevel - 1) * 2) + text; Console.WriteLine(text); } // select all tables var tables = root.Children.FindAll((e_) => e_.StructElement.Type == "Table").ToArray(); foreach (var t in tables) { ShowTable(t); } } } |
C# |
Copy Code
|
---|---|
// restore word document from pdf using (var s = new FileStream("CharacterFormatting.pdf", FileMode.Open, FileAccess.Read, FileShare.Read)) { doc.Load(s); // get the LogicalStructure and top parent element LogicalStructure ls = doc.GetLogicalStructure(); Element root = ls.Elements[0]; GcWordDocument wdoc = new GcWordDocument(); // iterate over elements and select all paragraphs foreach (Element e in root.Children) { if (e.StructElement.Type != "P") continue; var tps = e.GetParagraphs(); if (tps == null) continue; foreach (var tp in tps) { // build a Word paragraph from a ITextParagraph Paragraph p = wdoc.Body.Paragraphs.Add(); foreach (var tr in tp.Runs) { var range = p.GetRange(); var run = range.Runs.Add(tr.GetText()); run.Font.Size = tr.Attrs.FontSize; if (tr.Attrs.NonstrokeColor.HasValue) run.Font.Color.RGB = tr.Attrs.NonstrokeColor.Value; tr.Attrs.Font.GetFontAttributes(out string fontFamily, out FontWeight? fontWeight, out FontStretch? fontStretch, out bool? fontItalic); if (!string.IsNullOrEmpty(fontFamily)) run.Font.Name = fontFamily; if (fontWeight.HasValue) run.Font.Bold = fontWeight.Value >= FontWeight.Bold; if (fontItalic.HasValue) run.Font.Italic = fontItalic.Value; } } } wdoc.Save("CharacterFormatting.docx"); } |
Refer to Tagged PDF to know how to create tagged PDF files using DsPdf.