Text Handling in C1Document

Posted by: s.sagert on 16 July 2024, 2:10 am EST

  • Posted 16 July 2024, 2:10 am EST

    In version 657 there is a serious difference in the text handling of PDF.

    I search for occurrences in a certain order in standard documents.

    Old (e.g. 631):
    BeforeCompany: MyCompany
    Header
    Before9/999999: 9/999999 some other Text
    Header2 ColHeader1 ColHeader2
    Line1Col1 Line1Col2
    Line2Col1 Line2Col2
    
    
    New (657):
    MyCompany Address 9999999 9/999999 Header1 Line1Col1 Line2Col1
    Header BeforeCompany: Before9999999 Before9/999999: Header2

    How can I restore the old behavior in the new version?

    SearchString = "Before9/999999:"
    Using mc As C1.Win.C1Document.Util.C1DXTextMeasurementContext = New C1.Win.C1Document.Util.C1DXTextMeasurementContext()
                    Dim pdfLines As New List(Of String)
                    Dim dr As C1DocumentRange = C1PDF_DS.GetWholeDocumentRange(mc)
                    pdfLines.AddRange(dr.GetText().Split(Environment.NewLine))
                    filteredPDFLines = pdfLines.Where(Function(line) line.Contains(Searchstring)).ToList
                End Using

    In the next step, I search for 9/999999 in the lines found, but can no longer find it there:

                For i = 0 To n - 1
                    Dim fp As C1FoundPosition = _textSearchManager.FoundPositions(i)
                    For Each m As Match In regex.Matches(filteredPDFLines(i))
                        If ItemList.ContainsKey(m.Value) Then 
                            If Not ItemList.Item(m.Value).Contains(fp.GetPage().PageNo) Then
                                ItemList.Item(m.Value).Add(fp.GetPage().PageNo)
                            End If
                        Else 
                            Dim PageList As New List(Of Integer)
                            PageList.Add(fp.GetPage().PageNo)
                            ItemList.Add(m.Value, PageList)
                        End If
                    Next
                Next

    Because the line that has now been found looks like this:

    Header BeforeCompany: Before9999999 Before9/999999: Header2

    instead of as before:

    Before9/999999: 9/999999 some other Text
  • Posted 17 July 2024, 3:43 am EST

    Hello,

    We investigated the behavior and observed that the .657 version of PdfDocumentSource does not read the NewLine characters from the PDF. We have forwarded this behavior to the development team. Rest assured, we’ll update you once we get any necessary information.

    Could you please let us know if this behavior matches what you have reported? If it differs, provide more details and update the attached sample so we can observe and investigate the behavior accurately.

    [Internal Tracking ID: C1WIN-32679]

    Sample:PDF_TextHandling_VB.zip

    Regards,

    Uttkarsh.

  • Posted 17 July 2024, 6:09 am EST - Updated 17 July 2024, 6:11 am EST

    Thank you for your reply.

    Not reading the NewLine character may be the problem. Unfortunately I can’t test at the moment, the system is running build .631.

    I have created a PDF that looks like the documents. The PDF is loaded into the

    C1.Win.C1Document.C1PdfDocumentSource
    with
    LoadFromFile
    .

    MessageBox.Show(dr.GetText())
    Outputs the following for this PDF:

    Test Customer
    
    Internet 1
    
    10000 World
    
    Deutschland
    
    Stand: 17.07.24
    
    %%Filepath:2024-7-17–9_999999-Test Customer.pdf%%
    
    Description: 9/999999, SOME (text)
    
    Customs Tariff No. 99999999
    
    Producer: MyCompany
    
    Header 1
    
    Product-Name: 9/999999 SOME (text)
    
    Header 2 Col1 Col2
    
    Line1Col1 Line1Col2
    
    Line2Col1 Line2Col2
    
    Line3Col1 Line3Col1
    
    Line4Col1 Line4Col2
    
    Line5Col1 Line5Col2
    
    Line6Col1 Line6Col2
    
    Header 3
    
    Text in Header 3 Info in Header3
    
    Header 4
    
    Text in Header 4 No

    This is absolutely correct and the usual behavior.

    In build .657 the behavior described above was happening and the output was messed up.

    2024-7-17-9_999999-Test Customer.zip

  • Posted 22 July 2024, 8:09 am EST

    The following results with the different versions:

    Test Customer
    Internet 1
    10000 World
    Deutschland
    Stand: 17.07.24
    %%Filepath:2024-7-17–9_999999-Test Customer.pdf%%
    Description: 9/999999, SOME (text)
    Customs Tariff No. 99999999
    Producer: MyCompany
    Header 1
    Product-Name: 9/999999 SOME (text)
    Header 2 Col1 Col2
    Line1Col1 Line1Col2
    Line2Col1 Line2Col2
    Line3Col1 Line3Col1
    Line4Col1 Line4Col2
    Line5Col1 Line5Col2
    Line6Col1 Line6Col2
    Header 3
    Text in Header 3 Info in Header3
    Header 4
    Text in Header 4 No
    C1.C1Pdf.4.8, Version=4.8.20233.631, Culture=neutral, PublicKeyToken=79882d576c6336da: v4.0.30319
    C1.Win.C1Document.4.8, Version=4.8.20233.631, Culture=neutral, PublicKeyToken=944ae1ea0e47ca04: v4.0.30319
    C1.Win.4.8, Version=4.8.20233.631, Culture=neutral, PublicKeyToken=944ae1ea0e47ca04: v4.0.30319
    C1.Zip, Version=2.0.20233.3, Culture=neutral, PublicKeyToken=79882d576c6336da: v4.0.30319
    C1.Win.C1DX.4.8, Version=4.8.20233.631, Culture=neutral, PublicKeyToken=944ae1ea0e47ca04: v4.0.30319
    C1.Win.ImportServices.4.8, Version=4.8.20233.631, Culture=neutral, PublicKeyToken=944ae1ea0e47ca04: v4.0.30319
    C1.Win.Barcode.4.8, Version=4.8.20233.631, Culture=neutral, PublicKeyToken=79882d576c6336da: v4.0.30319
    C1.Win.FlexChart.4.8, Version=4.8.20233.631, Culture=neutral, PublicKeyToken=3aa2920c09e0aefd: v4.0.30319
    Test Customer Internet 1
    10000 World Deutschland
    Stand: 17.07.24
    %%Filepath:2024-7-17–9_999999-Test Customer.pdf%%
    9/999999, SOME (text)
    Description:
    MyCompany 99999999 9/999999 Col1 Line1Col1 Line2Col1 Line3Col1 Line4Col1 Line5Col1 Line6Col1
    Header 1 Producer: Customs Tariff No. Product-Name: Header 2
    SOME (text) Col2
    Line1Col2 Line2Col2 Line3Col1 Line4Col2 Line5Col2 Line6Col2
    Header 3
    Text in Header 3
    Info in Header3
    Header 4
    Text in Header 4
    No
    C1.C1Pdf.4.8, Version=4.8.20233.631, Culture=neutral, PublicKeyToken=79882d576c6336da: v4.0.30319
    C1.Win.C1Document.4.8, Version=4.8.20241.657, Culture=neutral, PublicKeyToken=944ae1ea0e47ca04: v4.0.30319
    C1.Win.4.8, Version=4.8.20241.657, Culture=neutral, PublicKeyToken=944ae1ea0e47ca04: v4.0.30319
    C1.Zip, Version=2.0.20233.4, Culture=neutral, PublicKeyToken=79882d576c6336da: v4.0.30319
    C1.Win.ImportServices.4.8, Version=4.8.20241.657, Culture=neutral, PublicKeyToken=944ae1ea0e47ca04: v4.0.30319
    C1.Win.Barcode.4.8, Version=4.8.20241.657, Culture=neutral, PublicKeyToken=79882d576c6336da: v4.0.30319
    C1.Win.FlexChart.4.8, Version=4.8.20241.657, Culture=neutral, PublicKeyToken=3aa2920c09e0aefd: v4.0.30319
    C1.Win.C1DX.4.8, Version=4.8.20241.657, Culture=neutral, PublicKeyToken=944ae1ea0e47ca04: v4.0.30319

    PDF_TextHandling_VB.zip

  • Posted 23 July 2024, 6:42 am EST

    Hello,

    Thank you for sharing your observations. The behavior is similar to what we have reported to the development team. We’ll update you once we have any necessary information.

    Thank you for your patience and coordination.

    Regards,

    Uttkarsh.

Need extra support?

Upgrade your support plan and get personal unlimited phone support with our customer engagement team

Learn More

Forum Channels