[go: up one dir, main page]

Invalid UTF-8 encoding in SVG from PDF-import

Summary:

  1. Inkscape does not sanitize invalid UTF-8 characters when writing to SVG.
  2. Inkscape produces such malformed characters when importing a PDF that contains a layer whose name contains the German special character "ä".

Steps to reproduce:

  • open input PDF
    • I cannot share the original input publicly. It was created with ArchiCAD.
    • Here's an artificially made PDF with similar problems: input.pdf
  • Choose internal import with standard settings
  • Save as SVG

What happened?

  • Note: one of the layers has a broken character
  • resulting output (note: file has been "anonymized" and simplified) is no valid SVG file (see validator result). It contains invalid UTF8 characters and is therefore no valid XML file. Several other programs refuse opening such malformed files, e.g., Illustrator or the Apache Xerxes XML parser.

For later Inkscape 1.5 versions on Linux the behavior has slightly changed, it seems to correctly convert the character encoding but creates an SVG with invalid ID attributes.

I'm sorry that this bug report is slightly convoluted. Please reply if you need better test data.

What should have happened?

  1. Any SVG saved by Inkscape is valid XML with valid UTF-8, no matter which garbage you originally loaded into Inkscape
  2. Maybe the PDF layer name should be imported with the correct encoding (not sure if the original PDF is broken or not)

Sample attachments:

see above

Version info

Tested on Inkscape 1.3.2 (WinServer2022) and on

Inkscape 1.5-dev (1fced74d68, 2024-07-05)

                      Compile  (Run)
    GLib version:     2.80.3
    GTK version:      4.14.4 (4.14.4)
    glibmm version:   2.80.0
    gtkmm version:    4.14.0
    libxml2 version:  2.12.8
    libxslt version:  1.1.41
    Cairo version:    1.18.0 (1.18.0)
    Pango version:    1.54.0 (1.54.0)
    HarfBuzz version: 9.0.0 (9.0.0)

    OS version:       Windows 10 22H2

Note that on later 1.5-dev versions the problem has changed; invalid IDs can still be generated but the text looks OK.

Edited by Max Gaukler