Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-51112

[Connect] Seg fault when converting empty dataframe with nested array columns to pandas

    XMLWordPrintableJSON

Details

    Description

      Run the following code with a running local connect server:

      from pyspark.sql.types import StructField, ArrayType, StringType, StructType, IntegerType
      import faulthandler
      faulthandler.enable()
      spark = SparkSession.builder \
          .remote("sc://localhost:15002") \
          .getOrCreate()
      sp_df = spark.createDataFrame(
          data = [],
          schema=StructType(
              [
                  StructField(
                      name='b_int',
                      dataType=IntegerType(),
                      nullable=False,
                  ),
                  StructField(
                      name='b',
                      dataType=ArrayType(ArrayType(StringType(), True), True),
                      nullable=True,
                  ),
              ]
          )
      )
      print(sp_df)
      print('Spark dataframe generated.')
      print(sp_df.toPandas())
      print('Pandas dataframe generated.') 

      When `sp_df.toPandas()` is called, a segmentation fault may occur. The seg fault is non-deterministic and does not occur every single time.

      Segfault:

      Thread 0x00000001f1904f40 (most recent call first):
        File "/Users/venkata.gudesa/spark/test_env/lib/python3.13/site-packages/pyarrow/pandas_compat.py", line 808 in table_to_dataframe
        File "/Users/venkata.gudesa/spark/test_env/lib/python3.13/site-packages/pyspark/sql/connect/client/core.py", line 949 in to_pandas
        File "/Users/venkata.gudesa/spark/test_env/lib/python3.13/site-packages/pyspark/sql/connect/dataframe.py", line 1857 in toPandas
        File "<python-input-3>", line 1 in <module>
        File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/code.py", line 92 in runcode
        File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/console.py", line 205 in runsource
        File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/code.py", line 313 in push
        File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/simple_interact.py", line 160 in run_multiline_interactive_console
        File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/main.py", line 59 in interactive_console
        File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/__main__.py", line 6 in <module>
        File "<frozen runpy>", line 88 in _run_code

      Observations:

      • When I added some sample data, the issue went away and the conversion was successfull.
      • When I changed ArrayType(ArrayType(StringType(), True), True) to ArrayType(StringType(), True), there was no seg fault and execution was successful regardless of data.
      • When I converted the nested array column into a JSON field using to_json (and dropped the original nested array column) , there was again no seg fault, and execution was successful regardless of data.

       

      Conculsion: There is an issue in pyarrow/pandas that is triggered when converting empty datasets containing nested array columns.

      Attachments

        Activity

          People

            vicennial Venkata Sai Akhil Gudesa
            vicennial Venkata Sai Akhil Gudesa
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: