[SPARK-51112] [Connect] Seg fault when converting empty dataframe with nested array columns to pandas - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.1.0, 4.0.0
Fix Version/s: 4.0.0
Component/s: Connect, PySpark
Labels:
- pull-request-available

Description

Run the following code with a running local connect server:

from pyspark.sql.types import StructField, ArrayType, StringType, StructType, IntegerType
import faulthandler
faulthandler.enable()
spark = SparkSession.builder \
    .remote("sc://localhost:15002") \
    .getOrCreate()
sp_df = spark.createDataFrame(
    data = [],
    schema=StructType(
        [
            StructField(
                name='b_int',
                dataType=IntegerType(),
                nullable=False,
            ),
            StructField(
                name='b',
                dataType=ArrayType(ArrayType(StringType(), True), True),
                nullable=True,
            ),
        ]
    )
)
print(sp_df)
print('Spark dataframe generated.')
print(sp_df.toPandas())
print('Pandas dataframe generated.')

When `sp_df.toPandas()` is called, a segmentation fault may occur. The seg fault is non-deterministic and does not occur every single time.

Segfault:

Thread 0x00000001f1904f40 (most recent call first):
  File "/Users/venkata.gudesa/spark/test_env/lib/python3.13/site-packages/pyarrow/pandas_compat.py", line 808 in table_to_dataframe
  File "/Users/venkata.gudesa/spark/test_env/lib/python3.13/site-packages/pyspark/sql/connect/client/core.py", line 949 in to_pandas
  File "/Users/venkata.gudesa/spark/test_env/lib/python3.13/site-packages/pyspark/sql/connect/dataframe.py", line 1857 in toPandas
  File "<python-input-3>", line 1 in <module>
  File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/code.py", line 92 in runcode
  File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/console.py", line 205 in runsource
  File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/code.py", line 313 in push
  File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/simple_interact.py", line 160 in run_multiline_interactive_console
  File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/main.py", line 59 in interactive_console
  File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/__main__.py", line 6 in <module>
  File "<frozen runpy>", line 88 in _run_code

Observations:

When I added some sample data, the issue went away and the conversion was successfull.

When I changed ArrayType(ArrayType(StringType(), True), True) to ArrayType(StringType(), True), there was no seg fault and execution was successful regardless of data.

When I converted the nested array column into a JSON field using to_json (and dropped the original nested array column) , there was again no seg fault, and execution was successful regardless of data.

Conculsion: There is an issue in pyarrow/pandas that is triggered when converting empty datasets containing nested array columns.

Attachments

Issue Links

links to

GitHub Pull Request #49834

GitHub Pull Request #50292

Activity

People

Assignee:: Venkata Sai Akhil Gudesa

Reporter:: Venkata Sai Akhil Gudesa

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Feb/25 14:31

Updated:: 17/Mar/25 10:16

Resolved:: 11/Feb/25 00:27