Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.1.0, 4.0.0
Description
Run the following code with a running local connect server:
from pyspark.sql.types import StructField, ArrayType, StringType, StructType, IntegerType import faulthandler faulthandler.enable() spark = SparkSession.builder \ .remote("sc://localhost:15002") \ .getOrCreate() sp_df = spark.createDataFrame( data = [], schema=StructType( [ StructField( name='b_int', dataType=IntegerType(), nullable=False, ), StructField( name='b', dataType=ArrayType(ArrayType(StringType(), True), True), nullable=True, ), ] ) ) print(sp_df) print('Spark dataframe generated.') print(sp_df.toPandas()) print('Pandas dataframe generated.')
When `sp_df.toPandas()` is called, a segmentation fault may occur. The seg fault is non-deterministic and does not occur every single time.
Segfault:
Thread 0x00000001f1904f40 (most recent call first): File "/Users/venkata.gudesa/spark/test_env/lib/python3.13/site-packages/pyarrow/pandas_compat.py", line 808 in table_to_dataframe File "/Users/venkata.gudesa/spark/test_env/lib/python3.13/site-packages/pyspark/sql/connect/client/core.py", line 949 in to_pandas File "/Users/venkata.gudesa/spark/test_env/lib/python3.13/site-packages/pyspark/sql/connect/dataframe.py", line 1857 in toPandas File "<python-input-3>", line 1 in <module> File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/code.py", line 92 in runcode File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/console.py", line 205 in runsource File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/code.py", line 313 in push File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/simple_interact.py", line 160 in run_multiline_interactive_console File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/main.py", line 59 in interactive_console File "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/__main__.py", line 6 in <module> File "<frozen runpy>", line 88 in _run_code
Observations:
- When I added some sample data, the issue went away and the conversion was successfull.
- When I changed ArrayType(ArrayType(StringType(), True), True) to ArrayType(StringType(), True), there was no seg fault and execution was successful regardless of data.
- When I converted the nested array column into a JSON field using to_json (and dropped the original nested array column) , there was again no seg fault, and execution was successful regardless of data.
Conculsion: There is an issue in pyarrow/pandas that is triggered when converting empty datasets containing nested array columns.