Skip to content

Tester Module Documentation

dataframes.py

Handling operations that help users to improve their test cases.

This module puts together some useful functions created in order to provid an easy way to fake Spark DataFrames objects. Its features can be imported and applied on every scenario that demands the creation of fake data rows, fake schema or even fake Spark DataFrame objects (for example, a conftest file that defined fixtures for unit test cases).


parse_string_to_spark_dtype(dtype)

Transform a string dtype reference into a valid Spark dtype.

This function checks for the data type reference for a field given by users while filling the JSON schema file in order to return a valid Spark dtype based on the string reference.

Examples:

# Returning the Spark reference for a "string" data type
spark_dtype = parse_string_to_spark_dtype(dtype="string")
# spark_dtype now holds the StringType Spark dtype object

Parameters:

Name Type Description Default
dtype str

A string reference for any parseable Spark dtype

required

Returns:

Type Description

A callable Spark dtype object based on the string reference provided

Source code in sparksnake/tester/dataframes.py
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
def parse_string_to_spark_dtype(dtype: str):
    """Transform a string dtype reference into a valid Spark dtype.

    This function checks for the data type reference for a field given by users
    while filling the JSON schema file in order to return a valid Spark dtype
    based on the string reference.

    Examples:
        ```python
        # Returning the Spark reference for a "string" data type
        spark_dtype = parse_string_to_spark_dtype(dtype="string")
        # spark_dtype now holds the StringType Spark dtype object
        ```

    Args:
        dtype (str): A string reference for any parseable Spark dtype

    Returns:
        A callable Spark dtype object based on the string reference provided
    """

    # Removing noise on string before validating
    dtype_prep = dtype.lower().strip()

    # Parsing string reference for dtype to spark data type
    if dtype_prep == "string":
        return StringType
    elif dtype_prep in ("int", "integer"):
        return IntegerType
    elif dtype_prep in ("bigint", "long"):
        return LongType
    elif dtype_prep == "decimal":
        return DecimalType
    elif dtype_prep == "float":
        return FloatType
    elif dtype_prep == "double":
        return DoubleType
    elif dtype_prep == "boolean":
        return BooleanType
    elif dtype_prep == "date":
        return DateType
    elif dtype_prep == "timestamp":
        return TimestampType
    elif dtype_prep[:5] == "array":
        # Checking if there is an inner array type
        if "<" not in dtype and ">" not in dtype:
            raise TypeError("Invalid entry for array type in schema "
                            f"(dtype={dtype}). When providing an array type "
                            "for a field in this definition schema, please "
                            "use the following approach: 'array<inner_type>' "
                            "where the tag 'inner_type' represents a valid "
                            "data type reference (such as 'string', 'int'). "
                            "It's also important not to forget to put this "
                            "inner data type between symbols < and >")
        else:
            # Extracting the array inner type and parsing it into a Spark dtype
            array_inner_dtype_str = dtype_prep.split("<")[-1].split(">")[0]
            array_inner_dtype = parse_string_to_spark_dtype(
                dtype=array_inner_dtype_str
            )

            # Returning the array data type with its inner type
            return ArrayType(array_inner_dtype())
    else:
        raise TypeError(f"Data type {dtype} is not valid or currently "
                        "parseable into a native Spark dtype")

generate_dataframe_schema(schema_info, attribute_name_key='Name', dtype_key='Type', nullable_key='nullable')

Generates a StructType Spark schema based on a list of fields info.

This function receives a preconfigured Python list extracted from a JSON schema definition file provided by user in order to return a valid Spark schema composed by a StructType structure with multiple StructField objects containing informations about name, data type and nullable info about attributes.

Examples:

# Showing an example of a input schema list
schema_info = [
    {
        "Name": "idx",
        "Type": "int",
        "nullable": true
    },
    {
        "Name": "order_id",
        "Type": "string",
        "nullable": true
    }
]

# Returning a valid Spark schema object based on a dictionary
schema = generate_dataframe_schema(schema_info)

Parameters:

Name Type Description Default
schema_info list

A list with information about fields of a DataFrame

required
attribute_name_key str

A string identification of the attribute name defined on every attribute dictionary

'Name'
dtype_key str

A string identification of the attribute type defined on every attribute dictionary

'Type'
nullable_key bool

A boolean flag that tells if the given attribute defined in the dictionary can hold null values

'nullable'

Returns:

Type Description
StructType

A StructType object structured in such a way that makes it possible to create a Spark DataFrame with a predefined schema.

Source code in sparksnake/tester/dataframes.py
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
def generate_dataframe_schema(
    schema_info: list,
    attribute_name_key: str = "Name",
    dtype_key: str = "Type",
    nullable_key: str = "nullable"
) -> StructType:
    """Generates a StructType Spark schema based on a list of fields info.

    This function receives a preconfigured Python list extracted from a JSON
    schema definition file provided by user in order to return a valid Spark
    schema composed by a StructType structure with multiple StructField objects
    containing informations about name, data type and nullable info about
    attributes.

    Examples:
        ```python
        # Showing an example of a input schema list
        schema_info = [
            {
                "Name": "idx",
                "Type": "int",
                "nullable": true
            },
            {
                "Name": "order_id",
                "Type": "string",
                "nullable": true
            }
        ]

        # Returning a valid Spark schema object based on a dictionary
        schema = generate_dataframe_schema(schema_info)
        ```

    Args:
        schema_info (list):
            A list with information about fields of a DataFrame

        attribute_name_key (str):
            A string identification of the attribute name defined on every
            attribute dictionary

        dtype_key (str):
            A string identification of the attribute type defined on every
            attribute dictionary

        nullable_key (bool):
            A boolean flag that tells if the given attribute defined in
            the dictionary can hold null values

    Returns:
        A StructType object structured in such a way that makes it possible to\
        create a Spark DataFrame with a predefined schema.
    """

    # Creating a list of Spark data types
    dtype_list = []
    for field_info in schema_info:
        # Removing noise for data type info
        dtype_prep = field_info[dtype_key].strip().lower()

        # Checking a special condition when dtype is an array
        if dtype_prep[:5] == "array":
            dtype = parse_string_to_spark_dtype(dtype=dtype_prep)
        else:
            # If it's not an array, we need to call the Spark type class
            dtype = parse_string_to_spark_dtype(dtype=dtype_prep)()

        # Appending the data type into a common list
        dtype_list.append(dtype)

    # Creating a list of attribute names
    field_names = [
        field_info[attribute_name_key] for field_info in schema_info
    ]

    # Creating a list of nullable information
    nullable_list = [
        field_info[nullable_key] if nullable_key in field_info else True
        for field_info in schema_info
    ]

    # Extracting the schema based on the preconfigured lists
    schema_zip_elements = zip(field_names, dtype_list, nullable_list)
    return StructType([
        StructField(field_name, dtype, nullable)
        for field_name, dtype, nullable in schema_zip_elements
    ])

generate_fake_data_from_schema(schema, n_rows=5)

Generates fake data based on a Struct Type Spark schema object.

This function receives a predefined DataFrame schema in order to return a list of tuples with fake data generated based on attribute types and the Faker library. The way the fake data is structured makes it easy to create Spark DataFrames to be used for test purposes.

Examples:

# Defining a list with attributes info to be used on schema creation
schema_info = [
    {
        "Name": "idx",
        "Type": "int",
        "nullable": true
    },
    {
        "Name": "order_id",
        "Type": "string",
        "nullable": true
    }
]

# Returning a valid Spark schema object based on a dictionary
schema = generate_dataframe_schema(schema_info)

# Generating fake data based on a Spark DataFrame schema
fake_data = generate_fake_data_from_schema(schema=schema, n_rows=10)

Parameters:

Name Type Description Default
schema StructType

a Spark DataFrame schema

required
n_rows int

the number of fake rows to be generated

5

Returns:

Type Description
tuple

A list of tuples where each tuple representes a row with fake data generated using the Faker library according to each data type of the given Spark DataFrame schema. For example, for a string attribute the fake data will be generated using the faker.word() method. For a date attribute, the fake data will be generated using the faker.date_this_year(). And so it goes on for all other dtypes.

Source code in sparksnake/tester/dataframes.py
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
def generate_fake_data_from_schema(
    schema: StructType,
    n_rows: int = 5
) -> tuple:
    """Generates fake data based on a Struct Type Spark schema object.

    This function receives a predefined DataFrame schema in order to return
    a list of tuples with fake data generated based on attribute types and
    the Faker library. The way the fake data is structured makes it easy to
    create Spark DataFrames to be used for test purposes.

    Examples:
        ```python
        # Defining a list with attributes info to be used on schema creation
        schema_info = [
            {
                "Name": "idx",
                "Type": "int",
                "nullable": true
            },
            {
                "Name": "order_id",
                "Type": "string",
                "nullable": true
            }
        ]

        # Returning a valid Spark schema object based on a dictionary
        schema = generate_dataframe_schema(schema_info)

        # Generating fake data based on a Spark DataFrame schema
        fake_data = generate_fake_data_from_schema(schema=schema, n_rows=10)
        ```

    Args:
        schema (StructType): a Spark DataFrame schema
        n_rows (int): the number of fake rows to be generated

    Returns:
        A list of tuples where each tuple representes a row with fake data\
        generated using the Faker library according to each data type of\
        the given Spark DataFrame schema. For example, for a string attribute\
        the fake data will be generated using the `faker.word()` method. For a\
        date attribute, the fake data will be generated using the\
        `faker.date_this_year()`. And so it goes on for all other dtypes.
    """

    # Creating fake data based on each schema attribute
    fake_data_list = []
    for _ in range(n_rows):
        # Iterting over columns and faking data
        fake_row = []
        for field in schema:
            dtype = field.dataType.typeName()
            if dtype == "string":
                fake_row.append(faker.word())
            elif dtype in ("int", "integer"):
                fake_row.append(randrange(-10000, 10000))
            elif dtype in ("bigint", "long"):
                fake_row.append(randrange(-10000, 10000))
            elif dtype == "decimal":
                fake_row.append(Decimal(randrange(1, 100000)))
            elif dtype in ("float", "double"):
                fake_row.append(float(random() * randrange(1, 100000)))
            elif dtype == "boolean":
                fake_row.append(faker.boolean())
            elif dtype == "date":
                fake_row.append(faker.date_this_year())
            elif dtype == "timestamp":
                fake_row.append(faker.date_time_this_year())
            elif dtype == "array":
                # Extracting inner array data type
                inner_array_dtype = field.dataType.jsonValue()["elementType"]

                # Generating fake data according to array inner type
                if inner_array_dtype == "string":
                    array_fake_data = faker.word()
                elif inner_array_dtype in ("int", "integer", "bigint", "long"):
                    array_fake_data = randrange(-10000, 10000)

                # Transforming fake data into a list and appending to the row
                fake_row.append([array_fake_data])

        # Appending the row to the data list
        fake_data_list.append(fake_row)

    # Generating a list of tuples
    return [tuple(row) for row in fake_data_list]

generate_fake_dataframe(spark_session, schema_info, attribute_name_key='Name', dtype_key='Type', nullable_key='nullable', n_rows=5)

Creates a Spark DataFrame with fake data using Faker.

This function receives a list of dictionaries, each one populated with information about the desired attributes defined in order to create a Spark DataFrame with fake data. So, this list of dictionaries (schema_info function argument) is used to create a StructType Spark DataFrame schema object and this objects is used to generate fake data using Faker and based on the type of the attributes defined on the schema. Finally, with the schema object and the fake data, this function returns a Spark DataFrame that can be used for any purposes.

This function calls the generate_dataframe_schema() and generate_fake_data_from_schema() in order to execute all the the steps explained above.

Examples:

# Defining a list with attributes info to be used on schema creation
schema_info = [
    {
        "Name": "idx",
        "Type": "int",
        "nullable": true
    },
    {
        "Name": "order_id",
        "Type": "string",
        "nullable": true
    }
]

# Generating a Spark DataFrame object with fake data
fake_df = generate_fake_dataframe(schema_info)

Parameters:

Name Type Description Default
spark_session SparkSession

A SparkSession object that is used to call createDataFrame method

required
schema_info list

A list with information about fields of a DataFrame. Check the generate_dataframe_schema() for more details.

required
attribute_name_key str

A string identification of the attribute name defined on every attribute dictionary. Check the generate_dataframe_schema() for more details.

'Name'
dtype_key str

A string identification of the attribute type defined on every attribute dictionary. Check the generate_dataframe_schema() for more details.

'Type'
nullable_key bool

A boolean flag that tells if the given attribute defined in the dictionary can hold null values. Check the generate_dataframe_schema() for more details.

'nullable'
n_rows int

The number of fake rows to be generated. Check the generate_fake_data_from_schema() for more details.

5

Returns:

Type Description
DataFrame

A new Spark DataFrame with fake data generated by Faker providers and Python built-in libraries.

Source code in sparksnake/tester/dataframes.py
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
def generate_fake_dataframe(
    spark_session: SparkSession,
    schema_info: list,
    attribute_name_key: str = "Name",
    dtype_key: str = "Type",
    nullable_key: str = "nullable",
    n_rows: int = 5
) -> DataFrame:
    """Creates a Spark DataFrame with fake data using Faker.

    This function receives a list of dictionaries, each one populated with
    information about the desired attributes defined in order to create a
    Spark DataFrame with fake data. So, this list of dictionaries (schema_info
    function argument) is used to create a StructType Spark DataFrame schema
    object and this objects is used to generate fake data using Faker and based
    on the type of the attributes defined on the schema. Finally, with the
    schema object and the fake data, this function returns a Spark DataFrame
    that can be used for any purposes.

    This function calls the generate_dataframe_schema() and
    generate_fake_data_from_schema() in order to execute all the the steps
    explained above.

    Examples:
        ```python
        # Defining a list with attributes info to be used on schema creation
        schema_info = [
            {
                "Name": "idx",
                "Type": "int",
                "nullable": true
            },
            {
                "Name": "order_id",
                "Type": "string",
                "nullable": true
            }
        ]

        # Generating a Spark DataFrame object with fake data
        fake_df = generate_fake_dataframe(schema_info)
        ```

    Args:
        spark_session (SparkSession):
            A SparkSession object that is used to call createDataFrame method

        schema_info (list):
            A list with information about fields of a DataFrame. Check the
            generate_dataframe_schema() for more details.

        attribute_name_key (str):
            A string identification of the attribute name defined on every
            attribute dictionary. Check the generate_dataframe_schema() for
            more details.

        dtype_key (str):
            A string identification of the attribute type defined on every
            attribute dictionary. Check the generate_dataframe_schema() for
            more details.

        nullable_key (bool):
            A boolean flag that tells if the given attribute defined in
            the dictionary can hold null values. Check the
            generate_dataframe_schema() for more details.

        n_rows (int):
            The number of fake rows to be generated. Check the
            generate_fake_data_from_schema() for more details.

    Returns:
        A new Spark DataFrame with fake data generated by Faker providers and\
        Python built-in libraries.
    """

    # Returning a valid Spark schema object based on a dictionary
    schema = generate_dataframe_schema(
        schema_info=schema_info,
        attribute_name_key=attribute_name_key,
        dtype_key=dtype_key,
        nullable_key=nullable_key
    )

    # Generating fake data based on a Spark DataFrame schema
    fake_data = generate_fake_data_from_schema(schema=schema, n_rows=n_rows)

    # Returning a fake Spark DataFrame
    return spark_session.createDataFrame(data=fake_data, schema=schema)

generate_dataframes_dict(definition_dict, spark_session)

Generates a Python dictionary with multiple Spark DataFrame objects.

This function uses a predefined Python dictionary with all information needed to create Spark DataFrames then it checks all flags and conditions in order to delivery to users another Python dictionary made by Spark DataFrame objects created with all user preconfigured info.

An example of a dictionary that can be used to simulate DataFrames can be found below:

Example of a dictionary used to create DataFrames:

SOURCE_DATAFRAMES_DEFINITION = {
    "tbl_name": {
        "name": "tbl_name",
        "dataframe_reference": "df_mocked",
        "empty": False,
        "fake_data": False,
        "fields": [
            {
                "Name": "idx",
                "Type": "int",
                "nullable": True
            },
            {
                "Name": "category",
                "Type": "string",
                "nullable": True
            }
        ],
        "data": [
            (1, "foo"),
            (2, "bar")
        ]
    }
}

In this approach, the dictionary is used to simulate and configure all elements of all datasets/tables to be created and returned as Spark DataFrame objects. In other words, users will be able to configure a Python dictionary with some predefined keys in order to generate DataFrame objects with a user defined schema that can simulate all tables that are part of the ETL process.

The aforementioned dictionary accepts the following keys:

  • "name": a name reference for the data structure to be simulated
  • "dataframe_reference": a name reference for the DataFrame
  • "empty": a boolean flag that indicates the creation of an empty df
  • "fake_data": a boolean flag to set fake data for the DataFrame
  • "fields": sets the schema of the data structure (check the example above)
  • "data": sets the data of the data structure (check the example above)

So, the generate_dataframes_dict() function can be called as the following example:

Examples:

# Importing function
from sparksnake.tester import generate_dataframes_dict

# Generating a dictionary with Spark DataFrames
dataframes_dict = generate_dataframes_dict(
    definition_dict=SOURCE_DATAFRAMES_DEFINITION,
    spark_session=spark
)

# Indexing the dictionary to get individual objects
df_mocked = dataframes_dict["df_mocked"]

Parameters:

Name Type Description Default
definition_dict dict

A Python dictionary built with predefined layout that handles all the elements needed to create DataFrame objects that can simulate all source data and intermediate stepts for users to improve their unit test construction. Check the docs aboce for more details.

required
spark_session SparkSession

A SparkSession object used to create Spark DataFrames.

required

Returns:

Type Description
dict

A Python dictionary made by Spark DataFrame objects created using the definition_dict dictionary.

Source code in sparksnake/tester/dataframes.py
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
def generate_dataframes_dict(
    definition_dict: dict,
    spark_session: SparkSession
) -> dict:
    """Generates a Python dictionary with multiple Spark DataFrame objects.

    This function uses a predefined Python dictionary with all information
    needed to create Spark DataFrames then it checks all flags and
    conditions in order to delivery to users another Python dictionary made
    by Spark DataFrame objects created with all user preconfigured info.

    An example of a dictionary that can be used to simulate DataFrames can
    be found below:

    Example of a dictionary used to create DataFrames:
    ```python
    SOURCE_DATAFRAMES_DEFINITION = {
        "tbl_name": {
            "name": "tbl_name",
            "dataframe_reference": "df_mocked",
            "empty": False,
            "fake_data": False,
            "fields": [
                {
                    "Name": "idx",
                    "Type": "int",
                    "nullable": True
                },
                {
                    "Name": "category",
                    "Type": "string",
                    "nullable": True
                }
            ],
            "data": [
                (1, "foo"),
                (2, "bar")
            ]
        }
    }
    ```

    In this approach, the dictionary is used to simulate and configure all
    elements of all datasets/tables to be created and returned as Spark
    DataFrame objects. In other words, users will be able to configure a
    Python dictionary with some predefined keys in order to generate DataFrame
    objects with a user defined schema that can simulate all tables that are
    part of the ETL process.

    The aforementioned dictionary accepts the following keys:

    - "name": a name reference for the data structure to be simulated
    - "dataframe_reference": a name reference for the DataFrame
    - "empty": a boolean flag that indicates the creation of an empty df
    - "fake_data": a boolean flag to set fake data for the DataFrame
    - "fields": sets the schema of the data structure (check the example above)
    - "data": sets the data of the data structure (check the example above)

    So, the generate_dataframes_dict() function can be called as the following
    example:

    Examples:
    ```python
    # Importing function
    from sparksnake.tester import generate_dataframes_dict

    # Generating a dictionary with Spark DataFrames
    dataframes_dict = generate_dataframes_dict(
        definition_dict=SOURCE_DATAFRAMES_DEFINITION,
        spark_session=spark
    )

    # Indexing the dictionary to get individual objects
    df_mocked = dataframes_dict["df_mocked"]
    ```

    Args:
        definition_dict (dict):
            A Python dictionary built with predefined layout that handles all
            the elements needed to create DataFrame objects that can simulate
            all source data and intermediate stepts for users to improve their
            unit test construction. Check the docs aboce for more details.

        spark_session (SparkSession):
            A SparkSession object used to create Spark DataFrames.

    Returns:
        A Python dictionary made by Spark DataFrame objects created using the\
        definition_dict dictionary.
    """

    # Defining a dictionary to hold all DataFrame objects
    dfs_dict = {}

    # Iterating over all source dataframe definition
    for df_key in definition_dict:
        # Getting the dictionary definition for the given DataFrame of the loop
        df_info = definition_dict[df_key]

        # Collecting the schema definition from the dictionary
        schema_info = df_info["fields"]

        # Checking if the DataFrame will be created with fake data
        if bool(df_info["fake_data"]):
            df = generate_fake_dataframe(
                spark_session=spark_session,
                schema_info=schema_info
            )

        # Checking if the DataFrame will be empty
        elif bool(df_info["empty"]):
            # Generating a schema object and setting the empty data list
            schema = generate_dataframe_schema(schema_info=schema_info)
            data = []

            # Creating the DataFrame object
            df = spark.createDataFrame(data=data, schema=schema)

        # Checking if the DataFrame rows were provided
        else:
            # Generating a schema object and getting the data provided
            schema = generate_dataframe_schema(schema_info=schema_info)
            data = df_info["data"]

            # Creating the DataFrame object
            df = spark.createDataFrame(data=data, schema=schema)

        # Adding the DataFrame object into the dictionary
        dfs_dict[df_info["dataframe_reference"]] = df

    return dfs_dict

compare_schemas(df1, df2, compare_nullable_info=False)

Compares the schema from two Spark DataFrames with custom options.

This function helps users to compare two Spark DataFrames schemas based on custom conditions provided in order to help the comparison.

The schema of a Spark DataFrame is made of three main elements: column name, column type and a boolean information telling if the field accepts null values. In some cases, this third element can cause errors when comparing two DataFrame schemas. Imagine that a Spark DataFrame is created from a transformation function and there is no way to configure if a field accepts a null value without (think of an aggregation step that can create null values for some rows... or not). So, when comparing schemas from two DataFrames, maybe we are interested only on column names and data types, and not if an attribute is nullable or not.

This function enables users to compare their Spark DataFrame schemas in two different approaches.

  1. Comparing the DataFrame.schema object attribute and returning true if two DataFrames have the same column names and if all column data types matches against each other (this happens when compare_nullable_info is False)
  2. Comparing the DataFrame.schema object attribute and returning true if all the column names and the its data types are the same, including the nullable information (this happens when compare_nullable_info is True)

Examples:

compare_dataframe_schemas(df1, df2, compare_nullable_info=False)
# Result is True or False

Parameters:

Name Type Description Default
df1 pyspark.sql.DataFrame

The first Spark DataFrame to be compared

required
df2 pyspark.sql.DataFrame

The second Spark DataFrame to be compared

required
compare_nullable_info bool

A boolean flag that enables to compare not only the column names and its data types, but also if the columns accepts nullable data or not.

False

Returns:

Type Description
bool

The function returns True if both DataFrame schemas are equal or False if it isn't.

Source code in sparksnake/tester/dataframes.py
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
def compare_schemas(
    df1: DataFrame,
    df2: DataFrame,
    compare_nullable_info: bool = False
) -> bool:
    """Compares the schema from two Spark DataFrames with custom options.

    This function helps users to compare two Spark DataFrames schemas based on
    custom conditions provided in order to help the comparison.

    The schema of a Spark DataFrame is made of three main elements:
    column name, column type and a boolean information telling if the field
    accepts null values. In some cases, this third element can cause errors
    when comparing two DataFrame schemas. Imagine that a Spark DataFrame is
    created from a transformation function and there is no way to configure
    if a field accepts a null value without (think of an aggregation step that
    can create null values for some rows... or not). So, when comparing schemas
    from two DataFrames, maybe we are interested only on column names and data
    types, and not if an attribute is nullable or not.

    This function enables users to compare their Spark DataFrame schemas in
    two different approaches.

    1. Comparing the DataFrame.schema object attribute and returning true if
    two DataFrames have the same column names and if all column data types
    matches against each other (this happens when `compare_nullable_info` is
    False)
    2. Comparing the DataFrame.schema object attribute and returning true if
    all the column names and the its data types are the same, including the
    nullable information (this happens when `compare_nullable_info` is True)

    Examples:
        ```python
        compare_dataframe_schemas(df1, df2, compare_nullable_info=False)
        # Result is True or False
        ```

    Args:
        df1 (pyspark.sql.DataFrame): The first Spark DataFrame to be compared
        df2 (pyspark.sql.DataFrame): The second Spark DataFrame to be compared
        compare_nullable_info (bool):
            A boolean flag that enables to compare not only the column names
            and its data types, but also if the columns accepts nullable data
            or not.

    Returns:
        The function returns True if both DataFrame schemas are equal or\
        False if it isn't.
    """

    # Extracting infos to be compared based on user conditions
    if not compare_nullable_info:
        df1_schema = [[col.name, col.dataType] for col in df1.schema]
        df2_schema = [[col.name, col.dataType] for col in df2.schema]
    else:
        df1_schema = df1.schema
        df2_schema = df2.schema

    # Checking if schemas are equal
    return df1_schema == df2_schema