Sometimes, when dealing with a Spark data frame it may be necessary to reorder columns in a certain order. For example, to keep data consistent when trying to union two or more data frames with the same schema but different order of columns. Let's explore how that can be done relatively easy using a select()
Spark SQL function.
Let's start from creating data frames for our example. We will create two data frames with the same list of columns but in a different order.
val ny = Seq(
("1 Main St","Buffalo","NY",14200),
("1 Church St","Lockport","NY",14090),
("1 Madison St","Rochester","NY",14600));
val pa = Seq(
("Lancaster","PA","1 Main Ave",17600),
("Philadelphia","PA","1 Church Blvd",19100),
("Harrisburg","PA","1 Lincoln St",17100));
val nyDf = ny.toDF("Street","City","State","Zip");
val paDf = pa.toDF("City","State","Street","Zip");
nyDf.printSchema()
nyDf.show()
paDf.printSchema()
paDf.show()
The code above when executed yields the following output:
root
|-- Street: string (nullable = true)
|-- City: string (nullable = true)
|-- State: string (nullable = true)
|-- Zip: integer (nullable = false)
+------------+---------+-----+-----+
| Street| City|State| Zip|
+------------+---------+-----+-----+
| 1 Main St| Buffalo| NY|14200|
| 1 Church St| Lockport| NY|14090|
|1 Madison St|Rochester| NY|14600|
+------------+---------+-----+-----+
root
|-- City: string (nullable = true)
|-- State: string (nullable = true)
|-- Street: string (nullable = true)
|-- Zip: integer (nullable = false)
+------------+-----+-------------+-----+
| City|State| Street| Zip|
+------------+-----+-------------+-----+
| Lancaster| PA| 1 Main Ave|17600|
|Philadelphia| PA|1 Church Blvd|19100|
| Harrisburg| PA| 1 Lincoln St|17100|
+------------+-----+-------------+-----+
As you can see, both data frames, nyDf
and paDf
, contain similar data with the same column names but in a slightly different order. Let's combine our two data frames into a single one using a union()
function. If the schemas of two data frames are different the union()
function should produce an error.
val allDf = nyDf.union(paDf);
allDf.printSchema()
allDf.show()
After the code is executed we see the output below. The union()
executed successfully without an error, meaning that it considers the schemas of two data frames to be the same even though the columns are in a different order.
root
|-- Street: string (nullable = true)
|-- City: string (nullable = true)
|-- State: string (nullable = true)
|-- Zip: integer (nullable = false)
+------------+---------+-------------+-----+
| Street| City| State| Zip|
+------------+---------+-------------+-----+
| 1 Main St| Buffalo| NY|14200|
| 1 Church St| Lockport| NY|14090|
|1 Madison St|Rochester| NY|14600|
| Lancaster| PA| 1 Main Ave|17600|
|Philadelphia| PA|1 Church Blvd|19100|
| Harrisburg| PA| 1 Lincoln St|17100|
+------------+---------+-------------+-----+
But we have a problem. You can see that the schema of the combined data frame was inferred from the first source data frame, but the second data frame was not re-ordered and the data is placed incorrectly. We need to fix this problem so in the combined data frame, all the data is located in the correct columns. Our solution is to re-order columns in the second data frame so they match the order of the first data frame before the union. For this, we'll modify a union code, to reorder the columns in the second data frame using a select()
function:
val allDf = nyDf.union(paDf.select("Street","City","State","Zip"));
allDf.printSchema()
allDf.show()
After the updated code runs we see the new output below. The data in the combined data frame now appears in the correct columns.
root
|-- Street: string (nullable = true)
|-- City: string (nullable = true)
|-- State: string (nullable = true)
|-- Zip: integer (nullable = false)
+-------------+------------+-----+-----+
| Street| City|State| Zip|
+-------------+------------+-----+-----+
| 1 Main St| Buffalo| NY|14200|
| 1 Church St| Lockport| NY|14090|
| 1 Madison St| Rochester| NY|14600|
| 1 Main Ave| Lancaster| PA|17600|
|1 Church Blvd|Philadelphia| PA|19100|
| 1 Lincoln St| Harrisburg| PA|17100|
+-------------+------------+-----+-----+
If we want a more generic version of reordering the columns without hard-coding the column names we could use the columns
property of the DataFrame
to get the list of the column names in the correct order:
val allDf = nyDf.union(paDf.select(nyDf.columns.toSeq.map(c => col(c)):_*));
allDf.printSchema()
allDf.show()
Conclusion
We have learned how to use a select()
function to reorder columns in a data frame and use that technique to combine two or more data frames with the same schemas but different order of columns.
No comments:
Post a Comment