Merge multiple subject Excel grade files using Python + Pandas

Jun 23, 2022

Suppose you have several Excel sheets with grades for different subjects. Each file contains the information of the same classmates, but each Excel stores the grades of different subjects. Now we want to integrate the scores of each subject of the same class, we use Python's Pandas suite with the DataFrame Merge function to achieve it.

https://odysee.com/@MakClass:1/merge-excel-files-with-python-and-pandas:6?r=5tVg6uo6okaG49AYonwXvMrf5rrC1ooK

Development environment: We will use Anaconda with Jupyter Notebook, if you use other tools, such as Spyder, VSCode, PyCharm, etc., it is also possible.

And the Anaconda I used in the demonstration has Pandas installed. In addition, Jupyter Notebook can also provide a more readable visualization environment when dealing with tabular data, so I will use Anaconda with Jupyter Notebook to explain.

First, I prepared five Excel files. In each Excel file, there are graded grades and total grades of subjects.

For example, the following is the data of Chinese .xlsx, and the rest of the subjects have the same structure.

Our goal is to integrate the total scores of each subject in these Excel files into one file. as follows:

For the sake of convenience, we prepare a files list and record the names of each file, which will be read in the program for convenience.

 # A list of .xlsx files to merge
files = [
    "Chinese", 
    "English", 
    "math",
    "physics",
    "Chemical",
]

Note: If you want to read all Excel files more dynamically in the future, consider using glob .

load first file

With pd.read_excel, Excel files can be imported into DataFrames for further processing. The Excel of each subject has its own "Total Score" column, and the total score column of each subject can be renamed to each subject name through rename .

In the end, we only need the two fields of student number and subject total grade. So use df[["學生編號", files[0]]] to get it.

 df = pd.read_excel(files[0] + ".xlsx")
 
# Rename the total grade to the subject name, otherwise all subjects will be called "total grade"
df = df.rename(columns={
    "Total Score": files[0]
})
 
# We only take the student number and the original total score.
df = df[["student number", files[0]]]
df

load second file

Then we load the second file and do similar processing.

 # Load second file df2 = pd.read_excel(files[1] + ".xlsx")
df2 = df2.rename(columns={
    "Total Score": files[1]
})
df2 = df2[["student number", files[1]]]
df2

The code for each file will be similar in the future, so we will eventually use a for-loop to achieve this. But before that, we manually process these two files before copying to all files after success.

Integrate two files

We use the merge function of the DataFrame to integrate, which uses on to specify the field on which the two DataFrames are based. Since the amount of data on both sides is the same, they are records of the same class of students, so there is no need to use how to specify left/right/innter/outer.

 # Use DataFrame Merge to integrate df and df2 two sets of data df_result = df.merge(df2, on="student number")

After merging, we arrive at the following data.

Merge all Excel files

With the above experience of merging two files, we turn df2 of df and df2 into reading each file in turn and merging.

 for topic in files[1:]:
    df2 = pd.read_excel(topic + ".xlsx")
    df2 = df2.rename(columns={
        "Total Score": topic
    })
 
    df2 = df2[["student number", topic]]
    df = df.merge(df2, on="student number")
    
df = df.round()
df["student number"] = df["student number"].astype(int)
df = df.set_index("student number")
df

resulting in the following dataset.

final code

Putting all the code together gives the final code:

 import numpy as np
import pandas as pd

files = [ "Chinese", "English", "Math", "Physics", "Chemistry" ]

# The first file
topic = files[0]
df = pd.read_excel(topic + ".xlsx")

df = df.rename(columns={ "Total score": topic })

df = df[["student number", topic]]

# All other files

for topic in files[1:]:
    df2 = pd.read_excel(topic + ".xlsx")
    df2 = df2.rename(columns={
        "Total Score": topic
    })

    df2 = df2[["student number", topic]]
    df = df.merge(df2, on="student number")
    
df = df.round()
df["student number"] = df["student number"].astype(int)
df = df.set_index("student number")
df.to_excel("result.xlsx")

Finally, put on YouTube's demonstration demonstration: