I now have been handed a bunch of legacy excel files.
The author had the good idea to put two header rows, but both have column names i want.
So I do:
xls = panda.read_excel(exel,header = 1,sheet_name = 'Sheet1')
But: some headers take up the 2 rows (0 and 1), others have something in row zero, and something in row 1 , like this:
..... | Family Name and firstnames..... |
NR | Name | Firstname | second name | etc...
When I specify header = 0, I can't get the second row names , when header = 1 the "NR" field is not recognised.
This happens 4 times. in the header rows. All excels are like this.
Remedy ?
Or skip the headers?
It sounds like the headers in your Excel files have some inconsistencies in their formatting, which is causing issues when trying to read them into Pandas. One possible solution is to use the
skiprows
parameter in
pandas.read_excel()
to skip the first row of headers, and then manually add the second row of headers using the
columns
parameter.
Here's an example:
import pandas as pd
# Read in the Excel file, skipping the first row of headers
df = pd.read_excel(excel_file, sheet_name='Sheet1', skiprows=[0])
# Manually specify the second row of headers using the 'columns' parameter
df.columns = ['NR', 'Name', 'Firstname', 'Second Name', 'etc...']
# Do whatever data cleaning or analysis you need to do on the resulting DataFrame
This should allow you to read in the Excel files while handling the inconsistencies in the header formatting. If you encounter additional issues with the headers, you may need to adjust the
skiprows
and
columns
parameters accordingly.
Alternatively, if you don't need the header information for your analysis, you could simply skip the headers altogether by setting
header=None
in
pandas.read_excel()
. This will result in Pandas automatically generating column names for you (e.g. "Unnamed: 0", "Unnamed: 1", etc.), which you can then rename as needed using
df.rename()
.
Is your question that you have multi-indexed columns and you don't know how to use them? To index multi-indexed columns, provide a value for each index. Thes example has two index values per column.
import pandas as pd
import numpy as np
col = pd.MultiIndex.from_arrays(
[["one", "one", "one", "two", "two", "two"], ["a", "b", "c", "a", "b", "c"]]
data = np.array(range(2 * 3 * 4)).reshape((4, 2 * 3))
df = pd.DataFrame(data, columns=col)
print(df)
print(df[("two", "a")])
Output:
one two
a b c a b c
0 0 1 2 3 4 5
1 6 7 8 9 10 11
2 12 13 14 15 16 17
3 18 19 20 21 22 23
0 3
1 9
2 15
3 21
Name: (two, a), dtype: int32
To verify this works when reading from an excel file, I wrote the dataframe to excel and read the file using this progrram.
import pandas as pd
df = pd.read_excel("test.xlsx", header=(0, 1))
print(df)
print(df[("two", "a")])
Output:
one two
a b c a b c
0 0 1 2 3 4 5
1 6 7 8 9 10 11
2 12 13 14 15 16 17
3 18 19 20 21 22 23
0 3
1 9
2 15
3 21
Name: (two, a), dtype: int64
You could also collapse the columns to be single-indexed.
def collapse_header(columns):
"""Collapse multi-index column indices to a list of strings."""
return [" ".join(column).strip() for column in columns]
df = pd.read_excel("test.xlsx", header=(0, 1))
print(df)
df.columns = collapse_header(df.columns)
print(df)
print(df["two a"])
Output:
one two
a b c a b c
0 0 1 2 3 4 5
1 6 7 8 9 10 11
2 12 13 14 15 16 17
3 18 19 20 21 22 23
one a one b one c two a two b two c
0 0 1 2 3 4 5
1 6 7 8 9 10 11
2 12 13 14 15 16 17
3 18 19 20 21 22 23
0 3
1 9
2 15
3 21
Name: two a, dtype: int64
This is an old question, the problem was:
- The original author created "good looking" column headers: 2 rows with nice colors.
- But columns need only 1 header name:
So: sometimes it was in row 0, sometimes in row 1.
Sometimes in both rows, when the top row title "groups" a few columns from row 1. (Like "name", groups "first name, family name" below)
I had speculated you could explain to pandas which header name to pick alternatively from row 0 or 1, but that does not work.
As suggested manual correction is always possible, but not for a few hundred excels.
I skipped the headers, (re)named them programmatically, and did not look back.