A comparative approach using Excel and Python
Excel is the most commonly used data analysis software in the world. Why? It’s easy to get the hang of and powerful once you learn it. In contrast, Python’s reputation is that it’s more difficult to use, though what you can do with it is once you’ve mastered it is almost unlimited.
But how do these two tools actually compare? Their reputations don’t really reflect reality. In this article, we’ll take a look at some common data analysis tasks to demonstrate how accessible Python analysis can be.
Why use Python vs Excel?
Before we start, you might be wondering why Python is even worth considering. Why couldn’t you just keep using Excel?
Even though Excel is great, there are some areas that make a programming language like Python better for certain types of data analysis. Here are some reasons :
You can work with almost any kind of data.
Automated and repetitive tasks are easier.
Working with large data sets is much faster and easier.
It’s easier for others to reproduce and audit your work.
Finding and fixing errors is easier.
Advanced statistics, machine learning, and data visualization capabilities.
Cross-platform stability — your analysis can be run on any computer.
To be clear, we don’t advocate leaving Excel behind — it’s a powerful tool with many uses! But as an Excel user, being able to also leverage the power of Python can save you hours of time and open up career advancement opportunities.
It’s worth remembering the two tools can work well in tandem, and you may find that some tasks are best left in Excel, while others would benefit from the power, flexibility, and transparency that’s offered by Python.
Importing Our Data
Let’s start by familiarizing ourselves with the data we’ll be working with. We’ll use fictional data about a company with salespeople. Here’s what our data looks like in Excel:
Sales.csv — By Author
Our data is saved as a CSV file called sales.csv. In order to import our data in pandas, we need to start by importing the pandas library itself. To read our file we use pd.read_csv():
import pandas as pd
sales = pd.read_csv('Sales.csv')
Image By Author
We assigned the result of pd.read_csv() to a variable called sales, which we’ll use to refer to our data. We also put the variable name on its own in the last line of our code, which prints the data in a nicely formatted table.
Immediately, we can notice a few differences between how pandas represent the data vs what we saw in Excel:
In pandas, the row numbers start at 0 versus 1 in Excel.
The column names in pandas are taken from the data, versus Excel where columns are labeled using letters.
Where there is a missing value in the original data, pandas have the placeholder NaN which indicates that the value is missing, or null.
The sales data has a decimal point added to each value because pandas stores numeric values that include null (NaN) values as a numeric type known as a float (this doesn’t affect anything for us, but we just wanted to explain why this is).
Let’s use the type() function to look at the type of our sales variable:
type(sales)
pandas.core.frame.DataFrame
This output tells us that our sales a variable is a DataFrame object, which is a specific type of object in pandas. Most of the time in pandas when we want to modify a data frame, we’ll use a special syntax called a data frame method, which allows us to access specific functionality that relates to the data frame objects.
Sorting Data
Let’s learn how to sort our data in Excel and Python. Currently, our data isn’t sorted. In Excel, if we wanted to sort our data by the "Start Date" column, we would:
Select our data.
Click the ‘Sort’ button on the toolbar.
Select ‘Start Date’ in the dialog box that opens.
Sorting Data Using Excel — By Author
In pandas, we use the DataFrame.sort_values() method. We mentioned methods briefly a moment ago. In order to use them, we have to replace DataFrame with the name of the data frame we want to apply the method to — in this case sales. If you’ve worked with lists in Python, you will be familiar with this pattern from the list.append() method.
We provide the column name to the method to tell it which column to sort by:
sales = sales.sort_values("Start Date")
sales
Data Sorting using Python — By Author
The values have been sorted with one simple line of pandas code!
Summing the Sales Values
The last three columns of our data contain sales values for the first three months of the year, known as the first quarter. Our next task will be to sum those values in both Excel and Python.
Let’s start by looking at how we achieve this in Excel:
Enter a new column name "Sales Q1" in cell H1.
In cell H2, use the SUM() formula and specify the range of cells using their coordinates.
Drag the formula down to all rows.
Summing the Data using Excel — By Author
In pandas, when we perform an operation it automatically applies it to every row at once. We’ll start by selecting the three-column by using their names in a list:
q1_columns = sales[["Sales January", "Sales February", "Sales March"]]
q1_columns
Next, we’ll use the DataFrame.sum() method and specify axis=1, which tells pandas that we want to sum the rows and not the columns. We’ll specify the new column name by providing it inside brackets:
sales["Sales Q1"] = q1_columns.sum(axis=1)
sales
Summing the Data using Python — By Author
In pandas, the “formula” we used isn’t stored. Instead, the resulting values are added directly to our data frame. If we wanted to make an adjustment to the values in our new column, we’d need to write new code to do it.
Adding a Conditional Column
f we look at the "End Date" column, we can see that not all employees are still with the company — those with a missing value are still employed but the rest have left. Our next task will be to create a column that tells us if each salesperson is a current employee. We’ll perform this in Excel and Python.
Starting with Excel, to add this column we:
Add a new column name to the cell J1.
Use the IF() formula to check if cell D1 (End Date) is empty, and if so fill J2 with TRUE, otherwise FALSE.
Drag the formula down to the cells below.
Adding a conditional column using Excel — By Author
In pandas, we use the pandas.isnull() function to check for null values in the "End Date" column, and assign the result to a new column:
sales["Current Employee"] = pd.isnull(sales['End Date'])
sales
Adding a conditional column using Python — By Author
Pivot Tables
One of the most powerful Excel features is pivot tables, which facilitate data analysis using aggregation. We’re going to look at two different pivot table applications in Excel and Python.
We’ll start with a pivot table in Excel that counts the number of employees in each department:
Value counts using Excel — By Author
This operation — counting the number of times a value occurs in a column — is so common that in pandas it has its own syntax: Series.value_counts().
sales['Department'].value_counts()
Value counts using Python — By Author
The second pivot table example also aggregates by the department, but instead calculates the average Q1 sales:
Aggregates using Excel — By Author
In order to calculate this in pandas, we’ll use the DataFrame.pivot_table() method. We need to specify some arguments:
index: the column to aggregate by.
values: the column we want to use the values for.
aggfunc: the aggregation function we want to use, in this case, 'mean' average.
sales.pivot_table(index='Department', values='Sales Q1', aggfunc='mean')
Aggregates using Python— By Author
Summary :
In this article we learned Python equivalents for the following Excel functionality:
Sorting data
SUM()
IF()
Pivot Tables
For each example we looked at, the pandas syntax was of similar complexity to the formulas or menu options you would use in Excel. But Python offers some advantages, as much faster processing of large data sets, more customization and complexity, and more transparency for error-checking and auditing (since everything you’ve done is clearly laid out in code, rather than hidden in cells).
Someone proficient in Excel is more than capable of making the leap to working in Python. Adding Python skills to your skillset will make you a faster and more powerful analyst in the long run, and you’ll discover new workflows that take advantage of both Excel and Python for more efficient and powerful data analysis than you could do with Excel alone.
Source: Medium
The Tech Platform
ความคิดเห็น