Introduction
Using dummy variables when preparing data with Python is a common transformation performed on categorical data.
In this article, I'll quickly describe how easy this is with Python and the pandas library.
Pandas
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
I can't do a better job describing the library than the actual organization's website. However, I will add to the statement above that the library is also—hands down—one of the most popular.
Pandas relies on another popular library, NumPy. If you're new to all of this, I recommend checking out these libraries or watching some YouTube tutorials (there are plenty of them) to get yourself familiarized.
Dummy Variables
Dummy variables can numerically represent text data. Creating dummy variables is part of a one-hot data transformation.
One-hot takes categorical data in a column with many rows and turns it into a cross-tabulation between the different category options and the item's index. I'll provide an example.
My blog has a series of articles, which all have one parent category.
The categories are tech, nature, and progress.
First, I convert the column into the category
data type.
df.categories = pd.Categorical(df.categories)
Then, I can create a new data frame by passing the column to the pd.get_dummies()
function. pd
is the namespace variable that I applied to the pandas library.
dummies = pd.get_dummies(df.categories)
When I look at the new dummies
data frame, I see the cross tab of articles and categories denoted with a 1 for hot and 0 for cold.
This is great for when I only have one text value in each cell. However, recently, I had an array of values in a single cell. Each article has tags, and each tag is technically its own category. I couldn't imagine how this would work.
Thankfully, It only took a few internet searches to find a quick answer to handle this scenario.
Handling Multiple Values In One Cell
Some cells had lists of string values or were null.
I thought the null values would throw an error, but this simple line calmed all my concerns.
df.tags.str.join('|').str.get_dummies()
The above line:
- Converts the cell values to strings
- Joins them with a pipe (
|
) operator - Accesses the string value of the newly joined cell
- Creates a dummy variable data frame
Once again, I have my data represented with a one-hot transformation.
Conclusion
The more I work with data analysis, the more I realize how important it is to understand how to transform the raw data.
Converting categorical values to a numerical representation is one of those key transformations.