Data Analysis or Data Science with Python

Advanced Pandas documentation:

# Set Index

View a dataframe using a different index. This will not change the data frame.

Modify and change the data frame to use a new column as the index.
df.set_index("column", inplace=True)

Reset the index by renumbering the rows.

# Creating Columns

Create a new column with new values.
df["new_column"] = [1, 2, 3, 4, 5, 6]  

Create a new column using a function.
df["new_column"] = function(df["column1"], df["column2"])   

# Importing Data

Import data from a CSV file.
df = pd.read_csv (r"data.csv")

Remove max columns limitation and show all columns. 
pd.set_option("display.max_columns", None)   

# Data Cleaning # Dropping Data

Drop unnecessary columns
df = df.drop(["column1", "column2"], axis=1)   

Determine missing values in each column.

Drop missing values.
Drop rows that contain missing values

Drop columns that contain missing values

Fill in missing values.
Fill in with a specific value
df.fillna(0, inplace=True)

Fill in with the number in the row behind it.

Fill in with the number in the column before it.
df.fillna(method='ffill', axis=1)

Determine the number of duplicate rows.

Find the duplicated row(s).

Drop duplicate rows.

Change the data type of a column.
Change the data type to a specific data type

Change the data type to a float

# Grouping/Sorting


Groups and returns the count 

Groups and returns the maximum value in two columns
df.groupby("value_to_group_by")[["column1", "column2"]].max()

Groups and returns the min, max and sum of the values in a column 
df.groupby("value_to_group_by").column.agg([min, max, sum])

Groups and returns the sorted list of values in a column 


Sort values (increasing/ascending)

Sorts one column of values (decreasing/descending) and then by another (increasing/ascending)
df.sort_values(by=["sort1", "sort2"], ascending=[False, True])

# Combining Datasets

To concatenating or merge a dataset, make sure that column names match between the different datasets.

# Concatenating two datasets:
# add second data set on as new rows
# use the reset_index function to renumber the rows
combined_df = pd.concat([df1, df2]).reset_index()

# Merging/Joining two datasets:
# Merge everything from both data sets
pd.merge(df1, df2, on="name", how="outer")

# Merge only values that exist in BOTH data sets
pd.merge(df1, df2, on="name", how="inner")

# Keep everything in the first data set and 
# merge in matching values from the second 
pd.merge(df1, df2, on="name", how="left")

# Keep everything in the second data set and 
# merge in matching values from the first
pd.merge(df1, df2, on="name", how="right")

# end