Data Analysis or Data Science with Python

Advanced Pandas documentation:

# Set Index

View a dataframe using a different index. This will not change the data frame.
df.set_index("column")

Modify and change the data frame to use a new column as the index.
df.set_index("column", inplace=True)

Reset the index by renumbering the rows.
df.reset_index(inplace=True)


# Creating Columns

Create a new column with new values.
df["new_column"] = [1, 2, 3, 4, 5, 6]  

Create a new column using a function.
df["new_column"] = function(df["column1"], df["column2"])   


# Importing Data

Import data from a CSV file.
df = pd.read_csv (r"data.csv")

Remove max columns limitation and show all columns. 
pd.set_option("display.max_columns", None)   


# Data Cleaning # Dropping Data

Drop unnecessary columns
df = df.drop(["column1", "column2"], axis=1)   

Determine missing values in each column.
df.isnull().sum()    

Drop missing values.
Drop rows that contain missing values
df.dropna()

Drop columns that contain missing values
df.dropna(axis=1)

Fill in missing values.
Fill in with a specific value
df.fillna(0, inplace=True)

Fill in with the number in the row behind it.
df.fillna(method='bfill')

Fill in with the number in the column before it.
df.fillna(method='ffill', axis=1)

Determine the number of duplicate rows.
df.duplicated().sum()   

Find the duplicated row(s).
df.loc[df.duplicated()]  

Drop duplicate rows.
df.drop_duplicates(inplace=True)  

Change the data type of a column.
Change the data type to a specific data type
df.column.astype(data_type)

Change the data type to a float
pd.to_numeric(df.column)


# Grouping/Sorting

Grouping

Groups and returns the count 
df.groupby("value_to_group_by").column.count()

Groups and returns the maximum value in two columns
df.groupby("value_to_group_by")[["column1", "column2"]].max()

Groups and returns the min, max and sum of the values in a column 
df.groupby("value_to_group_by").column.agg([min, max, sum])

Groups and returns the sorted list of values in a column 
df.groupby("value_to_group_by").column.agg([sorted]

Sorting

Sort values (increasing/ascending)
df.sort_values(by="sorting_value")

Sorts one column of values (decreasing/descending) and then by another (increasing/ascending)
df.sort_values(by=["sort1", "sort2"], ascending=[False, True])


# Combining Datasets

To concatenating or merge a dataset, make sure that column names match between the different datasets.

# Concatenating two datasets:
# add second data set on as new rows
# use the reset_index function to renumber the rows
combined_df = pd.concat([df1, df2]).reset_index()

# Merging/Joining two datasets:
# Merge everything from both data sets
pd.merge(df1, df2, on="name", how="outer")

# Merge only values that exist in BOTH data sets
pd.merge(df1, df2, on="name", how="inner")

# Keep everything in the first data set and 
# merge in matching values from the second 
pd.merge(df1, df2, on="name", how="left")

# Keep everything in the second data set and 
# merge in matching values from the first
pd.merge(df1, df2, on="name", how="right")

# end