Data Analysis or Data Science with Python
Advanced Pandas documentation:
# Set Index View a dataframe using a different index. This will not change the data frame. df.set_index("column") Modify and change the data frame to use a new column as the index. df.set_index("column", inplace=True) Reset the index by renumbering the rows. df.reset_index(inplace=True) # Creating Columns Create a new column with new values. df["new_column"] = [1, 2, 3, 4, 5, 6] Create a new column using a function. df["new_column"] = function(df["column1"], df["column2"]) # Importing Data Import data from a CSV file. df = pd.read_csv (r"data.csv") Remove max columns limitation and show all columns. pd.set_option("display.max_columns", None) # Data Cleaning # Dropping Data Drop unnecessary columns df = df.drop(["column1", "column2"], axis=1) Determine missing values in each column. df.isnull().sum() Drop missing values. Drop rows that contain missing values df.dropna() Drop columns that contain missing values df.dropna(axis=1) Fill in missing values. Fill in with a specific value df.fillna(0, inplace=True) Fill in with the number in the row behind it. df.fillna(method='bfill') Fill in with the number in the column before it. df.fillna(method='ffill', axis=1) Determine the number of duplicate rows. df.duplicated().sum() Find the duplicated row(s). df.loc[df.duplicated()] Drop duplicate rows. df.drop_duplicates(inplace=True) Change the data type of a column. Change the data type to a specific data type df.column.astype(data_type) Change the data type to a float pd.to_numeric(df.column) # Grouping/Sorting Grouping Groups and returns the count df.groupby("value_to_group_by").column.count() Groups and returns the maximum value in two columns df.groupby("value_to_group_by")[["column1", "column2"]].max() Groups and returns the min, max and sum of the values in a column df.groupby("value_to_group_by").column.agg([min, max, sum]) Groups and returns the sorted list of values in a column df.groupby("value_to_group_by").column.agg([sorted] Sorting Sort values (increasing/ascending) df.sort_values(by="sorting_value") Sorts one column of values (decreasing/descending) and then by another (increasing/ascending) df.sort_values(by=["sort1", "sort2"], ascending=[False, True]) # Combining Datasets To concatenating or merge a dataset, make sure that column names match between the different datasets. # Concatenating two datasets: # add second data set on as new rows # use the reset_index function to renumber the rows combined_df = pd.concat([df1, df2]).reset_index() # Merging/Joining two datasets: # Merge everything from both data sets pd.merge(df1, df2, on="name", how="outer") # Merge only values that exist in BOTH data sets pd.merge(df1, df2, on="name", how="inner") # Keep everything in the first data set and # merge in matching values from the second pd.merge(df1, df2, on="name", how="left") # Keep everything in the second data set and # merge in matching values from the first pd.merge(df1, df2, on="name", how="right") # end