[Learning Record | This Week Summary] Week 23 - Practice practice and practice, data cleaning

coletangsy

Jun 15, 2021

IPFS

Numpy, Pandas, and now, is time to Data Cleaning!

The cover image is still from my favorite Visual Artist @evieshaffer

Summary of the week

Python pandas exercises (done)
Python numpy exercises (in progress)
Need to know more about Web scraping, regex

Content and Reflection

Python exercises

According to the plan of the previous week, I completed the entire Pandas exercises before writing this article. It is said that it is "completed" because the exercises are indeed completed, but I have not sorted out the key points separately. If the key points are not sorted out, the learned content can easily be put aside and forgotten along with the completed exercises. So I have to find time tomorrow, suspend the others first, and put this one in the highest priority.

After completing the Pandas exercises, I found two sets of Numpy exercises on github to do, namely rougier/numpy-100 and Kyubyong/numpy_exercises , the first one is as the title says, 100 questions Numpy, the second This part is in the form of an exercise, taking you around the Documentation of Numpy.

Speaking of which, my order doesn't seem to be right. Pandas is built on the basis of Numpy. According to normal logic, we should first learn Numpy and then learn Pandas. However, I have already started with Pandas, and then learned Numpy. I regarded it as two to learn together. Anyway, I have not felt that this order has had a great impact on me.

Data Cleaning

This week, I also read some steps of data cleaning in Python and related coding. This time I want to put together some methods that I find useful when doing exercises, and also use them as a checklist and steps to facilitate future data work. Cleaning .

# 1. Import libraries

 import pandas as pd
import numpy as np
import re

# 2. Import data (csv as example)

 df = pd.read_csv("")

# 3. Check dataset size

 df.shape
df.head()
df.duplicated()

# 4. Drop unnecessary data

 # 4-1 drop duplicate rows
df.drop_duplicates(inplace=True)

# 4-2 drop unnecessary rows
to_drop = ['column1','column2'])
df.drop(to_drop, inplace=True, axis=1)

# 5. Assign new columns name, index

 # 5-1 Assign new columns name from another csv
d_new = pd.read_csv("")
d_new = header.set_index('Name').to_dict() 
df.columns = df.columns.to_series().map(d['Label']) 

# 5-2 Assign new columns name from python dictionary
d_new= {"old":"new"}
df_df.rename(columns=d_new, inplace=True)

# 5-3 Assign column as new index 
df["column"].is_unique
df.set.index("column",inplace=True)

# 6. Clean the data format

 # 6-1 Clean data through regex
 df["columns"].apply(lambda x : (re.search(r"pattern", x)).group())

# 6-2 Clean data through checking boolean values
df["columns"] = np.where(df["columns"] == "A","X","Y" ) 

# 6-3 Convert the column to correct types
df = df.astype({"columns1":"float64","columns2":"strings"})

# 7. Calculate statistical data for further plotting

 # 7. Calculate statistical info.
df.groupby("columns").agg(["mean","median"])
df.sort_values(by=["columns1","columns2"],ascending=False).head()
df["columns3"].value_counts()

# 8. Plotting

 # 8-1 Import libraries
import seaborn as sns
import matplotlib.pyplot as plt

# set plot style
sns.set_style('ticks')

# create boxplot
plt.figure(figsize=(10,15))
sns.boxplot(x="column1",y="column2",data=df)
plt.title("Title of the plot")
plt.show()

Extracurricular references include:

Pythonic Data Cleaning With Pandas and NumPy – Real Python

I also refer to another resident of Matt City @Coding for his own detailed teaching of Regex , and benefited a lot.

Supplementary statistical knowledge

Although I have auditioned the Data Science Math Skills course on Coursera before, I still feel that my statistical knowledge is still insufficient. Under the recommendation of my classmates, I now complete several StatQuest with Josh Starmer - YouTube statistics related videos every day. The examples of this series of videos are very detailed and the rhythm is also very good. You can take notes while listening. The next step is to apply this part of the knowledge to practical analysis. Only by practicing and using it in practice will you not forget this knowledge.

New week goals (Week 24)

Complete the Python Numpy Exercise (the rest)
Complete the Web-scraping Project
Google Capstone Project (in R) to complete the R coding part (previous unfinished task)
(If you have time, do the same in Python again.)

Off topic

I used to use Spotify to play Lo-Fi Beats to learn, but recently I like to open Twitch , where I play AmongUs to discuss voice and practice. And then not actually listening to the content of AmongUs.
This time, I sketched out the outline of this article one day earlier, so I was not as tired as last week when I finished the record (Is it?)
I really like this Visual Artist and have used one of her works before as a material for a design that I am personally happy with (sort of). Ah, looks like I need to learn a little more about Photoshop , Figma (already bookmarked several tutorials).
Recently, it is regarded as "returning to the old business", and some netizens' comments have been translated. "Returning to the old business" is interesting in itself, but it is actually meaningless.

CC BY-NC-ND 2.0

Like my work? Don't forget to support and clap, let me know that you are with me on the road of creation. Keep this enthusiasm together!

coletangsy學習 Data Science、Machine Learning 中，透過記錄，一步一步往目標前進。

Author
More