[Learning Record | This Week Summary] Week 23 - Practice practice and practice, data cleaning
The cover image is still from my favorite Visual Artist @evieshaffer
Summary of the week
- Python pandas exercises (done)
- Python numpy exercises (in progress)
- Need to know more about Web scraping, regex
Content and Reflection
Python exercises
According to the plan of the previous week, I completed the entire Pandas exercises before writing this article. It is said that it is "completed" because the exercises are indeed completed, but I have not sorted out the key points separately. If the key points are not sorted out, the learned content can easily be put aside and forgotten along with the completed exercises. So I have to find time tomorrow, suspend the others first, and put this one in the highest priority.
After completing the Pandas exercises, I found two sets of Numpy exercises on github to do, namely rougier/numpy-100 and Kyubyong/numpy_exercises , the first one is as the title says, 100 questions Numpy, the second This part is in the form of an exercise, taking you around the Documentation of Numpy.
Speaking of which, my order doesn't seem to be right. Pandas is built on the basis of Numpy. According to normal logic, we should first learn Numpy and then learn Pandas. However, I have already started with Pandas, and then learned Numpy. I regarded it as two to learn together. Anyway, I have not felt that this order has had a great impact on me.
Data Cleaning
This week, I also read some steps of data cleaning in Python and related coding. This time I want to put together some methods that I find useful when doing exercises, and also use them as a checklist and steps to facilitate future data work. Cleaning .
# 1. Import libraries
import pandas as pd import numpy as np import re
# 2. Import data (csv as example)
df = pd.read_csv("")
# 3. Check dataset size
df.shape df.head() df.duplicated()
# 4. Drop unnecessary data
# 4-1 drop duplicate rows df.drop_duplicates(inplace=True) # 4-2 drop unnecessary rows to_drop = ['column1','column2']) df.drop(to_drop, inplace=True, axis=1)
# 5. Assign new columns name, index
# 5-1 Assign new columns name from another csv d_new = pd.read_csv("") d_new = header.set_index('Name').to_dict() df.columns = df.columns.to_series().map(d['Label']) # 5-2 Assign new columns name from python dictionary d_new= {"old":"new"} df_df.rename(columns=d_new, inplace=True) # 5-3 Assign column as new index df["column"].is_unique df.set.index("column",inplace=True)
# 6. Clean the data format
# 6-1 Clean data through regex df["columns"].apply(lambda x : (re.search(r"pattern", x)).group()) # 6-2 Clean data through checking boolean values df["columns"] = np.where(df["columns"] == "A","X","Y" ) # 6-3 Convert the column to correct types df = df.astype({"columns1":"float64","columns2":"strings"})
# 7. Calculate statistical data for further plotting
# 7. Calculate statistical info. df.groupby("columns").agg(["mean","median"]) df.sort_values(by=["columns1","columns2"],ascending=False).head() df["columns3"].value_counts()
# 8. Plotting
# 8-1 Import libraries import seaborn as sns import matplotlib.pyplot as plt # set plot style sns.set_style('ticks') # create boxplot plt.figure(figsize=(10,15)) sns.boxplot(x="column1",y="column2",data=df) plt.title("Title of the plot") plt.show()
Extracurricular references include:
Pythonic Data Cleaning With Pandas and NumPy – Real Python
I also refer to another resident of Matt City @Coding for his own detailed teaching of Regex , and benefited a lot.
Supplementary statistical knowledge
Although I have auditioned the Data Science Math Skills course on Coursera before, I still feel that my statistical knowledge is still insufficient. Under the recommendation of my classmates, I now complete several StatQuest with Josh Starmer - YouTube statistics related videos every day. The examples of this series of videos are very detailed and the rhythm is also very good. You can take notes while listening. The next step is to apply this part of the knowledge to practical analysis. Only by practicing and using it in practice will you not forget this knowledge.
New week goals (Week 24)
- Complete the Python Numpy Exercise (the rest)
- Complete the Web-scraping Project
- Google Capstone Project (in R) to complete the R coding part (previous unfinished task)
(If you have time, do the same in Python again.)
Off topic
- I used to use Spotify to play Lo-Fi Beats to learn, but recently I like to open Twitch , where I play AmongUs to discuss voice and practice. And then not actually listening to the content of AmongUs.
- This time, I sketched out the outline of this article one day earlier, so I was not as tired as last week when I finished the record (Is it?)
- I really like this Visual Artist and have used one of her works before as a material for a design that I am personally happy with (sort of). Ah, looks like I need to learn a little more about Photoshop , Figma (already bookmarked several tutorials).
- Recently, it is regarded as "returning to the old business", and some netizens' comments have been translated. "Returning to the old business" is interesting in itself, but it is actually meaningless.
Like my work? Don't forget to support and clap, let me know that you are with me on the road of creation. Keep this enthusiasm together!
- Author
- More