Making with Code

Unit 01 Data Science Project #

The final project for this unit will be a research project. Working with the data science tools we explored this unit, you are expected to tell a narrative story about a time in your life using data as evidence.

To be successful in this project, you should find a topic that is both interesting to you and answerable (at least in part) with data from your chosen dataset. Your teachers will help you make sure your project hypothesis achieves that.


[1] Timeline #

1️⃣ Plan your project and choose a research question.

2️⃣ Conduct the data analysis and create data visualizations.

3️⃣ Create a research poster to communicate your findings.

4️⃣ Present your findings to the class.

5️⃣ If interested, you may present your findings at Shuyuan Research Week in June. This is not required.


[2] Starter Materials #

💻 You can find your project in your Google Drive: Project: Data Science.ipynb

✏️ For the poster, you may use Canva or any other service. It will be A3 size. You will begin the poster after you finish the coding portion.


[3] Assessment #

The assessment is broken down into four criteria:

  • Project Planning
  • Data Analysis
  • Data Visualization
  • Data Communication

For each criteria you will be assigned a score from 0-3:

  • 0 - no to no evidence of the concept
  • 1 - limited evidence of the concept
  • 2 - adequate evidence of the concept
  • 3 - substantial evidence of the concept

[4] Success Claims #

💯 Successful computer scientists should be able to make the following claims:

  • Project Planning
    • I can choose a relevant research question and determine appropriate forms of evidence
    • I can write out the necessary steps for each form of evidence
    • I can sketch an appropriate visualization for each form of evidence
  • Data Analysis
    • I can prepare my dataset by adding and removing necessary columns or rows
    • I can combine and reorganize pieces of data to explore new relationships and support my research question
    • I can generate summary statistics (mean, median, mode, or frequency count) for describing the data
    • I can write readable code by using descriptive names for modules, functions, and variables
    • I can write descriptive comments to describe complex pieces of the code
  • Data Visualization
    • I can choose appropriate data visualizations to communicate my findings
    • I can write accurate title, axis, and labels for my visualizations
    • I can display data visualizations that are thoughtfully sorted and easy to read
  • Data Communication: Poster
    • I can describe the dataset
    • I can explain the purpose and focus of the research question
    • I can reflect on the meaning of the data and provide context for each form of evidence
    • I can consider potential future areas of investigation if I were to continue this project

Keep the success claims in mind when coding your project.


[5] Tips & Tricks #

Detecting Chinese Characters #

Here is an example of how to detect if there are Chinese characters in a string.

  • You will need to import regex
  • Then use regex.search() - it will return either True or False if there are Chinese characters
import regex

chinese_string = "上海 2025"

if regex.search(r'\p{Han}', chinese_string) == True:
  print("Has Chinese!")
else:
  print("No chinese")

Detecting Chinese Characters #

import re

text = "hello world 💀 🌈"

# loop through each character
for char in text:
  # if you find a character, immediately say you have an emoji
  if char in re.findall(u'[\U0001f300-\U0001f650]', char):
    print("found an emoji!")

print("end of string")

Find the mode of a column for each unique value in another column #

📖 Here is a dataframe stored in the variable age_df. It stores names, ages, is_adult, and house.

age_df.head()
nameageis_adulthouse
0Alice10Falsefire
1Bob15Falsemetal
2Charlie25Truemetal
3David40Truefire
4Sally80Truefire

📖 We want to see what is the mode of is_adult for each house. For this we must use groupby.

mode_isAdult_by_house_df =  age_df.groupby(['house'])['is_adult'].agg(pd.Series.mode).to_frame().reset_index()

📖 Here is the new dataframe mode_isAdult_by_house_df.

For metal, because True and False appear the same amount of times it returns both options.

mode_isAdult_by_house_df
houseis_adult
0fireTrue
1metal[‘False’,‘True’]

This tutorial is based of this post.


Find the mean of a column for each unique value in another column #

📖 Here is a dataframe stored in the variable age_df. It stores names, ages, is_adult, and house.

age_df.head()
nameageis_adulthouse
0Alice10Falsefire
1Bob15Falsemetal
2Charlie25Truemetal
3David40Truefire
4Sally80Truefire

📖 We want to see what is the mean age for each value in is_adult. For this we must use groupby.

You could also replace .mean() with .max() or .min()

To round the mean, add .round(3) after .mean()

mean_isAdult_by_age_df =  df.groupby(['is_adult'])['age'].mean().to_frame().reset_index()

📖 Here is the new dataframe mean_isAdult_by_age_df.

mean_isAdult_by_age_df
is_adultage
0True50.5
1False12.5

Compare a value to the value right above it #

This is particularly useful to learn if you listened/watched the same piece of media multiple times in a row.

📖 Here is a dataframe stored in the variable watch_history_df. It stores shows, episodes, and genre.

df.head()
food
0Siu Mai
1Siu Mai
2Ice Cream
3Kinder Bueno
4Kinder Bueno
5Kinder Bueno

📖 We want to add a column to see what the previous item was called. For this we must use shift.

You could also put a number in the brackets to shift more than just 1 row down

For example, .shift(2) or .shift(-1)

df['previous'] = watch_history_df['show'].shift()
df.head()
foodprevious
0Siu MaiNone
1Siu MaiSiu Mai
2Ice CreamSiu Mai
3Kinder BuenoIce Cream
4Kinder BuenoKinder Bueno
5Kinder BuenoKinder Bueno

📖 Now we will add a new column to track if the show is the same as the previous one.

df["repeat"] = df["show"] == df["previous"]
df.head()
foodpreviousrepeat
0Siu MaiNoneFalse
1Siu MaiSiu MaiTrue
2Ice CreamSiu MaiFalse
3Kinder BuenoIce CreamFalse
4Kinder BuenoKinder BuenoTrue
5Kinder BuenoKinder BuenoTrue

Grouped Bar Chart #

A grouped bar chart allows you to compare multiple sets of similar data.

📖 In this chart, we compare two people and their number of cats and dogs. Their data is stored in person1df and person2df. The dataframes are identical, except for the count column.

person1df         
animalcount
0dog25
1cat30
person2df         
animalcount
0dog50
1cat23

📖 This is how you create a group bar chart with two dataframes.

fig = go.Figure(data=[
    go.Bar(name='person a', x=person1df.animal, y=person1df.count, text=df1.Name),
    go.Bar(name='person b', x=person2df.animal, y=person2df.count, text=df2.Name)
])


fig.update_layout(
    barmode='group',
    title="Plot Title",
    xaxis_title="x axis title",
    yaxis_title="y axis title",
    )

fig.show()

Line Charts with Many Lines #

A grouped bar chart allows you to compare multiple sets of similar data.

📖 In this chart, we compare two people and their heights over 5 years. Their data is stored in person1df and person2df. The dataframes are identical, except for the count column.

person1df         
yearheight
02020150
12021155
22022160
32023164
42024165
person2df          
yearheight
02020165
12021168
22022170
32023173
42024175

📖 This is how you create a group bar chart with two dataframes.

fig = go.Figure()

fig.add_trace(go.Scatter(x=person1df.year, y=person1df.height, name='person 1'))

fig.add_trace(go.Scatter(x=person2df.year, y=person2df.height, name='person 2'))

fig.update_xaxes(type='category') # only exisiting values for ticks 

fig.update_layout(
    title="Plot Title",
    xaxis_title="height",
    yaxis_title="years",
    )

fig.show()

Adding Other #

This can be helpful if you’d like to show your top 5 in relationship to the other items in a dataset.

1️⃣ Make a new dataframe with just the top 5 artists For this we must use head.

top3_df = df.head(3)
foodnum_purchased
0Siu Mai20
2Kinder Bueno15
3Ice Cream10

2️⃣ Make a new dataframe with all data except the top 3 items. We use iloc to select specific rows.

other_df = df.iloc[3:]

3️⃣ We add up the totals for all the other items. To do this, we use sum.

other_sum = other_df["num_purchased"].sum()

4️⃣ We add a new row for the other sum and add it to the top3_df.

# create the new row
new_row = {'food': 'Other', 'num_purchased': other_sum}

# make a new df that combines the top5_df with the new row
combo_df = top3_df.append(new_row, ignore_index = True)

📖 Here is the new dataframe combo_df.

foodnum_purchased
0Siu Mai20
2Kinder Bueno15
3Ice Cream10
4Other8

Drop Duplicate Values #

This may be helpful with the Youtube dataset if you are interested in analyzing what type of channels you watch.

df.head()
nameage
0John20
1Paul23
2Ringo18
3Ringo18
no_duplicates_df = df.drop_duplicates()
no_duplicates_df.head()
nameage
0John20
1Paul23
2Ringo28