Unit 01 Data Science Project #
The final project for this unit will be a research project. Working with the data science tools we explored this unit, you are expected to tell a narrative story about a time in your life using data as evidence.
To be successful in this project, you should find a topic that is both interesting to you and answerable (at least in part) with data from your chosen dataset. Your teachers will help you make sure your project hypothesis achieves that.
πΎ π¬ Extra Examplesπ Extra Examples HERE
[1] Timeline #
1οΈβ£ Plan your project and choose a research question.
2οΈβ£ Conduct the data analysis and create data visualizations.
3οΈβ£ Create a research poster to communicate your findings.
4οΈβ£ Present your findings to the class.
5οΈβ£ If interested, you may present your findings at Shuyuan Research Week in May. This is not required.
π You will have 6 in-class blocks to plan and complete the coding portion of this project.
section | code due date |
---|---|
cs9.1 | 4 March |
cs9.2 | 3 March |
cs9.3 | 27 Feb |
[2] Starter Materials #
π»
You can find your project in your Google Drive: Project: Data Science.ipynb
βοΈ For the poster, you may use Canva or any other service. It will be A3 size. You will begin the poster after you finish the coding portion.
[3] Assessment #
β The assessment is broken down into four criteria:
- Project Planning
- Data Analysis
- Data Visualization
- Data Communication
For each criteria you will be assigned a score from 0-3:
- 0 - no to no evidence of the concept
- 1 - limited evidence of the concept
- 2 - adequate evidence of the concept
- 3 - substantial evidence of the concept
[4] Success Claims #
π― Successful computer scientists should be able to make the following claims:
- Project Planning
- I can choose a relevant research question and determine appropriate forms of evidence
- I can consistently track my progress with specific comments
- Data Analysis
- I can prepare my dataset by adding and removing necessary columns or rows
- I can combine and reorganize pieces of data to explore new relationships and support my research question
- I can generate summary statistics (mean, median, mode, frequency count) for describing the data
- I can write readable code by using descriptive names for modules, functions, and variables
- I can write descriptive comments to describe complex pieces of the code
- Data Visualization
- I can choose appropriate data visualizations to communicate my findings
- I can display data visualizations that are thoughtfully sorted and easy to read
- I can include appropriate title, axis, and labels for my visualizations
- Data Communication: Poster
- I can explain the purpose and focus of the research
- I can explain my methodology, including action steps and reasoning for restructuring the data
- I can reflect on the meaning of the data and provide context of the results
- I can discuss the limitations and potential biases of my data and analysis
- I can consider potential future areas of investigation if I were to continue this project
Keep the success claims in mind when coding your project.
[5] Tips & Tricks #
Detecting Chinese Characters #
Here is an example of how to detect if there are Chinese characters in a string.
- You will need to import
regex
- Then use
regex.search()
- it will return eitherTrue
orFalse
if there are Chinese characters
import regex
chinese_string = "δΈζ΅· 2025"
if regex.search(r'\p{Han}', chinese_string) == True:
print("Has Chinese!")
else:
print("No chinese")
Creating a New Column Based on Another Column #
π Here is a dataframe
stored in the variable age_df
. It stores names and ages.
name | age | |
---|---|---|
0 | Alice | 10 |
1 | Bob | 15 |
2 | Charlie | 25 |
3 | David | 40 |
4 | Sally | 80 |
π In one code block, you will write a function with your conditional statements. This function returns True
or False
, based on their age.
def get_if_adult(age):
if age < 18:
return False
else:
return True
π In another code block, you can apply the function to each value in a given row. Here it applies the function get_if_adult()
to each row of the age
columns, and stores the return value in a new column called is_adult
.
# Apply the get_age_group function to the age column
age_df['is_adult'] = age_df.apply(lambda row: get_if_adult(row['age']), axis=1)
π Here is the updated dataframe.
name | age | is_adult | |
---|---|---|---|
0 | Alice | 10 | False |
1 | Bob | 15 | False |
2 | Charlie | 25 | True |
3 | David | 40 | True |
4 | Sally | 80 | True |
This tutorial is based of this guide.
Find the top 1+ based on two columns #
For example, I want to find the top 1 show I watched each month.
π Here is a dataframe
stored in the variable watch_history_df
.
month | show | episode | genre | |
---|---|---|---|---|
0 | January | One Piece | 08 | Fantasy |
1 | January | One Piece | 09 | Fantasy |
2 | January | The Office | 15 | Comedy |
3 | February | Avatar | 01 | Animation |
4 | February | Avatar | 02 | Animation |
5 | February | Avatar | 03 | Animation |
1οΈβ£ Count how many times we watched each show
during each month
. For this we must use groupby
.
#this counts up how many times I watched each show in each month
top_show_df = watch_history_df.groupby(by=["month", "show"]).size().to_frame("count")
2οΈβ£ Next, sort the values. Sort by month, then the count. We use ascending for the month, since we want to start with the lowest number (1 for january). We use descending for count, since we want the most watched at the top.
#this sorts the new df first by month, then by the count
top_show_df = top_show_df.sort_values(['month', 'count'], ascending=[True, False])
3οΈβ£ Last, get the top 1 show for each month. To do this, we use groupby
combined with head
. If you want the top 3, you could use .head(3)
, etc.
#this gets the top 1 for every month
top_show_df = top_show_df.groupby('month').head(1)
π Here is the new dataframe top_show_df
.
month | show | count |
---|---|---|
1 | One Piece | 2 |
2 | Avatar | 3 |
Find the mode of a column for each unique value in another column #
π Here is a dataframe
stored in the variable age_df
. It stores names, ages, is_adult, and house.
age_df.head()
name | age | is_adult | house | |
---|---|---|---|---|
0 | Alice | 10 | False | ‘fire’ |
1 | Bob | 15 | False | ‘metal’ |
2 | Charlie | 25 | True | ‘metal’ |
3 | David | 40 | True | ‘fire’ |
4 | Sally | 80 | True | ‘fire |
π We want to see what is the mode
of is_adult
for each house
. For this we must use groupby
.
mode_isAdult_by_house_df = age_df.groupby(['house'])['is_adult'].agg(pd.Series.mode).to_frame().reset_index()
π Here is the new dataframe mode_isAdult_by_house_df
.
For
metal
, becauseTrue
andFalse
appear the same amount of times it returns both options.
mode_isAdult_by_house_df
house | is_adult | |
---|---|---|
0 | fire | True |
1 | metal | [‘False’,‘True’] |
This tutorial is based of this post.
Find the mean of a column for each unique value in another column #
π Here is a dataframe
stored in the variable age_df
. It stores names, ages, is_adult, and house.
age_df.head()
name | age | is_adult | house | |
---|---|---|---|---|
0 | Alice | 10 | False | ‘fire’ |
1 | Bob | 15 | False | ‘metal’ |
2 | Charlie | 25 | True | ‘metal’ |
3 | David | 40 | True | ‘fire’ |
4 | Sally | 80 | True | ‘fire |
π We want to see what is the mean
age
for each value in is_adult
. For this we must use groupby
.
You could also replace
.mean()
with.max()
or.min()
To round the mean, add
.round(3)
after.mean()
mean_isAdult_by_age_df = df.groupby(['is_adult'])['age'].mean().to_frame().reset_index()
π Here is the new dataframe mean_isAdult_by_age_df
.
mode_isAdult_by_house_df
is_adult | age | |
---|---|---|
0 | True | 50.5 |
1 | False | 12.5 |
Compare a value to the value right above it #
π Here is a dataframe
stored in the variable watch_history_df
. It stores shows, episodes, and genre.
watch_history_df.head()
show | episode | genre | |
---|---|---|---|
0 | One Piece | 08 | Fantasy |
1 | One Piece | 09 | Fantasy |
2 | The Office | 15 | Comedy |
3 | Avatar | 01 | Animation |
4 | Avatar | 02 | Animation |
5 | Avatar | 03 | Animation |
π We want to add a column to see what the previous show was called. For this we must use shift
.
You could also put a number in the brackets to shift more than just 1 row down
For example,
.shift(2)
or.shift(-1)
watch_history_df['previous'] = watch_history_df['show'].shift()
watch_history_df.head()
show | episode | genre | previous | |
---|---|---|---|---|
0 | One Piece | 08 | Fantasy | None |
1 | One Piece | 09 | Fantasy | One Piece |
2 | The Office | 15 | Comedy | One Piece |
3 | Avatar | 01 | Animation | The Office |
4 | Avatar | 02 | Animation | Avatar |
5 | Avatar | 03 | Animation | Avatar |
π Now we will add a new column to track if the show is the same as the previous one.
watch_history_df["repeat"] = watch_history_df["show"] == watch_history_df["previous"]
watch_history_df.head()
show | episode | genre | previous | repeat | |
---|---|---|---|---|---|
0 | One Piece | 08 | Fantasy | None | False |
1 | One Piece | 09 | Fantasy | One Piece | True |
2 | The Office | 15 | Comedy | One Piece | False |
3 | Avatar | 01 | Animation | The Office | False |
4 | Avatar | 02 | Animation | Avatar | True |
5 | Avatar | 03 | Animation | Avatar | True |
Grouped Bar Chart #
A grouped bar chart allows you to compare multiple sets of similar data.
π In this chart, we compare two people and their number of cats and dogs. Their data is stored in person1df
and person2df
. The dataframes are identical, except for the count
column.
person1df
animal | count | |
---|---|---|
0 | dog | 25 |
1 | cat | 30 |
person2df
animal | count | |
---|---|---|
0 | dog | 50 |
1 | cat | 23 |
π This is how you create a group bar chart with two dataframes.
fig = go.Figure(data=[
go.Bar(name='person a', x=person1df.animal, y=person1df.count, text=df1.Name),
go.Bar(name='person b', x=person2df.animal, y=person2df.count, text=df2.Name)
])
fig.update_layout(
barmode='group',
title="Plot Title",
xaxis_title="x axis title",
yaxis_title="y axis title",
)
fig.show()

Line Charts with Many Lines #
A grouped bar chart allows you to compare multiple sets of similar data.
π In this chart, we compare two people and their heights over 5 years. Their data is stored in person1df
and person2df
. The dataframes are identical, except for the count
column.
person1df
year | height | |
---|---|---|
0 | 2020 | 150 |
1 | 2021 | 155 |
1 | 2022 | 160 |
1 | 2023 | 164 |
1 | 2024 | 165 |
person2df
year | height | |
---|---|---|
0 | 2020 | 165 |
1 | 2021 | 168 |
1 | 2022 | 170 |
1 | 2023 | 173 |
1 | 2024 | 175 |
π This is how you create a group bar chart with two dataframes.
fig = go.Figure()
fig.add_trace(go.Scatter(x=person1df.year, y=person1df.height, name='person 1'))
fig.add_trace(go.Scatter(x=person2df.year, y=person2df.height, name='person 2'))
fig.update_xaxes(type='category') # only exisiting values for ticks
fig.update_layout(
title="Plot Title",
xaxis_title="height",
yaxis_title="years",
)
fig.show()

Adding Other #
1οΈβ£ Make a new dataframe with just the top 5 artists For this we must use head
.
# make a new df with just the top 5
top5_df = artist_totals.head(5)
artist | total | |
---|---|---|
0 | Florence + The Machine | 110 |
1 | Childish Gambino | 62 |
2 | J. Cole | 28 |
3 | Muse | 18 |
4 | Chance the Rapper | 13 |
2οΈβ£ Make a new dataframe with everyone except
the top 5 artist. We use head
again, but with a negative number.
# get everything except the top 5
other_df = artist_totals.tail(-5)
3οΈβ£ We add up the totals for all the other artists. To do this, we use sum
.
#this gets the top 1 for every month
sum = other_df["total"].sum()
4οΈβ£ We add a new row for the other
sum.
# create the new row
new_row = {'artist': 'Other', 'total': sum}
# make a new df that combines the top5_df with the new row
combo_df = top5_df.append(new_row, ignore_index = True)
π Here is the new dataframe combo_df
.
artist | total | |
---|---|---|
0 | Florence + The Machine | 110 |
1 | Childish Gambino | 62 |
2 | J. Cole | 28 |
3 | Muse | 18 |
4 | Chance the Rapper | 13 |
5 | Other | 11 |