# Data Science 101 — CodingLab

Session 1

1st: ** import pandas as pd**2nd: Put the data into a list/dictionary

3rd:

*df=pd.DataFrame(data,column=[‘….’, ‘….’, ‘….’])*

dfdf

*note: Dictionary key(column name)-value(data) pair

**Uploading files:**

*from google.colab import filesuploaded = files.upload()*

**Reading files:**

*import pandas as pddf = pd.read_csv(‘unit05-data.csv’)df*

3. Data Processing

**Extracting Row from DataFrame:**

df = pd.DataFrame(data,columns = [‘name’,’age’,’salary’])

df.loc[0] → ‘0’ refers to the indexing

**Extracting Column from DataFrame:**

df = pd.DataFrame(data,columns = [‘name’,’age’,’salary’])

df[‘name’] → means the ‘name’ column

**Grouping & Aggregation:**

class_score = df.groupby(‘Class’).agg({‘Score’: [‘mean’, ‘min’, ‘max’]})

class_score

**Grouping by Multiple Columns:**

class_score = df.groupby([‘Class’,’Age’]).agg({‘Score’: [‘mean’, ‘min’, ‘max’]})

class_score

**Filtering: Select Rows Based on Value of Column:**

year_2002 = df[(df[‘year’] == 2002)]

year_2002

To get the original dataframe, just type: print(df)

**Select Rows Whose Column Value Does NOT Equal a Specific Value:**

year_not_2002 = df[ (df[‘year’] != 2002)]

year_not_2002

Those that are not year 2002 will be displayed.

There is two ways df[‘year’] and df.year

df[‘…’ ] is a better method if the header is long with spaces. However, df.year will be a faster method.

**Select Not Null:**

df_not_null = df[df.year.notnull()]

df_not_null

**Select rows based on a list:**

years = [2002,2008]

selection = df[df.year.isin(years)]

selection

**Select Rows Based on Values Not in List:**

The opposite is called negate ~

years = [2002,2008]

selection = df[~df.year.isin(years)]

selection

**Select Rows Using Multiple Conditions:**

Want to have more than one conditions. Use ‘and’ → &

Note for using print():

**Concatenate DataFrames by Rows:**

**Concatenate by Column:**

axis=1

The ‘NaN’ means empty/null. Because df2 only has 3 rows of data, so the rest were displayed as ‘NaN’.

**Merge DataFrames:**

Add a new column ‘id’

**Pivoting in Pandas:**

aggfunc → aggregation function by default will calculate the mean/average

index → Refers to the row

values → what data you want to display on the data frame.

fill_value → will display all ‘NaN’ with ‘0’ instead.

**Set Missing Values To 0:**

**Output DataFrame to csv:**

.to_csv

r → raw data means a full stop will be read as full stop

4. Data Cleaning

**Missing Data: **Either remove data or use imputation

i) remove data

By default dropna() will only remove row.

dropna(axis=1) will remove column.

**To read excel files using panda: pd.read_excel(“file name.xls”)

ii) Data Imputation

Removing data may be convenient, but it may delete some important information on other columns.

Imputation means editing the values by some other values.

.fillna(0)

df[‘Age’].mean()==df.Age.mean()

**Plotting Histogram:**

Quantitative(t-test,z-test,chi-square, Anova test), Graphical

**Plotting Boxplot:**

px.box( df, x= ,y= )

**Plotting Bar Graphs:**

Bar graph has spaces, while histogram is stick together.

Usually used for data with different category.

px.bar( df, x= , y= )

**Plotting Pie-Chart:**

px.pie(df, values= , names=)

values → data values

names → axis

Session 2

Descriptive statistics using data collected from a population through numerical calculations or graphs or tables.

Inferential statistics using sample data taken from a population to make inferences/predictions.

Confidence interval tells you the chances of meeting a certain range/criteria. eg.95%CI means 5% chance will not fall within the range.

Hypothesis Testing(z-test)

The higher the confidence level, the higher the z-score. (eg. You are confident that the mean height of people will fall within the range 1.4m to 1.8m, which is highly probable. But also mean higher margin of error thus higher z-score.)

one-tailed test, two-tailed test

level of significance(rejection level for two-tailed test)==alpha level

Z-test compare the mean of one sample/hypothesis

T-test compare the mean of two sample/hypothesis

ANOVA Test(Analysis of Variance) compare the mean of more than two sample/hypothesis

Chi-Square test compare categorical variables.

statistically significant means you are able to get the same result when you repeat the test on another sample. (It must be able to produce the same results when tested over and over again)

The more sample size (ie.30 samples and above), the more likely you will get a normal distribution.

Outliers normally we will keep them in your data unless you are really sure the outlier is due to errors(eg. recorded wrongly)

To determine data is normal:

- central limit theorem (>30 sample size)
- Chi-square test for normality
- plot a graph and see if the shape is a normal distribution

IQR allows you to determine/identify your outlier.

Session 3

Scipy — https://docs.scipy.org/doc/

Statsmodel z-test — https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.ztest.html?highlight=ztest

When do we use z-test?

Population standard deviation is known

Data should be normally distributed(ie. sample size >30 → central limit theorem)

statmodels gives you the pval and zset for the entire sample.

scipy.stats give you each indiviual value in the sample.

When do we use t-test?

When population variance/SD is unknown.

When determining the difference between the mean of 2 groups of sample.

Data should be normally distributed

For small sample size <30

Use Anova test when comparing data from more than two groups

Anova test == F-test

F-value = 7.1210194…(its same as the T-score, Z-score)

Chi-square is for data with categories.

p value is the probability of obtaining the test results are at least at the tails of the bell curve. Lower p value means to reject null hypothesis(ie <0.05)

rvalue → correlation of your values

matplotlib → draw graphs