“Information is the oil of the 21st century,
and analytics is the combustion engine.” — Peter Sondergaard
Experiential Learning (Badge)

Predicting Returns for PUMA
Tools: Python & Tableau | Sponsor: PUMA
A capstone project that I completed in my graduate school.
To predict whether a transaction will be returned (classification), we merged internal and external datasets, dealt with NA and outliers, balanced a training dataset, conducted one-hot, label, and target encoding, used tree-based algorithms, tuned hyperparameters, performed feature selection, evaluated model performance, created a dashboard and an interactive interface, and gave final conclusions and recommendations to the sponsor.
AI-Driven Finance Planning Platform

Tools: KNIME & Qualtrics & Wix.com | Sponsor: Dorval & Chorne Financial Advisors
An integrated experiential learning project that I completed in my graduate school.
To develop an AI-driven financial planning platform, we preprocessed text data, performed NLP, conducted LDA topic modeling, and used results to build predictive models. We found that the historical data was messy and time-consuming, and models were not precise. Thus, to avoid "garbage in, garbage out" and to help the sponsor achieve the project's final goal in the future, we proposed a short-term goal, designed a quality data collecting process, and built a website mockup.
Supervised ML
Predicting Bank Marketing Campaign
A classification project.
EDA | Dummy & One-hot encoding | Oversampling | Feature scaling | 5 models (LR, KNN, DT, RF, GB) | Hyperparameters tuning | Classification evaluation metrics | Future scope
Predicting Genre of Spotify Songs
A multiclass classification project.
EDA | Feature scaling | Oversampling & Undersampling | 5 models (Multiclass LR, KNN, DT, RF, XGB) | Hyperparameters tuning | Classification evaluation metrics | Future scope
Predicting Handwritten Digit
A multiclass logistic regression (softmax regression) project.
Image pixel dataset | 10 Logistic regression models (one per digit) | Softmax function | Classification evaluation metrics

Predicting Boston House Price
A regularized regression project.
Linear regression assumptions | Multiple linear regression (OLS) | Multicollinearity | Regularization | Ridge & Lasso regression | Regression evaluation metrics
The GitHub will be published soon, stay tuned!
Predicting Search Volume in Excel
A K-Nearest Neighbors algorithm in Excel for even nonprogrammers.
Two distance metrics (Euclidean & Manhattan) | Dimensionality of the model (n) | Number of nearest neighbors (k) | Regression evaluation metric (RMSE)
Predicting Customer Churn in Telecom
My first classification project.
EDA | Feature selection | Information values | Multicollinearity problem | Logistic regressions & Decision trees | Optimal probability cutoff | Classification evaluation metrics | Feature importance | Data-driven recommendations
Unsupervised ML
Segmenting Mall Customer
A clustering project.
EDA | Data standardization | K-means & DBScan | Optimal clusters | Optimal radius | Clusters & 3D scatter plots | Data-driven targeted business strategies
Visualizing Dimensional Reduction
A t-SNE (non-linear dimensionality reduction algorithm) project.
Image pixel dataset | Random samples | 784 dimensions to 2 dimensions | Scatter plot
Text Mining
The Big Bang Theory Episode Plot Descriptions
An NLP project.
Corpus & Dictionary | Convert letters to lower case | Remove white space & punctuation
& stop words | Customized Stemmer & Stem Completion | Bag of Words | Correlation of frequent terms | Association of a given word | 2-gram analysis | Hierarchical clustering

NLP - Text Detection and Correction System
Tools: MS SQL & C#
A capstone project that I completed in my undergraduate.
Like a woodpecker pecking an insect, to diagnose text errors, we utilized an existing corpus obtained from the government website, processed new input data, conducted n-gram analysis, and finally built a text detection and correction interactive system.
Statistical Inference
Hypothesis Testing and Confidence Intervals
A review of statistical knowledge for data science.
Hypothesis testing & Confidence intervals & their relationship | Other important statistical concepts (central limit theorem, sampling distribution, significance level, p-value)
Obesity in the USA
A hypothesis testing project.
One-sample t-test | Two proportion z-test | F-test | Two-sample t-test | Assumption checking | Null hypothesis & Alternative hypothesis | Critical value & Test statistic | Significance level & P-value | Hypothesis test graphs
Data Warehousing

Data Warehousing and Business Intelligence
An introduction to concepts of data warehouses and business intelligence with examples from my past working experience.
Entity-Relationship Diagram | Key terminologies | Normalization | SQL statements | Window functions | Joins | Star & Snowflake schema | Dimensions & Fact tables | Slowly changing dimensions | Referential integrity actions | OLTP & OLAP
The post will be published soon, stay tuned!
Other Analytics
Enterprise Analytics
An introduction to advanced Excel skills.
Data Analysis (Descriptive Statistics & Regression) | What-If Analysis (Data Table & Goal Seek) | Solver (Linear programming & Non-linear programming) | IF & SUMPRODUCT | Business scenario problems | Data-driven decisions | Different industries |

Leadership in Analytics & Risk Management Analytics
CRISP-DM | Qualitative risk assessment | Quantitative risk assessment | Risk treatment and response plan | Key risk indicators (KRIs)
The post will be published soon, stay tuned!