Fei Han
hanfei1986
Senior AI Engineer @ Microsoft | Ex-Amazon | Ex-MIT | Machine Learning and Generative AI
Languages
Repos
46
Stars
14
Forks
1
Top Language
Jupyter Notebook
Loading contributions...
Top Repositories
When signaficant amount of data in highly-important features are missing, what can we do? Impute the missing data with mean or median? In this Juyter notebook, I demonstrate embedding a XGBoost model to do the data imputation in the data transformer.
When training data is bigger than memory, we can feed the training data to neural network training in multiple batches. This notebook demostrates how to do it and visualizes the training and test losses.
This Jupyter notebook demonstrates a Recursive Feature Elimination with Cross-Validation (RFECV) feature selection process with a random forest model.
This Python program is used to pre-process images and recognize characters in them (OCR) with pytesseract in a batch-processing way.
Imbalanced data commonly exist in real world, especially in anomaly-detection tasks. Handling imbalanced data is important to the tasks, otherwise the predictions are biased towards the majority class. RandomOverSampler, SMOTE, and ADASYN are useful oversampling tools to fabricate data for minority classes and make the dataset balanced.
https://chatbot-v2.streamlit.app/
Repositories
46When signaficant amount of data in highly-important features are missing, what can we do? Impute the missing data with mean or median? In this Juyter notebook, I demonstrate embedding a XGBoost model to do the data imputation in the data transformer.
When training data is bigger than memory, we can feed the training data to neural network training in multiple batches. This notebook demostrates how to do it and visualizes the training and test losses.
This Jupyter notebook demonstrates a Recursive Feature Elimination with Cross-Validation (RFECV) feature selection process with a random forest model.
This Python program is used to pre-process images and recognize characters in them (OCR) with pytesseract in a batch-processing way.
Imbalanced data commonly exist in real world, especially in anomaly-detection tasks. Handling imbalanced data is important to the tasks, otherwise the predictions are biased towards the majority class. RandomOverSampler, SMOTE, and ADASYN are useful oversampling tools to fabricate data for minority classes and make the dataset balanced.
https://chatbot-v2.streamlit.app/
No description provided.
Monte Carlo simulation is a computational technique that uses random sampling and statistical methods to estimate the behavior of complex systems or solve problems. It is particularly useful when dealing with problems that involve a high degree of randomness or complexity.
The travelling salesman problem (TSP) asks the following question: "Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?" In this notebook, I demonstate the solution of this problem with the genetic algorithm.
BERT is an NLP model developed by Google Research in 2018, after its inception it has achieved state-of-the-art accuracy on several NLP tasks. This notebook demonstrates fine tuning BERT for sentiment analysis.
This is a CNN tutorial for beginners about a digits recognition model trained on the MNIST dataset. I built two models with TensorFlow/Keras and PyTorch/Skorch respectively.
The travelling salesman problem (TSP) asks the following question: "Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?" In this notebook, I demonstate the solution of this problem with the simulated annealing algorithm.
Calculating semiconductor chip yield against defect density using a Monte Carlo simulation is a common approach to assess the impact of defects on chip manufacturing. In this simulation, we'll randomly generate defect locations and evaluate chip yield based on specified criteria.
This notebook demonstrates the charts I usually plot for exploratory data analysis for regression tasks.
SHAP is a fancy tool for interpreting feature importance in machine learning tasks. This Jupyter notebook gives a demonstration.
A histogram of an image provides valuable insights into the distribution of pixel intensities within that image. This notebook gives a brief about how to plot the histogram. Furtherly, we can replot the picture with a heatmap based on its pixel intensities.
When signaficant amount of data are missing, what can we do? Impute the missing data with mean or median? Actually, Scikit-Learn provides two powerful imputers, KNNImputer and IterativeImputer, which can do this work effectively.
Monte Carlo integration is particularly useful when dealing with high-dimensional integrals or integrals over complex, irregularly shaped domains where traditional methods may be impractical. It's widely used in various fields, including physics, finance, and engineering, for solving problems involving numerical integration.
Word2Vec is a popular word embedding technique that converts words into vectors in a high-dimensional space, capturing semantic relationships between words. This notebook demonstrates embedding text data with Word2Vec for sentiment analysis.
PCA or truncated SVD reduces dimensionality of data by transforming the data into a lower-dimensional space. In this notebook a chart visualizes how much variance of the original data is picked up in the new components. The data transformation process is also explained.
Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices.
TFIDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This notebook demonstrates how to embed text data with TFIDF and do sentiment analysis based on it.
ResNet models are lightweight computer vision pre-trained models. This notebook demostrates how to infer the object in a picture with ResNet18, ResNet34, ResNet50, ResNet101, and ResNet251.
Increase the density of data by interpolation.
Linear regression model is widely used in industry for regression tasks as it is straightforward and easy to interpret. To capature non-linear patterns in data, polynomial features need to be added. However, high-degree polynomial features lead to overfitting. To solve the problem, regularizations can be added to the loss function.
Usually tree-based and neural network regressors work better for regression tasks than linear regression models, because they can capature complex or subtle non-linear patterns in data.
Keras and Starch provide us wrappers which simplify building neural network models. However, the wrappers sacrifice the flexibility of the models. In some scenarios like early stopping and batch reading, building pristine neural network models is still very useful.
With the python-pptx library, we can automate the updating of PowerPoint slides.
Two-sample t-test is a statistical hypothesis test used to determine if there is a significant difference between two independent groups. If the p-value is less than the chosen significance level (for example 0.05), you reject the null hypothesis and conclude that there is a significant difference between the groups.
Bar chart race is an elegant animation that depicts the progress of multiple categories over time. We can create them in Python.