Skip to content

Data Analytics (Python)

September 29, 2023
September 21, 2016

Home - PyData

An Overview of Python’s Datatable package – Towards Data Science
Algorithmic Trading with Python – Free 4-hour Course With Example Code Repos
Algorithmic Trading Using Python - Full Course - YouTube
Hello! — Practical Data Science

Python Data Analysis Library(pandas) massage data into a tabular state so it can be modeled
Spyder - Documentation Scientific PYthon Development EnviRonment

Python Data Transformation Tools for ETL - Towards Data Science
Data Preprocessing in Data Mining & Machine Learning
Data Preprocessing in Python - Towards Data Science
Assessing the Quality of Data - Towards Data Science
Five Command Line Tools for Data Science - Towards Data Science
Data Scientists, The 5 Graph Algorithms that you should know
Why and How to Use Pandas with Large Data - Towards Data Science

Top 10 Python Libraries for Data Science - Towards Data Science
5 essential Python programming tools for data science—now updated
Weekend Reading: Python | Linux Journal science and ML
Oktoberfest : Quick analysis using Pandas, Matplotlib, and Plotly

Python Data Science Handbook | Python Data Science Handbook âť—!important
pydata/pydata-cookbook: PyData Cookbook Project

Data Analysis with Dr Mike Pound - YouTube Computerphile
Data Analysis with Python - Full Course for Beginners (Numpy, Pandas, Matplotlib, Seaborn) - YouTube

Try Docker image for Intel® Distribution for Python* | Intel® Software

NumPy

NumPy — NumPy
NumPy - Wikiwand
Numpy and Scipy Documentation — Numpy and Scipy documentation
NumPy User Guide — NumPy Manual
Indexing — NumPy Manual
Quickstart tutorial — NumPy Manual

numpy.reshape — NumPy v1.21 Manual
Axis 0: columns
Axis 1: rows
Axis 2: height

Welcome to numpy-ml — numpy-ml documentation
ddbourgin/numpy-ml: Machine learning, in numpy

Free Deep Learning Tutorial - Deep Learning Prerequisites: The Numpy Stack in Python V2 | Udemy
How to create NumPy arrays from scratch? - Towards Data Science
A Visual Intro to NumPy and Data Representation – Jay Alammar – Visualizing machine learning one concept at a time
Numpy Guide for People In a Hurry – Towards Data Science
Reshape numpy arrays—a visualization | Towards Data Science
The Easiest Python Numpy Tutorial Ever - Towards Data Science
A Complete Beginners Guide to Matrix Multiplication for Data Science with Python Numpy | by Chris I. | Towards Data Science
Array Oriented Programming with Python NumPy | by Semi Koen | Towards Data Science
NumPy Crash Course: Array Basics - Towards Data Science
27 NumPy Operations for beginners - Towards Data Science
10 quick Numpy tricks that will make life easier for a data scientist | by Harsh Maheshwari | Jun, 2021 | Towards Data Science
Look Ma, No For-Loops: Array Programming With NumPy – Real Python
np.linspace(): Create Evenly or Non-Evenly Spaced Arrays – Real Python
Python Numpy Array Tutorial (article) - DataCamp
NumPy indexing explained. NumPy is the universal standard for… | by Àlex Escolà Nixon | Towards Data Science
10 quick Numpy tricks that will make life easier for a data scientist | by Harsh Maheshwari | Jun, 2021 | Towards Data Science

Deep Learning Prerequisites: The Numpy Stack in Python (V2+) | Udemy free
Deep Learning Prerequisites: The Numpy Stack in Python Extra Resources - Lazy Programmer

A NumPy affair: Broadcasting - Towards Data Science

Count value

python - Frequency counts for unique values in a NumPy array - Stack Overflow

def count_unique(keys):
    uniq_keys = np.unique(keys)
    bins = uniq_keys.searchsorted(keys)  # find index of key in uniq_keys
    return uniq_keys, np.bincount(bins) # bincount indices

Numpy on GPU

CuPy
Here’s how to use CuPy to make Numpy 700X faster - Towards Data Science

FilipeMaia/afnumpy: A GPU-ready drop-in replacement for numpy.

Apache MXNet (incubating) Documents — deepnumpy documentation
NDArray - Scientific computing on CPU and GPU — mxnet documentation

Shohei Hido - CuPy: A NumPy-compatible Library for GPU - PyCon 2018 - YouTube slide
CuPy: A NumPy Compatible Library for High Performance Computing with GPU | SciPy 2019 | SciPy 2019 | - YouTube

SciPy

SciPy.org — SciPy.org
SciPy - Wikiwand

SciPy is an open-source Python-based tool used for scientific and technical computing. It is built on the NumPy extension and allows Python programmers to manipulate and visualize data with a wide range of high-level commands. SciPy is popular in the field of Mathematics, Science, and Engineering.

Numpy and Scipy Documentation — Numpy and Scipy documentation
Optimizing complex simulations? Use Scipy interpolation | by Tirthajyoti Sarkar | Oct, 2021 | Towards Data Science

Linear Algebra in Python: Matrix Inverses and Least Squares – Real Python

Numba

Numba: A High Performance Python Compiler
numba/numba: NumPy aware dynamic Python compiler using LLVM
3. Numba for CUDA GPUs — Numba documentation
Numba: High-Performance Python with CUDA Acceleration | NVIDIA Developer Blog
Numba: Tell those C++ bullies to get lost | SciPy 2016 Tutorial | Gil Forsyth & Lorena Barba - YouTube

Make Python code 1000x Faster with Numba - YouTube
Accelerating Scientific Workloads with Numba - Siu Kwan Lam - YouTube
How to Accelerate an Existing Codebase with Numba | SciPy 2019 | Siu Kwan Lam, Stanley Seibert - YouTube

Numba: High-Performance Python with CUDA Acceleration | Svelte Hacker News
Python Numba or NumPy: understand the differences - Towards Data Science
Run Your Python User Defined Functions in Native CUDA Kernels with RAPIDS cuDF | by Jiqun Tu | RAPIDS AI | Medium

Dask: Scalable analytics in Python
Data Pre-Processing in Python: How I learned to love parallelized applies with Dask and Numba

xarray

xarray: N-D labeled arrays and datasets in Python — xarray documentation
better API to address columns, akin to pandas

Rapids

Open GPU Data Science | RAPIDS
Getting Started | RAPIDS
rapidsai/cudf: cuDF - GPU DataFrame Library

Python Pandas at Extreme Performance - Towards Data Science
GPU Accelerated Data Analytics & Machine Learning - Towards Data Science
Here’s how you can accelerate your Data Science on GPU
Here’s how you can speedup Pandas with cuDF and GPUs

XGBoost Documentation — xgboost documentation gradient boosting model
A Gentle Introduction to XGBoost for Applied Machine Learning
Introduction to XGBoost in Python
New Features and Optimizations for GPUs in XGBoost 1.1

abhishekkrthakur/autoxgb: XGBoost + Optuna

[QST] Can cuml and cudf installed on nvidia jetson tx1/tx2/nano? · Issue #665 · rapidsai/cuml

Pandas

Python Data Analysis Library — pandas: Python Data Analysis Library
User Guide — pandas documentation
API Reference — pandas documentation

Hannah Stepanek - Thinking like a Panda: Everything you need to know to use pandas the right way. - YouTube
Pandas Cheat Sheet: Data Science and Data Wrangling in Python - KDnuggets
Python-Pandas cheat sheet: 30 functions-methods | by Jyoti Kumar | Aug, 2022 | Mediumr
pandas - Getting started with pandas | pandas Tutorial

Learn Python, Data Science & Machine Learning with expert instruction
Pandas Tutorials – Dunder Data – Medium

Explore Your Dataset With Pandas – Real Python
Finding Temporal Patterns in Twitter Posts: Exploratory Data Analysis with Python | by Dmitrii Eliuseev | May, 2023 | Towards Data Science
Full Stack Pandas. Lesser known functionality of the… | by Sayar Banerjee | Towards Data Science
Pandas Makes Python Better. Something I’ve wanted to talk about for… | by Emma Boudreau | Towards Data Science
10 Python Skills They Don’t Teach in Bootcamp | Towards Data Science

Improve pandas performance with eval and query | Python in Plain English
Using numba to make pandas operations faster | Towards Data Science

Stylin’ with Pandas - Practical Business Python
Efficiently Cleaning Text with Pandas - Practical Business Python

pandas-profiling/pandas-profiling: Create HTML profiling reports from pandas DataFrame objects

Validation

pandera can use Pydantic model syntax for schema
How to Use Pandas With Pandera to Validate Your Data in Python - YouTube
ArjanCodes/2023-pandera

Modin

Scale your pandas workflow by changing a single line of code — Modin documentation
modin-project/modin: Modin: Speed up your Pandas workflows by changing a single line of code

Tutorials

Python Pandas Tutorial
Python - Data Science Tutorial - Tutorialspoint
Time Series Tutorial - Tutorialspoint
Examining Data Using Pandas | Linux Journal
Introduction to Pandas | Machine Learning, Deep Learning, and Computer Vision
Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects – Real Python
3 Excel Functions and How to Do Them in Python! | Towards Data Science
Top 3 Pandas Functions I Wish I Knew Earlier | by Dario Radečić | Towards Data Science

Pandas Foundations | DataCamp

Exploratory Data Analysis using Python | ActiveState
Quick and Dirty Data Analysis with Pandas

Pandas: The Swiss Army Knife for Your Data, Part 1
Pandas: The Swiss Army Knife for Your Data, Part 2

Video series: Easier data analysis in Python using the pandas library

Applying Statistics in Python — part I - Towards Data Science
Applying Statistics in Python — part II - Towards Data Science

Why Are We Teaching Pandas Instead of SQL? | HackerNoon compares Pandas ans SQL,

An End-to-End Project on Time Series Analysis and Forecasting with Python

The Easiest Data Cleaning Method using Python & Pandas
Seven Clean Steps To Reshape Your Data With Pandas Or How I Use Python Where Excel Fails

DataFrame

Intro to Data Structures — pandas documentation

Series:
1-D array with index as axis label
equivalent to a dict with key as index
mostly compatible to NumPy's ndarray

DataFrame:
2-D labeled data, like table or dict of series with key as columns
index is row labels , columns is column (field) labels
not intended to work as 2-D ndarray

Show all rows

pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_rows', df.shape[0]+1)

with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(df)

Intro to pandas data structures
Working with DataFrames
Using pandas on the MovieLens dataset

Creating Pandas DataFrames from Lists and Dictionaries - Practical Business Python
Python Dataclasses With Properties and Pandas | by Sebastian Ahmed | The Startup | Medium
SettingWithCopyWarning in Pandas: Views vs Copies – Real Python
Views and Copies in pandas — Practical Data Science
Python Pandas DataFrame: load, edit, view data | Shane Lynn
Combining Data in Pandas With merge(), .join(), and concat() – Real Python

Reshape pandas dataframe | Towards Data Science Convert long to wide with pd.pivot_table
Reshape pandas dataframe in Python | Towards Data Science Convert wide to long with pd.melt
Using the Pandas Data Frame as a Database. - Towards Data Science
Build pipelines with Pandas using “pdpipe” - Towards Data Science
Exploring your data with just 1 line of Python - Towards Data Science
Apply and Lambda usage in pandas - Towards Data Science

datas-frame – Modern Pandas (Part 1)
datas-frame – Modern Pandas (Part 2): Method Chaining
datas-frame – Modern Panadas (Part 3): Indexes
datas-frame – Modern Pandas (Part 4): Performance
datas-frame – Modern Pandas (Part 8): Scaling
datas-frame – Modern Pandas (Part 6): Visualization
datas-frame – Modern Pandas (Part 7): Timeseries
datas-frame – Modern Pandas (Part 8): Scaling

Pandas Tutorial 1: Pandas Basics (read_csv, DataFrame, Data Selection)
Pandas Tutorial 2: Aggregation and Grouping
Pandas Tutorial 3: Important Data Formatting Methods (merge, sort, reset_index, fillna)

Date types (dtype)

Overview of Pandas Data Types - Practical Business Python

Categorical data — pandas documentation
Pandas Category Type: Pros and Cons | by Arli | Jan, 2023 | Level Up Coding
Working with Large Data Sets Made Easy: Understanding Pandas Data Types - YouTube

df.info()

df["A"].astype("category")
df["A"].value_counts()
df.astype({"A": "category", "B": "boolean", "D": "datetime[timezone]})

Indexing/Selection

Tips for Selecting Columns in a DataFrame - Practical Business Python
Python : 10 Ways to Filter Pandas DataFrame
python - How to select rows from a DataFrame based on column values - Stack Overflow

pandas.DataFrame.query — pandas documentation

MultiIndex / advanced indexing — pandas documentation

python - How do I select rows from a DataFrame based on column values? - Stack Overflow

idx = pd.MultiIndex.from_tuples([('Chris',48), ('Brian',np.nan), ('David',65),('Chris',34),('John',28)],
                                 names=['Name', 'Age'])
col = ['Salary']

df = pd.DataFrame([120000, 140000, 90000, 101000, 59000], idx, col)

Serialization

IO tools (text, CSV, HDF5, …) — pandas documentation
Serializing pandas DataFrames | Pythontic.com

pandas-datareader — pandas-datareader documentation

How to use Pandas read_html to Scrape Data from HTML Tables

The Best Format to Save Pandas Data | by Ilia Zaitsev | Towards Data Science benchmarks

Tips and Tricks

Python Pandas: Tricks & Features You May Not Know – Real Python
Idiomatic Pandas: Tricks & Features You May Not Know – Real Python
25 Tricks for Pandas
10 Powerful Python Tricks for Data Science you Should Try Today

How to make your Pandas operation 100x faster | by Yifei Huang | Towards Data Science
Pandas tips and tricks. This post includes some useful tips for… | by Shir Meir Lador | Towards Data Science
5 lesser-known pandas tricks. 5 lesser-known pandas tricks that help… | by Roman Orac | Towards Data Science
Pandas and Python Tips and Tricks for Data Science and Data Analysis | by Zoumana Keita | Dec, 2022 | Towards Data Science
Display Customizations for pandas Power Users | by Roman Orac | Towards Data Science
My Python Pandas Cheat Sheet. The pandas functions I use everyday as… | by Chris I. | Towards Data Science
How To Make Your Pandas Loop 71803 Times Faster | by Benedikt Droste | Towards Data Science

Articles: Speed up your data science and scientific computing code

For the Love of God, Stop Using iterrows() – r y x, r df.itertuples(index=False)


# column name
col_mapping = [f"{c[0]}:{c[1]}" for c in enumerate(df.columns)]

python - Pretty-print an entire Pandas Series / DataFrame - Stack Overflow

Binning Data with Pandas qcut and cut - Practical Business Python

GUI/Visualizer

A GUI for pandas | bamboolib
Introducing Bamboolib — a GUI for Pandas - Towards Data Science
Bamboolib — Learn and use Pandas without Coding - Towards Data Science

man-group/dtale: Visualizer for pandas data structures
dtale · PyPI
D-Tale (house_data)

groupby

groupby — pandas documentation
pandas.core.groupby.DataFrameGroupBy.agg — pandas documentation

Pandas Grouper and Agg Functions Explained - Practical Business Python

Apply Operations To Groups In Pandas
Summarising, Aggregating, and Grouping data in Python Pandas | Shane Lynn

ZaxR/pandas_multiindex_tutorial: An in-depth introduction to Pandas' MultiIndexes and practical code snippets

How to use Pandas Count and Value_Counts | kanoki

python - Get statistics for each group (such as count, mean, etc) using pandas GroupBy? - Stack Overflow make groupby() result a dataframe

# group the dataframe by regiment
gb = df.groupby('regiment')
# for each regiment
for name, group in df.groupby('regiment'):
    # print the name of the regiment
    print(name)
    # print the data of that regiment
    print(group)
# make count result a dataframe to add more statistics
counts = gb.size().to_frame(name='counts')

filter

python - pandas: filter rows of DataFrame with operator chaining - Stack Overflow
How To Filter Pandas Dataframe By Values of Column? — Python, R, and Linux Tips

df.field == value creates a list of matching indices
so df[df.field == value] is a filtered list of data with those indices

Or use:
pandas.DataFrame.query — pandas documentation

Pivot Tables

Pivot Tables | Python Data Science Handbook
Pandas Crosstab Explained - Practical Business Python
pbpython/crosstab_cheatsheet.pdf at master · chris1610/pbpython

Introducing PivotUI: Never Use Pandas To GroupBy and Pivot Your Data Again | by Avi Chawla | Nov, 2022 | Towards Data Science

Check memory usage

df.memory_usage(deep=True)

#perfmatters

Enhancing Performance — pandas documentation

Scale your pandas workflow by changing a single line of code. — Modin documentation

4 Methods to Optimize Python Code for Data Science

Parallel Pandas – KRSTN concurrent.futures.ProcessPoolExecutor faster than multiprocessing.Pool
How to use Pandas the RIGHT way to speed up your code
Parallelize Pandas map() and apply() while accounting for future records – Adeel's Corner

JAX

JAX: High-Performance Array Computing — JAX documentation
Sucessor of Autograd and XLA

PyTorch

Using PyTorch to accelerate analytics

GPU Accelerated Python - YouTube
Accelerate PyTorch across any distributed configuration

Kedro

Welcome to Kedro’s documentation!
kedro-org/kedro: A Python framework for creating reproducible, maintainable and modular data science code.

pingouin

Installation — pingouin documentation
The new kid on the statistics-in-Python block: pingouin

Input/Output

IO Tools (Text, CSV, HDF5, …) — pandas documentation

For reading: xlrd (XLS), openpyxl (XLSX)
For writing: openpyxl/xlsxwriter (XLSX), PyTables (HDF5)

Three Ways of Storing and Accessing Lots of Images in Python – Real Python

Intake — intake documentation
intake/intake: Intake is a lightweight package for finding, investigating, loading and disseminating data.

PySpark

Welcome to Spark Python API Docs! — PySpark master documentation
First Steps With PySpark and Big Data Processing – Real Python
PySpark and SparkSQL Basics - Towards Data Science