devops
database
elastic-stack
elastic-kibana
jupyter
Analytics - Wikiwand
Data Analytics Reference Stack | Clear Linux* Project
Data Science Timeline - Noteworthy - The Journal Blog
Data Analyst VS Data Scientist – What's the Difference?
Software analytics - Wikiwand
Web analytics - Wikiwand
IT operations analytics - Wikiwand
Session (web analytics) - Wikiwand
Behavioral analytics - Wikiwand
not to be confused with User Behavioral Analytics, used in security context for threat detection
Business intelligence - Wikiwand
Cohort analysis - Wikiwand
10 Steps To Get You Started With Behavioral Analytics
Six Ways to Create Better Customer Behavior Analytics | Datameer
What is Operational Analytics? - Definition from Techopedia
Operations Analytics | Coursera
First data, logs or events triggered by applications and services, must be collected and store on some data store.
How to Learn Math for Data Science: A Roadmap for Beginners - KDnuggets
4 free maths courses to do in quarantine and level up your Data Science skills | by Gonzalo Ferreiro Volpi | Towards Data Science
Machine Learning and Data Science free online courses to do in quararantine | Towards Data Science
Training for Data Analysts | Microsoft Learn
Training for Data Engineers | Microsoft Learn
Training for Data Scientists | Microsoft Learn
The Best Free Data Science eBooks - Towards Data Science
10 Free Data Science Books For 2025 - KDnuggets
Introducing Application Insights Analytics - Brian Harry's Blog
Apache Hadoop Ecosystem and Open Source Big Data Projects | Hortonworks ❗!important
Prefect Docs Workflow Orchestration For Resilient Data Platforms
101 Machine Learning Algorithms for Data Science with Cheat Sheets | R-bloggers
7 Open Source Data Science Projects | Machine Learning Projects
Use cases
Online transaction processing - Wikiwand OLTP
What is OLTP (online transaction processing)? - Definition from WhatIs.com
Online analytical processing - Wikiwand OLAP
What is OLAP (online analytical processing)? - Definition from WhatIs.com
Hybrid transactional/analytical processing - Wikiwand NoSQL/NewSQL database can serve this purpose
RTA
Data warehouse - Wikiwand
Extract, transform, load - Wikiwand
ETL
ETLs vs ELTs: Why are ELTs Disrupting the Data Market? | by SeattleDataGuy | Coriers | Mar, 2021 | Medium
A good nudge trumps a good prediction - O'Reilly Radar
Whether prediction should be user friendly or business friendly
Stream Architecture
What is Stream Processing? - data Artisans
How a Stream Works - DZone Big Data
What is a Streaming Database?
The state can be built from events
"Turning the database inside out with Apache Samza" by Martin Kleppmann - YouTube
"Transactions: myths, surprises and opportunities" by Martin Kleppmann - YouTube
Streaming Architecture with Ted Dunning | Software Engineering Daily
Spark: Batch first, then stream; ELT job, working set in memory
Flink: Stream first, then batch; exactly one event processing
Streaming pipeline:
Type | Example | Storage Media | Usage |
---|---|---|---|
Message bus | Redis, Kafka | RAM, Disk | low latency data ingest |
Datalake | S3/HDFS | Disk | high capacity low cost long term storage |
Data warehouse | Elasticsearch | RAM | data structuring and indexing, fast interactive query |
Database | MySQL, MongoDB | RAM, Disk | data access with indexing |
Apache Flink vs. Apache Spark - DZone Big Data
Apache Flink: Does the world need another streaming engine? | ZDNet
Choose your real-time weapon: Storm or Spark? | InfoWorld
ksqlDB: The database purpose-built for stream processing applications.
Apache Flink: Scalable Stream and Batch Data Processing
How Netflix Optimized Flink for Massive Scale on AWS
Why Apache Flink - data Artisans
Apache Kafka
Apache Kafka - Hortonworks
Kafka Design Patterns with Gwen Shapira | Software Engineering Daily
Best Practices for Apache Kafka® in Production: Confluent Online Talk Series - Confluent
How to install Kafka using Docker - ITNEXT
Apache Kafka, Data Pipelines, and Functional Reactive Programming with Node.js | Heroku
Apache Kafka Crash Course - YouTube
Top 10 Problems When Using Apache Kafka - Pandio
Apache Pulsar Apache Pulsar is an open-source distributed pub-sub messaging system
Comparing Apache Kafka and Apache Pulsar | by Jaroslaw Kijanowski | SoftwareMill Tech Blog
7 Reasons We Chose Apache Pulsar over Apache Kafka | DataStax
5 More Reasons to Choose Apache Pulsar over Kafka | DataStax
Apache NiFi
Apache NiFi - Hortonworks
Apache Storm
Apache Storm - Hortonworks
Apache Storm: Architecture - DZone Big Data
Apache Spark™ - Unified Analytics Engine for Big Data
Apache Spark - Hortonworks
Spark and Streaming with Matei Zaharia | Software Engineering Daily
Apache Spark Tutorials - Frank Kane - YouTube
Apache Spark 2 using Python 3 - YouTube
Spark SQL: An Introductory Guide - DZone Big Data
We interrupt this revolution: Apache Spark changes the rules of the game | ZDNet
Apache Beam
Apache Beam - Wikiwand
stream API to abstract streaming warehouse, abstracts Flink, Spark, Dataflow
Beam is introducing a framework through which APIs in languages other than Java can be supported, and Python is the first one.
Cloud Dataflow - Stream & Batch Data Processing | Google Cloud
Hadoop and Spark: A tale of two cities | ZDNet
The Streaming Database | Materialize
Batch Architecture
Apache Hadoop
Big Data: What is Hadoop - An Easy Explanation For Absolutely Anyone
Is Hadoop Officially Dead?
Why is Hadoop dying? | Packt Hub
Big data
The Data Science Venn Diagram — Drew Conway
The Third Wave Data Scientist – Towards Data Science
Data Skeptic
A data cleaner's cookbook - About
Chris Albon
OpenRefine | OpenRefine
OpenRefine/OpenRefine: OpenRefine is a free, open source power tool for working with messy data and improving it
Pachyderm - Scalable, Reproducible Data Science
Containerized data analytics at scale, with Minio and Pachyderm
Data Science eBook by Analyticbridge - 2nd Edition - Data Science Central
Extracting value from the IoT - O'Reilly Radar
Collecting data and loading it into a data warehouse is not sufficient. You also need capabilities for accessing, modeling, and analyzing your data.
Awesome Data Science Repository - Data Science Central
Nyandwi/machine_learning_complete: A comprehensive machine learning repository containing 30+ notebooks on different concepts, algorithms and techniques.
PredictionIO Open Source Machine Learning Server
The Art and Science of Data-Driven Journalism
Kaggle: Your Home for Data Science
Introduction to Data Science
Explore Your Data: The Fundamentals of Network Analysis
Design vs. Data: Enemies or Friends? how to evolve and extent a code base.
Cathy O'Neil on Weapons of Math Destruction | EconTalk | Library of Economics and Liberty crucial decision made based on machine learn statistics is unreliable as no one really know how the algorithm works
An expert's guide to big data storage architecture
Big data tutorial: Everything you need to know
Apache
a49a/bigdata-sql-benchmark: Flink, Presto, Trino TPC-DS benchmark
Apache Iceberg The open table format for analytic datasets, supports SQL and Spark, Trino, Flink, Presto engine
Apache Airflow data pipeline in Python, SQL-like query
A Practical Guide to Modern Airflow - KDnuggets
Datasets
Fueling the Gold Rush: The Greatest Public Datasets for AI
Data Asset eXchange – IBM Developer
Open Data Kit
Computer Vision Datasets
Access Free Google Cloud Public Dataset with Python
Datasets – Google Research
Dataset Search
Find Open Datasets and Machine Learning Projects | Kaggle
Google just published 25 million free datasets - Towards Data Science
COCO - Common Objects in Context
An Introduction to the COCO Dataset
資料一線通 | DATA.GOV.HK
Open Data Hong Kong - 香港開放數據 | Hong Kong's Open Data community
g0vhk.io - Home | Facebook
70 Amazing Free Data Sources You Should Know
Datasets for Data Mining and Data Science
Downloading The Kinetics Dataset For Human Action Recognition in Deep Learning
Analysis of the MRNet Knee MRI dataset | The Startup
Label Studio Open-source data labeling, annotation and exploration tool
Business Analytics
Commercial
Big Data Integration and Analytics | Hitachi Vantara
Business Intelligence and Analytics | Tableau Software
Introduction to Tableau - Learn The Part - Medium
Data Visualization | Microsoft Power BI
15 分鐘上手 Power BI!我一旦認真起來連我自己都會害怕 ~ - YouTube
The 5 best self-service BI tools compared | CIO
15 分鐘上手 Power BI!我一旦認真起來連我自己都會害怕 ~ - YouTube
Open source
Apache Superset (incubating) — Apache Superset documentation
Redash helps you make sense of your data | Redash
Metabase
Easy analytics with Grafana, Postgres, and Kubernetes.
Data Generation
JavaScript
Chance
chancejs/chancejs: Chance - Random generator helper for JavaScript
aharris88/awesome-ipsum
dejavu1987/jabber: Simple random word / paragraph / lorem ipsum / dummy text generator.
Marak/faker.js
drewbrokke/chance-token-replacer
ngneat/falso: All the Fake Data for All Your Real Needs 🙂
adleroliveira/dreamjs: A lightweight json data generator.
json-schema-faker/json-schema-faker: JSON-Schema + Faker
danibram/mocker-data-generator
Python
Welcome to Faker’s documentation! — Faker documentation
joke2k/faker: Faker is a Python package that generates fake data for you.
Mimesis: Fake Data Generator — Mimesis 17.1.0 documentation
lk-geimfari/mimesis: Mimesis is a robust data generator for Python that can produce a wide range of fake data in multiple languages.
Getting Started with Mimesis: A Modern Approach to Synthetic Data Generation
chris1610/barnum-proj: Python application for generating pseudo-random data
Data Processing
Tabula: Extract Tables from PDFs
香港地址解析器 Hong Kong Address Parser
Data Analytics Reference Stack | Clear Linux* Project
AugLy: A new data augmentation library to help build more robust AI models
facebookresearch/AugLy: A data augmentations library for audio, image, text, and video.
Data Build Tool/dbt
What is dbt? | dbt Developer Hub
Transform Your Data Like a Pro With dbt (Data Build Tool) - DEV Community
DataStation
DataStation | The Data IDE for Developers
multiprocessio/datastation: Easily query, script, and visualize data from every database, file, and API.
multiprocessio/dsq: Commandline tool for running SQL queries against JSON, CSV, Excel, Parquet, and more.
Python
JavaScript
Danfo.js Documentation - Danfo.js Pandas for JavaScript
Hello from Scikit.js | Scikit.js Scikit Learn for JavaScript
JSdata
Crossfilter Pandas for JavaScript
How to Create an Interactive Dashboard with Crossfilter and Dc.Js
scijs
ndarray
Implementing Multidimensional Arrays in JavaScript | 0 FPS
tidy.js
tidy.js – Intro & Demo / Peter Beshai / Observable
C
Articles on Mathematics, Physics and Computer Science
muparser - fast math parser library
Go
DataFrames in Go with gota, qframe, and dataframe-go - MungingData
gonum
plot package - gonum.org/v1/plot - pkg.go.dev
tobgu/qframe: Immutable data frame for Go
go-gota/gota: Gota: DataFrames and data wrangling in Go (Golang)
rocketlaunchr/dataframe-go: DataFrames for Go: For statistics, machine-learning, and data manipulation/exploration
Rust
Vector | A lightweight, ultra-fast tool for building observability pipelines
Polars
Polars
Pandas vs. Polars: A Syntax and Speed Comparison | by Leonie Monigatti | Jan, 2023 | Towards Data Science
Why Polars uses less memory than Pandas
Polars for initial data analysis, Polars for production
Replacing Pandas with Polars. A Practical Guide. - Confessions of a Data Guy
Working With Python Polars – Real Python
Polars for Pandas Users: A Blazing Fast DataFrame Alternative - KDnuggets