Skip to content

Data Analytics

June 24, 2025
September 21, 2016

devops
database
elastic-stack
elastic-kibana
jupyter

Analytics - Wikiwand
Data Analytics Reference Stack | Clear Linux* Project
Data Science Timeline - Noteworthy - The Journal Blog
Data Analyst VS Data Scientist – What's the Difference?

Software analytics - Wikiwand
Web analytics - Wikiwand
IT operations analytics - Wikiwand
Session (web analytics) - Wikiwand

Behavioral analytics - Wikiwand
not to be confused with User Behavioral Analytics, used in security context for threat detection
Business intelligence - Wikiwand
Cohort analysis - Wikiwand
10 Steps To Get You Started With Behavioral Analytics
Six Ways to Create Better Customer Behavior Analytics | Datameer

What is Operational Analytics? - Definition from Techopedia
Operations Analytics | Coursera
First data, logs or events triggered by applications and services, must be collected and store on some data store.

How to Learn Math for Data Science: A Roadmap for Beginners - KDnuggets
4 free maths courses to do in quarantine and level up your Data Science skills | by Gonzalo Ferreiro Volpi | Towards Data Science
Machine Learning and Data Science free online courses to do in quararantine | Towards Data Science
Training for Data Analysts | Microsoft Learn
Training for Data Engineers | Microsoft Learn
Training for Data Scientists | Microsoft Learn

From unstructured data to actionable intelligence: Using machine learning for threat intelligence - Microsoft Security

The Best Free Data Science eBooks - Towards Data Science
10 Free Data Science Books For 2025 - KDnuggets

Introducing Application Insights Analytics - Brian Harry's Blog

Apache Hadoop Ecosystem and Open Source Big Data Projects | Hortonworks ❗!important

Prefect Docs Workflow Orchestration For Resilient Data Platforms
101 Machine Learning Algorithms for Data Science with Cheat Sheets | R-bloggers
7 Open Source Data Science Projects | Machine Learning Projects

Use cases

OLTP vs. OLAP

Online transaction processing - Wikiwand OLTP
What is OLTP (online transaction processing)? - Definition from WhatIs.com
Online analytical processing - Wikiwand OLAP
What is OLAP (online analytical processing)? - Definition from WhatIs.com
Hybrid transactional/analytical processing - Wikiwand NoSQL/NewSQL database can serve this purpose
RTA
Data warehouse - Wikiwand
Extract, transform, load - Wikiwand
ETL
ETLs vs ELTs: Why are ELTs Disrupting the Data Market? | by SeattleDataGuy | Coriers | Mar, 2021 | Medium

A good nudge trumps a good prediction - O'Reilly Radar

Whether prediction should be user friendly or business friendly

Stream Architecture

What is Stream Processing? - data Artisans
How a Stream Works - DZone Big Data
What is a Streaming Database?

The state can be built from events

"Turning the database inside out with Apache Samza" by Martin Kleppmann - YouTube
"Transactions: myths, surprises and opportunities" by Martin Kleppmann - YouTube

Streaming Architecture with Ted Dunning | Software Engineering Daily
Spark: Batch first, then stream; ELT job, working set in memory
Flink: Stream first, then batch; exactly one event processing

Streaming pipeline:

TypeExampleStorage MediaUsage
Message busRedis, KafkaRAM, Disklow latency data ingest
DatalakeS3/HDFSDiskhigh capacity low cost long term storage
Data warehouseElasticsearchRAMdata structuring and indexing, fast interactive query
DatabaseMySQL, MongoDBRAM, Diskdata access with indexing

Apache Flink vs. Apache Spark - DZone Big Data
Apache Flink: Does the world need another streaming engine? | ZDNet
Choose your real-time weapon: Storm or Spark? | InfoWorld

ksqlDB: The database purpose-built for stream processing applications.

Apache Flink: Scalable Stream and Batch Data Processing
How Netflix Optimized Flink for Massive Scale on AWS
Why Apache Flink - data Artisans

Apache Kafka
Apache Kafka - Hortonworks
Kafka Design Patterns with Gwen Shapira | Software Engineering Daily
Best Practices for Apache Kafka® in Production: Confluent Online Talk Series - Confluent
How to install Kafka using Docker - ITNEXT
Apache Kafka, Data Pipelines, and Functional Reactive Programming with Node.js | Heroku
Apache Kafka Crash Course - YouTube
Top 10 Problems When Using Apache Kafka - Pandio

Apache Pulsar Apache Pulsar is an open-source distributed pub-sub messaging system
Comparing Apache Kafka and Apache Pulsar | by Jaroslaw Kijanowski | SoftwareMill Tech Blog
7 Reasons We Chose Apache Pulsar over Apache Kafka | DataStax
5 More Reasons to Choose Apache Pulsar over Kafka | DataStax

Apache NiFi
Apache NiFi - Hortonworks

Apache Storm
Apache Storm - Hortonworks
Apache Storm: Architecture - DZone Big Data

Apache Spark™ - Unified Analytics Engine for Big Data
Apache Spark - Hortonworks
Spark and Streaming with Matei Zaharia | Software Engineering Daily
Apache Spark Tutorials - Frank Kane - YouTube
Apache Spark 2 using Python 3 - YouTube
Spark SQL: An Introductory Guide - DZone Big Data
We interrupt this revolution: Apache Spark changes the rules of the game | ZDNet

Apache Beam
Apache Beam - Wikiwand
stream API to abstract streaming warehouse, abstracts Flink, Spark, Dataflow
Beam is introducing a framework through which APIs in languages other than Java can be supported, and Python is the first one.

Cloud Dataflow - Stream & Batch Data Processing | Google Cloud
Hadoop and Spark: A tale of two cities | ZDNet

Benthos | Benthos

The Streaming Database | Materialize

Batch Architecture

Apache Hadoop
Big Data: What is Hadoop - An Easy Explanation For Absolutely Anyone

Is Hadoop Officially Dead?
Why is Hadoop dying? | Packt Hub

Big data

onurakpolat/awesome-bigdata: A curated list of awesome big data frameworks, ressources and other awesomeness.

The Data Science Venn Diagram — Drew Conway
The Third Wave Data Scientist – Towards Data Science

Data Skeptic
A data cleaner's cookbook - About
Chris Albon

OpenRefine | OpenRefine
OpenRefine/OpenRefine: OpenRefine is a free, open source power tool for working with messy data and improving it

Pachyderm - Scalable, Reproducible Data Science
Containerized data analytics at scale, with Minio and Pachyderm

Data Science eBook by Analyticbridge - 2nd Edition - Data Science Central

Extracting value from the IoT - O'Reilly Radar

Collecting data and loading it into a data warehouse is not sufficient. You also need capabilities for accessing, modeling, and analyzing your data.

Awesome Data Science Repository - Data Science Central
Nyandwi/machine_learning_complete: A comprehensive machine learning repository containing 30+ notebooks on different concepts, algorithms and techniques.

PredictionIO Open Source Machine Learning Server

The Art and Science of Data-Driven Journalism

Comparison of top data science libraries for Python, R and Scala [Infographic] - Data Science Central

Kaggle: Your Home for Data Science
Introduction to Data Science
Explore Your Data: The Fundamentals of Network Analysis

Design vs. Data: Enemies or Friends? how to evolve and extent a code base.

Cathy O'Neil on Weapons of Math Destruction | EconTalk | Library of Economics and Liberty crucial decision made based on machine learn statistics is unreliable as no one really know how the algorithm works

An expert's guide to big data storage architecture
Big data tutorial: Everything you need to know

Apache

a49a/bigdata-sql-benchmark: Flink, Presto, Trino TPC-DS benchmark
Apache Iceberg The open table format for analytic datasets, supports SQL and Spark, Trino, Flink, Presto engine

Apache Airflow data pipeline in Python, SQL-like query
A Practical Guide to Modern Airflow - KDnuggets

Datasets

Fueling the Gold Rush: The Greatest Public Datasets for AI
Data Asset eXchange – IBM Developer
Open Data Kit
Computer Vision Datasets

Access Free Google Cloud Public Dataset with Python

Datasets – Google Research
Dataset Search
Find Open Datasets and Machine Learning Projects | Kaggle
Google just published 25 million free datasets - Towards Data Science

COCO - Common Objects in Context
An Introduction to the COCO Dataset

資料一線通 | DATA.GOV.HK
Open Data Hong Kong - 香港開放數據 | Hong Kong's Open Data community
g0vhk.io - Home | Facebook

70 Amazing Free Data Sources You Should Know
Datasets for Data Mining and Data Science

Downloading The Kinetics Dataset For Human Action Recognition in Deep Learning
Analysis of the MRNet Knee MRI dataset | The Startup

Label Studio Open-source data labeling, annotation and exploration tool

Business Analytics

Commercial

Big Data Integration and Analytics | Hitachi Vantara
Business Intelligence and Analytics | Tableau Software
Introduction to Tableau - Learn The Part - Medium

Data Visualization | Microsoft Power BI
15 分鐘上手 Power BI!我一旦認真起來連我自己都會害怕 ~ - YouTube

The 5 best self-service BI tools compared | CIO

15 分鐘上手 Power BI!我一旦認真起來連我自己都會害怕 ~ - YouTube

Open source

Apache Superset (incubating) — Apache Superset documentation
Redash helps you make sense of your data | Redash
Metabase

Easy analytics with Grafana, Postgres, and Kubernetes.


Data Generation

JavaScript

Chance
chancejs/chancejs: Chance - Random generator helper for JavaScript

aharris88/awesome-ipsum
dejavu1987/jabber: Simple random word / paragraph / lorem ipsum / dummy text generator.

Marak/faker.js
drewbrokke/chance-token-replacer
ngneat/falso: All the Fake Data for All Your Real Needs 🙂
adleroliveira/dreamjs: A lightweight json data generator.
json-schema-faker/json-schema-faker: JSON-Schema + Faker
danibram/mocker-data-generator

Python

Welcome to Faker’s documentation! — Faker documentation
joke2k/faker: Faker is a Python package that generates fake data for you.

Mimesis: Fake Data Generator — Mimesis 17.1.0 documentation
lk-geimfari/mimesis: Mimesis is a robust data generator for Python that can produce a wide range of fake data in multiple languages.
Getting Started with Mimesis: A Modern Approach to Synthetic Data Generation

chris1610/barnum-proj: Python application for generating pseudo-random data

Data Processing

Tabula: Extract Tables from PDFs
香港地址解析器 Hong Kong Address Parser
Data Analytics Reference Stack | Clear Linux* Project

AugLy: A new data augmentation library to help build more robust AI models
facebookresearch/AugLy: A data augmentations library for audio, image, text, and video.

Data Build Tool/dbt

What is dbt? | dbt Developer Hub

Transform Your Data Like a Pro With dbt (Data Build Tool) - DEV Community

DataStation

DataStation | The Data IDE for Developers
multiprocessio/datastation: Easily query, script, and visualize data from every database, file, and API.
multiprocessio/dsq: Commandline tool for running SQL queries against JSON, CSV, Excel, Parquet, and more.

Python

data-analytics-python

JavaScript

Danfo.js Documentation - Danfo.js Pandas for JavaScript
Hello from Scikit.js | Scikit.js Scikit Learn for JavaScript
JSdata

Crossfilter Pandas for JavaScript
How to Create an Interactive Dashboard with Crossfilter and Dc.Js

scijs
ndarray
Implementing Multidimensional Arrays in JavaScript | 0 FPS

tidy.js
tidy.js – Intro & Demo / Peter Beshai / Observable

C

Articles on Mathematics, Physics and Computer Science

muparser - fast math parser library

Go

DataFrames in Go with gota, qframe, and dataframe-go - MungingData

gonum
plot package - gonum.org/v1/plot - pkg.go.dev

tobgu/qframe: Immutable data frame for Go
go-gota/gota: Gota: DataFrames and data wrangling in Go (Golang)
rocketlaunchr/dataframe-go: DataFrames for Go: For statistics, machine-learning, and data manipulation/exploration

Rust

Vector | A lightweight, ultra-fast tool for building observability pipelines

Polars

Polars
Pandas vs. Polars: A Syntax and Speed Comparison | by Leonie Monigatti | Jan, 2023 | Towards Data Science
Why Polars uses less memory than Pandas
Polars for initial data analysis, Polars for production
Replacing Pandas with Polars. A Practical Guide. - Confessions of a Data Guy
Working With Python Polars – Real Python
Polars for Pandas Users: A Blazing Fast DataFrame Alternative - KDnuggets