Blog posts

2025

Level Up Your Data Science Workflow: Standardizing Projects with Cookiecutter and Git

9 minute read

Published:

I’m in the job hunting mode at the moment and one thing has become crystal clear: presenting a portfolio of projects in a clean, professional, and industry-standard format is crucial. My own journey of wrangling personal projects – juggling data, code, notebooks, models, and results – highlighted the need for better organization and reproducibility. How do you transform scattered scripts and notebooks into something easily understandable and verifiable by potential employers or collaborators? The answer lies in standardized project structures and robust version control.

Automating Insight: Bash Scripting, Command-Line Power Tools, and Data Querying

6 minute read

Published:

Behind every robust data pipeline or analytics project lies a powerful foundation of automation and efficient data handling. While high-level tools like SQL engines and data visualization platforms get much of the spotlight, it’s often the low-level tools—like Bash scripts, rsync, find, and others—that keep the data world running smoothly.

My Dive into the Sepsis Challenge: Can Data Help Us Fight Back?

8 minute read

Published:

Sepsis. The word itself carries a weight of urgency. Learning that this condition, recognized as a global health priority by the World Health Assembly, is essentially our body’s own defense system going haywire in response to an infection – leading to potential widespread damage and even death [1] – really struck a chord with me. Millions affected globally each year, and the stark reality that every hour of delayed treatment increases mortality risk [2]… it’s a problem screaming for solutions.

Neural Network Force Fields for Molecular Dynamics Simulations: A Comprehensive Review

29 minute read

Published:

In recent years, there has been a surge in research in classical Molecular Dynamics and force-field parameterization using advanced machine learning like Neural Networks. Since this has been the topic of my PhD work, I wanted to explore the field and try to summirize recent advancing in this field. For this I wanted to test Gemini Research feature. Thus, this post is written by Gemini, and the results are very interesting.

From Molecules to Manufacturing: Understanding Storage Protocols and Modern Data Architecture

6 minute read

Published:

In today’s data-driven world, choosing the right storage protocol and architecture is crucial for performance, scalability, and cost efficiency. Over the years, I’ve worked with various storage systems—from NoSQL databases during my PhD to CRM systems, data warehouses, and data lakes in manufacturing. In this post, I’ll break down key storage protocols (NFS, SMB, S3) and explain the differences between data warehouses, data lakes, and the emerging data lakehouse paradigm.

Beyond the SQL Basics - Mastering Advanced SQL Constructs

4 minute read

Published:

For data scientists and analysts, basic SQL queries are just the starting point. To truly unlock the power of databases and perform complex analyses, you need to delve into advanced constructs. This blog post explores five essential techniques: Subqueries, Common Table Expressions (CTEs), Views, Temporary Tables, and Create Table As Select (CTAS). These tools enable you to write more efficient, readable, and powerful SQL code.

Data Pipelines basics - the backbone of data apps

6 minute read

Published:

Data is everywhere, like a river flowing into a city. But raw data, like river water, isn’t always ready to use. We need to clean it, process it, and get it where it needs to go so it can be helpful. That’s why data pipelines are important.

Building a Siamese CNN for Fingerprint Recognition: A Journey from Concept to Implementation

5 minute read

Published:

The idea for this project stemmed from a collaboration with my friend Jovan on his Bachelor’s thesis. His concept was to use a Siamese Convolutional Neural Network (Siamese CNN) for fingerprint recognition, structured as follows: This blog outlines how we implemented this in Python & Keras, while dealing with dataset augmentation, Siamese architecture, and model validation. You can explore the project’s Git repo and Jupyter Notebooks.

An easy to implement AI Voice and Video Agents with Livekit: A Straightforward Approach

6 minute read

Published:

In this blog post, we will discuss the implementation of AI-powered voice and video agents using the Livekit platform. Our experience demonstrates that setting up these agents is a straightforward process, especially with the comprehensive documentation and tutorials available on the Livekit website. We have successfully implemented two versions of these agents: one focused solely on voice interaction and another that incorporates both voice and visual assistance.

2024

A/B testing - principles and practicalities on how to setup the experiment

12 minute read

Published:

One of the most common Data Scientist requeriments nowadays is A/B testing, also known as split testing. A basic scientific method, A/B testing is a fundamental technique in data-driven decision-making, widely used in marketing, product development, and user experience optimization. It involves comparing two versions of a webpage, email campaign, or app feature to determine which performs better based on key metrics such as conversion rates, engagement, or revenue. By systematically testing variations and analyzing statistical significance, businesses can make informed changes that improve user experience and drive growth.

Crafting a Standout Data Analyst Portfolio

4 minute read

Published:

A well-constructed portfolio is your golden ticket to showcasing your data analysis skills and landing your dream job. It serves as a window into your expertise, showing potential employers not just what you’ve done but how you think, solve problems, and communicate results. Let’s break down the essential elements of a standout data analyst portfolio and explore how to build one that truly stands out.

Analyzing Manufacturing Data - Cpk and Six Sigma

16 minute read

Published:

I’ve been working in Automotive manufacturing for more then a year now and there is one concept that is holy grail in this industry, and that is Six Sigma. It’s a methodology for achieving near-perfect quality in manufacturing. But how can you could leverage Six Sigma tools right from your Python environment? That’s where the manufacturing package comes in.

SQL - The most used tool among Data Scientist and Analysts

7 minute read

Published:

In the world of data analysis, SQL (Structured Query Language) is the fundamental rock. It’s the language that allows you to communicate with databases, extracting, manipulating, and analyzing data with precision. Whether you’re a seasoned analyst or just starting your journey, a solid grasp of SQL is essential for uncovering meaningful patterns and driving data-informed decisions. This blog post will cover the fundamental concepts of SQL, basic query structures, and the software tools that empower data analysts. But first let’s answer what is a database and what are two main categories of databases.

Mastering the Art of Data Cleaning

6 minute read

Published:

As data analyst, people often ask me where do I spend most of my work time, besides scrolling through the internet. When people think about data analysis, they often imagine building predictive models or creating dazzling visualizations. But beneath the surface of every successful data project lies an essential yet often underestimated step: data cleaning. This critical process lays the foundation for trustworthy insights, making it one of the most valuable skills for any data professional.

Becoming a Data Scientist in 2024: Roadmap

3 minute read

Published:

As the data landscape continues to evolve in 2024, becoming a data scientist requires a multi-faceted approach, encompassing coding, mathematical proficiency, data analysis, and machine learning. This guide outlines a comprehensive roadmap to becoming a proficient data scientist, integrating essential skills and modern practices such as working with Large Language Models (LLMs) and prompt engineering.