Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Jupyter notebook markdown generator

Posts

The Lost Art of Testing Code in the Age of LLMs and Vibe-Coding

5 minute read

Published: May 31, 2025

Santiago Valdarrama recently highlighted a worrying trend: the decline of rigorous testing in software development. As he pointed out, we’re living in an era of “impressive demos,” where working code often takes a backseat to flashy presentations. This is especially true when working with Large Language Models (LLMs) or engaging in “vibe-coding”—writing code with an experimental, trial-and-error approach.

Effective Data Management in the AI World

9 minute read

Published: May 18, 2025

AI is rapidly transforming industries, but at its core lies a critical foundation: data. The quality, organization, and governance of this data directly impact the success and reliability of AI models. This blog post explores what I know and what I’ve learned about the essentials of data management and its integration with AI, from fundamental concepts to practical principles and crucial considerations around security and privacy.

I Always Forget Git Commands, so I Made This Cheat Sheet for Data Science Collaboration

5 minute read

Published: May 03, 2025

As a data scientist who is supposed to be working closely with developers, I constantly find myself forgetting Git commands, especially when switching between feature branches, stashing changes, or pushing to remotes. While Git is integrated into VSCode and offers a visual module for staging, committing, and syncing, I still prefer the command line. It gives me more control and a clearer understanding of what’s happening under the hood.

Level Up Your Data Science Workflow: Standardizing Projects with Cookiecutter and Git

9 minute read

Published: April 20, 2025

I’m in the job hunting mode at the moment and one thing has become crystal clear: presenting a portfolio of projects in a clean, professional, and industry-standard format is crucial. My own journey of wrangling personal projects – juggling data, code, notebooks, models, and results – highlighted the need for better organization and reproducibility. How do you transform scattered scripts and notebooks into something easily understandable and verifiable by potential employers or collaborators? The answer lies in standardized project structures and robust version control.

Automating Insight: Bash Scripting, Command-Line Power Tools, and Data Querying

6 minute read

Published: April 10, 2025

Behind every robust data pipeline or analytics project lies a powerful foundation of automation and efficient data handling. While high-level tools like SQL engines and data visualization platforms get much of the spotlight, it’s often the low-level tools—like Bash scripts, rsync, find, and others—that keep the data world running smoothly.

My Dive into the Sepsis Challenge: Can Data Help Us Fight Back?

8 minute read

Published: April 01, 2025

Sepsis. The word itself carries a weight of urgency. Learning that this condition, recognized as a global health priority by the World Health Assembly, is essentially our body’s own defense system going haywire in response to an infection – leading to potential widespread damage and even death [1] – really struck a chord with me. Millions affected globally each year, and the stark reality that every hour of delayed treatment increases mortality risk [2]… it’s a problem screaming for solutions.

Neural Network Force Fields for Molecular Dynamics Simulations: A Comprehensive Review

29 minute read

Published: March 23, 2025

In recent years, there has been a surge in research in classical Molecular Dynamics and force-field parameterization using advanced machine learning like Neural Networks. Since this has been the topic of my PhD work, I wanted to explore the field and try to summirize recent advancing in this field. For this I wanted to test Gemini Research feature. Thus, this post is written by Gemini, and the results are very interesting.

From Molecules to Manufacturing: Understanding Storage Protocols and Modern Data Architecture

6 minute read

Published: March 03, 2025

In today’s data-driven world, choosing the right storage protocol and architecture is crucial for performance, scalability, and cost efficiency. Over the years, I’ve worked with various storage systems—from NoSQL databases during my PhD to CRM systems, data warehouses, and data lakes in manufacturing. In this post, I’ll break down key storage protocols (NFS, SMB, S3) and explain the differences between data warehouses, data lakes, and the emerging data lakehouse paradigm.

Beyond the SQL Basics - Mastering Advanced SQL Constructs

4 minute read

Published: February 26, 2025

For data scientists and analysts, basic SQL queries are just the starting point. To truly unlock the power of databases and perform complex analyses, you need to delve into advanced constructs. This blog post explores five essential techniques: Subqueries, Common Table Expressions (CTEs), Views, Temporary Tables, and Create Table As Select (CTAS). These tools enable you to write more efficient, readable, and powerful SQL code.

Data Pipelines basics - the backbone of data apps

6 minute read

Published: February 16, 2025

Data is everywhere, like a river flowing into a city. But raw data, like river water, isn’t always ready to use. We need to clean it, process it, and get it where it needs to go so it can be helpful. That’s why data pipelines are important.

Building a Siamese CNN for Fingerprint Recognition: A Journey from Concept to Implementation

5 minute read

Published: January 31, 2025

The idea for this project stemmed from a collaboration with my friend Jovan on his Bachelor’s thesis. His concept was to use a Siamese Convolutional Neural Network (Siamese CNN) for fingerprint recognition, structured as follows: This blog outlines how we implemented this in Python & Keras, while dealing with dataset augmentation, Siamese architecture, and model validation. You can explore the project’s Git repo and Jupyter Notebooks.

An easy to implement AI Voice and Video Agents with Livekit: A Straightforward Approach

6 minute read

Published: January 29, 2025

In this blog post, we will discuss the implementation of AI-powered voice and video agents using the Livekit platform. Our experience demonstrates that setting up these agents is a straightforward process, especially with the comprehensive documentation and tutorials available on the Livekit website. We have successfully implemented two versions of these agents: one focused solely on voice interaction and another that incorporates both voice and visual assistance.

A/B testing - principles and practicalities on how to setup the experiment

12 minute read

Published: December 13, 2024

One of the most common Data Scientist requeriments nowadays is A/B testing, also known as split testing. A basic scientific method, A/B testing is a fundamental technique in data-driven decision-making, widely used in marketing, product development, and user experience optimization. It involves comparing two versions of a webpage, email campaign, or app feature to determine which performs better based on key metrics such as conversion rates, engagement, or revenue. By systematically testing variations and analyzing statistical significance, businesses can make informed changes that improve user experience and drive growth.

Crafting a Standout Data Analyst Portfolio

4 minute read

Published: November 25, 2024

A well-constructed portfolio is your golden ticket to showcasing your data analysis skills and landing your dream job. It serves as a window into your expertise, showing potential employers not just what you’ve done but how you think, solve problems, and communicate results. Let’s break down the essential elements of a standout data analyst portfolio and explore how to build one that truly stands out.

Analyzing Manufacturing Data - Cpk and Six Sigma

16 minute read

Published: October 12, 2024

I’ve been working in Automotive manufacturing for more then a year now and there is one concept that is holy grail in this industry, and that is Six Sigma. It’s a methodology for achieving near-perfect quality in manufacturing. But how can you could leverage Six Sigma tools right from your Python environment? That’s where the manufacturing package comes in.

Anomaly Detection in HTTP Requests: A Machine Learning Approach

5 minute read

Published: July 28, 2024

A while back, I was given an interesting assignment: build a model to detect anomalous HTTP requests. The goal was to identify malicious web traffic by analyzing patterns in normal and anomalous requests. This led me to explore the CSIC 2010 dataset, a well-known benchmark for HTTP anomaly detection.

SQL - The most used tool among Data Scientist and Analysts

7 minute read

Published: June 13, 2024

In the world of data analysis, SQL (Structured Query Language) is the fundamental rock. It’s the language that allows you to communicate with databases, extracting, manipulating, and analyzing data with precision. Whether you’re a seasoned analyst or just starting your journey, a solid grasp of SQL is essential for uncovering meaningful patterns and driving data-informed decisions. This blog post will cover the fundamental concepts of SQL, basic query structures, and the software tools that empower data analysts. But first let’s answer what is a database and what are two main categories of databases.

Mastering the Art of Data Cleaning

6 minute read

Published: May 15, 2024

As data analyst, people often ask me where do I spend most of my work time, besides scrolling through the internet. When people think about data analysis, they often imagine building predictive models or creating dazzling visualizations. But beneath the surface of every successful data project lies an essential yet often underestimated step: data cleaning. This critical process lays the foundation for trustworthy insights, making it one of the most valuable skills for any data professional.

Becoming a Data Scientist in 2024: Roadmap

3 minute read

Published: April 08, 2024

As the data landscape continues to evolve in 2024, becoming a data scientist requires a multi-faceted approach, encompassing coding, mathematical proficiency, data analysis, and machine learning. This guide outlines a comprehensive roadmap to becoming a proficient data scientist, integrating essential skills and modern practices such as working with Large Language Models (LLMs) and prompt engineering.

First Principles Thinking in everyday life

11 minute read

Published: February 29, 2024

Hey there!

portfolio

Fingerprint Verification System Using a Siamese Neural Network in Keras

Published: June 02, 2025

CNN Siamese model for fingerprint recognition using Python & Keras, while dealing with dataset augmentation, Siamese architecture, and model validation1