From Molecules to Manufacturing: Understanding Storage Protocols and Modern Data Architecture

6 minute read

Published: March 03, 2025

In today’s data-driven world, choosing the right storage protocol and architecture is crucial for performance, scalability, and cost efficiency. Over the years, I’ve worked with various storage systems—from NoSQL databases during my PhD to CRM systems, data warehouses, and data lakes in manufacturing. In this post, I’ll break down key storage protocols (NFS, SMB, S3) and explain the differences between data warehouses, data lakes, and the emerging data lakehouse paradigm.

Data Storage Concepts: Warehouse vs Lake vs Lakehouse

As data grows in volume, velocity, and variety, so too does the complexity of how we store and manage it. Choosing the right data storage architecture is no longer just a technical decision—it’s a strategic one that impacts performance, scalability, cost, and analytics capabilities. From highly structured data warehouses to flexible data lakes, and now hybrid models like the data lakehouse, understanding these paradigms is essential for designing robust data ecosystems that support both BI and advanced machine learning use cases. In this section, I’ll break down these core concepts and illustrate how they’ve played distinct roles across the research, humanitarian, and manufacturing sectors in my own career.

🏢 Data Warehouse

Structure: Schema-on-write; data must conform before ingestion.
Usage: Optimized for structured data and analytical queries (e.g., star/snowflake schemas).
Tools: Redshift, Snowflake, BigQuery.
My Experience: In manufacturing, structured ETL pipelines loaded curated datasets into Redshift for KPI tracking and dashboarding.

🌊 Data Lake

Structure: Schema-on-read; stores raw data in its native format (structured, semi-, unstructured).
Usage: Ideal for storing logs, sensor data, images, and experimental formats.
Tools: S3, Azure Data Lake, Hadoop HDFS.
My Experience: We built a data lake to archive raw machine telemetry from production systems. Later, this unstructured data became crucial in training anomaly detection models.

🏞️ Data Lakehouse

Hybrid Model: Combines the flexibility of lakes with the performance and governance of warehouses.
Advantage: Enables ACID transactions, metadata management, and query optimization over raw data.
Tools: Delta Lake (Databricks), Apache Iceberg, Hudi.
Use Case in Manufacturing: Our team piloted a lakehouse architecture using Databricks and Delta Lake, allowing both real-time monitoring of defects and downstream analytics for predictive maintenance—without duplicating data across silos.

Storage Protocols: How We Access Stored Data

Behind every data infrastructure lies a set of communication protocols that govern how data is accessed, transferred, and shared across systems. Whether it’s reading a simulation file on a Linux cluster, syncing CRM reports across humanitarian field offices, or ingesting IoT data into the cloud, the choice of protocol—NFS, SMB, or S3—can significantly influence system performance and user accessibility. These protocols form the connective tissue between data producers and consumers. In the following section, I’ll outline how these storage protocols differ, and how I’ve encountered each of them in real-world scenarios.

🗂️ NFS (Network File System)

Use Case: Widely used in Unix/Linux systems.
Strength: Ideal for high-throughput internal enterprise environments.
My Experience: During my PhD, NFS was essential in managing massive binary trajectory files generated by molecular dynamics simulations. It allowed remote compute clusters to mount and read large simulation outputs directly.

📁 SMB (Server Message Block)

Use Case: More common in Windows environments.
Strength: Excellent for small office environments with file sharing and printer access.
Note: SMB has evolved (from SMB1 to SMB3), addressing security and performance issues.
My Experience: In humanitarian field offices, SMB facilitated collaboration across geographically distributed teams who needed access to centralized Excel reports and CRM documents.

☁️ S3 (Amazon Simple Storage Service)

Use Case: Cloud-native object storage.
Strength: Scalable, reliable, and integral to modern data lakes and analytics workflows.
My Experience: At Continental, we used S3 buckets as the backbone of our manufacturing data lake, enabling ingestion of raw sensor data and logs from the production line for downstream analytics and ML.

Personal Reflections Across Domains

Throughout my career, I’ve worked with a wide spectrum of data storage architectures and protocols, shaped by the unique demands of each domain. During my scientific career in bioinformatics, I managed large-scale trajectory data from molecular simulations using NoSQL databases and NFS-mounted storage on high-performance computing clusters. Later, in humanitarian field operations, I worked with CRM systems housing sensitive personal data, accessed and shared via SMB across distributed teams. Most recently, in the manufacturing sector, I’ve implemented both data lakes and data warehouses—leveraging S3 for raw data ingestion, Oracle and Redshift for structured analytics, and increasingly, lakehouse architectures for unified access and advanced analytics. These hands-on experiences have given me a deep appreciation of how storage strategies must adapt to data types, privacy requirements, and the analytical maturity of the organization.

Domain	Storage Type	My Role & Data Type
PhD Research	NoSQL / NFS	Managed unstructured simulation data in BSON/JSON
Humanitarian Work	CRM / SMB	Analyzed sensitive personal data for protection metrics
Manufacturing	Data Lake + Warehouse + S3	Enabled ETL, dashboarding, and ML for production systems

Each environment came with its own set of challenges:

Ensuring data fidelity in academic research
Managing access and privacy in humanitarian CRM systems
Optimizing throughput and minimizing latency in manufacturing data lakes

These experiences have underscored a key principle for me: your storage architecture should evolve with your data maturity and analytics goals.

From my PhD (NoSQL for simulation data) to humanitarian CRM databases and industrial data lakes, I’ve seen how the right storage choice can make or break a data strategy. The evolution from siloed warehouses to unified lakehouses is an exciting shift—one that promises simpler architectures and more powerful insights.

Final Thoughts

The evolution from file-sharing protocols (like NFS/SMB) to cloud-native object stores (like S3), and from siloed warehouses to lakehouse paradigms, reflects the growing demand for flexible yet governed data infrastructure.

Whether you’re just starting with structured dashboards or building ML pipelines on petabytes of raw data, understanding your storage options is foundational to delivering reliable, scalable insights.

What’s your experience with these storage systems? Let’s discuss!

Key Takeaways

NFS/SMB are great for file sharing but lack the scalability of S3 for big data.
Data Warehouses excel in structured analytics, while Data Lakes handle raw, diverse datasets.
Data Lakehouses are the future—bridging the gap between flexibility and performance.

Want to dive deeper?

Let me know if you’d like code examples or architecture diagrams next!

Would you like this adapted for a presentation or turned into a series of short posts?

Share on

Twitter Facebook LinkedIn

Ivan Ivani, PhD.

From Molecules to Manufacturing: Understanding Storage Protocols and Modern Data Architecture

Data Storage Concepts: Warehouse vs Lake vs Lakehouse

🏢 Data Warehouse

🌊 Data Lake

🏞️ Data Lakehouse

Storage Protocols: How We Access Stored Data

🗂️ NFS (Network File System)

📁 SMB (Server Message Block)

☁️ S3 (Amazon Simple Storage Service)

Personal Reflections Across Domains

Final Thoughts

Data Lakehouses are the future—bridging the gap between flexibility and performance.

Share on

You May Also Enjoy

The Lost Art of Testing Code in the Age of LLMs and Vibe-Coding

Effective Data Management in the AI World

I Always Forget Git Commands, so I Made This Cheat Sheet for Data Science Collaboration

Level Up Your Data Science Workflow: Standardizing Projects with Cookiecutter and Git