
Welcome to the second issue of the datum newsletter. Thank you so much for subscribing!! Share and tell a few friends if you feel like it.
An article seizing opportunity in data quality published in MIT Sloan Management Review says
“The cost of bad data is an astonishing 15% to 25% of revenue for most companies. Two-Thirds of These Costs Can Be Eliminated by Getting in Front on Data Quality”.
It’s no different situation for real-time streaming data. I have developed a functional prototype for data quality column profiling using Kafka Streams API, including test cases, that you can use start using straight away in your projects. The project is available GitHub kafka-streams-dataquality use, share, and contribute. Do let me know feedback.
you can find blog post here
Streaming data quality — How to implement column profiling using Kafka Streams?— Data Quality Series
Archives and Recommendations
Understanding tradeoffs in designing real-time streaming analytical applications
There is no good or bad design instead, there will be many tradeoffs to make and hopefully, those tradeoffs are good for a particular use…
How do you explain distributing computing and Apache Spark with different levels of complexity
How do you explain spark distributed computing to a 7 yrs old kid, 9th-grade student, a software engineer (java), ETL Engineer, Machine Learning engineer and an executive
Apache Spark performance recipe — Explicitly cache RDD when branching out from parent RDD
The word count example below illustrates the importance of caching the RDD when the RDD lineage breaks/branches out.
You are receiving this email because you have subscribed via our website. All the posts are available on the website.
Disclaimer: All the opinions expressed are personal independent thoughts and not to be attributed to my current or previous employers.