Things are starting to align within this renewed writing project as my content creation process gets back into some semblance of a proper routine. We are getting pretty close to a place based on the current state of technology where my weekly podcast audio could be produced using a model based on my voice in a matter of seconds. That is not really something that I am considering. I have recorded the last two podcast episodes using my newly acquired MacBook Air using the freely supplied GarageBand software instead of using Audacity on my Windows powered desktop computer. I’m still using the Yeti X microphone and a Marantz Sound Shield Live professional vocal reflection filter, but the operating system and software being used for recording the audio is very different. For scientific purposes, you are welcome to go back and listen to a few of the previous recordings and then check out any episode from 157 forward to see if the audio quality is different. I think the overall quality of the recording is higher with the new setup.
We are going to jump into the deep end of featurization for machine learning this week. To achieve that effort in practice a series of potential next level featurization techniques will be evaluated. Yes – you guessed it, a new series is forming. Within 7 upcoming editions of this Substack newsletter, including upcoming weeks 163 to 169, I’m going to try to pull together some solid coverage and include some academic articles to read related to these topics. You know that I strive to find the best open research papers to share. Things that reside behind a paywall where practitioners and pracademics cannot easily read them I tend to exclude from these missives. That is a choice that is being made on purpose to favor open research. I’ll be really digging into each of these topics in more detail during some future missives. On a side note, It’s about time to refresh my open source intro to machine learning syllabus as well [1].
Here are some concepts that to me are highly promising strategies in feature engineering that represent good places to focus understanding as we move toward the future of the field:
Self-Supervised Learning: Leveraging large amounts of unlabeled data to automatically learn feature representations.
Graph-Based Feature Engineering: Utilizing graph neural networks to capture relationships and dependencies in graph-structured data.
Federated Feature Engineering: Creating features in a decentralized manner to enhance privacy and security by keeping data distributed.
Explainable Feature Engineering: Developing features that improve model interpretability and explainability.
Adaptive Feature Engineering: Using dynamic techniques that evolve features based on real-time data and model feedback.
Synthetic Data Generation: Generating synthetic datasets to create new features and augment training data.
Transfer Learning for Features: Reusing feature representations learned from one domain or task to another, reducing the need for extensive feature engineering in new tasks.
These strategies power feature engineering by providing more advanced, adaptive, and interpretable features for cutting-edge machine learning models. Feature engineering is crucial to machine learning for several reasons:
Improves Model Performance and Efficiency: Well-engineered features enhance the predictive power and efficiency of machine learning models, leading to better accuracy and faster convergence during training.
Simplifies Complexity and Enhances Interpretability: Effective feature engineering simplifies the problem space, making models easier to understand and interpret, thereby increasing stakeholder trust in the model's predictions.
Incorporates Domain Knowledge and Handles Diverse Data: Integrating domain-specific knowledge and transforming diverse data types into a consistent format ensures models can process information effectively and produce relevant results.
Addresses Data Quality and Robustness: Feature engineering helps clean and normalize data, handle missing values and outliers, and improves the model's robustness to changes in data distribution and external conditions.
Now that the foundation has been set for considering featurization within the machine learning space you can sit back and relax as these topics receive even more evaluation within future editions of this newsletter.
Footnotes:
What’s next for The Lindahl Letter?
Week 160: Increasingly problematic knowledge graph updates
Week 161: Structuring really large knowledge graphs
Week 162: Indexing facts vs. graphing knowledge
Week 163: Self-Supervised Learning
Week 164: Graph-Based Feature Engineering
If you enjoyed this content, then please take a moment and share it with a friend. If you are new to The Lindahl Letter, then please consider subscribing. Stay curious, stay informed, and enjoy the week ahead!
The next level of featurization