Ankitsrihbti | Buzz Chronicles

Top 10 Data Science Projects with Python

✔️ 10 Datasets
✔️ 10 Projects with solution

👇🧵

1️⃣ Project: Detecting Spam

✔️ Big email dataset
✔️ 35.000+ spam and ham messages
✔️ Learn how to filter

https://t.co/wvNSeFSbmr

Solution 👇🧵

1️⃣ Solution: Detecting Spam

✔️ How to build a spam filter
✔️ Using Scikit-learn
✔️ Naive-Bayes and

2️⃣ Project: Music Recommendation

✔️ Million Song Dataset
✔️ Metadata for a million songs

https://t.co/QgdSdIYnVV

Solution 👇🧵

2️⃣ Solution: Music Recommendation

✔️ Using Tableau
✔️ Collaborative-filtering engine
✔️ Similar to YouTube

Francesco Ciulla
@FrancescoCiull4

Docker run

The `docker run` command is one of the most used commands when we use the docker CLI (Command Line Interface).

What does happen when we use it and which are the most used options? Check the docs!

But if you don't have time, let’ see it in 2 minutes:

1/12

Under the Hood, `docker run` are 2 commands:

docker create+docker start

- a container layer is created on the top of an image
- the container just created is started

It's important to understand this to avoid confusion between docker run, docker start, and docker create

2/12

Docker RUN Common Options:

--name: assign a name
--rm: remove it when it exits
-p: publish ports
-e: set environment variable
-d: run it in the background
--network: connect it to a network
-i: keep stdin open
--mount: set a volume or bind mount
--user: set a user

3/12

Assign a name (--name)

By default, a random name is assigned, but in general is a good idea to assign a custom name to our containers, as we do to our pets!

4/12

Remove the container when exits (--rm)

Useful for disposable containers and examples. When a container is stopped, it’s automatically removed. Handy

5/12

Francesco Pochetti
@Fra_Pochetti

All ML projects which turned into a disaster in my career have a single common point:

🚨 I didn't understand the business context first, got over-excited about the tech, and jumped into coding too early.

When someone asks you for a model, always ask:

👉 why do you need it?
👉 what is your current solution (e.g. what is the baseline)?
👉 who is going to use the predictions and how?
👉 what is the impact of the model’s downtime or mistakes?
👉 which metrics do we care about?

Once you have your answers, back them up with a solid exploratory data analysis, and, when done, loop in the biz team again.

This is a critical moment as your results will translate into 3 potential outcomes:

💡 “Really? This is weird. Well, in this case, the ML model doesn’t make much sense anymore”. You are off the hook 🔴
💡 “Interesting. I guess we’ll have to change requirements/scope then.” Course-correct before moving forward 🟠
💡 “This is what I expected. Let’s go ahead”.🟢

Might seem silly, but skip the above and you are all set for failure.
Trust me, I learned it the hard way 😱

Also, always remember that the best model is no model.

Owain Evans
@OwainEvans_UK

Thread on @AnthropicAI's cool new paper on how large models are both predictable (scaling laws) and surprising (capability jumps).
1. That there’s a capability jump in 3-digit addition for GPT3 (left) is unsurprising. Good challenge to better predict when such jump will occur.

2. The MMLU capability jump (center) is very different b/c it’s many diverse knowledge questions with no simple algorithm like addition.
This jump is surprising and I’d like to understand better why it happens at all.

3. Program Synthesis jump (right) feels like it should be in between 1 and 2. Less diversity than 2 and we can also imagine models grokking certain concepts in programming leading to a jump.

I’d love to see more work on this topic of predictability and surprise and how they relate to forecasting alignment/risk.
Related work:
1. @gwern's list of capability jumps and classic article on

2. @JacobSteinhardt's insightful blog series. https://t.co/0ckLgOgBiV
3. Lukas Finnveden's post on GPT-n extrapolation /scaling on different task

$haltakov.eth \U0001f30d \U0001f1fa\U0001f1e6$

haltakov.eth 🌍 🇺🇦...
@haltakov

Machine Learning in the Real World 🧠 🤖

ML for real-world applications is much more than designing fancy networks and fine-tuning parameters.

In fact, you will spend most of your time curating a good dataset.

Let's go through the process together 👇

#RepostFriday

Collect Data 💽

We need to represent the real world as accurately as possible. If some situations are underrepresented we are introducing Sampling Bias.

Sampling Bias is nasty because we'll have high test accuracy, but our model will perform badly when deployed.

👇

Traffic Lights 🚦

Let's build a model to recognize traffic lights for a self-driving car. We need to collect data for different:

▪️ Lighting conditions
▪️ Weather conditions
▪️ Distances and viewpoints
▪️ Strange variants

And if we sample only 🚦 we won't detect 🚥 🤷‍♂️

👇

Data Cleaning 🧹

Now we need to clean all corrupted and irrelevant samples. We need to remove:

▪️ Overexposed or underexposed images
▪️ Images in irrelevant situations
▪️ Faulty images

Leaving them in the dataset will hurt our model's performance!

👇

Preprocess Data ⚙️

Most ML models like their data nicely normalized and properly scaled. Bad normalization can also lead to worse performance (I have a nice story for another time...)

▪️ Crop and resize all images
▪️ Normalize all values (usually 0 mean and 1 std. dev.)

👇

$Jes\xfas L\xf3pez$

Jesús López
@jsulopz

@svpino I understand ReLu in this way:

1. a neuron is a number in the model

2. not every neuronal is important for each label you want to predict

3. if you apply a “linear” activation, redundant neurons will influence the prediction a bit

4. but the ReLu activation won't consider a neuron relevant until it overpass a threshold

5. therefore removing noise for the predictions

would you agree ?

Santiago
@svpino

One of the most popular activation functions used in deep learning models is ReLU.

I asked: "Is ReLU continuous and differentiable?"

Surprisingly, a lot of people were confused about this.

Let's break this down step by step: ↓

Let's start by defining ReLU:

f(x) = max(0, x)

In English: if x <= 0, the function will return 0. Otherwise, the function will return x.

If you draw this function, you'll get the attached chart.

Notice there are no discontinuities in the function.

This should be enough to answer half of the original question: the ReLU function is continuous.

Let's now think about the differentiable part.

A necessary condition for a function to be differentiable: it must be continuous.

ReLU is continuous. That's good, but not enough.

Its derivative should also exist for every individual point.

Here is where things get interesting.

We can compute the derivative of a function using the attached formula.

(I'm not going to explain where this is coming from; you can trust me on this one.)

We can use this formula to see whether ReLU is differentiable.

elvis
@omarsar0

The past month I've been writing detailed notes for the first 15 lectures of Stanford's NLP with Deep Learning. Notes contain code, equations, practical tips, references, etc.

As I tidy the notes, I need to figure out how to best publish them. Here are the topics covered so far:

I know there are a lot of you interested in these from what I gathered 1 month ago. I want to make sure they are high quality before publishing, so I will spend some time working on that. Stay

I've been writing notes for the latest Deep Learning for NLP course by Stanford.

For fun, I also started to add my own code snippets into the notes. I think this is a more efficient way to study: theory + code.

Plan to share these notes soon. Stay tuned! pic.twitter.com/hWzZDORbl6
— elvis (@omarsar0) January 14, 2022

Below is the course I've been auditing. My advice is you take it slow, there are some advanced concepts in the lectures. It took me 1 month (~3 hrs a day) to take rough notes for the first 15 lectures. Note that this is one semester of

I'm super excited about this project because my plan is to make the content more accessible so that a beginner can consume it more easily. It's tiring but I will keep at it because I know many of you will enjoy and find them useful. More announcements coming soon!

NLP is evolving so fast, so one idea with these notes is to create a live document that could be easily maintained by the community. Something like what we did before with NLP Overview: https://t.co/Y8Z1Svjn24

Let me know if you have any thoughts on this?

Mark Tenenholtz
@marktenenholtz

The worst taught skill in machine learning is model validation.

If you can’t validate your models well, you have no idea if they will actually work.

Here are 3 steps I’d take if I was relearning model validation from scratch 🧵

1. Learn the essential evaluation metrics

Think accuracy should be your primary metric? You’re sorely mistaken.

Most of the best metrics instead focus on how far your were from the correct answer. Think RMSE and MAE.

Others point to how well calibrated your model is, like F1.

2. Learn the common forms of cross validation

Before diving in too deep, make sure you understand the basics.

You can’t become an expert in validation in the classroom, but knowing what is out there (simple k-fold, stratified, grouped, roll forward, etc.) is crucial.

3. Read old Kaggle competition solutions

Every day, or multiple times a week, pick an old Kaggle competition.

Read every solution that is posted and skip to their validation schemes.

There are nuances to every dataset, and this is the best way to see how pros navigate them.

4. Build simple models and try different CV schemes

Get a dataset and create a random test set.

Then, build some simple models and switch validation strategies in and out and see how well your models generalize for each scheme.

This will cement the importance of validation.

Max Vladymyrov
@mvladymyrov

I’m excited to share our new paper on HyperTransformers, a novel architecture for few-shot learning able to generate the weights of a CNN directly from a given support set. 🧵👇

📜: https://t.co/vcm67G6P6t with Andrey Zhmoginov and Mark Sandler.

2) We train a transformer model to `convert` a few-shot task description into a small CNN network specialized in solving it on new images.

3) This effectively decouples a high-capacity transformer generator from a much smaller inference model. It is different from most of the existing methods, e.g. MAML where the generator and the executing model share the same architecture.

4) CNN weights are generated layer-by-layer from a combination of layer embedding (features from the last generated layer), and image w/ class embeddings (features directly from the data). The final weights are extracted from output of self-attention (similar to [CLS] tokens).

5) What is cool is that we can also add unlabeled samples from the support set into the mix, effectively allowing for semi-supervised few-shot learning!

Learn Python With Rune
@PythonWithRune

2022 has started 🚀

If you want to
🐍 Learn Python
🧑‍💻 Master Data Scientist & ML
💰 Use Python for Financial Analysis

Then follow me ✔️

I will share a lot of great content this year 🔥

See my top 10 pages in 2021 👇🧵

No 1

Learn Python - a 8 hours video course

- Includes 17 lessons
- 34 prepared Notebooks
- a FREE eBook

No 2

No 2

Calculate the Relative Strength Index (RSI) with pandas
- Learn what RSI is
- Read stock prices with PDR
- How to calculate RSI using pandas

No 3

No 3

Calculate the MACD with pandas
- Get stock prices
- How to calculate MACD
- Make a plot with MACD lines

No 4

No 4

FREE eBook
- Backtesting a investment strategy
- Use Python and pandas
- 82 pages with source code

No 5

Russell Kaplan
@russelljkaplan

Lessons learned debugging ML models:

1/ It pays to be paranoid. Bugs can take so long to find that it’s best to be really careful as you go. Add breakpoints to sanity check numpy tensors while you're coding; add visualizations just before your forward pass (it must be right before! otherwise errors will slip in).

2/ It's not enough to be paranoid about code. The majority of issues are actually with the dataset. If you're lucky, the issue is so flagrant that you know something must be wrong after model training or evaluation. But most of the time you won't even notice.

3/ The antidote is obsessive data paranoia. Without this, data issues will silently take away a few percentage points of model accuracy.

4/ You can unit test ML models, but it's different from unit testing code. To prevent bugs from re-occurring, you have to curate scenarios of interest, then turn them into many small test sets ("unit tests") instead of one large one.

Alex Strick van Linsch...
@strickvl

I recently switched what I spend the majority of my professional life doing: history -> software engineering. I'm currently working as an ML Engineer @zenml_io and really enjoying this new world of #MLOps, filled as it is with challenges and opportunities.

I wanted to get some context for the wider work of a data scientist to help me appreciate the problem we are trying to address @zenml_io, so looked around for a juicy machine learning problem to work on as a longer project.

I was also encouraged by @jeremyphoward's advice to "build one project and make it great" (https://t.co/Doo88EUhkN). This approach seems like it has really paid off for those who've studied the @fastdotai course and I wanted to really go deep on something myself.

Following some previous success working with @adyantalamadhya and another mentor via @SharpestMindsAI on a previous project, I settled on computer vision and was lucky to find @ai_fast_track to mentor me through the work. (We meet a couple of times per week).

In the last 6 weeks, I've made what feels like good progress on the problem. This image offers an overview of the pieces I've been working on, to the point where the 'solution' to my original problem feels on the verge of being practically within reach.

Alisa Liu
@alisawuffles

We introduce a new paradigm for dataset creation based on human 🧑‍💻 and machine 🤖 collaboration, which brings together the generative strength of LMs and the evaluative strength of humans. And we collect 🎉 WaNLI, a dataset of 108K NLI examples! 🧵

Paper: https://t.co/IUXcm9wIh2

Our pipeline starts with an existing dataset (MNLI), and uses data maps 📜 to automatically identify pockets of examples that demonstrate challenging 🧐 reasoning patterns relative to a trained model. Then we use GPT-3 to generate new examples likely to have the same pattern. 2/

Next we propose a new metric, also inspired by data maps, to automatically filter generations for those most likely to aid model learning. Finally, we validate ✅ the generated examples through crowdworkers, who assign a gold label 🟡 and (optionally) revise for quality ✍️. 3/

Remarkably, replacing MNLI with WaNLI (which is 4x smaller) for training improves performance📈 on seven OOD test sets🧪, including by 11% on HANS and 9% on ANLI. Under a data augmentation setting, combining MNLI with WaNLI is more effective than using other augmentation sets. 4/

Our method addresses limitations of crowdsourcing, where workers may resort to repetitive writing strategies 🤷, and leverages the great progress in text generation 📃. We get the best of both worlds: 🤖’s ability to produce diverse examples, and 🧑‍💻’s ability to evaluate them. 5/

Russell Kaplan
@russelljkaplan

@praveen0582 I use a combination of Jupyter notebooks and (just a bit biased :)) Scale Nucleus.

Oliver Jumpertz
@oliverjumpertz

With average salaries of $145,000 for remote positions and an open end to what you can earn, Solidity developers are in high demand.

Time to become one and enter an interesting field in the industry.

This is your roadmap to becoming a Solidity Developer in 2022. ↓

Before we get into it, one clarification:

This is the roadmap for a very specific tech stack, used to develop smart contracts on Ethereum-like blockchains.

This includes:

- Ethereum
- Polygon
- Binance Smart Chain
- etc.

It contains many of the fundamentals you need to branch out to other blockchains if you want to, but please don't expect to become a competent Solana smart contract developer by following this roadmap to the end.

Now that we got this out of the way, let's get into it! ↓

1. Learn The Basics Of Computer Science

Depending on where you currently stand in terms of your skills, it might be that you first need an introduction to CS overall.

Harvard offers its CS50 for free, and it'll take you a while, but it's worth

Learn Python With Rune
@PythonWithRune

Want to learn Django?

Check out these 8 free resources.

see 👇🧵

Get started with Django

Covers
- Installing and creating first project
- Working with templates
- Authentication frameworks
...and much more

More resources

Django Admin Cookbook

How to do things with Django admin
- Create two admin sites
- Bulk and custom actions
- Working with permissions
...and much more

https://t.co/NM0Nby5NwB

More 👇🧵

Django ORM cookbook

How to do things using Django ORM
- How to do OR/AND queries in ORM
- CRUD with ORM
- Database modelling
...and much more

https://t.co/tYjdbiYpWU

More 👇🧵

Building APIs with Django and Django Rest Framework

Learn about
- Simple API with pure Django
- Serializing and Deserializing data
- Access control
...and much more

https://t.co/gQTDxiG3wX

More 👇🧵

Jean de Nyandwi
@Jeande_d

Early last year, I wanted to learn about Machine Learning Operations(MLOps).

MLOps refers to the whole processes involved in building and deploying machine learning models reliably.

A thread on the importance of MLOps and resources that I used 🧵

As you may have heard, models are a tiny part of any typical ML-powered application.

There is nothing that stresses that as this picture:

Source: Hidden Technical Debt in Machine Learning Systems, https://t.co/JDyAr1s3kc

There are lots of critical processes that are involved in MLOps such as:

- Data processes: collection, labeling, exploration, preprocessing
- Modeling processes: building, training, evaluation, testing
- Production processes - Serving, monitoring, and maintaining models

MLOps is a new topic for almost anyone. Maintaining models for a prolonged period of time is difficult.

Models are very prone to change. They drift over time. The world (that sources the data) changes, and so data change too.

MLOps is a huge topic. All I wanted was to have a reasonable understanding of it.

Here are 3 resources that I used:

- Machine Learning Engineering book by @burkov
- MLOps Specialization by @DeepLearningAI_
- Introducing MLOps book Oreilly