For those of you interest in learning more about causality, I just announced on my mailing list (https://data4sci.com/latest) that I'm working on a series of blog posts covering the contents of Judea Pearls's Causal Inference in Statistics - A Primer (https://amzn.to/39G5lWl, affiliate link) using Python.
I find that as well. I've also read Book of Why but it's difficult to know when and how to use them. Perhaps it's just more practice of writing causal diagrams, as Pearl seems to do with ease, just due to the years of practice he's had. Something like programming, I guess.
Note that the article subtly admits that you can subvert the objectivity in your model-based decisions by making your subjecive choices of how to build the graph.
This is always true: you're always making some strong assumptions when trying to make a causal claim. Using a causal framework, DAGs in this case, at least makes those assumptions explicit.
I didn't downvote, but your comment came across as dismissive.
Personally, I think there's a world of difference between the aphorism of "correlation does not equal causation" and actually understanding how causation works. I understand the former quite well, but haven't much a clue about the latter. (And I'm not alone!)
I left my comment for the benefit of those who have learned regression. I read the article expecting something new, seeing how highly upvoted the post was on HN, but realized it was basic information.
If they removed the symbols from the article, I bet the article would be accessible to way more people.
Regarding causation, I think the essential points are:
- there is never any true causation (e.g. Even though it is always dark when I close my eyelids, I can never be sure that closing my eyelids causes the darkness. How can I know that the next time I close my eyelids, it won't be something other than dark?)
- you can see if a correlation may have a causal relationship by applying common sense to the chronology of events (if darkness correlates highly with closed eyelids, and if I notice that, chronologically, darkness has always followed closing my eyelids, I can be more confident that the relationship may be causal. Stats help formalize this process.)
Try not to make comments that an article is trivial and common sense when you admittedly missed 50% of the article, and of that 50% you weren't familiar with any of the concepts.
Essentially that's reading an introduction, saying you knew everything in the introduction and then claiming the article has no content.
The parts that I am unfamiliar with are the second paragraph in the "What is my data telling me?" section where they describe DAG, and the last paragraph in the "Can’t I just use XGBoost?" section, where they introduce TMLE.
Even then, DAG seems like a very straightforward concept that would be useful to have in your toolkit, but is probably something most thoughtful people do without explicitly thinking about it when doing regressions.
And the TMLE paragraph is essentially jargon that you would need to do your own research to actually understand. Why couldn't they spend the article describing this, as it seems that this what their value add as a service is.
Overall, this piece seemed more intended to market themselves ("Look! We do more complicated regression than you know how to do, and we're not going to explain it in any depth.") instead of well meaning teaching. Of course, I'm probably seeing bad intentions when there aren't any, so I am likely wrong.
> Even then, DAG seems like a very straightforward concept that would be useful to have in your toolkit, but is probably something most thoughtful people do without explicitly thinking about it when doing regressions.
I appreciate the honest response. To elaborate a bit, I think you may be mixing up the point of the article with the fact it starts with a low baseline of expected familiarity / knowledge to be able to bring more readers along.
My read is that the purpose of the article is to explain how to perform more accurate blocking and be explicit about you model assumptions by using DAGs. Followed by why the common tool everyone reaches for of linear regression is the wrong tool in this case, and TMLE is better.
That's the meat of what the article builds to, and is 9 out of the 30 paragraphs. Roughly 1/3rd (not 1/2 as guessed).
> Why couldn't they spend the article describing this, as it seems that this what their value add as a service is.
Agreed, but I think this is a different type of article. It's (in my opinion) and article saying "hey, there is a problem or shortcoming in what you're doing today. Gives justification as to why many people do have this problem, and then says "These are the tools to help you fix it."
A person unaware knows what to look into to solve their problem. Presumably a treatment on the solutions doesn't fit in a blog post - and it doesn't have to. There is value in identifying a person's problem and telling them how to learn about the solution. Even without fully detailing the solution. And I think this is your main criticism with the post, is that it wasn't a post to just present the solution in full detail.
That said, I don't want this to seem like im beating you up - but rather show that there are many different ways to get value and information on a topic without a detailed treatment of a solution.
And cycling back to your original point, I agree, I think the reason they don't go into full detail on the solution is because at that point the amount there'd need to cover to do a full treatment on the solution ends up having the cost outweigh the value in a blog post that is motivated to bringing people to their company/service.
Upvoted you, because why not throw around internet points!
It's easy to forget that things that are basic to some are extremely advanced to others. As HN grows, it draws in more diverse backgrounds; many of which don't have much statistics exposure.
I know data scientists who work at FAANGs, for whom this would be almost entirely new information (the causality stuff).
Thank you for taking the time to add more essential points. Do you think it's possible to make better decisions with partial causation, versus a more correlational approach? That is, do we really need true causation to improve our decision making? Or can we make better, albeit imperfect, decisions with imperfect causality? Is that better than what most statisticians do with correlation?
Perhaps you had an exceptional intro to regression class. But the ones I’m familiar haven’t gone into the Rubin causal model, DAGs, or combining linear treatment effect models with non-parametric components.