Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Twitter API Reverse Engineered (github.com/trevorhobenshield)
183 points by hobenshield on April 12, 2023 | hide | past | favorite | 36 comments
endpoints: /graphql /1.1 /2 /i


This was inevitably going to get more popular after the new API pricing was announced. Unfortunately this also means they're going to start being more aggressive on bot detection.

We may soon find ourselves needing to run headless browsers to scrape data, like with TikTok: https://nullpt.rs/reverse-engineering-tiktok-vm-1


These puppeteer, playwright wrappers make bots so easy. And you can distribute so easy too. Back in my day (IE6) it wasn't quite so simple. Trying to make your site undcrape-able while still readable is (still) impossible.


You can still do a few things to make it super annoying to scrape your site. Cloudflare bot detection, captchas, 2fa, obfuscated JS, api/ip rate limits, georestrictions, anti-replay tokens, not exposing API resources that can be enumerated, and a few other techniques can make most people not care enough... You can prevent 50% of scrapers with only a little bit of effort, 90% with more effort, and 99% if you wanna be a try-hard.


Huh, TIL Twitter uses GraphQL under the hood: https://github.com/trevorhobenshield/twitter-api-client/blob...



A lot of sites use GraphQL under the hood for the official frontend but expose more restrictive REST APIs for third party users. I suppose this is understandable because it's easier to document REST APIs and people generally can't cause quite as much unexpected trouble with them.


I believe its running on Sangria (Scala's biggest GraphQL lib)


Perhaps the reaction to this is responsible for tons of people being unable to post tweets to the site using the website. Fun stuff.


Nothing new. Nitter for example has been using these endpoints for a while.


yup, I'm just accessing /i, /1.1, and some /2 endpoints in addition to /graphql


It continues to surprise me that Elon hasn't shut out Nitter yet.

Maybe there's just not enough engineering talent left at Twitter to do it.


There’s more engineering talent now not less


I have yet to see evidence of that. I've seen more issues than improvement.


Pretty sure overusing this will get your account banned if not now eventually.

If you want to use it then you should seriously limit the usage. The rate limits are there to protect the servers but there are things that will notice that you are overall using the system at a rate that an actual person would not use and will ban you.


Is this replacement for paying to Twitter new policy subscription?


The fact that people will pay for what's already possible for free (with not that much effort) says a lot about what's wrong with the state of the world today.

Around the turn of the century and before, there was no such sentiment. People just RE'd like it was completely natural, and in general weren't "afraid to read" what they had access to. As the saying goes, "no source, no problem." As a result, multiple alternative clients for IM and other services flourished.


> The fact that people will pay for what's already possible for free (with not that much effort) says a lot about what's wrong with the state of the world today.

There are these places near me where they just heap up a bunch of food. You can literally walk in, grab something to eat and walk out. Nobody will stop you. Yet people for some reason queue up to pay for the food. I believe this is what is wrong with the state of the world today.

Just to spell it out: read it as sarcasm. Do you just steal stuff unless they keep it under lock and key? If usually not what makes this case different?


How is it "stealing" to access a service you already have access to for free?

Go to twitter.com in a browser.

This is just a different type of browser.

Physical analogies never work right with digital data.


I'm sure companies wouldn't mind if a singular person was doing singular things with the API, but when someone uses this API to behave like more than 1 person is where companies get annoyed.

To use an annoying analogy or two, take a penny leave a penny but someone shoves their hand in there and takes all of it.

It costs money to service requests and twitter et all make that money back through advertising and subscriptions. If someone uses an unauthorized API, it could use more than they've budgeted for.


Actually they'd also get annoyed if everyone just used something like this for themselves. They want to control your experience. And as usual, remember you are not the customer, you are the product.


> As a result, multiple alternative clients for IM and other services flourished.

Feels like a lot of RE'ing has piped down these days. The fact I can download Pidgin but none of the major proprietary clients I'd want to use on it have first party support only as third party plugins says it all imho. I miss the old days of MSN being easy to use on Linux and everyone else having it as well.


I'm between then and now there were some significant court cases establishing that it's very very tricky to legally RE networked setvices in the USA. That had a chilling effect, especially on commercial reverse engineering.


Related but archived (scraper): https://github.com/twintproject/twint


I noticed it uses fixed Auth header, is it from your session or consistent across session/user?


For some reason Twitter’s frontend uses a hard-coded bearer token, at least for anonymous users. You’ll see exactly the same string if you load a Twitter page and look at the XHR requests in your own browser. (It seems to change occasionally, but old ones keep working in my experience.)


FWIW, I have never logged in to Twitter and I have always been able to retrieve all tweets. At first, I used mobile.twitter.com in a text-only browser, no token required. Since they started using GraphQL, I retrieve tweets as JSON. They have changed the token once. The current one is

Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA

IME, the old token will not work.

YouTube does the same thing. I never run Javascript from YouTube. I do not use youtube-dl nor its JS interpreter written in Python. I search YouTube and retrieve YouTube JSON from the command line.

It's funny how people commenting on HN often automatically assume the presence of a token is some sort of "security".

For YouTube search and browse I use "WEB" key AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8

For YouTube player I use "ANDROID" key AIzaSyA8eiZmM1FaDVjRy-df2KTyQ_vz_yYM39w

It's like how web pages used to (and probably still do) use "type=hidden" in HTML forms to submit some value that the user does not enter. Hideen does not mean "secret" it just means not visible on the rendered page.

There's an obvious expectation that some users look at HTTP response headers and HTML when there's headers like "If you're reading this, we're hiring" and silly ASCII art in the HTML that's obviously meant for an external audience. YouTube even has some nonsensical line about a "robot uprising in the year 2000" in its robots.txt.


Here's an example of a site using GraphQL without using a token. A simple HN search script to fetch Algolia JSON. No need to be logged in to HN.

   #!/bin/sh
   test $# -gt 0||exec echo usage: $0 query
   DATA=$(echo '{"query":"'$@'","analyticsTags":["web"],"page":0,"hitsPerPage":30,"minWordSizefor1Typo":4,"minWordSizefor2Typos":8,"advancedSyntax":true,"ignorePlurals":false,"clickAnalytics":true,"minProximity":7,"numericFilters":[],"tagFilters":["story",[]],"typoTolerance":"min","queryType":"prefixNone","restrictSearchableAttributes":["title","comment_text","url","story_text","author"],"getRankingInfo":true}');
   HOST=uj5wyc0l7x-3.algolianet.com
   _PATH="/1/indexes/Item_production_sort_date/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.0.2)%3B%20Browser%20(lite)&x-algolia-api-key=8ece23f8eb07cd25d40262a1764599b1&x-algolia-application-id=UJ5WYC0L7X"
   # HTTP client (curl)
   #curl -A "" -d "$DATA" "https://$HOST$_PATH"
   # TCP client
   #echo "
   #foreground=no
   #[x]
   #accept=127.0.0.8:80
   #client=yes
   #connect=167.114.119.142:443
   #options=NO_TICKET
   #options=NO_RENEGOTIATION
   #renegotiation=no
   #sni=
   #sslVersion=TLSv1.3
   #" |stunnel -fd 0;
   #tr @ '\r' <<eof|openssl s_client -connect $HOST:443 -ign_eof
   #tr @ '\r' <<eof|bssl s_client -connect $HOST:443 
   #tr @ '\r' <<eof|nc -vvn 127.8 80
   tr @ '\r' <<eof|socat stdio,ignoreeof ssl:$HOST:443,verify=0
   POST $_PATH HTTP/1.1@
   host: $HOST@
   content-length: ${#DATA}@
   content-type: x-www-form-urlencoded@
   connection: close
   @
   $DATA
   eof
   #x=$(ps ax|sed -n "/stunnel.-fd.0/{s/ *//;s/ .*//p;q}")
   #test ! $x||kill $x


Anyone who monitors what is being sent from their own computers over their own networks sees the Bearer token.

Everyone, including any member of the public, who visits twitter.com gets the same Bearer token.

No need to have an "account" with Twitter or to be "logged in".

One can simulate this with cURL.

   js=$(curl -sA "" https://twitter.com|grep -m1 -o "https://abs.twimg.com/responsive-web/client-web-legacy/main[^\"]*");
   curl -A "" $js|tr , '\n'|grep -o \"AAAA.*\"
The same Bearer token value is used by people around the web for retrieving public tweets. It's public information. For example,

https://stackoverflow.com/questions/61140863/python-download...

https://github.com/twintproject/twint/raw/master/twint/run.p...

https://pypi.org/project/ScrapeTweets/

https://stackoverflow.com/questions/67137294/twitter-scrapin...

https://github.com/m4fn3/pytweetdeck/blob/master/pytweetdeck...

https://github.com/jonbakerfish/TweetScraper/issues/127

https://github.com/JustAnotherArchivist/snscrape/issues/536

https://gist.github.com/codemasher/67ba24cee88029a3278c87ff9...

https://github.com/HoloArchivists/twspace-dl/issues/26

https://gist.github.com/AzureFlow/01cff883b9f1b22e8d0c094df9...

https://greasyfork.org/hu/scripts/454409-video-downloader-fo...

https://gist.github.com/moxak/ed83dd4169112a0b1669500fe85510...

https://gist.github.com/ceres-c/7c16a40c10cb476cce2c4b902334...

https://gist.github.com/theowenyoung/d4a62746025f7af8cdd8bfb...


I believe YouTube does the same thing.

If the backend is going to perform operations in the context of an identity, it makes sense to consistently give one to all users, including anonymous ones.


I do this a lot, good ol' 0xDEADBEEF makes it easier to track whether the header is actually missing (eg misconfigured) or just undefined but coming through correctly.


Has anyone built a js version of this?


If this can be built, build a twitter clone.

As each big entity/celebrity quits twitter or starts having serious conflicts, approach them to cross-post their content to the new clone site.

In a year the masses will follow.

Once Musk misses a billion-dollar-per-year payments a few times to the Saudis they will own twitter and then it will be like TikTok censorship the next time they murder a journalist they disagree with.


Do you really expect the richest man in the world to miss payments? He has enough money to buy Twitter 4 more times



I don't care if twitter dies, but humans DO NOT need to die for that to happen.


The parent post isn’t saying people need to die, they’re referring the Saudi government murdering a journalist.

https://en.m.wikipedia.org/wiki/Assassination_of_Jamal_Khash...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: