Looks promising. I found a music generator called heymusic.ai and was considering subscribing, the songs were fun to make with the kids, but then it disappeared off the face of the earth with no updates anywhere online. I'm cautious of subscribing to AI services, lots of GEN AI startups popping up and it's not easy to make a profit
It doesn't sound like the approaches are incompatible. You can use minhash LSH to search a large set and get a top-k list for any individual, then use a weighted average with penalty rules to decide which of those qualifies as a dupe or not. Weighted minhash can also be used to efficiently add repeats to give some additional weighting.
Pretty amusing the old AI revolution was pure logic/reasoning/inference based. People knew to be a believable AI the system needed some level of believable reasoning and logic capabilities, but nobody wanted to decompose a business problem into disjunctive logic statements, and any additional logic can have implications across the whole universe of other logic making it hard to predict and maintain.
LLMs brought this new revolution where it's not immediately obvious you're chatting with a machine, but, just like most humans, they still severely lack the ability to decompose unstructured data into logic statements and prove anything out. It would be amazing if they could write some datalog or prolog to approximate more complex neural-network-based understanding of some problem, as logic based systems are more explainable
One of the reasons for why word vectors, sentence embeddings and LLMs won (for now) is that text found on the web especially, does not necessarily follow strict grammar and lexical rules.
Sentences that are incorrect but still understandable.
If you then include leet speak, acronyms, short form writing (SMS / Tweets), it quickly becomes unmanageable.
I am not a linguist, but I don't think that many linguists would agree with your assessment that dialects, leet speak, short form writing, slang, creoles, or vernaculars are necessarily ungrammatical.
From what I understand, the modern understanding is that these point to the failure of grammar as a prescriptive exercise ("This is how thou shalt speak"). Human speech is too complex for simple grammar rules to fully capture its variety. Strict grammar and lexical rules were always fantasies of the grammar teacher anyway.
I am a linguist, and I agree. But it does complicate the grammar to allow for these other options. (I haven't studied leet speak, but my impression is that it's more a matter of vocabulary than grammar, and vocabulary is relatively easy to add.)
For the record, the parser I worked on ended up having the "interesting" rules removed, leaving it as a tool for finding sentences that didn't conform to a Basic English grammar with a controlled vocabulary--and used to QC aircraft repair manuals, which need to be read by non-native English speakers.
There are languages that have fully codified grammar which completely covers everything people actually use (and more). But we spend 10 years learning the grammar itself 1-2 hours every day at school (then you have literature etc on top of that)...
I see it as a complete waste of my youth, BTW. Today I speak English that I learned through listening, reading and watching, and all of this mother tongue grammar nonsense that used to stress me out daily at school and during homework is absolutely useless to me.
I wonder if people approach NLP as a sea of semes rather than a semi-rigid grammatical structures to then be affected with meaning. (probably but I'm not monitoring these field)
Looking forward to an 8bit instruct version on llama.cpp to try out problems with the insane context length.
It would be interesting if all these models were finetuned on basic datalog which is a very simple language. That way they could demonstrate their logic/reasoning capabilities as well as ability to learn from mistakes and iterate.
If they add fstrings, some type of easier map get/put syntax eg #map.value = 1, and maybe a shorter hand fun syntax, then Erlang feels like it's gotten all the conveniences of python. Amazing how far things have come
Sounds like it's just a Haskell thunk but as a probability wave
Lazily evaluated until there's a probability it has to interact with something. Since you can never really see the value of the actual function, but only see what it looks like when it's forced to evaluate a computation in some context, an interaction, you can never get a precise definition of the function