Google Books reportedly indexing bad AI-written works https://www.theverge.com/2024/4/5/24122077/google-books-ai-indexing-ngram
Google Books reportedly indexing bad AI-written works https://www.theverge.com/2024/4/5/24122077/google-books-ai-indexing-ngram
#HowToThing #030 — Procedural, rule-based & stochastic text generation using a custom DSL, parse grammar (via https://thi.ng/parse) and abstract syntax tree transformation (via https://thi.ng/defmulti).
Since it's #NaNoWriMo & #NaNoGenMo [1], I'm closing out this first season of 30 #HowToThing's with a related topic & maybe someone even finds it useful/interesting...
This example is in principle inspired by @galaxykate's oldie & goodie #Tracery, but is using a super simple custom text format instead of JSON to define variables and template text. Variables are expanded recursively and I've also added features like dynamic, indirect pointer-like variable lookups to derive variables based on current values (useful for conditionals & context-specific expansions), hidden assignments, chainable modifiers... I've included 5 different "story" templates (incl. comments) showing various features. Just press "regenerate" to create new random variations...
Similar to the previous #HowToThing, I'm hoping this example also shows that approaching use cases like this via small domain-specific languages with proper grammar rules, does not require much ceremony and is often more amenable to change during prototyping (and later also more maintainable!) than just regex bashing approaches...
The parser grammar itself is explained in the https://thi.ng/parse readme. As usual, the grammar was created/prototyped with the Parser Playground[2], which we developed from scratch during the first thi.ng livestream[3] (2.5h video)...
Demo (example project #145):
https://demo.thi.ng/umbrella/procedural-text/
Source code:
https://github.com/thi-ng/umbrella/tree/develop/examples/procedural-text/src
If you have any questions about this topic or the packages used here, please reply in thread or use the discussion forum (or issue tracker):
https://github.com/thi-ng/umbrella/discussions
[1] https://github.com/NaNoGenMo/2023/
[2] https://demo.thi.ng/umbrella/parse-playground/
[3] https://www.youtube.com/watch?v=mXp92s_VP40
Catching up on the #SPP2023 #preconference on #memory:
Felipe De Brigaard introduced us to the topic and some recent trends before a series of talks ensued.
Find Felipe's work on gScholar: https://scholar.google.com/citations?user=l9gS2joAAAAJ&hl=en&oi=ao
Google #ngram Viewer seems like a great tool for #writers who need to research what people were calling things during a specific era or span of years. https://books.google.com/ngrams/
One of the basic questions we tackle when working towards statistical language models is "Can we predict a word?"
This was also one of the intro questions to the students last Wednesday in our #ise2023 lecture no.4, when we were introducing simple n-gram language models.
#nlp #lecture #ngram #languagemodels #language #aiart #stablediffusion #creativeAI @fizise @KIT_Karlsruhe @nfdi4ds @nfdi4culture
A fun thing to do is enter your name in the amazing Ngram Viewer. Mine looks like this. The peak is explained, I presume, by the misfortunes of a gentleman who ran into a spot of bother in Spain. My other namesakes have failed in much less heroic ways. Don’t like the look of the graph, though. Seems we’re dying off as fast as butterflies in the UK
(Google Ngram charts word frequencies from a large corpus of books that were printed between 1500 and 2019) #history #language #Ngram
In the next episode we'll be building out our Hashtag grain in #MicrosoftOrleans.
It'll be responsible for taking in raw input and breaking it apart into many ngram combinations and then returning possibile solutions, ranked by some metrics associated with the #Google #ngram #dataset.
Importing the #Google #ngram #Data set into #PostgresSQL.
I'm almost done with the bi-grams.
I've got about ~900GB more to import, then it's on to the tri-grams.
This is an entire, unfiltered set, that I'm going to backup first and put in cold storage.
Then I'm going to filter out rows that have characters that aren't allowed in #HashTags. This is the dataset that will power #FediMod's hashtag #accessibility service.
The Google corpus of edited text shows a big pre-COVID spike in one word starting in 2012. But by 2019, hand washing and handwashing were equally likely.
My son told me today that he loves the word "peckish", so we talked about #word frequency, and tried to find a common (non-specialized) word that is is less frequent than "peckish". Mostly failed.
But what draw my attention is how all of the words we thought of raised in popularity after 2000, on Google #ngram. Why? What happened in 1960-1980 that drove them down? Or is GN corpus skewed (towards patents and academic papers?) during this era?