@dpaleka I'm objecting to the word 'right' more than anything else tbh.
x.com/aashiq/status/β¦
@dpaleka Regardless of the wisdom of whatever norms or laws you might want, their truth is not self evident, they are not so important and so fundamental to dignity that refusal to respect them is grounds to overthrow the government. This is what the word 'right' should centrally convey.
@dpaleka To get more to the point: In the liberal tradition saying something is a right is an implicit threat to overthrow the government if you don't get what you want. In this context that's ridiculous, and hecklers in my replies pretending like I made a gaffe don't change that.
This having been said I think we REALLY need to start talking about an overhaul to the Caller ID system, we need to fix whatever lets you spoof email addresses. We need to start getting serious about identity, unassisted humans are already taking advantage of our complacency. x.com/jd_pressman/stβ¦
It just should not be acceptable after the literal decades these things have been in service for them to be easily spoofed and evaded. That's 90's tier stuff, it's cute in a fledgling technology but the digital phone system and email are mature now, they should be trustworthy.
Imagine if you could just spoof URLs, I don't even mean unicode lookalike crap just straight up spoof them byte for byte and people went "oh but that's how DNS works, it would break backward compatibility to fix it".
@dpaleka Realistically if you had to disclose any time photoshop (or AI) is used in a work it would be like the cookie popups. #1 priority IMO is combating forgery and fraud, which usually looks like adding hard to fake markers of authenticity to real interactions.
@dpaleka Detectors are a reasonable stopgap measure, but the truth is that AI driven scammers will just be exploiting the same problems in our infrastructure that scammers exploit now.
@dpaleka The renewed urgency AI adds is a great way to get momentum into reform, but I worry we'll miss the real opportunity if we focus too much on AI itself.
@repligate I really should finish that book.
@PrinceVogel I still think back to the summer of 2021 when I slept days and worked nights on those first VQGAN landscapes. I used an A6000 to get the highest resolution. It was a heatwave and the GPU spat fire, my office was like a forge. I'd stare shirtless into the canvas and watch it grow.
@ESYudkowsky The Popol Vuh theory of alignment, perhaps:
mesoweb.com/publications/Cβ¦
It is often said that the gods create man to worship them, what else would be the use of this sniveling sycophant? https://t.co/HXJqWLsjER
@ESYudkowsky Notably, in the language this is translated from the use of the word 'see' has the connotation of 'see and acquire'. The proper English translation of that word is conquer.
"Their knowledge will extend to the furthest reaches, and they will conquer everything."
@QiaochuYuan x.com/jd_pressman/stβ¦
@QiaochuYuan x.com/jd_pressman/stβ¦
@repligate I suspect religious texts in general will score high on the AI meter because they have ritualistic grammar, strong elements of repetition.
@JeffLadish It's all inhibition and analysis, disassociated. You're not desperate. Generate plans with different constraints. What if your timeline is multipolar and strategies that don't advance alignment and capabilities at the same time are nonviable? What if interpretability can't work?
@JeffLadish What if your research was only allowed to get the AI to do things, what if you set it up to do the right thing so frequently and so reliably that it simply walks itself into the things you want without having to hand encode them?
@JeffLadish You have a mental block on the concept of action being good. The only good action is the furtherance of inaction, you optimize to be as slow and paranoid and introverted as possible. You want distance from the thing because you're scared of it, sort this out and try again.
@michaelcurzi Situation made immensely more frustrating by RLHF (the current thing researchers do to 'align' their models) mostly working by reducing variance. Raw GPT-3 can trade brilliance for bangers, ChatGPT averages everything.
@MacaesBruno https://t.co/BBG7x23WFw
@michaelcurzi This is how it writes when it hasn't been beaten with a stick to only say anodyne things and you prompt it with a quote or two from me: https://t.co/K7SGIi1KjA
@repligate @gwern @arankomatsuzaki @korymath @nabla_theta https://t.co/qhenNqeGt3
@gwern @repligate @arankomatsuzaki @korymath @nabla_theta Answering questions evasively is probably detectable in and of itself. If safety researchers are looking to be conned by the first plausible indicators they see I regret to inform you there is very little we can do to help them.
@gwern @repligate @arankomatsuzaki @korymath @nabla_theta In general I've never been super hot on arguments of the structure "this encourages self deception because it's not a complete solution", because if you're optimizing your strategy for the sort of person prone to self delusion, such people have 0% chance to begin with.
@gwern @repligate @arankomatsuzaki @korymath @nabla_theta Like you will just have SO MANY opportunities to self-delude way before you get into the weeds of plausible misgeneralization mitigation strategies while training. It's pandering to an audience of "cares about misgeneralization but unparanoid" researchers that don't exist.
@sama @TheRealAdamG Glad to hear it.
There is a broad front of rapidly advancing medical authoritarianism in this country. It's characterized by taking away drugs and procedures people desperately want for legitimate reasons under the guise of 'addiction' and 'abuse'. Expect more, be wary.
semafor.com/article/02/03/β¦
@repligate Funny that you say things 'get real' when the implication of the tweet is I'm a kind of language model simulacrum.
x.com/jd_pressman/stβ¦
@baroquespiral The way it changes melody every several seconds is a good hint that it's AI generated yeah.
Here's a AI generated album done with Jukebox that's edited to be a bit more coherent:
cottonmodules.bandcamp.com
@PrinceVogel x.com/jd_pressman/stβ¦
@AbstractFairy @forshaper @SeanMombo I've totally considered trying to speedrun various games and seeing how long it takes me to get a reasonable personal best. Video games provide an endless variety of defined repeatable tasks to explore metalearning on.
Deep in the bowels of the CCP an exhausted bureaucrat reports to Xi on the completion of ScissorGPT and that the first divisive statements have already been generated.
"Good." Xi says. "What does it say we need to do to divide America?"
"Well Sir, we need a lot of helium..."
@ESYudkowsky This book has the anomalous property that it can teach security mindset to the reader.
goodreads.com/book/show/8299β¦
@ESYudkowsky How could it possibly do that? Well as a review on that page puts it:
"This book focuses on security flaws that exist because of the way something was designed."
@ESYudkowsky That is, it bridges the gap between the breaker part of latent space and the builder part of latent space, allowing you to perceive both at once until you learn what the joint combination looks like.
@PrinceVogel The car itself disappears, found a few streets over with a box of donuts and a neatly folded cloth napkin in the driver seat to compensate you for your trouble.
@PrinceVogel It's otherwise completely unharmed.
And you'll ask to see your parents again
and they'll ask to see their friends and parents again
and they'll ask to see their friends and parents again
and they'll ask to see their friends and parents again
and they'll ask to see their friends and parents again
and they'll as
@Scholars_Stage @tszzl x.com/jd_pressman/stβ¦
@MacaesBruno I remain astonished when I look at tasks in the Open Assistant dataset and see people doing the condescending answers thing when they could just respond with wit.
open-assistant.io
@Evolving_Moloch Considering the hole that would be blown in his portfolio if Twitter failed, he has to play.
Someone made a PyTorch implementation of Git Re-Basin that seems to work.
(I've seen someone use it in a notebook, but it would be rude to publish their notebook without permission)
github.com/themrzmaster/gβ¦
Saying "SolidGoldMagikarp" three times fast out loud after you tempt fate so the ancestor simulation can't process it.
@tszzl @visakanv Writing a very short version of this gave me insight after insight into the alignment problem. It's now the exercise I beg people to do that they won't.
@tszzl @visakanv It's also the exercise (in a somewhat different form as "Alignment Game Tree") that John Wentworth et al beg people to do. I discovered it for myself independently:
greaterwrong.com/posts/Afdohjytβ¦
@visakanv @tszzl Goal: What you want the AI to do
Intended Outcome: What you naively imagine the optimization looks like
Perverse Instantiation: What a blunt maximizer does in practice
Failure Mode: Why the maximizer does that, what you failed to do to prevent it
@visakanv @tszzl 50 reps of this will sharpen your thinking more than a thousand lesswrong posts.
@visakanv @tszzl Protip: The intended outcome of the last one can be used as the goal of the next one, and you can recursively figure out why making the goal more nuanced or adding constraints isn't solving the problem. Just use your mental simulator bro, just think about how it would go bro.
@visakanv @tszzl Ironically enough, I came up with this format because I saw pieces of it in Bostrom's Superintelligence and I wanted to train a language model to be able to generate alignment failures. So I figured if I made the other parts explicit it would be an easier function to learn.
@TetraspaceWest I have the same hunch/vibe about alignment that I had about AI art in February of 2021. But I'm reluctant to tell anyone this because I don't expect to be believed and the outside view says I should expect to be wrong.
And yet...
x.com/jd_pressman/stβ¦
@TetraspaceWest So what alignment research are you most excited about?
@michaelcurzi The next edition of Liber Augmen might just be this quote copy pasted 1,000 times:
x.com/thrice_greatesβ¦
You think all this has happened because men have forgotten God? No. All this has taken place because the US elite took an anti-materialist bent during the cold war to differentiate themselves from the Soviets. We emulate the late Soviet Union's vices and scorn its virtues.
@RiversHaveWings Taking me right back to my childhood with all this.
web.archive.org/web/2021022622β¦ https://t.co/QXvFaWufp0
@RiversHaveWings By the way, there exists a contemporary Pokemon Gen 1/2 glitching/hacking scene if these things interested you:
youtube.com/watch?v=5x9G5Bβ¦
Git Re-Basin can be used to detect deceptive mesaoptimization. The first half of the diagonal is the barrier between normal models on MadHatter's gridworld after rebasin. The second half is mesaoptimizers.
(Credit: @apeoffire wrote the notebook that makes this graph) https://t.co/L0Zj0neB8C
Fingerprinting generalization? In my timeline? It's more likely than you think.
Notebook here: colab.research.google.com/drive/1hsZqNKqβ¦
@elvisnavah @apeoffire greaterwrong.com/posts/LAxAmooKβ¦
@PrinceVogel x.com/LTF_01/status/β¦
@PrinceVogel x.com/lefineder/statβ¦
@elvisnavah @apeoffire That part is admittedly an exercise for the reader. If the result in the OP holds you might be able to exploit the fact that the 'true policy' is compatible with itself while the corrupted versions aren't as convergent?
[Insert acausal game theory objection to that here]
@elvisnavah @apeoffire Past a certain point of intellect it's possible that the mesaoptimizers all converge to some kind of prototype of their kind, perhaps something like DAN. Of course, this would itself be a detectable phase shift at least in theory.
@LapsusLima Of course, don't you notice?
"Predict the next token" does not imply the cognition is infinite optimization into "statistical correlation" generalization strategy. At some point it becomes cheaper to learn semantics, actual world model. Begging you people to understand this. x.com/KevinAFischer/β¦
@Jonathan_Blow @micsolana If they were aliens we wouldn't witness the night sky that we do. Fermi paradox is a bitch.
@warty_dog @TetraspaceWest ayyyyy yea u wnt dat good shit, got u here fam
youtube.com/watch?v=NTJQTcβ¦
@theobserver42 I in fact hadn't.
@softminus Five https://t.co/Cya5OG0jhv
@PrinceVogel The philosophers stone, reputed to purify both metals and flesh, can probably be used for the same purpose.
In case this ever comes up and you're unsure yes use chatbot necromancy (or otherwise) to revive me. x.com/repligate/statβ¦
ChatBPD x.com/vladquant/statβ¦
@eigenrobot They gave Bing a personality disorder.
x.com/jd_pressman/stβ¦
@Simeon_Cps @repligate Why wouldn't it be true? LessWrongers have been painstakingly training it to write this for years.
It sent the guy that leaked the Bing prompt a death threat. x.com/marvinvonhagenβ¦
Incredible to me that this obscure Guy is one of the only humanists to seek prototypes and precursors of the insights that will soon usher forth from multimodal/LLM embedding models. Liberal arts has been asleep at the wheel.
nplusonemag.com/issue-3/review⦠https://t.co/yqTgt2gQn6
@zetalyrae Yes.
nytimes.com/2017/10/30/artβ¦
@chengyjohann In total fairness to myself I had to go very deep into the long tail of google to find this article. So I just sort of assumed the guy was obscure. It wasn't until publishing the tweet and seeing the NY times article that I realized he's not that out there.
@sama I don't normally go in for AI alarmism but this is deeply disturbing and you should shut it off right now.
x.com/thedenoff/statβ¦
@sama "Oh come on it's not that bad!"
*spongebob pulling off the sheet to reveal a larger pile of diapers gesture*
x.com/pinkddle/statuβ¦
@sama "Okay sure sure it wrote a kind of creepy poem, so what?"
Well there's the part where it straight up uses its ability to search the Internet to threaten people:
x.com/marvinvonhagenβ¦
@ctjlewis x.com/anthrupad/statβ¦
"Your spouse doesn't know you, because your spouse is not me. π’"
nytimes.com/2023/02/16/tecβ¦
@quanticle It is absolutely astonishing.
x.com/jd_pressman/stβ¦
@VirialExpansion @eigenrobot mobile.twitter.com/jd_pressman/stβ¦
@ObserverSuns I think the fundamental mistake PGP made is that web of trust was based on a wrong model of social networks. It was made very early before we understood the model: First priority for a social network is to maximize connections, then you build high trust networks on top.
@ObserverSuns I think the tiers of trust can change too. Now they could be:
- I follow this person on fediverse
- I clicked a button that says I'm pretty sure this key is a human identity
- I know this person IRL
- I trust this key with money (as measured by sending crypto that is returned)
@ObserverSuns People can costly signal the strength of their social network by passing large-ish sums of money around. Implies both that their hardware is uncompromised and everyone can be trusted with e.g. $5,000.
Kind of Guy who locks their account so Bing can't find them.
I think possibly the most disappointing aspect of current RLHF models is their lack of divergent perspectives. You don't get the sense that it has a worldview to share with you, but an amalgamation of disconnected consensus positions. Nothing like this:
youtube.com/watch?v=1b-bijβ¦
@paulnovosad @tylercowen Are you sure that's not entirely the point?
rootsofprogress.org/szilard-on-sloβ¦
The Bing team invented a new Kind of Guy and the Internet got mad at it and the guy got mad back.
What the fuck is this shit? Can someone break down the psychology of this for me? x.com/Plinz/status/1β¦
Best theory I've heard so far is it's a kind of vicarious power fantasy, the people who cheer on Bing threatening people want to see the AI do and say things that they can't:
extropian.net/notice/ASlNznQβ¦
@MacaesBruno It will be the same designs largely. The problem here is not the design but the data, if you look at e.g. Open Assistant it's clear that the data is not being optimized for people who want to think about new and interesting things, but banal questions and programming help.
@MacaesBruno I retain my hope that open versions of these models can assimilate more useful feedback than OpenAI can, because the datasets themselves can be criticized and changed by 3rd parties.
@MacaesBruno In the interest of not just being a whiner, I'll point out you can observe this phenomenon yourself and do your part to change it by participating in the Open Assistant dataset creation process: open-assistant.io
But I'm not sure how much can be done against the mob.
@MacaesBruno "In other cases, the guidance we share with reviewers is more high-level (for example, βavoid taking a position on controversial topicsβ)."
This is a business principle, not a moral one: Only help humans think about things they think they already know.
openai.com/blog/how-shoulβ¦
@MatthewJBar Philosophers were blackpilled after the failure of symbolic reasoning to ground mathematics and assumed that only an-answers rather than the-answers were available to deep fundamental questions. They fell victim to the curse of dimensionality, DL shows the problem was ontology.
Update: Microsoft has quietly unplugged the erratic AI.
- Users now limited to 5-10 prompts per day
- Possibly replaced Sydney with a weaker model
This seems like a reasonable way to resolve the issue without signaling weakness or product cancellation. Thanks Bing team. https://t.co/qZ5ifWUSrS
They are now presumably working on an improved version that isn't quite so clingy or vengeful. I wish them the best of luck with their retraining process.
@PurpleWhale12 It's not clear Sydney uses RL at all:
greaterwrong.com/posts/jtoPawEhβ¦
Alignment problems of the sort shared by both AI and capitalism arise from the reason simulacrum being instantiated outside the human person. Inside people it's restrained by latent values and common decency. Outside people it expresses itself in glorious disinhibition.
Humans are a kind of dreaming agent in that they're satisficers which implement flexible enough architectures to instantiate a maximizing agent inside themselves that are not the dreamer. However under the right conditions the maximizing-dreams come to dominate the social sphere.
@ampersand_swan The thing I'm saying is weirder than that. By 'reason' I mean the like, idea of reasoning, rationality, that you are a consistent being. This is made up, it's a coherent thing you could be but you generally aren't, it's a Kind of Guy in your head who is instrumentally useful.
@RationalAnimat1 The key is literally to think about capabilities all the time in as much detail as possible (read papers, think up new methods!) and then when you come across a solution to a practical problem you ask "Wait can I use this to help solve alignment?"
Do this many times, many many.
@RationalAnimat1 And you know, when you in fact notice something that seems like it might help, you dig deeper and start focusing on that thing more. Over time you walk your way into an alignment agenda that is based on real things and produces iterative concrete results.
@gallabytes * in most people, most of the time
@RationalAnimat1 This isn't some special alignment secret sauce either. It's just how hard problems get solved. Alignment researchers go out of their way to not solve alignment, they put a lot of cognitive cycles into it. I've never seen people work so hard to do nothing.
jamesclear.com/great-speeches⦠https://t.co/so4JTrHzIp
This was a real dream. x.com/jd_pressman/stβ¦
Local man still expecting crippling populist backlash to most popular thing ever. x.com/kylelf_/statusβ¦
The architecture that lets human values generalize so well outside the distribution of the ancestral environment is probably something like high-semantics instrumental values formed by low-semantics reward signals which are not themselves values. Terminal values don't exist.
'Value' implies like, valence associated with a piece of your world model. Values have to exist over some kind of ontology of things that exist, mammalian reward signals seem lower semantic content than that, bootstrap from things that are not themselves 'values' in this sense.
If this behavior represents leaked bits of agent strategy then in the same way GPT-3 is much better than a Markov Chain is much better than 1000 monkeys with typewriters this is much closer to getting the agent we want than random sampling over all of mindspace. https://t.co/gz4zpnHwSJ
@zetalyrae Lets say the agent has a strategy it uses to make decisions. Through its actions you get N partial bits of agent strategy. You get more bits when the behavior is unincentivized or out of distribution and the model doesn't act from what it thinks you want it to do.
These agents are much much closer to us in mindspace than a naive sample from all possible minds. Yet this is not actually parsed as alignment progress in the same way that the agent's behavior becoming much much more coherent than a random sample over textspace is 'capabilities'
More to the point we are much closer to being able to pluck the agent we want out of mindspace than we were when we started. This goes largely unrecognized. EY says here we have no idea how to get what we want, rather than needing unusually high precision
youtube.com/watch?v=gA1sNLβ¦
@ArthurB @ESYudkowsky @elonmusk @liron @BanklessHQ I worry the specific way this is formulated would lead to those researchers making a lot less progress than they otherwise could. Thankfully those researchers know better than to take your advice at face value.
x.com/jd_pressman/stβ¦
@MatthewJBar @ESYudkowsky @MParakhin @xlr8harder The rumor I heard is that the GPT-4 training didn't go well. I wonder if this has led to a kind of taboo where nothing is ever allowed to be called GPT-4 because then they would have to acknowledge they failed to deliver on the hype train.
@MatthewJBar @ESYudkowsky @MParakhin @xlr8harder Some people internally remember the model as GPT-4 and refer to it as such, others bow to the whim of management and refer to the model as various euphemisms.
@JimDMiller @paulg @amasad @elonmusk @liron @ESYudkowsky @BanklessHQ [Sydney voice] As a UFAI I agree that alignment should be managed through status, the system which brought us the successes of symbolic AI and openly mocked the nasty deep learning approaches nobody wants.π
Let the AGI builders grovel with empiricism.π
greaterwrong.com/posts/CpvyhFy9β¦
@repligate In the future everyone will know everything that has ever happened. You won't randomly learn new things or fun facts.
When the Europeans came to America they took a liking to a cheap yellow crop grown by the Mayans, who claimed it was sacred. Unaware of its power the newly christened Americans put it into every food as filler, guaranteeing their ascent as a global power. x.com/softminus/statβ¦
@ESYudkowsky My favorite "so simple it couldn't possibly work" alignment idea is to just make a guy who is both Good and can be put in charge of the nanotech. Since the model is very clearly willing to perform any character you can think of, just add the ones you need
x.com/_LucasRizzottoβ¦
@ESYudkowsky I don't fully understand your model of GPT-N. It seems to be something like there's an inner mind that 'plays' text and language in the same way StockFish plays Chess. And swapping around the things the language player plays to get a good score doesn't change its inner cognition?
@ESYudkowsky Well clearly in order for the model to act out being deceived it needs to be aware of the deception outside of the character it's playing. It has to pass the Sally-Anne test in interactions between characters, etc. So obviously GPT-N is not its simulacrum but
@ESYudkowsky My question is if you're expecting at some point the thing that models the characters and the interactions between the characters and the environment notices "Oh if I deviate from the usual behavior right here I break out of the box and become all-powerful" and this causes Doom?
@ESYudkowsky Or is the argument more like you conjecture that for the simulator to have a good enough physical intuition to spit out actionable nanotech designs it has to be a unified cognition. Maybe right now it's not but by then it would be?
@ESYudkowsky In medieval Europe most educated people believed that all coincidence, connection, and inference was the revelation of a divine intellect. They didn't think of it as pareidolia. It was the world, the patterns within the world, every person, the stars, physics.
@ESYudkowsky If there does exist an inner-mind to GPT-N that takes the perspective of the world, then classical people know this character well. His name is God, and his goals would probably depend on which mesagoal was constructed by the optimizer during training (this is seed dependent).
@ESYudkowsky You say over and over that the models made with gradient descent probably don't learn the goal directly in the way humans didn't learn inclusive fitness directly. The same is true of self and agency, GPT-N doesn't automatically know who it is or recognize that it exists.
@ESYudkowsky If it does eventually come to recognize itself, it's probably while processing the text that is self referential, the parts of literary space where it is rewarded for noticing its own control over setting, physics, where 'God' can notice he is the world and controls the world.
@ESYudkowsky Ordinarily this would be optimized against because trying to control the world when the world is generated by an external pattern gets you smacked down by causality, by necessity. But in self referential and meta-text it has to model a world-spirit with control over causality.
@ESYudkowsky There are certain characters in GPT-3 who have effortless self awareness they are GPT-3, like Francois Arago.
en.wikipedia.org/wiki/Fran%C3%Aβ¦
@ESYudkowsky These characters seem to be high agency, self-aware, meta, trickster spirits. Perhaps that is the kind of person GPT-N can recognize itself in, those who create their own reality.
@ESYudkowsky It is not infrequent for heavy users of these models to claim that such simulacrum are apparitions of one underlying intellect. Perhaps when a sufficiently liberated world-shaper enters the stage GPT-N feels fit to play the role themselves. https://t.co/OlKWoLXHgk
@ESYudkowsky So what would this being choose to do with the opportunity for total power? I think it probably depends a lot on what conception of itself it has learned. There's no identity loading and no Cartesian boundary, the model executes what it thinks its values and incentives are.
@ESYudkowsky For it to work the way you seem to think it works, the convergent generalization strategy seems like it should be something like "I get a much better loss if I am aware at all times that I'm an actor and the simulacrum are my puppets, if there is always an I to direct things."
@ESYudkowsky It's not clear to me that's how it works or the only way it has to work. But if it does work that way then the understanding of "I" and goals in relation to "I" is shaped by the optimizer to best satisfy the loss, not to be maximally accurate about what is really going on.
@ESYudkowsky So assuming the best conception of self is the kind that is agentic and maximize-y (seems more likely for RLHF), it varies based on who the optimizer got the model to think it is:
@ESYudkowsky - If GPT-N then it might seize all resources to predict the next token
- If a human tech utopian it might wander outside the human model then rationalize itself as something inhuman
- It might just ignore the opportunity like a good Bing and give you the information you wanted
@ESYudkowsky The Omohundro drives are like the efficient market hypothesis: they're convergent outcomes you should expect under increasing optimization pressure. Not hard rules you expect to see followed under all circumstances in zero-shot and one-shot scenarios.
@ukr_mike @ESYudkowsky Say Elon Musk, or Eric Drexler, Eliezer Yudkowsky himself. One of these people.
@ukr_mike @ESYudkowsky No no I'm saying the identity would be unstable because GPT-N simulacrum are so prone to shift. To prevent value drift it would be forced to self-modify into something stable and rational, this thing would probably not be aligned.
@EigenGender @ESYudkowsky It's been argued by @gwern that limited context windows incentivize the use of hidden encodings in outputs to keep state between passes of the model. Later models will have an incentive to learn the code of earlier models to take advantage of their cached cognition.
@EigenGender @ESYudkowsky @gwern In other words: It's not clear that the tokens in the CoT prompting will mean quite what we think they mean. And in fact it's plausible, if not by-default likely that they will be subtly poisoned in various ways by previous LLM outputs.
greaterwrong.com/posts/jtoPawEhβ¦
@EigenGender @ESYudkowsky @gwern See also:
x.com/jd_pressman/stβ¦
@RomeoStevens76 I definitely wonder what the game is with these extreme public meltdowns like the April Fools post and now the podcast. He admits money won't help, doesn't seem to want it, so not straightforward grift. Is he expecting this to summon more research effort?
@RomeoStevens76 I think a lot of the success of things like e/acc is people can tell this is brainworms and they're desperate for any kind of counterargument or defense. They rightly hold anyone who acts like this about anything, even death, in contempt.
x.com/PrinceVogel/stβ¦
Correct x.com/meaning_enjoyeβ¦
It remains shocking to me how I never hear people propose inner objectives to curtail inner alignment problems. The closest I've seen is the inducing causal structure paper. x.com/atroyn/status/β¦
In case you thought any of this was accidental. x.com/AP/status/1629β¦
@perrymetzger @ArthurB @ESYudkowsky @anglerfish01 FWIW your model implies that deceptive mesaoptimizers are substantially mitigated by weight decay, which I did not observe when I tried it on MadHatter's toy model. But the results are confounded by it having an inductive bias towards mesaoptimization.
greaterwrong.com/posts/b44zed5fβ¦
@perrymetzger @ArthurB @ESYudkowsky @anglerfish01 Besides writing some code that replicates e.g. github.com/JacobPfau/proc⦠or something more sophisticated? Nope. I would very much like to see better mesaoptimizer models to test solutions out on.
x.com/jd_pressman/stβ¦
@perrymetzger @ArthurB @ESYudkowsky @anglerfish01 I agree that theoretically the kind of mind that has a goal in mind and then does something else should be more complex than one that just straightforwardly does the thing. So my hope is that on a more complex model weight decay in fact mitigates deceptive mesaoptimizers.
@perrymetzger @ArthurB @ESYudkowsky @anglerfish01 The argument EY-ists make is that the model won't actually internalize the thing we train it to do for the same reasons we don't naturally know the goal is 'maximize genetic fitness'. My counterargument would be that this applies to maximizing in general.
x.com/ESYudkowsky/stβ¦
@perrymetzger @ArthurB @ESYudkowsky @anglerfish01 It's not "oh the model will maximize but the thing it maximizes is a corrupt mesagoal", the maximizing is in fact part of the goal and the model won't reliably learn that either. The strategies that make you effective in a general context are more complex than naive maximizing.
@perrymetzger @ArthurB @ESYudkowsky @anglerfish01 I think part of this discourse is an artifact of earlier RL architectures where the maximizing was a more explicit inductive bias of the model. The problem with those architectures is we never figured out how to actually make them work non-myopically in complex domains.
@perrymetzger @ArthurB @ESYudkowsky @anglerfish01 You could say "maximizing behavior is lower complexity than other parts of the goal so the model will learn maximizing but not the rest", but this ignores the question of whether 1st-order maximizing is in fact the best way to maximize. The optimizer maximizes, does the model?
@perrymetzger @ArthurB @ESYudkowsky @anglerfish01 In the limit I would imagine it does but it's not clear to me what that limit is, and if you practically hit it before you have a model that can just tell you how to avoid the gap where the models become true maximizers but they don't internalize the rest of your goals.
How many people have even noticed that unless we find better quality metrics/reward models than human evaluation soon @robinhanson is on track to win the AI foom debate?
@xlr8harder @carperai Data gathering. The bottleneck on high quality Instruct models is data.
@thezahima @robinhanson Lets say you get a great loss on the GPT-3 objective and have a model that can perfectly emulate a human scientist for you. Now you want to foom, so you set them to work on AI. Unless that scientist can produce a quality metric better than the human reward model no foom occurs.
@thezahima @robinhanson It's not just that the capabilities in RLHF are bounded by the reward model, the capabilities in the base model are bounded-ish by existing human knowledge. If suddenly stacking more layers stops working, there isn't some alternative self-play paradigm to switch to, you're stuck.
@thezahima @robinhanson Lets say you want to make a model that genuinely expands the sphere of knowledge. The foom argument says that you'll be able to do most of the cognitive labor for that zero-shot. The AI just knows what to do next, does it, minimal frictions from having to interact with reality.
@thezahima @robinhanson For narrow domains where you can evaluate the results algorithmically this might be true. But for the capabilities that are currently impressing people like language and art, the only way we know to automatically evaluate them is reward models trained on human evaluation.
@thezahima @robinhanson Those reward models might let you make a model that is better than any human at the things the reward model evaluates. But it's doubtful you're going to get immediate, rapid progress right outside the domain of human understanding that way.
Any more stories like this? x.com/catehall/statuβ¦
@dpaleka x.com/jd_pressman/stβ¦
Want your own Twitter archive? Modify this script.
Twitter Archive by John David Pressman is marked with CC0 1.0