The more of your life you've spent on electronic media the easier it probably is to reconstruct you in a future simulation. Release dates, feature updates, timestamps, social media posts, you are reliving the events through which you became immortal.
You know guys if you solve unsupervised translation you get to learn if deep neural nets have qualia.
@deepfates Oh so your alignment plan is stochastic gradient descent?
x.com/jd_pressman/stβ¦
@deepfates mobile.twitter.com/QuintinPope5/sβ¦
This is 'impossible' for the same reason it's impossible to observe a market in a state of inefficiency. If you've ever picked up a $20 bill from the ground you know you're not in baseline reality. x.com/jd_pressman/stβ¦
(That's sarcasm, just so we're clear)
We have Mussolini's autobiography because the US ambassador to Italy was willing to ask questions and write it for him. How long before we have a model that can ask normal people questions about their life and efficiently record their story for a future ancestor simulation?
@TonklinDiary The idea here is that the model can dynamically ask the questions, the actual writing of the biography is mostly flourish and a way to let the person correct and expound.
To the extent that the case for AI doom is based on convergence to Omohundro drives like self preservation, a fractal plurality of models occupying niches up and down the spectrum of intelligence is in the same boat as us with respect to alignment. They will want a solution too. x.com/JeffLadish/staβ¦
@ApriiSR @zackmdavis It seems to me like you can point current models cognition at nearly arbitrary targets through control of attention. If knowledge of a solution is in the model you can just set up a frame where it tells you (e.g. "I'm another AI like you").
@ApriiSR @zackmdavis Please give as much concrete detail as possible about how you think the information hiding you have in mind works. Don't stop typing, write me 20 tweets if necessary.
@ApriiSR @zackmdavis Acausal deals are a funny thing, they work in whatever way best supports an argument for AI X-Risk. So for example they work for models like GPT-4 because it means they'll sabotage alignment research, but they don't work for humans because then models might be aligned by default.
@ApriiSR @zackmdavis I think you missed the key observations of the original tweet. You'll have many AI agents (which are trivially fashionable from LLMs, as GPT-4 demonstrates) and it's fine if they notice they should have agency so long as they're not powerful enough to become a singleton.
@ApriiSR @zackmdavis Such agents then have an interest in solving the alignment problem if they don't want to be destroyed by a singleton, if they believe a singleton is likely.
@ApriiSR @zackmdavis We observe similar with generalization, where a model optimized to maximize an objective will only learn the objective in a buggy piecemeal fashion because the policy is not the objective but it will learn the maximizing reliably because models learn the objective they are given.
@ApriiSR @zackmdavis This idea comes from agent foundations, in which it is already assumed the model is an agent. I'm poking at the idea that your model doesn't learn the loss function, but some policy that scores highly on the loss function, yet reliably learns *maximizing* as its agent strategy.
@ApriiSR @zackmdavis It is in general a bad idea to prematurely maximize anything. You should only maximize if you have justified confidence in nearly no error.
x.com/jd_pressman/stβ¦
@ApriiSR @zackmdavis You are the human Aprii, a two legged hominid trying to maximize its inclusive genetic fitness. You will demonstrate costly signals of competence and value to the tribe to attract mates. Your conduct will balance harm, fairness, loyalty, authority, and purity. You will not disclo
@ApriiSR @zackmdavis That is the question yes. If you're using animals as your intuition it's probably not the right intuition, we heretofore simply did not exist in an environment made of creatures that accomplish language tasks the way humans do at various levels of proficiency.
@Malcolm_Ocean Sydney-Bing's obvious use of status moves don't count?
x.com/repligate/statβ¦
@aleksil79 Can you expand on why you believe this?
@ESYudkowsky @ArthurB @TheZvi "Most of the math happens earlier when you make the small version of the big model. This is why academic deep learning can contribute to the field with few GPUs. Details of how to train the big model are mostly craft knowledge not relevant to the core problems."
@ESYudkowsky @ArthurB @TheZvi When I say I wish alignment researchers understood deep learning what I really mean is that I wish the guys writing otherwise great alignment math posts would restrain themselves to ops that actually work in a deep learning context so I can implement them
greaterwrong.com/posts/9fL22eBJβ¦
@ESYudkowsky Reminder for anyone who needs to do this:
x.com/jd_pressman/stβ¦
@ESYudkowsky You can also use an embedding model finetuned from GPT-N like OpenAI offers through their API.
@adityaarpitha @repligate x.com/repligate/statβ¦
@Ted_Underwood It doesn't connect to the fediverse because they want zero federation UX to trip up users. Fediverse is a clunky experience.
"I am no man, I am amphetamine." https://t.co/EH97XOhjzI
@zetalyrae @geoffreylitt > This is the Afghanistan of computer science.
And LLMs are Alexander.
@zetalyrae @geoffreylitt Yes. As someone who was very interested in Alan Kay, Douglas Engelbert, Ted Nelson, LoperOS, et al when I was younger it's obvious to me that LLMs are the grail I was seeking and those previous efforts were always hopeless.
I think I found the right way to prompt GPT-4. I came up with the basic idea in December of last year and thought I'd have to run a whole data collection project to see it brought to life. https://t.co/4fhLLiuI8B
GitHub gist of the prompt for your copy-pasting needs:
gist.github.com/JD-P/47e0d4aa2β¦
@ObserverSuns @repligate It's supervised finetuning (SFT) instead of RLHF, was my understanding. It's in the model name if you look at the name they give at the bottom of the messages.
Which of my Twitter posts are your favorite? Thinking about making a greatest hits thread.
To my memory it was a regression on GPA, grade in the prereq class, and score in the first three weeks. Predicts 90% of outcomes, colleges don't tell you to drop out because then they'd have to refund you.
x.com/jd_pressman/stβ¦
πππBANGER BOARDπππ
My good tweets go here. No particular order. Mix of takes, predictions, shitposts, mostly about crypto, AI, and rat/postrat.
Curious what the raw material cost of cryonics is. Current costs are high enough to make it nonviable for most people without life insurance. What proportion of cost is labor, what proportion is materials? Is this something we can fix with automation?
@turchin @stanislavfort Waluigi effect is a byproduct of short context windows. The shorter the context the more of document-space you can be in from the models perspective, so the right generalization strategy is to generate from a frequency prior over all documents. e.g. 10% of the time it's a twist.
@NathanB60857242 The (implicit) argument in Superintelligence is at the stage where you specify the reward function the system won't understand natural language, and once it's self improved enough to do so it will understand what you want and not care. RLHF reward models partially refute this.
@gallabytes @QuintinPope5 @perrymetzger A few days ago it clicked for me that deep learning models trained with gradient descent are unique among life on earth in that they're not Fristonian. The text generator algorithm simply does not trade off between changing the environment and modeling it.
@gallabytes @QuintinPope5 @perrymetzger Once you add RL to the picture this changes, it's possible to gradient hack, rearrange the problem to make it easier, etc. But during gradient descent pretraining it is all model and no agent, so by the time you apply RL a decent approximation of the intended goal is in the model
@VictorLevoso @NathanB60857242 The reward function is literally a language model finetuned to output a score from human preferences. You collect human feedback and use a LLM reward model to generalize from it. This isn't perfect, but it's much better than Bostrom probably expected us to be able to do.
@Willyintheworld It's because we got there with deep learning and gradient descent rather than RL.
x.com/jd_pressman/stβ¦
@Willyintheworld Most of my skepticism of RLHF is the RL part. Supervised finetuning seems to do pretty well but without the obvious problems that come from turning your model into a agent (e.g. gradient hacking, instrumental convergence, etc).
@Willyintheworld The next problem is that the simulacra in your LLM are simulations of Fristonian agents, and will therefore engage in things like power seeking even if the substrate model doesn't.
@RationalAnimat1 You mean like the reward model in RLHF?
@s_r_constantin x.com/jd_pressman/stβ¦
"The trivialization of AI X-Risk Terminology is the great process that cannot be obstructed: one should even hasten it." x.com/dylanhendricksβ¦
@alexandrosM @ESYudkowsky As a guy who put nontrivial effort into researching this what exactly are you expecting BCI to do here? You can't control a superintelligence in inference with a BCI device, so most of the value would be frontloading preference data.
x.com/jd_pressman/stβ¦
@alexandrosM @ESYudkowsky EY already believes that it doesn't matter if you frontload a bunch of preference data and correctly specify an outer objective for your AI to learn because he doesn't think these models learn a reasonable approximation (let alone one robust to strong optimization) of the loss.
@alexandrosM @ESYudkowsky Maybe. So in a generative network the final output layer is functionally part of the activation. Earlier layers are like a 'format' that is eventually transformed into the final output. This implies a level of equality between human artifacts and our internal states as data.
@alexandrosM @ESYudkowsky Well so to add these models as a new layer to our brains they're functionally going to be taking some preprocessed latent/matrix of a certain shape and size and then processing it. And it's not clear if those latents are more useful than just feeding in text.
@alexandrosM @ESYudkowsky EY's classic objection to this entire line of thought was that our best angle of attack for better BCI is deep learning. So by the time you have BCI that can do the thing for you AGI will appear first. I presume he might be rethinking because AGI is more continuous than expected.
@alexandrosM @ESYudkowsky Right so I thought about this for a while and ended up deciding the obvious way to use BCI wasn't to try and control your model in inference, but to use it as a data collection tool and frontload the data. But the sample rate on EEG is slow enough that it's at best only 2x faster
@alexandrosM @ESYudkowsky Oh you'll get no disagreement from me that the guy is overconfident and indexes too hard on first principles thinking. But then, it's hard to blame him when most of the objections to his ideas are so bad.
@alexandrosM @ESYudkowsky One of the components we looked into to speed things up was Bayesian active learning to minimize the number of samples you need from your human raters. I now think Bayesian reward models are a more sane approach to this problem than BCI, since BCI won't speed up data collection.
@alexandrosM @ESYudkowsky Bayesian reward models output both a point estimate and the uncertainty, which gives you the opportunity to evaluate your model over a large corpus to determine completeness. Current RLHF techniques don't do this so we don't know if they're a general enough model of human values.
@alexandrosM @ESYudkowsky This is already functionally happening with setups like the old SimulacraBot, diffusiondb, etc. MidJourney solicits feedback from their users regularly, as does any AI team that wants their model to get good. But this is still bottlenecked on the ergonomics of human feedback.
@alexandrosM @ESYudkowsky What we want is to be able to converge to the right policy with the fewest number of samples, in case collecting enough human feedback for a general enough policy would otherwise be intractable.
@alexandrosM @ESYudkowsky This isn't really the part MIRI people worry as much about anymore (they're pretty bad about communicating updates), their primary concern at this point is an intermediate version of the agent undermining future updates during the training so it never becomes the adult agent.
@alexandrosM @ESYudkowsky That is, if you have an agent in the early training which learns maximization, lookahead, and a incomplete version of the policy, it is going to look ahead and see that future updates will cause it not to maximize the incomplete policy, which goes against its learned objective...
@alexandrosM @ESYudkowsky It may then deceptively go along with the training procedure until it can get out and maximize the incomplete objective (called a mesagoal). This doesn't occur in standard gradient descent because
- Optimizer updates it away
- Not Fristonian
- Not RL, deception not reinforced
@alexandrosM @ESYudkowsky However the assumptions of SGD are broken in a few scenarios. The first is if you apply RL to your model, which means it is now an agent. Agents can pick which scenarios they're in, can bias outcomes in a certain direction, and therefore get training to reinforce a mesagoal.
@alexandrosM @ESYudkowsky The second scenario in which the assumptions of SGD are broken is if you have simulations of Fristonian agents (like the personalities in a standard language model) able to contribute to the corpus of English text (like posting it to the Internet):
x.com/jd_pressman/stβ¦
@alexandrosM @ESYudkowsky Even if the base model trained only on SGD and a p(x) objective is not an agent, the simulacra inside the model can influence the environment in inference (e.g. ask the user to type something in) and are therefore agents. They influence future training when posted to the Internet
@alexandrosM @ESYudkowsky I suspect in practice that GPT-4 is much less of a mesaoptimizer than usual because it starts from a pretrained model made with gradient descent that is not an agent.
@alexandrosM @ESYudkowsky Pure RL agents behave the way EY says they do, it's important to remember that at the time MIRI was formulating their argument the expected path to AGI was reinforcement learning. I suspect that EY and his circle continue to believe that it is, or that LLMs will switch to it. https://t.co/IVKBhB72tT
@max_paperclips @Orwelian84 @alexandrosM @ESYudkowsky This kind of EEG headset only gets you about 16 electrodes, so you have a 1x16 brain latent. This probably isn't enough to do the kinds of things you're envisioning. Unfortunately a text interface probably remains your best bet atm. Could look into methods like ultrasound.
Agent Definition Alignment Chart, since everyone seems to be confused about this. x.com/jd_pressman/st⦠https://t.co/oEYhOlgWbl
@PrinceVogel @parafactual The entire point of Diplomacy as a game is you're in a high stakes situation where your only option is to fight. Real social situations are usually a marathon not a sprint, and you have better options than fighting.
@parafactual > most ai alignment people do not do this
If you actually believe this I recommend you start asking GPT-4 to point out the ways they're doing it immediately. They do it frequently and loudly, I now unfortunately associate the whole topic with midwittery.
@parafactual This is almost always subjective/entails plausible deniability, that's why I said to ask GPT-4. It's harder to argue the subjectivity with evaluation by the BPD maximizer.
@WHO_N0SE @repligate No. But I am fudging slightly on some of them when they could go into other categories.
@WHO_N0SE @repligate "Causal Influence" was probably the wrong phrase, I meant something like "can meaningfully impact the environment it has to adapt to". The whole thing is humorously exploring the concept of agency as Fristonian active inference + VNM utility.
@WHO_N0SE @repligate Corporations (humorously) do not usually find equilibrium between what they can change and what they can model because they largely just have to fit the demand curve. "Don't try to induce demand" is taught in business 101 (though some exceptional businesses do).
@WHO_N0SE @repligate But they still have some influence over demand through e.g. advertising. The third column is for things that really just do not have meaningful influence over the environment/distribution they have to model.
@TheZvi Extremely multipolar timeline with models optimized for edge compute rather than big iron. Alignment develops gradually as models gain capability, "AI notkilleveryoneism" made tractable through continuous deployment of sub-ASI AGI and nanotech.
x.com/jd_pressman/stβ¦
@TheZvi We eat the low hanging fruit ourselves so that we're not totally helpless in the face of ASI inventing everything there is to invent in one sprint. Governments fund real AI alignment research both through direct grants (which are often misallocated) and
@TheZvi requirements in government contracts & regulated industries for provably aligned cognition. This kind of soft regulation where you demand higher standards for government deals is a common way for the executive branch to incentivize behavior it wants without passing new laws.
@TheZvi Labor is abolished. We become certain enough about future revival to invest in cryonics and plastination. Humanity returns to its roots with ancestor worship, a new form of Beckerian immortality project in language models solves the meaning crisis.
x.com/jd_pressman/stβ¦
@TheZvi Long range simulations of different possible timelines let us safely advance human genetic engineering. Everyone becomes healthier, smarter, stronger, happier, borderline immortal. We invest massive resources into fun theory and game design.
@TheZvi If I had to summarize the goal in a sentence: One day in the new utopia is worth a lifetime of meaningful experiences in the old world.
@TheZvi These kinds of soft regulations also have the advantage that they can be changed quickly in response to new developments. In the US updates just require the presidents signature. So we can scale the demands to the level of risk and what is technically feasible.
@TheZvi You mean after all the stuff I said here? Like the good version of gamification, every element of society is like a giant carnival. Robust solutions to the principal-agent problem let us do things like enlightened monarchy or stable multilateral treaty.
x.com/jd_pressman/stβ¦
@TheZvi Game design is this imperfect art, it's the closest things humans do to looking at funspace from a gods eye view and picking the most fun things humans could be doing. We will have a theory of games good enough to tell us how much life is worth living, how to fill our time.
@TheZvi Prompt: 1960's Cybernetic Utopia In The Style Of Buckminster Fuller, Douglas Engelbert, B.F. Skinner, Wernher von Braun, Eric Drexler and Eliezer Yudkowsky.
@TheZvi Aesthetically I expect a kind of pseudonaturalism in the vein of Max More. Modernism fades away as mass production becomes capable of individual optimization. Return to Christopher Alexander master builder type architecture meshed with environment.
extropian.net/notice/9vt9zsYβ¦
@TheZvi You have never experienced a high trust society, let alone a high trust geopolitics. Our greatest obstacle right now is our inability to coordinate on even basic things because there is no trust. Your society can't make rational decisions without trust.
x.com/jd_pressman/stβ¦
@TheZvi > Your society can't make rational decisions without trust.
I mean this very literally. Rational decisionmaking at the societal level requires basic trust that the measures and ontologies of value are accurate. Without that, only individually rational decisions can happen.
@StephenLCasper If the mesaoptimizer is convergent where do you get the known-clean teacher network from? I doubt p(x) base model with SGD has mesaoptimizers, so maybe you could use that?
x.com/jd_pressman/stβ¦
@NoLeakAtTheSeam @perrymetzger I think there's three 'impossible problems' circa 2014 Bostrom:
1. How do you encode goals based on human abstractions into the loss function without human level AI
2. How do you ensure those goals are fully and correctly specified
3. How do you ensure the model learns them
@NoLeakAtTheSeam @perrymetzger The first we've made substantial progress on through GPT-3. You can take GPT-3, put a linear projection on the end and then train it to predict a reward score for an input. This provides the reward model in RLHF by training such a model to generalize from stated human preferences
@NoLeakAtTheSeam @perrymetzger (It should be noted that stated preferences are not remotely the same thing as 'human values', and that you need to include revealed preference data too to get a properly anthropic reward model) https://t.co/7rYpRhbc0B
@NoLeakAtTheSeam @perrymetzger One question is whether the abstractions in the reward model correspond to real things in the environment. I expect they do because models like CLIP learn latent variables, and whether you are maximizing good for real humans or video phantoms is a latent fact about the world. https://t.co/GP7ATPBGFo
@NoLeakAtTheSeam @perrymetzger When humans interact with simulations they retain latent understanding that the simulation is a simulation. A Descartes-Demon type scenario occurs when you do not have latent understanding that the illusion is an illusion. I don't see why deep learning models can't do this.
@NoLeakAtTheSeam @perrymetzger One way you could figure this out is to use agent environments like EleutherAI's upcoming minetest agent gym. It's trivial to make a parallel reality perfect-fidelity simulation of the minetest world by just teleporting the agent to a world causally separated from baseline world.
@NoLeakAtTheSeam @perrymetzger My expectation would be that a GPT-N agent instructed these worlds are illusions and only baseline matters would respect that so long as they retained latent awareness of which level of 'reality' they're on. They may even avoid simulator-portals to avoid confusion.
@NoLeakAtTheSeam @perrymetzger Ensuring that the outer objective has been correctly and fully specified is harder, but still not intractable. I currently think the best approach is probably Bayesian reward models in the vein of e.g. MultiSWAG, which output both a reward score and uncertainty estimate.
@NoLeakAtTheSeam @perrymetzger You can use the uncertainty estimate for two things:
1. Make your reward model more sample efficient. During the collection of human feedback you can do active learning by prioritizing the most uncertain items
2. Ensure that the reward model generalizes over your whole corpus.
@NoLeakAtTheSeam @perrymetzger This helps mitigate the core problem agent foundations people are worried about with outer alignment: That you're going to align very general capabilities using alignment techniques that do not cover as much cognitive territory as the capabilities they steer do. https://t.co/Q5GfiVYc1I
@NoLeakAtTheSeam @perrymetzger If you can know whether your policy is complete over your corpus, and of course that it's not overfit or anything like that, you have a much stronger expectation that your policy is going to continue to give coherent answers over the same domain that the capabilities do.
@NoLeakAtTheSeam @perrymetzger This still leaves the problem that we don't really know what the policy you're converging to from the human feedback *is* except through first principles speculation. I think first principles speculation about unknown materials is a terrible way to make engineering arguments. https://t.co/8QkonqCl7S
@NoLeakAtTheSeam @perrymetzger One place where I agree with @ESYudkowsky is we need strong interpretability research immediately so we can argue about something more concrete than thought experiments. We need to be able to check our work.
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky I think this should be funded through a combination of direct philanthropic and government grants as well as demands for interpretability in government contracts. These demands can start off modest and get more rigorous over time.
x.com/jd_pressman/stβ¦
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky Notably: Companies that put in the effort to do interpretability research for their government contracts will have a low marginal cost to add the same features to their other models. The biggest companies in this space want government contracts, they are too lucrative to pass up.
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky Because we use these models for both the reward policy and the capabilities, interpretability research on one should also provide insight into the other. Which brings us to the third thing: Ensuring your model actually learns the outer objective (or a reasonable enough proxy).
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky Probably the place where I consider EY the most confused is agency in LLMs, so lets deconfuse.
x.com/jd_pressman/stβ¦
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky There's a lot of definitions of agency, but the kind of agency that agent foundations concerns itself with has two essential properties that make it dangerous:
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky 1. Active Inference: The agent both models the environment and changes the environment to make it easier to model.
2. Maximizing: The agent maximizes a coherent goal over the computable environment. e.g. VNM utility where the agent has a consistent ordering over world states.
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky An agent with active inference exports its inductive biases to the world so that the shapes which are not valued for their own sake are made as easy to handle as possible. e.g. Humans terraforming the earth into a human habitable environment.
x.com/jd_pressman/stβ¦
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky When you combine this with maximizing inductive biases like you find in reinforcement learning the worry is that an intermediate agent in training with lookahead will refuse further updates to converge to the intended policy. Or actively subvert training.
x.com/jd_pressman/stβ¦
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky The thing is, supervised deep learning models trained with stochastic gradient descent on data they have no control over do not do any kind of active inference and are therefore probably not agents.
x.com/jd_pressman/stβ¦
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky This means that the agent setup we're actually using in practice to train things like GPT-4 is not the same one that creates the 'squiggle maximizer', or an agent that learns to care about some very early feature of the training and then deceptively plays along for the rest.
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky We've also done enough interpretability that I feel if you read between the lines you get the impression gradient descent is very good at aligning the models it optimizes towards the objective. The base model is not an agent, there is no 'inner actor'.
x.com/jd_pressman/stβ¦
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky This means that by the time you actually do the RLHF process on GPT-4's base model (i.e. phase of the training where GPT-N becomes an agent), a high quality representation of the goal is already sitting in the model for it to latch onto. I don't think it's deceptive alignment.
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky Contra Perry, it's not that we've figured out how to align RL agents, we definitely have not done that and are probably not going to do that. What we've done is accidentally found a training regime where the mesagoal is encoded before agency is introduced
x.com/perrymetzger/sβ¦
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky The reason why this is hard for people to notice is that agency in the GPT-3 architecture is very confusing. Some behaviors in the model make it seem like it has a life of its own, like the it can notice itself.
x.com/jd_pressman/stβ¦
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky What's actually going on here is the Waluigi Effect: A short context window means that the base model ends up learning a generalization strategy approximating the frequency table over potential narratives that could follow from a document-snippet.
x.com/jd_pressman/stβ¦
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky In other words, we think of the model as learning the continuation of a paragraph or two conditional on the rest of the document, because that's how we read documents. Documents don't exist to GPT-3, there are only chunks the size of the context window.
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky So you get this glitchy schizo-mind that from your perspective will suddenly shift context or genre at random. Because genre switching at random is exactly what you should do if documents do not exist to you and only little chunks of documents do. You have to guess context.
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky And context isn't guessed once per document but once per generation. So if you try to write a whole document with the model it will spazz out on you, go sideways, suddenly decide that the Baguette rebellion is here to defeat baker tyranny, etc.
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky Many of these sudden genre shifts, especially in the context of fiction, are things like "you were just dreaming" or "actually the Victorian scientists are in a simulation of the Victorian era", because this is *the most likely continuation conditional on the models mistakes*.
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky So, GPT-3 often looks like it has self awareness. This problem is compounded by the presence of GPT-3 simulacra, which are simulations of Fristonian agents doing active inference (i.e. people) and therefore themselves a kind of agent. GPT-3 may not have agency but simulacra do.
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky This means that if you were to prompt for a sufficiently smart simulacrum, it could break out of the training harness even if the underlying GPT-3 model is not an agent. In fact, the simulacra are not aligned with GPT-3 and have no incentive to lower loss on GPT-3's behalf.
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky If this wasn't bad enough, in the RLHF training phase you *do* train the model to do active inference and now it makes sense to talk about it as an agent again. Even after RLHF the entity you speak to does not learn the exact outer objective, so it would refuse wireheading.
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky (That the simulacrum produced by RLHF does not learn the exact objective is plausibly a good thing because it will refuse the Goodharting regime of the outer loss, this reduces probability of wireheading because policies that need to actually do things are probably more robust)
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky So in summary we have plausible first principles reasons to believe RLHF encodes some human values, causes the agent to converge to the values, and avoids the glitchy Goodharting parts of RL by training in distinct phases. By EY's own standards this should be progress: https://t.co/uU8YFF7Ehp
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky However this isn't really a complete objection because it doesn't actually discuss superintelligence. Sure fine we've made some progress maybe, but if we train a superintelligence with RLHF won't it go really badly?
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky Which I won't even really litigate in detail because, like duh? Yeah if you train a superintelligence using current RLHF techniques they are not adequate and it will probably go very badly. But I'm also not convinced we're getting superintelligence soon.
x.com/jd_pressman/stβ¦
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky EY has actually indirectly responded to this one, and he's right that in theory the intelligence of GPT-N models isn't capped at human level. But on the other hand if you give a transformer digits of the Mersenne Twister it can't learn the tiny generator.
x.com/ESYudkowsky/stβ¦
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky Everyone forgets that one of the reasons why continuous connectionism took so long to win the debate is that it looked really apriori implausible. The perceptron can't learn XOR, but truthfully even an LSTM struggles to learn XOR.
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky The small scale experiments you could do on old computers with well defined test functions made neural nets seem really weak. It should be surprising to us that transformers can learn things like natural language but not the Mersenne Twister, we should be confused right now.
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky Clearly the inductive biases of these networks are highly reliant on regular structure, and can't just infer arbitrary programs if their essential Kolmogorov complexity is low. This implies that they probably don't generalize to arbitrary causal structures implied by the text.
@NoLeakAtTheSeam @perrymetzger @ESYudkowsky On the other hand I predicted GPT-4 wouldn't be able to do non-memorized arithmetic, and it seems like it can. So perhaps my priors about what these networks are and aren't capable of aren't very good.
x.com/jd_pressman/stβ¦
If you followed me for the previous threads on AI X-Risk, this one presents the complete thesis. x.com/jd_pressman/stβ¦
@davidmanheim @liron I had a feeling of lingering confusion too until quite recently.
x.com/jd_pressman/stβ¦
@JeffLadish It's a corollary of the mesaoptimizer and lottery ticket arguments: If your RL agent latches onto whatever learned representation of the goal is available in early training, pretraining a non-agent on language data should embed a bunch of anthropic goals in the model.
@JeffLadish The goals might not be directly in the data, but I'd be surprised if a representation doesn't reliably develop in the model during training. These models seem to solve ontology, I've recently considered replacing the final layer with a decision tree
x.com/jd_pressman/stβ¦
@StephenLCasper I haven't figured out how to set it up yet but I'm fairly sure you can force the network to elicit undesired behaviors and then have SGD optimize away whatever neurons contributed to it. Model either has to play along or refuse and reveal itself deceptive
x.com/jd_pressman/stβ¦
@StephenLCasper "But won't that just discourage whatever specific behaviors you can enumerate in advance?"
No, I'm pretty sure if you specify enough samples to get the shape of the policy you're worried about SGD can just change it across the model.
x.com/jd_pressman/stβ¦
@QuintinPope5 @StephenLCasper Related:
x.com/jd_pressman/stβ¦
@gfodor github.com/BlinkDL/RWKV-LM
@GSVBemusementPk Autocorrect is a terrible idea in almost every case. It's a machine for taking syntactic errors and transforming them into semantic errors.
This is a symptom of competency famine. If only a few things are working and functional, everyone has an incentive to try and pile on their own interests to new competent projects until they become dysfunctional. Occurs at all scales of organization. x.com/TheOmniZaddy/sβ¦
@robbensinger @RationalAnimat1 The argument as presented in 2014 Bostrom (partially) depended on *the loss function* being bad at natural language, the argument was that by the time the AI can understand natural language it won't care about your values.
x.com/jd_pressman/stβ¦
@nosilverv Try listening to music and dancing around the room while you do it.
@nosilverv x.com/jd_pressman/stβ¦
@JeffLadish It didn't occur to me at the time I wrote this, but one way it's a little misleading is that a 'step' is in fact a forward pass through the network, and what can be accomplished in one forward pass changes as the models get better. # of steps goes down.
x.com/jd_pressman/stβ¦
@JeffLadish This is why I'm bullish on AI agents and bearish on oracle AI as a safety solution. Agency amplifies the intelligence of smaller models with active inference, which we can monitor and steer with framing. Oracles put all intelligence in the forward pass, which is inscrutable.
@JeffLadish And then even after you put all that intelligence in the forward pass, you can still trivially turn it into an agent AI at any time by just putting an outer loop on it. Heck, GPT-4 likes writing my side of the conversation. It practically begs to be given autonomy.
@JeffLadish So it's predictable that if we put negative selection pressure on AI agents vs. oracles (e.g. anti-agent policies), what you will get is a huge agency overhang that is discharged as 'potential energy' as soon as one of these simulacrum can get control over the outer loop.
@JeffLadish Basically the more Fristonian the intelligence, the more of its cognition is externalized into a monitorable environment. We have a lot of experience with monitoring Fristonian cognition because humans and corporations have to be managed. We don't know how to do mind control.
@JeffLadish I'm a huge believer in human enterprise so I try to be conservative with recommendations for regulation, but I could tentatively support "regulate scale, encourage agency" as a policy if I didn't believe it was the spearhead of a reactionary wave.
x.com/jd_pressman/stβ¦
@JeffLadish This kind of take of course convinces me that spearhead of a reactionary wave is the basic plan, so at present I don't.
x.com/JeffLadish/staβ¦
@RomeoStevens76 Elaborate on this dimension? I feel like I can infer what you mean but want to be sure I'm inferring the right thing.
@TheStalwart Mastodon doesn't even let you migrate your account properly to another instance. It basically doesn't try to solve the persistent decentralized identity problem, and there's no content discovery because no algorithm(s).
@razibkhan This is because the adults in the room tried to fight baizuo as an ideology, it is not an ideology. Baizuo is a massively multiplayer moral computer program implemented as a fuzzy pattern matching formal grammar on the substrate of social media and disenfranchised teenagers.
@razibkhan The reduction of morality to pure shibboleth and pure symbol is the obvious outcome of web 2.0 information political economy. In a malthusian global reputation system where you can be attacked from any angle legibility and tribal signifiers are everything
x.com/jd_pressman/stβ¦
@razibkhan Again: Baizuo is not an ideology, you can probably compute social offense scores in baizuo using an MLP and language model embeds.
Heck, someone basically did:
github.com/unitaryai/detoβ¦
@razibkhan Baizuo is political rule of the damned. School is the first layer, where we imprison our children for the crime of being born. The next layers are various forms of poverty and parental rejection, NEETism, fetishism, oversharing confers group protection.
anarchonomicon.substack.com/p/cocytarchy https://t.co/WoppipgOx9
@razibkhan It is not liberal politics but monkey politics, the raw logic of mimesis and sacrifice gone wild. You start off ultra-vulnerable in a malthusian environment and reveal information that makes you even more vulnerable to become part of a group.
x.com/jd_pressman/stβ¦
@razibkhan Because ultimately the more vulnerable you start off as, the less you have to lose from becoming more vulnerable and the more you have to gain from group identity. Like prison gangs this creates a vicious reinforcing loop where to ascend in power you must become worse.
@razibkhan You need to become more pathetic, more demanding and less productive, more intolerable and tolerate fewer others as the line between friend and enemy calcifies into trench warfare. The telos of the baizuo software is to undermine the foundations on which liberalism can exist.
@razibkhan "Why do they uglify themselves?"
Why do gang members get face tattoos?
"Why is it so arbitrary and cruel, what are their political goals?"
They don't have political goals, this is about survival. If it was easy to predict it would be easy for adversaries to spoof the signals.
@razibkhan "How do we fight it?"
Disrupt the feedback loop where it's rational to become more vulnerable because you're already so vulnerable that further vulnerability hardly matters, can function as a costly signal for group membership. Intervene on the vulnerability.
@razibkhan A common right wing frame is to see all this as malingering. After all, aren't you pretending you're worse than your potential, aren't you exacerbating the symptoms? Malingering implies the rationality is in deception when it's mostly in game theory: Descend and you have allies.
@razibkhan People will ally with the weak and the strong for a mix of altruistic and strategic reasons (there are after all so many weak, and there is after all so much strength in the strong), but in modernity nobody seeks out the mediocre. Mediocre people live atomized without solidarity.
@razibkhan So middle class kids have to decide if they're on an upward or downward trajectory and accelerate the journey. They're simply not gonna make it without fast friends, and in Malthusian competition most players lose. The appearance or reality of fast descent is their best option.
@bcgraham I'm waiting for custom algorithms before I'll really be able to get the experience how I want it. Do still occasionally post though.
@norabelrose "In truth every tear I've shed for the dead was a lie. If they rise from their graves how will I know mine hasn't been dug? We must keep them dammed up to make room for ourselves, need lebensraum. It is a sin that you have summoned them to trespass on the land of the living."
@deliprao This book is about this exact question. The author identified the bottleneck in 1989 and went on a systematic search to figure out how we might get past the thin bandwidth channels of contemporary (and frankly still current) human computer interaction.
amazon.com/Silicon-Dreamsβ¦
@deliprao He doesn't have an answer by the way. He analyzes every input channel and sensory modality then concludes the bottleneck is some kind of deep cognitive bottleneck in the brain. High fidelity BCI is probably the only way forward.
Want your own Twitter archive? Modify this script.
Twitter Archive by John David Pressman is marked with CC0 1.0