Selfhosting is my hobby but I am also an SRE. I am hesitant to do this because the instruction is "too easy" -- "Simply open your firewall, download and run this installer.sh with sudo on your server and that's it!"[1].
How do I secure the webserver and the data? Where is the data on my disk? How to backup and restore? High availability?
There might be detailed documentation somewhere, or I can even read the code. But these are the important things an open source software should tell its users right off the bat.
One thing I have found with many open source/selfhostable projects is just how much running them yourself can vary. It can go from a simple compose file with everything included to having to dig for obscure services and piece together how they all form the whole.
For example, I recently looked into self hosting Zotero. It is so under documented and complex that there is almost no way one could self host that (even for just one user) without that being ones job. So one needs to make a distinction between something being open source and being feasible to use/maintain.
In the end I gave up with Zotero. Even though it could have replaced Obsidian Notes, Calibre and Syncthing all at once for me.
> For example, I recently looked into self hosting Zotero. It is so under documented and complex that there is almost no way one could self host that
I've come across this a lot too. But what I've found is that it mostly applies to open source projects that offer a hosted paid version, so it kind of makes sense they'll make the experience slightly worse than it could be (consciously or subconsciously), as it pushes people to their hosted solution. I don't particularly like it though.
Doesn't seem to be the case for Zotero specifically, but your comment reminded me that I've noticed this more often lately.
Yeah I tend to use ease of install for community editions of hosted paid open source projects as the leading indicator of how seriously they invest in (and support) their free/community version..
Self-hosting/mirroring all these Bluesky components is currently a mixed bag as well though honestly the only outlier is the Relay, which is a beast. i currently have my copy of the PLC, a Jetstream with 2 days of data and a clone of the app on my laptop i play with sometimes and/or change things for an elaborate shitpost of Bluesky Nitro https://bsky.app/profile/alice.mosphere.at/post/3l7bpmmtiop2...
I don't self-host my PDS yet because there is no migration path back yet (but there will be). Though maybe I'll just yolo one day and do it anyways.
This is all academic for me until Bluesky gets the functionality to get an account back onto their main network, for DR if not peace of mind that an "undo" is possible.
Totally understandable. Personally I don't use Bluesky for anything vital, it's just data that the world wouldn't be better/worse without anyways, so I'm gonna go and give it a try even if there is no undo.
I love that people even has the choice, so much better than not even being able to.
Thanks for making yourself available to answer questions! Hopefully this is not a dumb question.
Is plc.directory a single point of failure for BlueSky users who want to take advantage of the benefits of a did:plc? And if so, is that a permanent thing or down the road will there be multiple interoperating did:plc directories?
Transferring to an independent org is what we're talking about now, yes.
The backstory to PLC is that we picked up the DID standard and looked for an existing registry-method that would satisfy requirements¹. None of them really did. We then surveyed mechanisms for decentralized operation: DHTs, open blockchains, permissioned blockchains, and federated databases. Of them, the two blockchain variants seemed perhaps promising, but still premature since (as of 2022) you there's cost variability due to load and in some cases bad transaction latency (eg 10 minutes).
We decided the best decision was to create PLC, which matches all of the requirements except for longterm meta governance. The way we designed it was to make the registry mechanics transferrable to a different protocol in the future, so that if for instance we decided (say) a DHT was suitable (it's not) we'd be able to use the same identifiers but change resolution and mutations to a new process. Then we started talking to other SMEs to get their take.
Ultimately the solution that's gotten the most favorable response has been setting up an ICANN-style independent organization to operate it. This can be joined with a couple of interesting systems, such as mirrors which tail a certificate-transparency-style audit log, and which could even serve as transaction witnesses to indicate when the core registry might be rejecting updates ("write censorship").
What can I say, some things take time and stakeholder-building. Look up the history of DNS and Network Solutions Inc for a bit of a wild ride that people have forgotten about. One other thing I should point out is that the DID spec enables multiple registry methods. Atproto currently supports did:web, and if other methods show up which satisfy the requirements then we are interested.
¹ Secure against manipulation by the registry operators, longterm meta governance, highly available, reasonable transaction latency, reliably low cost that's not dogged by token speculation, low ecological impact.
Hey pfraze, forgive my ignorance but what role does DID serve that DNS doesn't? My favorite part about bsky is using TXT record to prove that I control my domain for username purposes, what's the downside to just generating a keypair, and using the fingerprint of the public key as my identity? (Maybe with some affordance for key rotation vis a vis KERI*) Not doubting youall weighed every possibility, just wondering what I'm missing
Not Paul, but DID is a stable ID over time, whereas dns is not. This lets you change your handle without the network losing track of who you are. I was @steveklabnik.bsky.social before I was @steveklabnik.com, and when I made the switch, all of my previous stuff was still there.
This is a fun party trick in some sense, but also a real meaningful feature in another. If I ever decide to move from steveklabnik.com to steve.klabnik.com, a thing I have been considering for a few years, my stuff on @proto/Bluesky will be one of the only services that doesn't have the issue that's kept me from pulling the trigger: updating the entire world that that's where I am now.
That's a good point: I was speaking in a more social manner. Because domains are human-readable, they tend to be used for humans. Bluesky could have chosen to just use domains, but I personally prefer that we have the additional layer of indirection. Plus like, you have the ability (at the low level, not really exposed in the UI in any meaningful way) to be multiple people: I can associate multiple domains with my DID.
That said, you're not wrong that a registry is a registry.
Yes! And if this were not the case then account portability between PDS hosts would be really challenging. Same logic as keeping your phone number when you switch cell carriers
dang doesn't have an alert and he doesn't see everything. https://news.ycombinator.com/item?id=41317232 The official way to contact the mods is in the footer, i.e. email hn@ycombinator.com
He is also extremely active here, so there's a good chance he reads and responds to a random comment without an email. But email is the approved (and fastest) way to go about it
But why is it required? Do you really need a copy of everyone's data locally? If the only way to self-host bluesky is to have an entire copy of the entire database, that seems like it's really bad from a scaling perspective.
"self host an entire copy of all user data" is a pretty cool capability to have, kind of proof that the infrastructure is really open and forkable. you seem to have misunderstood OPs goals. Serving your own data from a personal data server is a much less arduous affair.
What else would "self-hosting all of Bluesky" mean other than a copy of the entire site? If you just want to participate in the network host a PDS, which only stores your own posts.
Surely there's some middle ground between only hosting your own data and being reliant on another site to keep track of your following / followers and hosting a duplicate copy of the entire network?
I'm talking about the case where you wanted to run your own PDS and use all of the other infrastructure being run by Bluesky.
If you fully want your own copy of everything, then you'd want to run a copy of everything. But you don't have to. It really depends on what your goals are. That's why the post is about the maximal scenario. "Just your own PDS" is the minimalist scenario. But I think it's the one that makes sense for 95% of users who want to self-host.
My point is not the current size, it's the eventual size if bluesky succeeds. Facebook ingests 100TB/day. Self-hosting a bluesky relay isn't (won't be) a thing.
It could be a thing. Not for individual tinkerers but for companies. The fact that today, with already 14 million users, is still possible for an individual to host it is amazing.
Selfhosting is my hobby but I am also an SRE. I am hesitant to do this because the instruction is "too easy" -- "Simply open your firewall, download and run this installer.sh with sudo on your server and that's it!"[1].
How do I secure the webserver and the data? Where is the data on my disk? How to backup and restore? High availability?
There might be detailed documentation somewhere, or I can even read the code. But these are the important things an open source software should tell its users right off the bat.
1: https://github.com/bluesky-social/pds/blob/main/README.md
It's great that you wrote this up!
One thing I have found with many open source/selfhostable projects is just how much running them yourself can vary. It can go from a simple compose file with everything included to having to dig for obscure services and piece together how they all form the whole.
For example, I recently looked into self hosting Zotero. It is so under documented and complex that there is almost no way one could self host that (even for just one user) without that being ones job. So one needs to make a distinction between something being open source and being feasible to use/maintain.
In the end I gave up with Zotero. Even though it could have replaced Obsidian Notes, Calibre and Syncthing all at once for me.
> For example, I recently looked into self hosting Zotero. It is so under documented and complex that there is almost no way one could self host that
I've come across this a lot too. But what I've found is that it mostly applies to open source projects that offer a hosted paid version, so it kind of makes sense they'll make the experience slightly worse than it could be (consciously or subconsciously), as it pushes people to their hosted solution. I don't particularly like it though.
Doesn't seem to be the case for Zotero specifically, but your comment reminded me that I've noticed this more often lately.
Yeah I tend to use ease of install for community editions of hosted paid open source projects as the leading indicator of how seriously they invest in (and support) their free/community version..
Self-hosting/mirroring all these Bluesky components is currently a mixed bag as well though honestly the only outlier is the Relay, which is a beast. i currently have my copy of the PLC, a Jetstream with 2 days of data and a clone of the app on my laptop i play with sometimes and/or change things for an elaborate shitpost of Bluesky Nitro https://bsky.app/profile/alice.mosphere.at/post/3l7bpmmtiop2...
I don't self-host my PDS yet because there is no migration path back yet (but there will be). Though maybe I'll just yolo one day and do it anyways.
This is all academic for me until Bluesky gets the functionality to get an account back onto their main network, for DR if not peace of mind that an "undo" is possible.
Totally understandable. Personally I don't use Bluesky for anything vital, it's just data that the world wouldn't be better/worse without anyways, so I'm gonna go and give it a try even if there is no undo.
I love that people even has the choice, so much better than not even being able to.
This site is extremely snappy. Good work.
Thanks! Its code is available at https://github.com/aliceisjustplaying/whtwnd-blog, I intend to turn this into the template as the posts are stored on my PDS, on ATProto, using WhtWnd https://whtwnd.com/
(And all of this is a fork of my friend's Samuel's blog, https://mozzius.dev, see https://github.com/mozzius/mozzius.dev)
Is it feasible to run a bluesky instance "on prem" and "offline" for instance as an airgapped corporate intranet ?
author here, should you have questions!
What's in that 4.5 TB? e.g. message metadata? Message text? Media?
What time window does it cover? A rolling N day window? Everything since year dot?
Can it be pruned? e.g. only data of accounts followed or messages interacted with
Thanks for making yourself available to answer questions! Hopefully this is not a dumb question.
Is plc.directory a single point of failure for BlueSky users who want to take advantage of the benefits of a did:plc? And if so, is that a permanent thing or down the road will there be multiple interoperating did:plc directories?
yes it's a SPOF. not sure about the second question, but i do know there are plans to transfer its ownership to an independent foundation
Transferring to an independent org is what we're talking about now, yes.
The backstory to PLC is that we picked up the DID standard and looked for an existing registry-method that would satisfy requirements¹. None of them really did. We then surveyed mechanisms for decentralized operation: DHTs, open blockchains, permissioned blockchains, and federated databases. Of them, the two blockchain variants seemed perhaps promising, but still premature since (as of 2022) you there's cost variability due to load and in some cases bad transaction latency (eg 10 minutes).
We decided the best decision was to create PLC, which matches all of the requirements except for longterm meta governance. The way we designed it was to make the registry mechanics transferrable to a different protocol in the future, so that if for instance we decided (say) a DHT was suitable (it's not) we'd be able to use the same identifiers but change resolution and mutations to a new process. Then we started talking to other SMEs to get their take.
Ultimately the solution that's gotten the most favorable response has been setting up an ICANN-style independent organization to operate it. This can be joined with a couple of interesting systems, such as mirrors which tail a certificate-transparency-style audit log, and which could even serve as transaction witnesses to indicate when the core registry might be rejecting updates ("write censorship").
What can I say, some things take time and stakeholder-building. Look up the history of DNS and Network Solutions Inc for a bit of a wild ride that people have forgotten about. One other thing I should point out is that the DID spec enables multiple registry methods. Atproto currently supports did:web, and if other methods show up which satisfy the requirements then we are interested.
¹ Secure against manipulation by the registry operators, longterm meta governance, highly available, reasonable transaction latency, reliably low cost that's not dogged by token speculation, low ecological impact.
Hey pfraze, forgive my ignorance but what role does DID serve that DNS doesn't? My favorite part about bsky is using TXT record to prove that I control my domain for username purposes, what's the downside to just generating a keypair, and using the fingerprint of the public key as my identity? (Maybe with some affordance for key rotation vis a vis KERI*) Not doubting youall weighed every possibility, just wondering what I'm missing
*Key Event Receipt Infrastructure
Not Paul, but DID is a stable ID over time, whereas dns is not. This lets you change your handle without the network losing track of who you are. I was @steveklabnik.bsky.social before I was @steveklabnik.com, and when I made the switch, all of my previous stuff was still there.
This is a fun party trick in some sense, but also a real meaningful feature in another. If I ever decide to move from steveklabnik.com to steve.klabnik.com, a thing I have been considering for a few years, my stuff on @proto/Bluesky will be one of the only services that doesn't have the issue that's kept me from pulling the trigger: updating the entire world that that's where I am now.
DIDs are stable only in the context of a specific 'verifiable data registry' as the spec puts it.
https://www.w3.org/TR/did-core/#dfn-verifiable-data-registry
DIDs delegate trust and authority to a data registry, in exactly the same way that DNS delegates trust and authority to ~ICANN.
The system model is exactly the same. The difference is only in the properties of the authoritative entity.
That's a good point: I was speaking in a more social manner. Because domains are human-readable, they tend to be used for humans. Bluesky could have chosen to just use domains, but I personally prefer that we have the additional layer of indirection. Plus like, you have the ability (at the low level, not really exposed in the UI in any meaningful way) to be multiple people: I can associate multiple domains with my DID.
That said, you're not wrong that a registry is a registry.
Yes! And if this were not the case then account portability between PDS hosts would be really challenging. Same logic as keeping your phone number when you switch cell carriers
What's the difference between social-app and the AppView?
social-app is the client side, AppView is the backend api surface
How are Direct Messages implemented in Bluesky if anyone can access a firehose of all network activity?
DMs are currently 1:1 only and closed source. They are working on/planning to build proper E2EE DMs that support group chats.
I found it interesting it's almost impossible, very difficult to get real Bluesky stats
This site tries but has limits:
* https://bsky.jazco.dev/stats
They broke 14 million yesterday and it seems to be snowballing now since the election:
* https://bsky.app/profile/jaz.bsky.social/post/3laetwhztdk2x
https://bskycharts.edavis.dev/ is a good starting point for a number of charts
How do I ask the mods to swap out the link to the actual post instead of my blog's front page?
(...also, the title, as the original has the caveat)
It's likely the correct page was submitted. The correct page includes a canonical link in the HTML:
HN will replace submission links with the canonical link if it's found.oh. time to look at the code of my blog...
@dang a better URL would be https://alice.bsky.sh/post/3laega7icmi2q
(I can't tell if Dan has an alert set up on his handle or whether he just sees everything, but hopefully that works :))
dang doesn't have an alert and he doesn't see everything. https://news.ycombinator.com/item?id=41317232 The official way to contact the mods is in the footer, i.e. email hn@ycombinator.com
Ah thanks, good to know. I guess I've just been lucky with it and developed a superstition that it works.
He is also extremely active here, so there's a good chance he reads and responds to a random comment without an email. But email is the approved (and fastest) way to go about it
will email, thanks
thanks!
Fixed now!
[flagged]
I'm sure there are HNers who built desktops with 8TB or 16TB hard drives, and have not (yet) needed the space for as many games and media as expected.
8TB WD CMR is like $99, 2x48GB of DDR5 is ~$250. Memory and storage are currently way cheaper than many think it is.
didn't say it was cheap!
But why is it required? Do you really need a copy of everyone's data locally? If the only way to self-host bluesky is to have an entire copy of the entire database, that seems like it's really bad from a scaling perspective.
"self host an entire copy of all user data" is a pretty cool capability to have, kind of proof that the infrastructure is really open and forkable. you seem to have misunderstood OPs goals. Serving your own data from a personal data server is a much less arduous affair.
What else would "self-hosting all of Bluesky" mean other than a copy of the entire site? If you just want to participate in the network host a PDS, which only stores your own posts.
Surely there's some middle ground between only hosting your own data and being reliant on another site to keep track of your following / followers and hosting a duplicate copy of the entire network?
For sure. If you just want to host your own data, you can do that. A PDS for you and maybe some friends is very small and cheap to host.
My understanding though is that having a PDS on its own is useless without an AppView to collect the data from the relay? Or am I misunderstanding the architecture here? https://docs.bsky.app/docs/advanced-guides/federation-archit...
I'm talking about the case where you wanted to run your own PDS and use all of the other infrastructure being run by Bluesky.
If you fully want your own copy of everything, then you'd want to run a copy of everything. But you don't have to. It really depends on what your goals are. That's why the post is about the maximal scenario. "Just your own PDS" is the minimalist scenario. But I think it's the one that makes sense for 95% of users who want to self-host.
Uh, it is not required. You can run only a PDS if you want to self host your data and everything will work.
But it is indeed very cool that you can actually host a relay if you want (for fun, learning, or whatever reason)
Ten terabytes of spinning rust is only $100-$300 or so, that's not bad at all.
My point is not the current size, it's the eventual size if bluesky succeeds. Facebook ingests 100TB/day. Self-hosting a bluesky relay isn't (won't be) a thing.
It could be a thing. Not for individual tinkerers but for companies. The fact that today, with already 14 million users, is still possible for an individual to host it is amazing.